πŸ” code-review-agent-grpo

A Qwen2.5-Coder-7B model fine-tuned with GRPO (Group Relative Policy Optimization) to review and fix real-world code bugs across 6 programming languages.

Built for the Meta Γ— HuggingFace Γ— PyTorch OpenEnv Grand Finale β€” Bangalore 2026.


πŸ† Key Result

A 7B parameter model, after GRPO training on CodeReviewEnv, outperformed a 70B parameter baseline by 46% on average.

Task Groq llama-3.3-70B (baseline) Qwen2.5-Coder-7B (GRPO) Change
easy 0.95 1.13 ↑ +0.18
medium 0.90 1.28 ↑ +0.38
hard 0.15 0.48 ↑ +0.33 (3x!)
api_security 0.90 1.20 ↑ +0.30
auth_system 0.00 1.13 ↑ +1.13 (from zero!)
AVERAGE 0.58 1.04 ↑ +0.46

🎯 What This Model Does

This model acts as an AI code reviewer that can:

  1. Find bugs β€” structured comments with line numbers and severity levels
  2. Fix bugs β€” suggest correct code patches for each issue found
  3. Issue verdicts β€” approve or request changes with clear reasoning
  4. Handle 6 languages β€” Python, JavaScript, SQL, React/JSX, Django, Node.js

πŸš€ Quick Start

from transformers import pipeline

reviewer = pipeline(
    "text-generation",
    model="lucifer0077/code-review-agent-grpo",
    device_map="auto"
)

code_to_review = """
def process_tasks(queue):
    while queue:
        task = queue.pop(0)  # multiple workers sharing this queue
        handle(task)
"""

prompt = f"""Review this code for bugs. For each bug found, provide:
- Line number
- Severity (critical/major/minor)
- Description
- Suggested fix

Code:
{code_to_review}
"""

output = reviewer(
    [{"role": "user", "content": prompt}],
    max_new_tokens=512,
    return_full_text=False
)[0]

print(output["generated_text"])

πŸ—οΈ Training Details

Environment β€” CodeReviewEnv

This model was trained inside CodeReviewEnv, a custom OpenEnv-compliant RL environment with:

  • 13 tasks across 6 languages
  • Curriculum learning β€” automatically promotes agent from easy β†’ medium β†’ hard
  • Anti-reward hacking β€” spam detection, duplicate detection, quality checks
  • Dense reward shaping β€” rewards at every step, not just episode end

Reward Function

Signal Reward
βœ… Critical bug found +0.20
βœ… Major bug found +0.12
βœ… Minor bug found +0.05
❌ False positive βˆ’0.08
βœ… Correct verdict +0.10
❌ Wrong verdict βˆ’0.15
βœ… Correct fix (critical) +0.40
βœ… Correct fix (major) +0.35
❌ Wrong fix βˆ’0.10
⏱️ Step penalty βˆ’0.02/step

Hyperparameters

Parameter Value
Base Model Qwen2.5-Coder-7B-Instruct
GPU A100 (40GB)
Training Steps 250
Framework Unsloth + TRL
LoRA Rank 32
Learning Rate 3e-6
Training Time 2h 43min

Learning Curve

Reward increased from ~0.60 β†’ ~1.15 over 250 training steps, surpassing the Groq llama-3.3-70B baseline (0.58) early in training.


🌍 Supported Tasks & Languages

Language Tasks Bug Types
Python easy, medium, hard, api_security, auth_system, orm_bugs, data_pipeline ZeroDivisionError, SQL injection, race conditions, JWT bypass, N+1 queries
JavaScript js_async, node-race Missing await, callback hell, memory leaks
SQL sql-injection ORDER BY injection, LIMIT injection
React/JSX react-security XSS via dangerouslySetInnerHTML, token leaks
Django django-auth Timing attacks, plaintext comparison
Node.js node-race Inventory oversell, atomicity bugs

πŸ”— Links


πŸ“š Citation

If you use this model or environment, please cite:

@misc{codereviewenv2026,
  title     = {CodeReviewEnv: A Self-Improving AI Code Review Agent via GRPO},
  author    = {Aditya Sharma},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/lucifer0077/code-review-agent-grpo}
}

Framework Citations

@article{shao2024deepseekmath,
  title  = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
  author = {Zhihong Shao et al.},
  year   = {2024},
  eprint = {arXiv:2402.03300}
}

Built at Meta Γ— HuggingFace Γ— PyTorch OpenEnv Grand Finale β€” April 2026, Bangalore

Theme 4: Self-Improving Agent | Theme 3.1: Professional Tasks

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for lucifer0077/code-review-agent-grpo