🔍 code-review-agent-grpo

A Qwen2.5-Coder-7B model fine-tuned with GRPO (Group Relative Policy Optimization) to review and fix real-world code bugs across 6 programming languages.

Built for the Meta × HuggingFace × PyTorch OpenEnv Grand Finale — Bangalore 2026.

🏆 Key Result

A 7B parameter model, after GRPO training on CodeReviewEnv, outperformed a 70B parameter baseline by 46% on average.

Task	Groq llama-3.3-70B (baseline)	Qwen2.5-Coder-7B (GRPO)	Change
easy	0.95	1.13	↑ +0.18
medium	0.90	1.28	↑ +0.38
hard	0.15	0.48	↑ +0.33 (3x!)
api_security	0.90	1.20	↑ +0.30
auth_system	0.00	1.13	↑ +1.13 (from zero!)
AVERAGE	0.58	1.04	↑ +0.46

🎯 What This Model Does

This model acts as an AI code reviewer that can:

Find bugs — structured comments with line numbers and severity levels
Fix bugs — suggest correct code patches for each issue found
Issue verdicts — approve or request changes with clear reasoning
Handle 6 languages — Python, JavaScript, SQL, React/JSX, Django, Node.js

🚀 Quick Start

from transformers import pipeline

reviewer = pipeline(
    "text-generation",
    model="lucifer0077/code-review-agent-grpo",
    device_map="auto"
)

code_to_review = """
def process_tasks(queue):
    while queue:
        task = queue.pop(0)  # multiple workers sharing this queue
        handle(task)
"""

prompt = f"""Review this code for bugs. For each bug found, provide:
- Line number
- Severity (critical/major/minor)
- Description
- Suggested fix

Code:
{code_to_review}
"""

output = reviewer(
    [{"role": "user", "content": prompt}],
    max_new_tokens=512,
    return_full_text=False
)[0]

print(output["generated_text"])

🏗️ Training Details

Environment — CodeReviewEnv

This model was trained inside CodeReviewEnv, a custom OpenEnv-compliant RL environment with:

13 tasks across 6 languages
Curriculum learning — automatically promotes agent from easy → medium → hard
Anti-reward hacking — spam detection, duplicate detection, quality checks
Dense reward shaping — rewards at every step, not just episode end

Reward Function

Signal	Reward
✅ Critical bug found	+0.20
✅ Major bug found	+0.12
✅ Minor bug found	+0.05
❌ False positive	−0.08
✅ Correct verdict	+0.10
❌ Wrong verdict	−0.15
✅ Correct fix (critical)	+0.40
✅ Correct fix (major)	+0.35
❌ Wrong fix	−0.10
⏱️ Step penalty	−0.02/step

Hyperparameters

Parameter	Value
Base Model	Qwen2.5-Coder-7B-Instruct
GPU	A100 (40GB)
Training Steps	250
Framework	Unsloth + TRL
LoRA Rank	32
Learning Rate	3e-6
Training Time	2h 43min

Learning Curve

Reward increased from ~0.60 → ~1.15 over 250 training steps, surpassing the Groq llama-3.3-70B baseline (0.58) early in training.

🌍 Supported Tasks & Languages

Language	Tasks	Bug Types
Python	easy, medium, hard, api_security, auth_system, orm_bugs, data_pipeline	ZeroDivisionError, SQL injection, race conditions, JWT bypass, N+1 queries
JavaScript	js_async, node-race	Missing await, callback hell, memory leaks
SQL	sql-injection	ORDER BY injection, LIMIT injection
React/JSX	react-security	XSS via dangerouslySetInnerHTML, token leaks
Django	django-auth	Timing attacks, plaintext comparison
Node.js	node-race	Inventory oversell, atomicity bugs

🔗 Links

Resource	Link
🌐 Live Demo	https://lucifer0077-code-review-env.hf.space
🏋️ Training Space	https://huggingface.co/spaces/lucifer0077/code-review-training
📝 Blog Post	https://huggingface.co/spaces/lucifer0077/code-review-env/blob/main/BLOG.md
💻 GitHub	https://github.com/Lucifer-cyber007/meta-hackathon-open-env

📚 Citation

If you use this model or environment, please cite:

@misc{codereviewenv2026,
  title     = {CodeReviewEnv: A Self-Improving AI Code Review Agent via GRPO},
  author    = {Aditya Sharma},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/lucifer0077/code-review-agent-grpo}
}

Framework Citations

@article{shao2024deepseekmath,
  title  = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
  author = {Zhihong Shao et al.},
  year   = {2024},
  eprint = {arXiv:2402.03300}
}

Built at Meta × HuggingFace × PyTorch OpenEnv Grand Finale — April 2026, Bangalore

Theme 4: Self-Improving Agent | Theme 3.1: Professional Tasks

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for lucifer0077/code-review-agent-grpo

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 145