π code-review-agent-grpo
A Qwen2.5-Coder-7B model fine-tuned with GRPO (Group Relative Policy Optimization) to review and fix real-world code bugs across 6 programming languages.
Built for the Meta Γ HuggingFace Γ PyTorch OpenEnv Grand Finale β Bangalore 2026.
π Key Result
A 7B parameter model, after GRPO training on CodeReviewEnv, outperformed a 70B parameter baseline by 46% on average.
| Task | Groq llama-3.3-70B (baseline) | Qwen2.5-Coder-7B (GRPO) | Change |
|---|---|---|---|
| easy | 0.95 | 1.13 | β +0.18 |
| medium | 0.90 | 1.28 | β +0.38 |
| hard | 0.15 | 0.48 | β +0.33 (3x!) |
| api_security | 0.90 | 1.20 | β +0.30 |
| auth_system | 0.00 | 1.13 | β +1.13 (from zero!) |
| AVERAGE | 0.58 | 1.04 | β +0.46 |
π― What This Model Does
This model acts as an AI code reviewer that can:
- Find bugs β structured comments with line numbers and severity levels
- Fix bugs β suggest correct code patches for each issue found
- Issue verdicts β approve or request changes with clear reasoning
- Handle 6 languages β Python, JavaScript, SQL, React/JSX, Django, Node.js
π Quick Start
from transformers import pipeline
reviewer = pipeline(
"text-generation",
model="lucifer0077/code-review-agent-grpo",
device_map="auto"
)
code_to_review = """
def process_tasks(queue):
while queue:
task = queue.pop(0) # multiple workers sharing this queue
handle(task)
"""
prompt = f"""Review this code for bugs. For each bug found, provide:
- Line number
- Severity (critical/major/minor)
- Description
- Suggested fix
Code:
{code_to_review}
"""
output = reviewer(
[{"role": "user", "content": prompt}],
max_new_tokens=512,
return_full_text=False
)[0]
print(output["generated_text"])
ποΈ Training Details
Environment β CodeReviewEnv
This model was trained inside CodeReviewEnv, a custom OpenEnv-compliant RL environment with:
- 13 tasks across 6 languages
- Curriculum learning β automatically promotes agent from easy β medium β hard
- Anti-reward hacking β spam detection, duplicate detection, quality checks
- Dense reward shaping β rewards at every step, not just episode end
Reward Function
| Signal | Reward |
|---|---|
| β Critical bug found | +0.20 |
| β Major bug found | +0.12 |
| β Minor bug found | +0.05 |
| β False positive | β0.08 |
| β Correct verdict | +0.10 |
| β Wrong verdict | β0.15 |
| β Correct fix (critical) | +0.40 |
| β Correct fix (major) | +0.35 |
| β Wrong fix | β0.10 |
| β±οΈ Step penalty | β0.02/step |
Hyperparameters
| Parameter | Value |
|---|---|
| Base Model | Qwen2.5-Coder-7B-Instruct |
| GPU | A100 (40GB) |
| Training Steps | 250 |
| Framework | Unsloth + TRL |
| LoRA Rank | 32 |
| Learning Rate | 3e-6 |
| Training Time | 2h 43min |
Learning Curve
Reward increased from ~0.60 β ~1.15 over 250 training steps, surpassing the Groq llama-3.3-70B baseline (0.58) early in training.
π Supported Tasks & Languages
| Language | Tasks | Bug Types |
|---|---|---|
| Python | easy, medium, hard, api_security, auth_system, orm_bugs, data_pipeline | ZeroDivisionError, SQL injection, race conditions, JWT bypass, N+1 queries |
| JavaScript | js_async, node-race | Missing await, callback hell, memory leaks |
| SQL | sql-injection | ORDER BY injection, LIMIT injection |
| React/JSX | react-security | XSS via dangerouslySetInnerHTML, token leaks |
| Django | django-auth | Timing attacks, plaintext comparison |
| Node.js | node-race | Inventory oversell, atomicity bugs |
π Links
| Resource | Link |
|---|---|
| π Live Demo | https://lucifer0077-code-review-env.hf.space |
| ποΈ Training Space | https://huggingface.co/spaces/lucifer0077/code-review-training |
| π Blog Post | https://huggingface.co/spaces/lucifer0077/code-review-env/blob/main/BLOG.md |
| π» GitHub | https://github.com/Lucifer-cyber007/meta-hackathon-open-env |
π Citation
If you use this model or environment, please cite:
@misc{codereviewenv2026,
title = {CodeReviewEnv: A Self-Improving AI Code Review Agent via GRPO},
author = {Aditya Sharma},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/lucifer0077/code-review-agent-grpo}
}
Framework Citations
@article{shao2024deepseekmath,
title = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
author = {Zhihong Shao et al.},
year = {2024},
eprint = {arXiv:2402.03300}
}
Built at Meta Γ HuggingFace Γ PyTorch OpenEnv Grand Finale β April 2026, Bangalore
Theme 4: Self-Improving Agent | Theme 3.1: Professional Tasks