AbbasSabra commited on
Commit
14bacf3
·
1 Parent(s): db25f30

Introduce the model card

Browse files
Files changed (1) hide show
  1. README.md +314 -3
README.md CHANGED
@@ -1,3 +1,314 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ tags:
6
+ - code
7
+ - sft
8
+ - reasoning
9
+ - fine-tuned
10
+ - java
11
+ - gpt-oss
12
+ - bf16
13
+ base_model:
14
+ - openai/gpt-oss-20b
15
+ base_model_relation: finetune
16
+ datasets:
17
+ - OpenCoder-LLM/opc-sft-stage1
18
+ - OpenCoder-LLM/opc-sft-stage2
19
+ metrics:
20
+ - code_eval
21
+ - accuracy
22
+ model-index:
23
+ - name: SonarSweep-java-gpt-oss-20b
24
+ results:
25
+ - task:
26
+ type: text-generation
27
+ dataset:
28
+ type: nuprl/MultiPL-E
29
+ name: MultiPL-HumanEval (Java)
30
+ metrics:
31
+ - name: pass@1
32
+ type: pass@1
33
+ value: 0.845
34
+ verified: false
35
+ - task:
36
+ type: text-generation
37
+ dataset:
38
+ type: nuprl/MultiPL-E
39
+ name: MultiPL-MBPP (Java)
40
+ metrics:
41
+ - name: pass@1
42
+ type: pass@1
43
+ value: 0.674
44
+ verified: false
45
+ - task:
46
+ type: text-generation
47
+ dataset:
48
+ type: nuprl/MultiPL-E
49
+ name: MultiPL-HumanEval (Python)
50
+ metrics:
51
+ - name: pass@1
52
+ type: pass@1
53
+ value: 0.4739
54
+ verified: false
55
+ - task:
56
+ type: text-generation
57
+ dataset:
58
+ type: nuprl/MultiPL-E
59
+ name: Sanitized-MBPP (Python)
60
+ metrics:
61
+ - name: pass@1
62
+ type: pass@1
63
+ value: 0.2989
64
+ verified: false
65
+ - task:
66
+ type: text-generation
67
+ dataset:
68
+ type: nuprl/MultiPL-E
69
+ name: MultiPL-MBPP (PHP)
70
+ metrics:
71
+ - name: pass@1
72
+ type: pass@1
73
+ value: 0.658
74
+ verified: false
75
+ - task:
76
+ type: text-generation
77
+ dataset:
78
+ type: nuprl/MultiPL-E
79
+ name: MultiPL-MBPP (TypeScript)
80
+ metrics:
81
+ - name: pass@1
82
+ type: pass@1
83
+ value: 0.73
84
+ verified: false
85
+ - task:
86
+ type: text-generation
87
+ dataset:
88
+ type: nuprl/MultiPL-E
89
+ name: MultiPL-MBPP (Go)
90
+ metrics:
91
+ - name: pass@1
92
+ type: pass@1
93
+ value: 0.369
94
+ verified: false
95
+ - task:
96
+ type: text-generation
97
+ dataset:
98
+ type: cais/mmlu
99
+ name: MMLU
100
+ split: test
101
+ metrics:
102
+ - name: accuracy
103
+ type: accuracy
104
+ value: 0.7812
105
+ verified: false
106
+ ---
107
+
108
+ # SonarSweep Java gpt-oss-20b
109
+
110
+ ## Model Details
111
+
112
+ ### Model Description
113
+
114
+ This is a fine-tuned version of [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b), optimized for high-quality Java code generation. [SonarSweep](https://www.sonarsource.com/products/sonarsweep/) was used to create a high-quality Java dataset. Fine-tuning on this dataset has produced a model that generates expert-level Java patterns while avoiding the bugs and vulnerabilities observed when benchmarking the base model.
115
+
116
+ - **Developed by:** Sonar SonarSweep Team
117
+ - **Model type:** Mixture of Experts
118
+ - **Languages:** Primarily Java, English
119
+ - **License:** Apache 2.0
120
+
121
+ ## Uses
122
+
123
+ This model is designed primarily as a demonstration of our SonarSweep pipeline, specifically for fine-tuning on Java.
124
+
125
+ By using SonarSweep for targeted data preprocessing, pass@1 metrics are maintained across all coding benchmarks, while the number of bugs and vulnerabilities is significantly reduced compared to the base model.
126
+
127
+ Despite being a demonstration model, we have tested the model to ensure its responses are helpful, natural, and adhere to instructions. We have evaluated a range of benchmarks across software engineering and general use cases to ensure the model remains widely useful.
128
+
129
+ We focus on `gpt-oss-20b` in the "low" reasoning setting, which we refer to as `gpt-oss-20b-low`. In this setting, the base model responds quickly and achieves relatively high scores in code generation benchmarks. This configuration would be appropriate, for example, for developers using an LLM for in-line completions.
130
+
131
+ As with the base model, the fine-tuned model was trained on data in OpenAI's [harmony response format](https://github.com/openai/harmony). The model should only be used with the harmony format, as it will not work correctly otherwise.
132
+
133
+ Technical recommendations for usage are included below; see [Getting Started](#getting-started).
134
+
135
+ ### Reasoning Capabilities
136
+
137
+ This model operates exclusively as a low-reasoning model, derived from `gpt-oss-20b-low`. It is optimized for speed and standard conversational tasks rather than complex chain-of-thought processing.
138
+
139
+ Please note that specifying or adjusting reasoning effort is not supported. Any parameters attempting to enforce "medium" or "high" reasoning settings will be ignored or may result in an error. The model is hard-coded to a low-reasoning profile.
140
+
141
+ ## Bias, Risks, and Limitations
142
+
143
+ Despite being trained to output high-quality Java code and achieving substantial improvements on our [code quality benchmarks](#code-quality), the model is still liable to generate bugs and security vulnerabilities. Users must never treat generated code as production-ready without a thorough review using static analysis—for example, using [SonarQube](https://www.sonarsource.com/products/sonarqube/).
144
+
145
+ Our model's (and the base model's) knowledge is static, based on its training data cutoff. We cannot guarantee adherence to the latest Java standards, best practices in newly released libraries, or correct use of private or proprietary APIs.
146
+
147
+ ## Getting Started
148
+
149
+ You can use the `SonarSweep-java-gpt-oss-20b` model with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use `model.generate` directly, you need to apply the harmony format manually using the chat template.
150
+
151
+ ### Minimum Requirements
152
+
153
+ - **GPU Memory:** 48GB+ VRAM required for the model loaded in bf16 precision
154
+ - **Storage:** 100GB
155
+
156
+ To get started, install the necessary dependencies to set up your environment:
157
+
158
+ ```shell
159
+ pip install -U transformers kernels torch
160
+ ```
161
+
162
+ Once installed, you can run the model using the following snippet:
163
+
164
+ ```python
165
+ from transformers import pipeline
166
+
167
+ model_id = "SonarSource/SonarSweep-java-gpt-oss-20b"
168
+
169
+ pipe = pipeline(
170
+ "text-generation",
171
+ model=model_id,
172
+ torch_dtype="auto",
173
+ device_map="auto",
174
+ )
175
+
176
+ messages = [
177
+ {"role": "user", "content": "Write a function in Java that creates a two-dimensional array with 5 rows and 2 columns, each element of which is a random number between 1 and 50."},
178
+ ]
179
+
180
+ outputs = pipe(
181
+ messages,
182
+ max_new_tokens=2048,
183
+ )
184
+ print(outputs[0]["generated_text"][-1])
185
+ ```
186
+
187
+ For more details see [here](https://huggingface.co/openai/gpt-oss-20b#transformers).
188
+
189
+ ## Training Details
190
+
191
+ ### Training Data
192
+
193
+ We compiled open-source code data from [OpenCoder Datasets](https://huggingface.co/collections/OpenCoder-LLM/opencoder-datasets) and synthetic alignment data generated using [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) to create a Java dataset of 70k examples. We then used SonarSweep to improve the quality of the dataset.
194
+
195
+ ### Training Hyperparameters
196
+
197
+ We trained LoRA adapters across all linear layers of the experts and attention blocks.
198
+
199
+ | Parameter | Value |
200
+ |-----------|-------|
201
+ | Batch Size | 64 |
202
+ | Training Epochs | 2 |
203
+ | Learning Rate | 1e-4 |
204
+ | LR Scheduler | Cosine with 10% Warmup |
205
+ | LoRA Rank | 64 |
206
+ | LoRA Alpha | 128 |
207
+ | Attention Mechanism | [SDPA](https://huggingface.co/docs/transformers/v4.57.3/en/perf_infer_gpu_one#scaled-dot-product-attention-sdpa) |
208
+ | Precision | bf16 mixed precision |
209
+
210
+ ### Model Architecture
211
+
212
+ | Property | Value |
213
+ |----------|-------|
214
+ | Architecture | gpt-oss (Transformer-based Mixture of Experts) |
215
+ | Parameters | 20.9 billion (3.6 billion active) |
216
+ | Trainable Parameters | 740M (3.4% of total) |
217
+
218
+ ## Evaluation
219
+
220
+ ### Code Quality
221
+
222
+ We used SonarQube to evaluate the quality, verbosity, and complexity of Java code generated for the [ComplexCodeEval](https://github.com/ComplexCodeEval/ComplexCodeEval) and [MultiPL-E Java](https://huggingface.co/datasets/nuprl/MultiPL-E/viewer/humaneval-java) benchmarks.
223
+
224
+ The fine-tuned and base models achieve a similar pass@1 metric for code generation (within 1% difference). Results for other languages are shown in the next subsection.
225
+
226
+ The fine-tuned model achieves this metric while generating fewer lines of code.
227
+
228
+ For code quality, we see a dramatic reduction in both the number and density of Sonar issues, split among bugs, security vulnerabilities, and code smells (see the [Glossary](#glossary) for definitions).
229
+
230
+ | Metric | Base Model | Fine-tuned Model |
231
+ |--------|------------|------------------|
232
+ | MultiPL-E Pass@1 | 71.49 | 72.37 |
233
+ | Lines of Code Generated | 247,895 | 233,031 |
234
+ | Bugs Generated | 222 | 123 |
235
+ | Bugs per KLOC | 0.9 | 0.53 |
236
+ | Security Vulnerabilities | 102 | 56 |
237
+ | Vulnerabilities per KLOC | 0.41 | 0.24 |
238
+ | Code Smells | 4,968 | 3,796 |
239
+ | Code Smells per KLOC | 20.04 | 16.29 |
240
+
241
+ For cyclomatic and cognitive complexity, after fine-tuning, there was a reduction in both total complexity and complexity per thousand lines of code.
242
+
243
+ | Complexity Metric | Base Model | Fine-tuned Model |
244
+ |-------------------|------------|------------------|
245
+ | Cyclomatic (Total) | 52,139 | 45,006 |
246
+ | Cyclomatic (per KLOC) | 210.33 | 193.13 |
247
+ | Cognitive (Total) | 30,871 | 24,419 |
248
+ | Cognitive (per KLOC) | 124.53 | 104.79 |
249
+
250
+ Note: KLOC = Thousand Lines of Code.
251
+
252
+ ### Code Generation
253
+
254
+ MultiPL-E provides a multilanguage parallel benchmark for evaluating the performance of LLMs on natural-language-to-code generation tasks. For each language, they provide a translated version of HumanEval and MBPP.
255
+
256
+ We fine-tuned on Java but chose to evaluate a selection of available languages. For all languages, scores are averages of 10 samples with temperature set to 0.01.
257
+
258
+ #### Results
259
+
260
+ The changes are not significant for any language; this demonstrates that we do not see performance degradation in the fine-tuned model's ability to generate functional code.
261
+
262
+ | Language | Dataset | Num Examples | Base Model Pass@1 | Fine-tuned Model Pass@1 |
263
+ |----------|---------|--------------|-------------------|-------------------------|
264
+ | Java | HumanEval | 158 | 85.40% | 84.50% |
265
+ | Java | MBPP | 386 | 65.80% | 67.40% |
266
+ | Python | HumanEval | 164 | 43.06% | 47.39% |
267
+ | Python | MBPP* | 257 | 26.50% | 29.89% |
268
+ | PHP | MBPP | 397 | 63.20% | 65.80% |
269
+ | TypeScript | MBPP | 390 | 74.10% | 73.00% |
270
+ | Go | MBPP | 374 | 35.00% | 36.90% |
271
+
272
+ \* This is the [sanitized MBPP benchmark](https://github.com/google-research/google-research/tree/master/mbpp) from the original [Google Research paper](https://arxiv.org/abs/2108.07732).
273
+
274
+ ### General Ability: MMLU
275
+
276
+ The MMLU (Massive Multitask Language Understanding) benchmark evaluates the model's general knowledge with 14,042 multiple-choice questions across a wide range of subjects.
277
+
278
+ | Metric | Base Model | Fine-tuned Model |
279
+ |--------|------------|------------------|
280
+ | Correct Answers | 11,081 | 10,969 |
281
+ | Accuracy | 78.91% | 78.12% |
282
+
283
+ The fine-tuned model maintains comparable performance on MMLU with only a 0.79% decrease in accuracy, demonstrating that specialization in Java code quality does not significantly impact general knowledge capabilities.
284
+
285
+ ## Glossary
286
+
287
+ **SonarSweep:** A pipeline that analyzes and remediates code for training datasets. For more details, see the announcement on [sonarsource.com](https://www.sonarsource.com/products/sonarsweep/).
288
+
289
+ **Lines of Code (LOC):** The number of lines of code generated, not counting comments or blank lines. For our scale, KLOC (thousand lines of code) is more appropriate.
290
+
291
+ **Code Quality:** Using static analysis, SonarQube specifically monitors and measures three core [software qualities](https://docs.sonarsource.com/sonarqube-server/user-guide/rules/software-qualities), each of which has an associated type of issue:
292
+
293
+ - **Security:** The protection of your software from unauthorized access, use, or destruction. Detected security issues are called *Vulnerabilities*.
294
+ - **Reliability:** A measure of how your software is capable of maintaining its level of performance under stated conditions for a stated period of time. Detected reliability issues are called *Bugs*.
295
+ - **Maintainability:** Refers to the ease with which you can repair, improve, and understand software code. Detected maintainability issues are called *Code Smells*.
296
+
297
+ SonarQube analysis for Java supports the detection of a wide range of quality issues. For details, see [rules.sonarsource.com](https://rules.sonarsource.com/java/).
298
+
299
+ ## Acknowledgements
300
+
301
+ - **Base Model:** [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
302
+ - **Code Quality Assessment Methodology:** [Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis](https://arxiv.org/abs/2508.14727)
303
+ - **Benchmarks:**
304
+ - ComplexCodeEval: [https://arxiv.org/abs/2409.10280](https://arxiv.org/abs/2409.10280)
305
+ - MultiPL-E: [https://arxiv.org/abs/2208.08227](https://arxiv.org/abs/2208.08227)
306
+ - MMLU: [https://arxiv.org/abs/2009.03300](https://arxiv.org/abs/2009.03300)
307
+ - MBPP: [https://arxiv.org/abs/2108.07732](https://arxiv.org/abs/2108.07732) | [Code](https://github.com/google-research/google-research/tree/master/mbpp)
308
+ - **OpenCoder:** [https://opencoder-llm.github.io/](https://opencoder-llm.github.io/)
309
+
310
+ ## Model Card Authors
311
+
312
+ SonarSweep Team
313
+
314
+ For feedback: [https://community.sonarsource.com/](https://community.sonarsource.com/)