pentest-vulnerability-detector / README.md

Upload README.md with huggingface_hub

512d858 verified about 1 month ago

4.17 kB

	---
	license: apache-2.0
	base_model: deepseek-ai/deepseek-coder-1.3b-instruct
	tags:
	- security
	- vulnerability-detection
	- penetration-testing
	- code-analysis
	- cybersecurity
	- lora
	- deepseek
	library_name: peft
	pipeline_tag: text-generation
	---

	# Pentest Vulnerability Detector

	## Model Description

	This is a fine-tuned version of DeepSeek-Coder-1.3B-Instruct, specialized for detecting security vulnerabilities in code.

	Base Model: deepseek-ai/deepseek-coder-1.3b-instruct
	Training Data: 440 synthetic vulnerability examples
	Training Method: LoRA (Low-Rank Adaptation) with 4-bit quantization
	Training Platform: Google Colab (Free T4 GPU)

	## Capabilities

	The model can detect and analyze:
	- SQL Injection
	- Cross-Site Scripting (XSS)
	- Command Injection / RCE
	- Insecure Direct Object Reference (IDOR)
	- Server-Side Request Forgery (SSRF)
	- Authentication Bypass
	- Cross-Site Request Forgery (CSRF)
	- Path Traversal

	## Training Details

	- Examples: 440 vulnerability patterns
	- Epochs: 3
	- Batch Size: 2 (with gradient accumulation)
	- Learning Rate: 2e-4
	- LoRA Rank: 8
	- Quantization: 4-bit (NF4)
	- Training Time: ~45-60 minutes on T4 GPU

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	# Load base model
	base_model = "deepseek-ai/deepseek-coder-1.3b-instruct"
	model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto")
	tokenizer = AutoTokenizer.from_pretrained(base_model)

	# Load LoRA adapter
	model = PeftModel.from_pretrained(model, "YOUR_USERNAME/pentest-vulnerability-detector")

	# Analyze code
	code = "SELECT * FROM users WHERE id = 'user_input'"
	prompt = f"System: You are a security expert.\n\nUser: Analyze this code:\n{code}\n\nAssistant:"

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=200)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	## Inference Script

	For easier usage, use the provided inference script:

	```bash
	python inference_deepseek.py --model ./model --code "YOUR_CODE_HERE"
	```

	## Model Performance

	The model provides:
	- Vulnerability type identification
	- Severity assessment (CRITICAL/HIGH/MEDIUM/LOW)
	- Detailed attack vector analysis
	- Specific remediation recommendations
	- Code-specific security guidance

	## Limitations

	- Not 100% accurate - always verify findings manually
	- May have false positives/negatives
	- Best used as a pre-screening tool
	- Should complement, not replace, manual security testing
	- Trained on synthetic data - may need fine-tuning for specific use cases

	## Ethical Use

	This model is intended for:
	- Security research
	- Penetration testing (authorized only)
	- Code review and security auditing
	- Educational purposes

	Do not use for:
	- Unauthorized system access
	- Malicious activities
	- Illegal purposes

	## Training Data

	The model was trained on 440 synthetic vulnerability examples covering:
	- 100 SQL Injection patterns
	- 80 XSS patterns
	- 60 Command Injection patterns
	- 50 IDOR patterns
	- 40 SSRF patterns
	- 40 Authentication Bypass patterns
	- 40 CSRF patterns
	- 30 Path Traversal patterns

	## Citation

	If you use this model, please cite:

	```
	@misc{pentest-vulnerability-detector,
	author = {YOUR_NAME},
	title = {Pentest Vulnerability Detector},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/YOUR_USERNAME/pentest-vulnerability-detector}}
	}
	```

	## License

	This model adapter is released under the Apache 2.0 License.

	The base model (DeepSeek-Coder-1.3B-Instruct) has its own license terms.

	### Apache 2.0 License Summary:
	- ✅ Commercial use allowed
	- ✅ Modification allowed
	- ✅ Distribution allowed
	- ✅ Patent use allowed
	- ⚠️ Must include license and copyright notice
	- ⚠️ Must state changes made

	See LICENSE file for full terms.

	## Contact

	For questions or issues, please open an issue on the model repository.

	## Acknowledgments

	- Base model: DeepSeek-Coder by DeepSeek AI
	- Training framework: Hugging Face Transformers, PEFT
	- Training platform: Google Colab