Drug Discovery Model - ChemBERTa Finetuned
A ChemBERTa model finetuned for predicting drug approval success based on molecular SMILES representation.
Model Description
This model is a finetuned version of seyonec/ChemBERTa-zinc-base-v1 for drug success prediction. It classifies molecules as likely to be approved or failed based on their SMILES structure.
Training Details
- Base Model: ChemBERTa-zinc-base-v1 (~85M parameters)
- Training: Full finetuning (not LoRA)
- Hardware: NVIDIA RTX 3050 6GB
- Optimization: Gradient checkpointing + FP16 mixed precision
- Dataset: ChEMBL, DrugBank, FDA drugs (approved + withdrawn)
Performance
| Metric | Score |
|---|---|
| Accuracy | 69.6% |
| F1 Score | 48.8% |
| ROC-AUC | 70.8% |
| PR-AUC | 65.1% |
Comparison with Pretrained
| Model | Accuracy | ROC-AUC |
|---|---|---|
| Pretrained (no finetuning) | 44.2% | 42.9% |
| Finetuned (this model) | 69.6% | 70.8% |
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/drug-discovery-chemberta")
model = AutoModelForSequenceClassification.from_pretrained("YOUR_USERNAME/drug-discovery-chemberta")
# Predict for a molecule (Aspirin)
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
prediction = "Approved" if probs[0][1] > 0.5 else "Failed"
confidence = probs[0][1].item() if prediction == "Approved" else probs[0][0].item()
print(f"Prediction: {prediction} (confidence: {confidence:.2%})")
Labels
- 0: Failed/Withdrawn drug
- 1: Approved drug
Intended Use
This model is intended for:
- Drug discovery research
- Virtual screening of candidate molecules
- Prioritizing compounds for further development
Limitations
- Trained on small molecule drugs only
- Predictions are probabilistic and should not replace clinical trials
- Performance may vary for novel chemical scaffolds
Citation
If you use this model, please cite:
@misc{drug-discovery-chemberta-2024,
author = {Your Name},
title = {Drug Discovery ChemBERTa: Finetuned for Drug Success Prediction},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/YOUR_USERNAME/drug-discovery-chemberta}
}
License
MIT License
- Downloads last month
- 9