Drug Discovery Model - ChemBERTa Finetuned

A ChemBERTa model finetuned for predicting drug approval success based on molecular SMILES representation.

Model Description

This model is a finetuned version of seyonec/ChemBERTa-zinc-base-v1 for drug success prediction. It classifies molecules as likely to be approved or failed based on their SMILES structure.

Training Details

  • Base Model: ChemBERTa-zinc-base-v1 (~85M parameters)
  • Training: Full finetuning (not LoRA)
  • Hardware: NVIDIA RTX 3050 6GB
  • Optimization: Gradient checkpointing + FP16 mixed precision
  • Dataset: ChEMBL, DrugBank, FDA drugs (approved + withdrawn)

Performance

Metric Score
Accuracy 69.6%
F1 Score 48.8%
ROC-AUC 70.8%
PR-AUC 65.1%

Comparison with Pretrained

Model Accuracy ROC-AUC
Pretrained (no finetuning) 44.2% 42.9%
Finetuned (this model) 69.6% 70.8%

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/drug-discovery-chemberta")
model = AutoModelForSequenceClassification.from_pretrained("YOUR_USERNAME/drug-discovery-chemberta")

# Predict for a molecule (Aspirin)
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    prediction = "Approved" if probs[0][1] > 0.5 else "Failed"
    confidence = probs[0][1].item() if prediction == "Approved" else probs[0][0].item()

print(f"Prediction: {prediction} (confidence: {confidence:.2%})")

Labels

  • 0: Failed/Withdrawn drug
  • 1: Approved drug

Intended Use

This model is intended for:

  • Drug discovery research
  • Virtual screening of candidate molecules
  • Prioritizing compounds for further development

Limitations

  • Trained on small molecule drugs only
  • Predictions are probabilistic and should not replace clinical trials
  • Performance may vary for novel chemical scaffolds

Citation

If you use this model, please cite:

@misc{drug-discovery-chemberta-2024,
  author = {Your Name},
  title = {Drug Discovery ChemBERTa: Finetuned for Drug Success Prediction},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/YOUR_USERNAME/drug-discovery-chemberta}
}

License

MIT License

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support