Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| data | 13 items | ||
| .DS_Store | 8.2 kB xet | dea5244a | |
| .gitattributes | 3.11 kB xet | 0746608f | |
| README.md | 4.86 kB xet | 6e628f97 | |
| split_statistics.csv | 582 Bytes xet | 4efb7202 | |
| split_statistics_table.csv | 263 Bytes xet | 9e6082e8 |
InteractBind
A physically grounded, large-scale protein–ligand interaction dataset
for interpretable and interaction-aware binding prediction
Links
- Paper: A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?
- Github: ZhaohanM/InteractBind
Motivation
Most existing protein–ligand binding datasets provide only coarse-grained supervision, such as binary labels or scalar affinity values. While effective for prediction, these signals compress complex molecular interaction processes into a single outcome, limiting interpretability and mechanistic understanding.
InteractBind addresses this limitation by explicitly modelling non-covalent interaction patterns derived from experimentally resolved protein–ligand complexes.
It enables token-level supervision, bridging sequence-based representations with physically meaningful interaction structures.
Dataset Overview
InteractBind is constructed from high-quality experimentally resolved complexes and includes:
- Protein sequences (FASTA and structure-aware sequence)
- Ligand molecular representations (SMILES and SELFIES)
- Binding labels and affinity annotations
- Token-level non-covalent interaction maps
The dataset is designed to support both prediction accuracy and mechanistic interpretability.
Dataset
This repository provides benchmark CSVs with true residue-level interaction maps for PLI prediction evaluation.
| Dataset | Type | Example Use |
|---|---|---|
| InteractBind (affinity) | Binding affinity splits | Evaluate in-domain |
| InteractBind-P-25%/28%/31%/33% OOD | Protein OOD splits | Evaluate novel protein generalisation |
Files
The Hugging Face Dataset Viewer is configured to read the CSV subsets under data/:
affinity: the full InteractBind affinity table.p_ood_25,p_ood_28,p_ood_31,p_ood_33: protein OOD benchmark subsets withtrain,validation, andtestsplits.
Each CSV includes seven residue-level binding-site fingerprint columns derived from the interaction maps:
Hydrogen bonding_binding_siteSalt Bridges_binding_siteπ–π Stacking_binding_siteCation–π_binding_siteHydrophobic_binding_siteVan der Waals_binding_siteOverall_binding_site
Each value is a binary list aligned to the protein FASTA sequence. For example, [0,0,1,0] marks the third residue as a binding-site residue. Negative protein-ligand pairs without contact-map entries are encoded as all-zero fingerprints.
Supported Interaction Types
Structured annotations are provided for major non-covalent interaction categories:
- Hydrogen bonds
- Hydrophobic interactions
- Salt bridges
- π–π stacking
- π–cation interactions
- Van der Waals contacts
Each interaction channel can be used independently or combined for multi-channel supervision.
Key Features
Physically grounded supervision
Derived from experimentally resolved complexes rather than heuristic attention signals.Token-level interaction maps
Enables fine-grained modelling of residue–atom interactions.Model-agnostic integration
Compatible with sequence-based encoders (e.g., ESM, SELFormer, and other protein–ligand models).Interpretability support
Facilitates binding residue identification and interaction pattern analysis.Scalable design
Allows large-scale training without requiring full structural modelling during inference.
Research Applications
InteractBind supports a broad range of research directions:
- Protein–ligand binding prediction
- Binding site/pocket localisation
- Interaction-aware representation learning
- Mechanistic hypothesis generation
- Drug discovery and virtual screening
- Explainable AI for molecular modelling
Citation
If you use InteractBind in your research, please cite:
@misc{meng2026largescaleinteractbind,
title = {A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?},
author = {Meng, Zhaohan and Bai, Zhen and Yuan, Ke and Ounis, Iadh and Meng, Zaiqiao and Xu, Hao and Loscalzo, Joseph},
year = {2026},
eprint = {2605.24045},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2605.24045}
}
- Total size
- 989 MB
- Files
- 18
- Last updated
- May 30
- Pre-warmed CDN
- US EU US EU