Buckets:

989 MB
18 files
Updated 4 days ago
NameSize
data
.DS_Store8.2 kB
xet
.gitattributes3.11 kB
xet
README.md4.86 kB
xet
split_statistics.csv582 Bytes
xet
split_statistics_table.csv263 Bytes
xet
README.md

InteractBind

A physically grounded, large-scale protein–ligand interaction dataset
for interpretable and interaction-aware binding prediction


Links


Motivation

Most existing protein–ligand binding datasets provide only coarse-grained supervision, such as binary labels or scalar affinity values. While effective for prediction, these signals compress complex molecular interaction processes into a single outcome, limiting interpretability and mechanistic understanding.

InteractBind addresses this limitation by explicitly modelling non-covalent interaction patterns derived from experimentally resolved protein–ligand complexes.

It enables token-level supervision, bridging sequence-based representations with physically meaningful interaction structures.


Dataset Overview

InteractBind is constructed from high-quality experimentally resolved complexes and includes:

  • Protein sequences (FASTA and structure-aware sequence)
  • Ligand molecular representations (SMILES and SELFIES)
  • Binding labels and affinity annotations
  • Token-level non-covalent interaction maps

The dataset is designed to support both prediction accuracy and mechanistic interpretability.


Dataset

This repository provides benchmark CSVs with true residue-level interaction maps for PLI prediction evaluation.

Dataset Type Example Use
InteractBind (affinity) Binding affinity splits Evaluate in-domain
InteractBind-P-25%/28%/31%/33% OOD Protein OOD splits Evaluate novel protein generalisation

Files

The Hugging Face Dataset Viewer is configured to read the CSV subsets under data/:

  • affinity: the full InteractBind affinity table.
  • p_ood_25, p_ood_28, p_ood_31, p_ood_33: protein OOD benchmark subsets with train, validation, and test splits.

Each CSV includes seven residue-level binding-site fingerprint columns derived from the interaction maps:

  • Hydrogen bonding_binding_site
  • Salt Bridges_binding_site
  • π–π Stacking_binding_site
  • Cation–π_binding_site
  • Hydrophobic_binding_site
  • Van der Waals_binding_site
  • Overall_binding_site

Each value is a binary list aligned to the protein FASTA sequence. For example, [0,0,1,0] marks the third residue as a binding-site residue. Negative protein-ligand pairs without contact-map entries are encoded as all-zero fingerprints.

Supported Interaction Types

Structured annotations are provided for major non-covalent interaction categories:

  • Hydrogen bonds
  • Hydrophobic interactions
  • Salt bridges
  • π–π stacking
  • π–cation interactions
  • Van der Waals contacts

Each interaction channel can be used independently or combined for multi-channel supervision.


Key Features

  • Physically grounded supervision
    Derived from experimentally resolved complexes rather than heuristic attention signals.

  • Token-level interaction maps
    Enables fine-grained modelling of residue–atom interactions.

  • Model-agnostic integration
    Compatible with sequence-based encoders (e.g., ESM, SELFormer, and other protein–ligand models).

  • Interpretability support
    Facilitates binding residue identification and interaction pattern analysis.

  • Scalable design
    Allows large-scale training without requiring full structural modelling during inference.


Research Applications

InteractBind supports a broad range of research directions:

  • Protein–ligand binding prediction
  • Binding site/pocket localisation
  • Interaction-aware representation learning
  • Mechanistic hypothesis generation
  • Drug discovery and virtual screening
  • Explainable AI for molecular modelling

Citation

If you use InteractBind in your research, please cite:

@misc{meng2026largescaleinteractbind,
  title = {A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?},
  author = {Meng, Zhaohan and Bai, Zhen and Yuan, Ke and Ounis, Iadh and Meng, Zaiqiao and Xu, Hao and Loscalzo, Joseph},
  year = {2026},
  eprint = {2605.24045},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2605.24045}
}
Total size
989 MB
Files
18
Last updated
May 30
Pre-warmed CDN
US EU US EU

Contributors