Buckets:

Synthyra
/

InteractBind-bucket

989 MB

18 files

Updated 4 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
data		4 days ago	13 items
.DS_Store	8.2 kB xet	4 days ago	dea5244a
.gitattributes	3.11 kB xet	4 days ago	0746608f
README.md	4.86 kB xet	4 days ago	6e628f97
split_statistics.csv	582 Bytes xet	4 days ago	4efb7202
split_statistics_table.csv	263 Bytes xet	4 days ago	9e6082e8

README.md

InteractBind

A physically grounded, large-scale protein–ligand interaction dataset
for interpretable and interaction-aware binding prediction

Motivation

Most existing protein–ligand binding datasets provide only coarse-grained supervision, such as binary labels or scalar affinity values. While effective for prediction, these signals compress complex molecular interaction processes into a single outcome, limiting interpretability and mechanistic understanding.

InteractBind addresses this limitation by explicitly modelling non-covalent interaction patterns derived from experimentally resolved protein–ligand complexes.

It enables token-level supervision, bridging sequence-based representations with physically meaningful interaction structures.

Dataset Overview

InteractBind is constructed from high-quality experimentally resolved complexes and includes:

Protein sequences (FASTA and structure-aware sequence)
Ligand molecular representations (SMILES and SELFIES)
Binding labels and affinity annotations
Token-level non-covalent interaction maps

The dataset is designed to support both prediction accuracy and mechanistic interpretability.

Dataset

This repository provides benchmark CSVs with true residue-level interaction maps for PLI prediction evaluation.

Dataset	Type	Example Use
InteractBind (affinity)	Binding affinity splits	Evaluate in-domain
InteractBind-P-25%/28%/31%/33% OOD	Protein OOD splits	Evaluate novel protein generalisation

Files

The Hugging Face Dataset Viewer is configured to read the CSV subsets under data/:

affinity: the full InteractBind affinity table.
p_ood_25, p_ood_28, p_ood_31, p_ood_33: protein OOD benchmark subsets with train, validation, and test splits.

Each CSV includes seven residue-level binding-site fingerprint columns derived from the interaction maps:

Hydrogen bonding_binding_site
Salt Bridges_binding_site
π–π Stacking_binding_site
Cation–π_binding_site
Hydrophobic_binding_site
Van der Waals_binding_site
Overall_binding_site

Each value is a binary list aligned to the protein FASTA sequence. For example, [0,0,1,0] marks the third residue as a binding-site residue. Negative protein-ligand pairs without contact-map entries are encoded as all-zero fingerprints.

Supported Interaction Types

Structured annotations are provided for major non-covalent interaction categories:

Hydrogen bonds
Hydrophobic interactions
Salt bridges
π–π stacking
π–cation interactions
Van der Waals contacts

Each interaction channel can be used independently or combined for multi-channel supervision.

Key Features

Physically grounded supervision
Derived from experimentally resolved complexes rather than heuristic attention signals.
Token-level interaction maps
Enables fine-grained modelling of residue–atom interactions.
Model-agnostic integration
Compatible with sequence-based encoders (e.g., ESM, SELFormer, and other protein–ligand models).
Interpretability support
Facilitates binding residue identification and interaction pattern analysis.
Scalable design
Allows large-scale training without requiring full structural modelling during inference.

Research Applications

InteractBind supports a broad range of research directions:

Protein–ligand binding prediction
Binding site/pocket localisation
Interaction-aware representation learning
Mechanistic hypothesis generation
Drug discovery and virtual screening
Explainable AI for molecular modelling

Citation

If you use InteractBind in your research, please cite:

@misc{meng2026largescaleinteractbind,
  title = {A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?},
  author = {Meng, Zhaohan and Bai, Zhen and Yuan, Ke and Ounis, Iadh and Meng, Zaiqiao and Xu, Hao and Loscalzo, Joseph},
  year = {2026},
  eprint = {2605.24045},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2605.24045}
}

Total size: 989 MB

Files: 18

Last updated: May 30

Pre-warmed CDN: US EU US EU