Instructions to use QizhiPei/BioMatrix-4B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QizhiPei/BioMatrix-4B-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QizhiPei/BioMatrix-4B-Base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("QizhiPei/BioMatrix-4B-Base")
model = AutoModelForMultimodalLM.from_pretrained("QizhiPei/BioMatrix-4B-Base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use QizhiPei/BioMatrix-4B-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QizhiPei/BioMatrix-4B-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QizhiPei/BioMatrix-4B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QizhiPei/BioMatrix-4B-Base

SGLang

How to use QizhiPei/BioMatrix-4B-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QizhiPei/BioMatrix-4B-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QizhiPei/BioMatrix-4B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QizhiPei/BioMatrix-4B-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QizhiPei/BioMatrix-4B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QizhiPei/BioMatrix-4B-Base with Docker Model Runner:
```
docker model run hf.co/QizhiPei/BioMatrix-4B-Base
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

BioMatrix-4B-Base

BioMatrix is a multimodal biological foundation model that natively integrates 1D sequences, 3D structures, and natural language for both molecules and proteins within a single decoder-only architecture.

This is the 4B-parameter Base model, obtained via multimodal continual pretraining of Qwen3-4B-Base on 304.4 billion tokens spanning text, molecular and protein 1D/3D data, and cross-modal corpora. This base checkpoint is intended for further fine-tuning on downstream tasks. For an instruction-tuned model ready for inference, see BioMatrix-4B-SFT.

📄 Paper: BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language
💻 Code: https://github.com/QizhiPei/BioMatrix
🤗 Model & Data Collection: https://huggingface.co/collections/QizhiPei/biomatrix

BioMatrix Architecture

Model Overview

BioMatrix maps all biological modalities into a shared discrete token space via a unified tokenization scheme:

Molecular 1D sequences (both SMILES and SELFIES notations)
Molecular 3D structures (via MolStrucTok with branch-decoupled decoder)
Protein 1D sequences (residue-level tokens)
Protein 3D structures (via GCP-VQVAE backbone tokenizer)
Natural language (inherited from Qwen3 tokenizer)

All modalities are consumed and produced uniformly under a single next-token prediction objective—without external encoders, projection adapters, or modality-specific output heads.

Model	Molecule 1D	Molecule 3D	Protein 1D	Protein 3D	Natural Language
ESM3	✗	✗	✓	✓	✓
3D-MoLM	✓	✓	✗	✗	✓
AlphaFold3	✓	✓	✓	✓	✗
BioT5/BioT5+	✓	✗	✓	✗	✓
BioMedGPT	✓	✗	✓	✗	✓
BioMatrix	✓	✓	✓	✓	✓

Molecule and Protein Structure Tokenizers

Model Details

Base Architecture: Qwen3-4B-Base
Parameters: 4B
Training Stage: Multimodal Continual Pretraining only (not instruction-tuned)
Training Tokens: 304.4B
Context Length: 8,192 tokens
Tokenizer: Extended Qwen3 vocabulary with:
- 11,294 joint molecular 3D tokens (composed from SELFIES atom × MolStrucTok codes)
- 4,096 protein 3D tokens (GCP-VQVAE codebook)
- 26 protein 1D tokens (amino acids + non-standard/unknown)
- SELFIES atom tokens and modality-specific control tokens

Embedding Initialization

New vocabulary entries are initialized via a description-based scheme: each new token is grounded in the pretrained Qwen3 embedding space by averaging the embeddings of the subword tokens of a short natural-language description (e.g., <A_W> → "Tryptophan"), plus a small isotropic Gaussian perturbation to break symmetry. This provides a more stable starting point than random initialization.

Pretraining Corpus (304.4B tokens)

Category	Tokens	Sources
Text (105.3B)	General: 25.6B	FineWeb-Edu
	Scientific: 79.7B	FineFineWeb (biology/chemistry/medical/health), PubMed Full Articles
Molecule (73.7B)	1D: 36.0B	PubChem, MolTextNet
	3D: 17.6B	PubChem, PCQM4Mv2, PubChemQC
	Other: 24.0B	(text descriptions, properties, IUPAC names)
Protein (77.4B)	1D: 17.1B	UniRef50
	3D: 38.5B	RCSB PDB, AlphaFold DB
	Other: 19.5B	Swiss-Prot, TrEMBL annotations
	Other (additional): 2.9B
Cross-entity (48.0B)	Interleaved Text: 17.1B	PubMed, bioRxiv, S2ORC, USPTO
	3D: 11.4B	CrossDocked, PPIRef
	Other: 19.5B	BindingDB, STITCH, jglaser, AlphaSeq

BioMatrix Continual Pretraining Data

Training Configuration

Framework: LLaMA-Factory
Hardware: 64 NVIDIA H100 GPUs
Global Batch Size: 1,024
Maximum Sequence Length: 8,192 tokens
Optimizer: AdamW
Peak Learning Rate: 2.0 × 10⁻⁴ (cosine schedule)
Warmup Steps: 2,000
Total Steps: ~36.4K (1 epoch over the full 304.4B-token corpus)

Intended Use

This Base model is not instruction-tuned. It is suitable for:

Further fine-tuning on custom biological tasks
Continued pretraining on domain-specific corpora
Research on representation learning across biomolecular modalities
Embedding extraction for downstream classification/regression tasks

For ready-to-use instruction-following capabilities (e.g., molecule captioning, protein design, property prediction), please use the SFT variant.

The BioMatrix SFT variants cover downstream molecule and protein task families such as generation, prediction, captioning, editing, folding, inverse folding, and structure-aware question answering:

BioMatrix Molecule Tasks

BioMatrix Protein Tasks

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "QizhiPei/BioMatrix-4B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

# Example: Continue a SMILES sequence
prompt = "<|mol_smi_start|>CC(=O)"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Modality Wrapping

When constructing inputs, biomolecular content must be wrapped with the corresponding control tokens:

Modality	Wrapping Example
Molecule SMILES	`<\|mol_smi_start\|>CC#CC#N<\|mol_smi_end\|>`
Molecule SELFIES	`<\|mol_sfi_start\|>[C][#C][C][#N]<\|mol_sfi_end\|>`
Molecule 3D	`<\|mol_3d_start\|>[H 3][C 0][#C 6]...<\|mol_3d_end\|>`
Protein 1D	`<\|prot_aa_start\|><A M><A R><A A>...<\|prot_aa_end\|>`
Protein 3D	`<\|prot_3d_start\|><S 4012><S 153><S 2091>...<\|prot_3d_end\|>`

Natural language text is left unwrapped and serves as the default carrier modality.

Limitations

This model is not instruction-tuned and is unlikely to follow natural-language instructions out-of-the-box. Use the SFT variant for instruction-following.
Molecular and protein 3D structures are tokenized in disjoint geometric reference frames, so the model cannot natively represent biomolecular complexes (e.g., docking poses).
Heavy domain specialization may erode some general-purpose language capabilities of the underlying Qwen3 backbone.
Coverage is limited to small molecules and proteins; nucleic acids, carbohydrates, and lipids are not currently supported.

Citation

If you find BioMatrix useful, please cite:

@article{pei2026biomatrix,
  title={BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language},
  author={Pei, Qizhi and Zhou, Zhimeng and Duan, Yi and Zhao, Yiyang and Li, Wei and Guo, Han and He, Liang and Li, Chengping and Hsieh, Chang-Yu and He, Conghui and Yan, Rui and Wu, Lijun},
  journal={arXiv preprint arXiv:2606.22138},
  year={2026}
}