Model Summary
DeepSTARR‑Mouse is a Convolutional Neural Network (CNN) adapted from the previously published DeepSTARR architecture (Nature Genetics, 2022).
This model is designed for use in a transfer‑learning framework to predict enhancer activity in E11.5 mouse embryos.
For each tissue, CNNs are pre‑trained on DNA accessibility data (i.e., ATAC‑seq) and fine‑tuned on a limited set of experimentally validated enhancers (VISTA enhancer browser, https://enhancer.lbl.gov/vista/).
Model Architecture
The DeepSTARR‑Mouse model is a custom TensorFlow 2 (Keras) implementation consisting of convolutional, pooling, and dense layers optimized to extract predictive features from 1,001‑bp DNA sequences.
The code and architecture definitions are available on GitHub:
➡️ https://github.com/Shenzhi‑Chen/DeepSTARR2‑Mouse
Model Weights
Sequence‑to‑accessibility and sequence‑to‑activity model weights are stored separately in two folders.
Each folder contains three tissue‑specific subfolders: heart, limb, and midbrain (CNS).
Within each tissue folder there are six models, corresponding to three cross‑validation folds and two replicates.
Model weights are stored in Model.json (architecture) and Model.h5 (trained parameters).
Training Objectives and Evaluation Metrics
Accessibility Model (Regression)
- Evaluation Metric: Pearson Correlation Coefficient (Pearson r, reported as PCC)
Activity Model (Classification)
- Evaluation Metric: Positive Predictive Value (PPV)
Model Performance
Accessibility Model (Regression)
PCC values were computed between predicted and observed accessibility profiles across the held‑out test chromosome 18.
| Tissue | PCC |
|---|---|
| Heart | 0.76 |
| Limb | 0.76 |
| Midbrain | 0.78 |
Activity Model (Classification)
PPV values were computed on the held‑out test set using independent folds.
| Tissue | PPV (%) |
|---|---|
| Heart | 71.5 |
| Limb | 70.6 |
| Midbrain (CNS) | 80.2 |
All metrics represent the mean precision score across all cross‑validation folds and replicated models.
Dataset
Training and evaluation data are available in the companion Hugging Face dataset repository:
👉 Shenzhi‑Chen/DeepSTARR_Mouse_training_dataset
Framework
Implemented in TensorFlow 2 (v.2.4.1) / Keras.
All scripts for training, and using are hosted in the GitHub repository linked above.
Intended Use
This model is released for research and educational purposes.
It can be applied to predict enhancer activity or chromatin accessibility from DNA sequences in mammalian genomes, particularly for E11.5 mouse embryonic tissues.
Limitations
- Trained only on mouse embryonic enhancers (E11.5).
- Generalization beyond this developmental stage or to other species may be limited.
- Not validated for clinical or diagnostic applications.
License
- Model and Code: Apache License 2.0
- Dataset: Creative Commons Attribution 4.0 International (CC BY 4.0)
Citation
If you use this model, please cite:
[Paper citation once published]
- Downloads last month
- -