Model Summary
DeepSTARR‑Mouse is a Convolutional Neural Network (CNN) adapted from the previously published DeepSTARR architecture (Nature Genetics, 2022). This model is designed for use in a transfer‑learning framework to predict enhancer activity in E11.5 mouse embryos. For each tissue, CNNs are pre‑trained on DNA accessibility data (i.e., ATAC‑seq) and fine‑tuned on a limited set of experimentally validated enhancers (VISTA enhancer browser, https://enhancer.lbl.gov/vista/).


Model Architecture
The DeepSTARR‑Mouse model is a custom TensorFlow 2 (Keras) implementation consisting of convolutional, pooling, and dense layers optimized to extract predictive features from 1,001‑bp DNA sequences.
The code and architecture definitions are available on GitHub:
➡️ https://github.com/Shenzhi‑Chen/DeepSTARR2‑Mouse


Model Weights
Sequence‑to‑accessibility and sequence‑to‑activity model weights are stored separately in two folders.
Each folder contains three tissue‑specific subfolders: heart, limb, and midbrain (CNS).
Within each tissue folder there are six models, corresponding to three cross‑validation folds and two replicates.
Model weights are stored in Model.json (architecture) and Model.h5 (trained parameters).


Training Objectives and Evaluation Metrics

Accessibility Model (Regression)

  • Evaluation Metric:Pearson Correlation Coefficient (Pearson r, reported as PCC)

Activity Model (Classification)

  • Evaluation Metric:Positive Predictive Value (PPV)

Model Performance

Accessibility Model (Regression)
PCC values were computed between predicted and observed accessibility profiles across the held‑out test chromosome 18.

Tissue PCC
Heart 0.76
Limb 0.76
Midbrain 0.78

Activity Model (Classification)
PPV values were computed on the held‑out test set using independent folds.

Tissue PPV (%)
Heart 71.5
Limb 70.6
Midbrain (CNS) 80.2

All metrics represent the mean precision score across all cross‑validation folds and replicated models.


Dataset
Training and evaluation data are available in the companion Hugging Face dataset repository:
👉 Shenzhi‑Chen/DeepSTARR_Mouse_training_dataset


Framework
Implemented in TensorFlow 2 (v.2.4.1) / Keras.
All scripts for training, and using are hosted in the GitHub repository linked above.


Intended Use
This model is released for research and educational purposes.
It can be applied to predict enhancer activity or chromatin accessibility from DNA sequences in mammalian genomes, particularly for E11.5 mouse embryonic tissues.


Limitations

  • Trained only on mouse embryonic enhancers (E11.5).
  • Generalization beyond this developmental stage or to other species may be limited.
  • Not validated for clinical or diagnostic applications.

License

  • Model and Code:Apache License 2.0
  • Dataset:Creative Commons Attribution 4.0 International (CC BY 4.0)

Citation
If you use this model, please cite:

[Paper citation once published]

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support