Instructions to use sail/data-mixture-pile-cc-1b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sail/data-mixture-pile-cc-1b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sail/data-mixture-pile-cc-1b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("sail/data-mixture-pile-cc-1b") model = AutoModelForCausalLM.from_pretrained("sail/data-mixture-pile-cc-1b") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use sail/data-mixture-pile-cc-1b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sail/data-mixture-pile-cc-1b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sail/data-mixture-pile-cc-1b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/sail/data-mixture-pile-cc-1b
- SGLang
How to use sail/data-mixture-pile-cc-1b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sail/data-mixture-pile-cc-1b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sail/data-mixture-pile-cc-1b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sail/data-mixture-pile-cc-1b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sail/data-mixture-pile-cc-1b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use sail/data-mixture-pile-cc-1b with Docker Model Runner:
docker model run hf.co/sail/data-mixture-pile-cc-1b
Models Trained with Human Selection
This is a collection of the language models trained using Pile-CC, each with approximately 1B parameters, trained on different seeds. This project aims to validate the generalization capabilities of the RegMix approach (https://huggingface.co/papers/2407.01492) from small-scale (e.g., 1M parameters) to large-scale (e.g., 1B parameters) models.
Key Features
- Model Size: 5 separate models trained with different seeds, each with ~1B parameters
- Training Data: The pile-cc only data mixture on the RegMix-Data dataset
Dataset
The models were trained using the RegMix-Data dataset, which is split into different domains from The Pile dataset.
Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Batch Size | 1M tokens |
| Learning Rate | 4e-4 |
| Minimum Learning Rate | 1e-5 |
| Learning Rate Schedule | Cosine |
| Warmup Ratio | 4% |
| Total Tokens | 25B |
How to Load a Model
You can load any model using the corresponding branch with the Hugging Face Transformers library:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("sail/data-mixture-pile-cc-1b", revision="seed-1")
tokenizer = AutoTokenizer.from_pretrained("sail/data-mixture-pile-cc-1b", revision="seed-1")
Data Mixture
The specific data mixture used for training this 1B model is as follows, which can be also found in our code:
train:
train_the_pile_pile_cc: 1.0
valid:
valid_the_pile_pile_cc: 1.0
model_name: tinyllama_1_1b
Model Variants
To access different model variants, simply change the revision parameter in the from_pretrained method to the desired seed (e.g., "seed-2", "seed-3"), and the maxium seed is 5.
Model Performance
We evaluated each model using lm-evaluation-harness. The performance metric for each task is the average of 0-shot to 5-shot accnorm (accuracy normalized, if available) or acc (accuracy) scores.
| Seed | PIQA | LAMBADA | MultiRC | LogiQA | SocialIQA | Winogrande | RACE | OpenBookQA | COPA | HellaSwag | SciQ | ARC Easy | QQP | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 69.23 | 33.16 | 50.33 | 27.57 | 33.22 | 52.10 | 31.80 | 31.07 | 65.83 | 44.15 | 81.77 | 51.80 | 57.04 | 48.39 |
| 2 | 68.62 | 33.69 | 53.15 | 25.13 | 32.96 | 51.24 | 31.06 | 30.84 | 69.80 | 43.28 | 83.18 | 52.00 | 58.06 | 48.69 |
| 3 | 69.04 | 35.68 | 52.38 | 26.36 | 33.45 | 51.95 | 30.83 | 30.16 | 66.80 | 42.80 | 83.32 | 51.57 | 57.69 | 48.62 |
| 4 | 69.35 | 33.56 | 50.01 | 26.24 | 33.62 | 50.99 | 31.81 | 30.44 | 65.60 | 43.00 | 83.00 | 52.33 | 56.14 | 48.16 |
| 5 | 67.91 | 35.09 | 49.93 | 27.50 | 33.90 | 52.85 | 31.77 | 30.04 | 69.40 | 42.62 | 80.94 | 51.25 | 61.03 | 48.79 |
Usage Notes
- These models are primarily intended for research purposes.
- Performance may vary depending on the specific task and domain.
Citation
If you use these models in your research, please cite the RegMix paper:
@article{liu2024regmix,
title={RegMix: Data Mixture as Regression for Language Model Pre-training},
author={Liu, Qian and Zheng, Xiaosen and Muennighoff, Niklas and Zeng, Guangtao and Dou, Longxu and Pang, Tianyu and Jiang, Jing and Lin, Min},
journal={arXiv preprint arXiv:2407.01492},
year={2024}
}
For more information about the RegMix methodology and its applications, please refer to the original paper.
- Downloads last month
- 8