T-lite-it-2.1-int4-ov

OpenVINO INT4 quantized version of t-tech/T-lite-it-2.1.

πŸ“₯ Quick Start - Download & Run

1. Install HF CLI:

pip install -U huggingface_hub["cli"]

2. Download model:

hf download savvadesogle/T-lite-it-2.1-int4-ov --local-dir ./T-lite-it-2.1-int4-ov

πŸ”§ Quantization Parameters

Weight compression was performed using optimum-cli export openvino with the following parameters:

optimum-cli export openvino ^
  --model ./T-lite-it-2.1 ^
  --task text-generation-with-past ^
  --weight-format int4 ^
  ./T-lite-it-2.1-int4-ov

βœ… Compatibility

The provided OpenVINO IR model is compatible with:

  • OpenVINO version 2026.0.0.dev20260102
  • Optimum 2.1.0.dev0
  • Optimum Intel 1.27.0.dev0+25fcb63
  • NNCF 3.0.0.dev0+999c5e91
  • OpenArc

🎯 Running with OpenArc

OpenArc β€” OpenAI-compatible inference server for OpenVINO models.

Terminal 1 β€” Start server:

set OPENARC_API_KEY=BIG_KEY
openarc serve start --host 127.0.0.1

Terminal 2 β€” Add model:

set OPENARC_API_KEY=BIG_KEY
openarc add --model-name T-lite-it-2.1 --model-path ./T-lite-it-2.1-int4-ov --engine ovgenai --model-type llm --device GPU


openarc load T-lite-it-2.1

openarc bench T-lite-it-2.1

Connect via OpenAI API:

http://127.0.0.1:8000/v1/chat/completions

🌐 OpenAI API Example (Windows CMD)

curl http://127.0.0.1:8000/v1/chat/completions ^
  -H "Content-Type: application/json" ^
  -H "Authorization: Bearer 234" ^
  -d "{\"model\":\"T-lite-it-2.1-int4-ov\",\"messages\":[{\"role\":\"user\",\"content\":\"РасскаТи Π°Π½Π΅ΠΊΠ΄ΠΎΡ‚\"}],\"temperature\":0.7,\"max_tokens\":128}"

πŸ“Š Performance Metrics

openarc bench T-lite-it-2.1-int4-ov
2026-01-04 00:52:05,359 - INFO - [T-lite-it-2.1-int4-ov LLM Worker] Metrics: {'load_time (s)': 20.36, 'ttft (s)': 0.1, 'tpot (ms)': 24.38149, 'prefill_throughput (tokens/s)': 5116.81, 'decode_throughput (tokens/s)': 41.01472, 'decode_duration (s)': 3.19661, 'input_token': 512, 'new_token': 128, 'total_token': 640, 'stream': False}
2026-01-04 00:52:05,360 - INFO - [bench] model=T-lite-it-2.1-int4-ov input_ids_len=512 metrics={'load_time (s)': 20.36, 'ttft (s)': 0.1, 'tpot (ms)': 24.38149, 'prefill_throughput (tokens/s)': 5116.81, 'decode_throughput (tokens/s)': 41.01472, 'decode_duration (s)': 3.19661, 'input_token': 512, 'new_token': 128, 'total_token': 640, 'stream': False}

T-lite-it-2.1 Inference Speed

⚠️ Limitations

Check the original model card for limitations.

πŸ“„ Legal information

Distributed under the same license as the original model.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for savvadesogle/T-lite-it-2.1-int4-ov

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Quantized
(4)
this model