Instructions to use KandirResearch/CiSiMi-v0.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use KandirResearch/CiSiMi-v0.1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-to-audio", model="KandirResearch/CiSiMi-v0.1")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("KandirResearch/CiSiMi-v0.1")
model = AutoModelForCausalLM.from_pretrained("KandirResearch/CiSiMi-v0.1")

llama-cpp-python

How to use KandirResearch/CiSiMi-v0.1 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="KandirResearch/CiSiMi-v0.1",
	filename="unsloth.Q4_K_M.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use KandirResearch/CiSiMi-v0.1 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf KandirResearch/CiSiMi-v0.1:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf KandirResearch/CiSiMi-v0.1:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf KandirResearch/CiSiMi-v0.1:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf KandirResearch/CiSiMi-v0.1:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf KandirResearch/CiSiMi-v0.1:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf KandirResearch/CiSiMi-v0.1:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf KandirResearch/CiSiMi-v0.1:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf KandirResearch/CiSiMi-v0.1:Q4_K_M

Use Docker

docker model run hf.co/KandirResearch/CiSiMi-v0.1:Q4_K_M

LM Studio
Jan
Ollama
How to use KandirResearch/CiSiMi-v0.1 with Ollama:
```
ollama run hf.co/KandirResearch/CiSiMi-v0.1:Q4_K_M
```

Unsloth Studio new

How to use KandirResearch/CiSiMi-v0.1 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for KandirResearch/CiSiMi-v0.1 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for KandirResearch/CiSiMi-v0.1 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for KandirResearch/CiSiMi-v0.1 to start chatting

Docker Model Runner
How to use KandirResearch/CiSiMi-v0.1 with Docker Model Runner:
```
docker model run hf.co/KandirResearch/CiSiMi-v0.1:Q4_K_M
```

Lemonade

How to use KandirResearch/CiSiMi-v0.1 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull KandirResearch/CiSiMi-v0.1:Q4_K_M

Run and chat with the model

lemonade run user.CiSiMi-v0.1-Q4_K_M

List all available models

lemonade list

CiSiMi: A Text-to-Speech TTS Model

Overview

CiSiMi is an early prototype of a text-to-audio model that can process text inputs and respond with both text and audio. Built for resource-constrained environments, it's designed to run efficiently on CPU using llama.cpp, making advanced speech synthesis accessible even without powerful GPUs.

"Being GPU poor and slightly disappointed with the csm release and my inability to run it, having to wait for time it takes me to run an ASR+LLM+TTS combo, I decided to ask Mom and Mom gave me CiSiMi At Home!"

This project demonstrates the power of open-source tools to create accessible speech technology. While still in its early stages, it represents a step toward democratizing advanced text-to-audio capabilities.

Technical Details

Model Specifications

Architecture: Based on OuteTTS-0.3-500M
Languages: English
Pipeline: Text-to-audio
Parameters: 500M
Training Dataset Size: ~15k samples
Future Goals: Scale to 200k-500k dataset with multi-turn conversation using both a 500M and a 1B parameter model variants, plus adding streaming for realtime.

Training Methodology

Dataset Preparation:
- Started with gruhit-patel/alpaca_speech_instruct
- Cleaned by removing code, mathematical expressions, and non-English content
- Filtered to keep only entries with input+output texts of 256 tokens or less
Audio Generation:
- Converted text outputs to speech using hexgrad/Kokoro-82M
- Verified each audio generation using OpenAI Whisper
- Published the resulting dataset as KandirResearch/Speech2Speech
Model Training:
- Preprocessed dataset using modified OuteTTS methodology (training details)
- Fine-tuned OuteAI/OuteTTS-0.3-500M using Unsloth SFT
- Trained for 6 epochs reaching a loss of 2.27 as a proof of concept
- ~~Trained for 3 epochs reaching a loss of 2.42 as a proof of concept~~

Usage Guide

Sample

Explain to me how gravity works!

Installation

pip install outetts llama-cpp-python --upgrade
pip install huggingface_hub sounddevice

Implementation

import torch
import outetts
import numpy as np
from huggingface_hub import hf_hub_download
from outetts.wav_tokenizer.audio_codec import AudioCodec
from outetts.version.v2.prompt_processor import PromptProcessor
from outetts.version.playback import ModelOutput

# Download the model
model_path = hf_hub_download(
    repo_id="KandirResearch/CiSiMi-v0.1",
    filename="unsloth.Q8_0.gguf",
)

# Configure the model
model_config = outetts.GGUFModelConfig_v2(
    model_path=model_path,
    tokenizer_path="KandirResearch/CiSiMi-v0.1",
)

# Initialize components
interface = outetts.InterfaceGGUF(model_version="0.3", cfg=model_config)
audio_codec = AudioCodec()
prompt_processor = PromptProcessor("KandirResearch/CiSiMi-v0.1")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
gguf_model = interface.get_model()

# Helper function to extract audio from tokens
def get_audio(tokens):
    outputs = prompt_processor.extract_audio_from_tokens(tokens)
    if not outputs:
        return None
    audio_tensor = audio_codec.decode(torch.tensor([[outputs]], dtype=torch.int64).to(device))
    return ModelOutput(audio_tensor, audio_codec.sr)

# Helper function to clean text output
def extract_text_from_tts_output(tts_output):
    text = ""
    for line in tts_output.strip().split('\n'):
        if '<|audio_end|>' in line or '<|im_end|>' in line:
            continue
        if '<|' in line:
            word = line.split('<|')[0].strip()
            if word:
                text += word + " "
        else:
            text += line.strip() + " "
    return text.strip()

# Generate response function
def generate_response(instruction):
    prompt = f"<|im_start|>\nInstructions:\n{instruction}\n<|im_end|>\nAnswer:\n"
    gen_cfg = outetts.GenerationConfig(
        text=prompt, 
        temperature=0.6, 
        repetition_penalty=1.1, 
        max_length=4096, 
        speaker=None
    )
    
    input_ids = prompt_processor.tokenizer.encode(prompt)
    tokens = gguf_model.generate(input_ids, gen_cfg)
    
    output_text = prompt_processor.tokenizer.decode(tokens, skip_special_tokens=False)
    
    if "<|audio_end|>" in output_text:
        first_part, _, _ = output_text.partition("<|audio_end|>")
        
        if "<|audio_end|>\n<|im_end|>\n" not in first_part:
            first_part += "<|audio_end|>\n<|im_end|>\n"
            
        extracted_text = extract_text_from_tts_output(first_part)
        
        audio_start_pos = first_part.find("<|audio_start|>\n") + len("<|audio_start|>\n")
        audio_end_pos = first_part.find("<|audio_end|>\n<|im_end|>\n") + len("<|audio_end|>\n<|im_end|>\n")
        
        if audio_start_pos >= len("<|audio_start|>\n") and audio_end_pos > audio_start_pos:
            audio_tokens_text = first_part[audio_start_pos:audio_end_pos]
            audio_tokens = prompt_processor.tokenizer.encode(audio_tokens_text)
            audio_output = get_audio(audio_tokens)
            
            if audio_output is not None and hasattr(audio_output, 'audio') and audio_output.audio is not None:
                audio_numpy = audio_output.audio.cpu().numpy()
                if audio_numpy.ndim > 1:
                    audio_numpy = audio_numpy.squeeze()
                
                return extracted_text, (audio_output.sr, audio_numpy)
    
    return output_text, None

# Example usage
question = "What is the meaning of life?"
response_text, response_audio = generate_response(question)
print(response_text)

# Play audio if available
if response_audio is not None:
    if "ipykernel" in sys.modules:
        from IPython.display import display, Audio
        display(Audio(response_audio[1], rate=response_audio[0], autoplay=True))
    else:
        import sounddevice as sd
        sd.play(response_audio[1], samplerate=response_audio[0])
        sd.wait()

Limitations & Future Work

This early prototype has several areas for improvement:

Limited training data (~15k samples)
Basic prompt/chat template structure
Opportunity to optimize training hyperparameters
Potential for multi-turn conversation capabilities

Potential Limitation: This type of model quickly fills up context window, making smaller models generally more practical for implementation.

Acknowledgments & Citations

This model builds on the following open-source projects:

OuteAI/OuteTTS-0.3-500M - Base model
gruhit-patel/alpaca_speech_instruct - Initial dataset
hexgrad/Kokoro-82M - TTS generation
OpenAI Whisper - Speech verification
Unsloth - Training optimization

Downloads last month: 197

Safetensors

Model size

0.5B params

Tensor type

F16

Model tree for KandirResearch/CiSiMi-v0.1

Base model

OuteAI/OuteTTS-0.3-500M

Quantized

(4)

this model

Quantizations

2 models

KandirResearch
/

CiSiMi-v0.1