--- language: - multilingual - ko - en - zh - ja - es - fr - de - ru - ar - hi tags: - tokenization - byte-level - neural-tokenizer - pattern-learning - vocabulary-free license: mit datasets: - flores200 metrics: - accuracy model-index: - name: intelligent-tokenizer-v6 results: - task: type: token-classification name: Language Pattern Learning dataset: name: flores200 type: flores200 metrics: - type: accuracy value: 0.623 name: Character Accuracy (Major Languages) - type: accuracy value: 0.472 name: Character Accuracy (Minor Languages) --- # Intelligent Tokenizer v6.0 - Language Pattern Learning ## Model Description **World's First Language Pattern Learning Tokenizer** - Discovers each language's unique patterns through pure learning. ### Key Features - **No vocabulary files** - Only 260 fixed byte values - **Language pattern discovery** - Learns Korean particles, English morphology, Chinese characters - **Equal language processing** - No English bias - **Semantic unit preservation** - Keeps meaning units intact ## Performance (Epoch 22) | Language Group | Accuracy | |---------------|----------| | English/European | 95-100% | | Korean | 70% | | Japanese | 81% | | Chinese | 7% (still learning) | | Rare Languages | 47% avg | ## Technical Details - **Architecture**: 5-layer Encoder + 6-layer Decoder - **Parameters**: 105M - **Input**: Raw UTF-8 bytes - **Output**: Compressed semantic units - **Training**: 22 epochs on Flores-200 dataset ## Usage ```python from transformers import AutoModel, AutoTokenizer # Load model model = AutoModel.from_pretrained("woo-jinhyun/intelligent-tokenizer-v6") tokenizer = ByteTokenizerV6() # Custom tokenizer # Process text text = "안녕하세요" encoded = tokenizer.encode(text) compressed = model.encode(encoded) ``` ## Limitations - Current chunk size: 256 bytes (POC limitation) - Chinese/Arabic need more training - Compression still learning ## Citation ```bibtex @software{intelligent_tokenizer_2025, author = {Woo, Jinhyun and Claude Code}, title = {Intelligent Tokenizer: Language Pattern Learning}, year = {2025}, url = {https://github.com/Woojiggun/intelligent-tokenizer} } ``` ## Contact - **Author**: Woo Jinhyun - **Email**: ggunio5782@gmail.com - **LinkedIn**: [www.linkedin.com/in/namuneup](https://www.linkedin.com/in/namuneup) - **Paper**: [Read on Zenodo](https://zenodo.org/records/17116281?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImIyNWZiYTQyLWNiNGEtNDBmNi1iNTczLWVkMDJlNDI1YTQ1OSIsImRhdGEiOnt9LCJyYW5kb20iOiI0OWJkZWMzMjJjZTc3OTIwMTk4NTJlNTY1YmNjOGU1ZiJ9.Z_hXEp160tWBD5Qe2laQv1vhS4Js2a0R5BMWYs2PTG5vJMrc8l-BmPAIMya9O_HiN85jYZp-WOMOHg_DTHrg2A) ## Development - **Design**: Woo Jinhyun - **Implementation**: Claude Code collaboration - **Hardware**: RTX 4070 - **Duration**: 1 months (Aug-Sep 2025)