--- license: mit language: - en tags: - text-generation - sql - text-to-sql - gemma - fine-tuned - database - nlp base_model: google/gemma-7b datasets: - estu-research/sql-training-dataset metrics: - accuracy - exact_match library_name: transformers pipeline_tag: text-generation --- # Gemma-7B SQL Expert (Fine-Tuned) Fine-tuned version of Google's Gemma-7B model for converting natural language questions to SQL queries. ## Model Details - **Base Model**: [google/gemma-7b](https://huggingface.co/google/gemma-7b) - **Fine-tuned by**: ESTU Research Team (Kulalı, Aydın, Alhan, Fidan) - **Institution**: Eskisehir Technical University - **Project**: TÜBİTAK 2209-A Research - **License**: MIT - **Language**: English - **Task**: Natural Language to SQL Translation ## Performance - **Execution Accuracy**: 76.0% - **Exact Match**: 65.4% - **Average Latency**: 500ms - **Model Size**: 14.1 GB (full) / 183 MB (LoRA adapters) ## Training Details ### Training Data - **Dataset**: [estu-research/sql-training-dataset](https://huggingface.co/datasets/estu-research/sql-training-dataset) - **Examples**: 1,000+ natural language to SQL pairs - **Domain**: Sales database queries (customers, orders, products, employees) ### Training Configuration ```python { "base_model": "google/gemma-7b", "method": "LoRA", "rank": 16, "alpha": 32, "dropout": 0.05, "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"], "epochs": 3, "batch_size": 8, "learning_rate": 1.5e-4, "training_time": "10.8 hours (A100 GPU)" } ``` ### Training Results ``` Epoch 1: Loss 1.456 | Val Loss 1.512 | Accuracy 68.2% Epoch 2: Loss 0.521 | Val Loss 0.589 | Accuracy 72.8% Epoch 3: Loss 0.234 | Val Loss 0.267 | Accuracy 76.0% ``` ## Usage ### Installation ```bash pip install transformers torch ``` ### Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("estu-research/gemma-7b-sql-ft") tokenizer = AutoTokenizer.from_pretrained("estu-research/gemma-7b-sql-ft") # Example query question = """ Schema: CREATE TABLE customers (customerNumber INT, customerName VARCHAR(50), country VARCHAR(50)); Question: List all customers from France """ inputs = tokenizer(question, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=256) sql = tokenizer.decode(outputs[0], skip_special_tokens=True) print(sql) # Output: SELECT * FROM customers WHERE country = 'France'; ``` ### Advanced Usage with Pipeline ```python from transformers import pipeline pipe = pipeline("text-generation", model="estu-research/gemma-7b-sql-ft") result = pipe( "Schema: CREATE TABLE products (productName VARCHAR, price DECIMAL);\nQuestion: Show top 10 expensive products", max_new_tokens=200, temperature=0.1 ) print(result[0]['generated_text']) ``` ## Example Queries | Natural Language | Generated SQL | |------------------|---------------| | List top 5 customers by sales | `SELECT customerName, SUM(amount) as total FROM customers JOIN orders USING(customerId) GROUP BY customerId ORDER BY total DESC LIMIT 5;` | | Show products never ordered | `SELECT p.productName FROM products p LEFT JOIN orderDetails od ON p.productCode = od.productCode WHERE od.productCode IS NULL;` | | Total revenue by country | `SELECT country, SUM(amount) as revenue FROM customers JOIN orders USING(customerId) GROUP BY country ORDER BY revenue DESC;` | ## Comparison with Other Models | Model | Accuracy | Latency | Cost | |-------|----------|---------|------| | **Gemma-7B (FT)** | **76.0%** | 500ms | Free | | Llama-3-8B (FT) | 78.2% | 450ms | Free | | GPT-4o-mini (FT) | 97.8% | 800ms | $0.30/1K | | GPT-3.5 Turbo | 78.9% | 500ms | $0.05/1K | ## Limitations - Trained primarily on sales database schema - May struggle with very complex nested queries - Best performance on English language queries - Requires GPU for optimal inference speed ## Intended Use - **Primary**: Natural language to SQL translation for analytics - **Secondary**: SQL query assistance and education - **Not For**: Production databases without query validation ## Citation ```bibtex @misc{gemma7b-sql-ft, title={Gemma-7B SQL Expert: Fine-Tuned Model for Text-to-SQL}, author={Kulalı and Aydın and Alhan and Fidan}, institution={Eskisehir Technical University}, year={2024}, url={https://huggingface.co/estu-research/gemma-7b-sql-ft} } ``` ## Links - **GitHub**: [Japyh/llm-based-dbms](https://github.com/Japyh/llm-based-dbms) - **Research Paper**: [docs/research_paper_draft.md](https://github.com/Japyh/llm-based-dbms/blob/main/docs/research_paper_draft.md) - **Dataset**: [estu-research/sql-training-dataset](https://huggingface.co/datasets/estu-research/sql-training-dataset) - **Organization**: [estu-research](https://huggingface.co/estu-research) ## Acknowledgments This work was supported by TÜBİTAK 2209-A Research Grant at Eskisehir Technical University. ## License MIT License - See LICENSE file for details