--- base_model: unsloth/gemma-2-2b-it-bnb-4bit tags: - text-generation-inference - transformers - unsloth - gemma2 - text-to-sql - qlora - sql-generation license: apache-2.0 language: - en datasets: - gretelai/synthetic_text_to_sql pipeline_tag: text-generation --- # Gemma-2-2B Text-to-SQL QLoRA Fine-tuned Model - **Developed by:** rajaykumar12959 - **License:** apache-2.0 - **Finetuned from model:** unsloth/gemma-2-2b-it-bnb-4bit - **Dataset:** gretelai/synthetic_text_to_sql - **Task:** Text-to-SQL Generation - **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation) This gemma2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [](https://github.com/unslothai/unsloth) ## Model Description This model is specifically fine-tuned to generate SQL queries from natural language questions and database schemas. It excels at handling complex multi-table queries requiring JOINs, aggregations, filtering, and advanced SQL operations. ### Key Features - ✅ **Multi-table JOINs** (INNER, LEFT, RIGHT) - ✅ **Aggregation functions** (SUM, COUNT, AVG, MIN, MAX) - ✅ **GROUP BY and HAVING clauses** - ✅ **Complex WHERE conditions** - ✅ **Subqueries and CTEs** - ✅ **Date/time operations** - ✅ **String functions and pattern matching** ## Training Configuration The model was fine-tuned using QLoRA with the following configuration: ```python # LoRA Configuration r = 16 # Rank: 16 is a good balance for 2B models target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] lora_alpha = 16 lora_dropout = 0 bias = "none" use_gradient_checkpointing = "unsloth" # Training Parameters max_seq_length = 2048 per_device_train_batch_size = 2 gradient_accumulation_steps = 4 # Effective batch size = 8 warmup_steps = 5 max_steps = 100 # Demo configuration - increase to 300+ for production learning_rate = 2e-4 optim = "adamw_8bit" # 8-bit optimizer for memory efficiency weight_decay = 0.01 lr_scheduler_type = "linear" ``` ## Installation ```bash pip install unsloth transformers torch trl datasets ``` ## Usage ### Loading the Model ```python from unsloth import FastLanguageModel import torch max_seq_length = 2048 dtype = None load_in_4bit = True model, tokenizer = FastLanguageModel.from_pretrained( model_name = "rajaykumar12959/gemma-2-2b-text-to-sql-qlora", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, ) FastLanguageModel.for_inference(model) # Enable faster inference ``` ### Inference Function ```python def inference_text_to_sql(model, tokenizer, schema, question, max_new_tokens=300): """ Perform inference to generate SQL from natural language question and database schema. Args: model: Fine-tuned Gemma model tokenizer: Model tokenizer schema: Database schema as string question: Natural language question max_new_tokens: Maximum tokens to generate Returns: Generated SQL query as string """ # Format the input prompt input_prompt = f"""user You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables. ### Schema: {schema} ### Question: {question} model """ # Tokenize input inputs = tokenizer([input_prompt], return_tensors="pt").to("cuda") # Generate output with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=max_new_tokens, use_cache=True, do_sample=True, temperature=0.1, # Low temperature for more deterministic output top_p=0.9, pad_token_id=tokenizer.eos_token_id ) # Decode and clean the result result = tokenizer.batch_decode(outputs)[0] sql_query = result.split("model")[-1].replace("", "").strip() return sql_query ``` ### Example Usage #### Example 1: Simple Single-Table Query ```python # Simple employee database simple_schema = """ CREATE TABLE employees ( employee_id INT PRIMARY KEY, name TEXT, department TEXT, salary DECIMAL, hire_date DATE ); """ simple_question = "Find all employees in the 'Engineering' department with salary greater than 75000" sql_result = inference_text_to_sql(model, tokenizer, simple_schema, simple_question) print(f"Generated SQL:\n{sql_result}") ``` **Expected Output:** ```sql SELECT * FROM employees WHERE department = 'Engineering' AND salary > 75000; ``` ## Training Details ### Dataset - **Source:** gretelai/synthetic_text_to_sql - **Size:** 100,000 synthetic text-to-SQL examples - **Columns used:** - `sql_context`: Database schema - `sql_prompt`: Natural language question - `sql`: Target SQL query ### Training Process The model uses a custom formatting function to structure the training data: ```python def formatting_prompts_func(examples): schemas = examples["sql_context"] questions = examples["sql_prompt"] outputs = examples["sql"] texts = [] for schema, question, output in zip(schemas, questions, outputs): text = gemma_prompt.format(schema, question, output) + EOS_TOKEN texts.append(text) return { "text" : texts, } ``` ### Hardware Requirements - **GPU:** Single GPU with 8GB+ VRAM - **Training Time:** ~30 minutes for 100 steps - **Memory Optimization:** 4-bit quantization + 8-bit optimizer ## Performance Characteristics ### Strengths - Excellent performance on multi-table JOINs - Accurate aggregation and GROUP BY operations - Proper handling of foreign key relationships - Good understanding of filtering logic (WHERE/HAVING) ### Model Capabilities Test The model was tested on a complex 4-table JOIN query requiring: 1. **Multi-table JOINs** (users → orders → order_items → products) 2. **Category filtering** (WHERE p.category = 'Electronics') 3. **User grouping** (GROUP BY user fields) 4. **Aggregation** (SUM of price × quantity) 5. **Aggregate filtering** (HAVING total > 500) ## Limitations - **Training Scale:** Trained with only 100 steps for demonstration. For production use, increase `max_steps` to 300+ - **Context Length:** Limited to 2048 tokens maximum sequence length - **SQL Dialects:** Primarily trained on standard SQL syntax - **Complex Subqueries:** May require additional fine-tuning for highly complex nested queries ## Reproduction To reproduce this training: 1. **Clone the notebook:** Use the provided `Fine_tune_qlora.ipynb` 2. **Install dependencies:** ```bash pip install unsloth transformers torch trl datasets ``` 3. **Configure training:** Adjust `max_steps` in TrainingArguments for longer training 4. **Run training:** Execute all cells in the notebook ### Production Training Recommendations ```python # For production use, update these parameters: max_steps = 300, # Increase from 100 warmup_steps = 10, # Increase warmup per_device_train_batch_size = 4, # If you have more GPU memory ``` ## Model Card | Parameter | Value | |-----------|--------| | Base Model | Gemma-2-2B (4-bit quantized) | | Fine-tuning Method | QLoRA | | LoRA Rank | 16 | | Training Steps | 100 (demo) | | Learning Rate | 2e-4 | | Batch Size | 8 (effective) | | Max Sequence Length | 2048 | | Dataset Size | 100k examples | ## Citation ```bibtex @misc{gemma-2-2b-text-to-sql-qlora, author = {rajaykumar12959}, title = {Gemma-2-2B Text-to-SQL QLoRA Fine-tuned Model}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/rajaykumar12959/gemma-2-2b-text-to-sql-qlora}}, } ``` ## Acknowledgments - **Base Model:** Google's Gemma-2-2B via Unsloth optimization - **Dataset:** Gretel AI's synthetic text-to-SQL dataset - **Framework:** Unsloth for efficient fine-tuning and TRL for training - **Method:** QLoRA for parameter-efficient training ## License This model is licensed under Apache 2.0. See the LICENSE file for details. --- *This model is intended for research and educational purposes. Please ensure compliance with your organization's data and AI usage policies when using in production environments.*