--- title: PipelineForge MCP emoji: 🔧 colorFrom: purple colorTo: pink sdk: gradio sdk_version: 6.0.1 app_file: app.py pinned: true tags: - mcp-in-action-track-enterprise - aws-glue - etl - data-engineering - rag - chromadb license: mit --- # 🔧 PipelineForge MCP **AWS Glue ETL Optimizer with AI-Powered Workflow** Transform your ETL development with AI! PipelineForge MCP is an intelligent MCP (Model Context Protocol) server that automates AWS Glue ETL pipeline creation from screenshots to production-ready code. ## 🏆 MCP 1st Birthday Hackathon Submission **Track:** MCP in Action - Enterprise Category **Tag:** `mcp-in-action-track-enterprise` ## ✨ Features ### 6 Integrated MCP Tools 1. **🔍 Screenshot Analysis** - Extract ETL requirements from AWS console images using Claude Vision 2. **🚀 Script Generation** - Generate optimized PySpark scripts with AWS Glue best practices 3. **💵 Cost Simulation** - Calculate AWS Glue job costs before deployment 4. **🏭 CDK Infrastructure** - Generate AWS CDK Python code for complete deployment 5. **🎤 Voice Summary** - ElevenLabs TTS narration of your pipeline 6. **📚 Template Library** - RAG-powered similarity search with 5 ETL templates ### 🎯 Key Technologies - **🧠 Claude AI (Anthropic)**: Vision and text generation - **🔍 RAG Search**: ChromaDB + SentenceTransformers for template matching - **🎙️ ElevenLabs**: Text-to-speech for pipeline summaries - **⚡ Modal**: Serverless cost simulation - **🎨 Gradio 5+**: Modern, colorful UI with automatic data flow ## 🚀 How It Works **Automatic Workflow** - Each step uses data from previous steps automatically: 1. **Upload Screenshot** → Extract requirements 2. **Generate Script** → AI creates PySpark code 3. **Simulate Cost** → Calculate AWS expenses 4. **Generate CDK** → Infrastructure as code 5. **Voice Summary** → Audio explanation 6. **Find Templates** → Similar ETL patterns No copy/paste needed - data flows automatically! ## 💻 Local Setup ```bash # Clone repository git clone cd pipelineforge-mcp # Install dependencies pip install -r requirements.txt # Set environment variables # ANTHROPIC_API_KEY=your_key # ELEVENLABS_API_KEY=your_key # Run app python app.py ``` Access at: http://127.0.0.1:7861 ## 🎨 UI Highlights - **Purple-to-violet gradient** backgrounds - **Pink-to-red** accent headers - **Glassmorphism effects** for modern look - **Smooth animations** on interactions - **6 tabbed workflow** with progress indicators ## 🔧 Technical Architecture ### RAG Implementation - **ChromaDB** for vector storage - **SentenceTransformers** (all-MiniLM-L6-v2) for embeddings - **5 pre-loaded templates**: Daily sales, CDC, data quality, multi-source, incremental ETL - **Semantic similarity search** for pattern matching ### MCP Tools Each tool is decorated with `@gr.mcp.tool()` and integrated into the Gradio UI: - Screenshot analysis uses Claude Vision API - Script generation with Claude 3 Opus - Cost simulation with configurable worker types - CDK code generation for deployment - Voice synthesis with ElevenLabs v2.x API - RAG template search with relevance scoring ## 📊 Use Cases Perfect for: - **Data Engineers** building ETL pipelines - **DevOps Teams** automating infrastructure - **Analytics Teams** processing data workflows - **Cloud Architects** designing data platforms ## 🎯 What Makes It Unique Unlike generic ChatGPT prompts, PipelineForge MCP provides: 1. **Visual Analysis** - Upload AWS console screenshots 2. **Cost Awareness** - See expenses before deployment 3. **Template Learning** - RAG-powered pattern matching 4. **Voice Explanations** - Audio summaries of pipelines 5. **Production Export** - Ready-to-deploy CDK code 6. **Automatic Flow** - No manual copy/paste between steps ## 📦 Dependencies - `gradio>=5.0.0` - MCP server framework - `anthropic>=0.40.0` - Claude AI - `elevenlabs>=0.2.0` - Text-to-speech - `chromadb>=0.4.18` - Vector database - `sentence-transformers>=2.2.0` - Embeddings - `modal>=1.2.0` - Serverless compute - `boto3>=1.34.0` - AWS SDK ## 🏗️ Project Structure ``` pipelineforge-mcp/ ├── app.py # Main Gradio MCP server ├── rag_templates_minimal.py # RAG implementation ├── requirements.txt # Dependencies ├── README.md # This file ├── .env # API keys (not committed) └── chroma_db/ # RAG vector storage ``` ## 🎬 Demo *[Demo video coming soon]* ## 🤝 Contributing Built for the MCP 1st Birthday Hackathon! ## 📄 License MIT License ## 🙏 Acknowledgments - **Anthropic** - Claude AI for vision and text - **ChromaDB** - Vector database for RAG - **ElevenLabs** - Voice synthesis - **Modal** - Serverless compute - **Gradio** - MCP framework --- **🏆 MCP 1st Birthday Hackathon | Track: MCP in Action (Enterprise)** Built with ❤️ for data engineers