ZeroGPU-LLM-Inference / gcp-deployment.md
Alikestocode's picture
Add Google Cloud Platform deployment configurations
aa65d00
|
raw
history blame
5.32 kB

Google Cloud Platform Deployment Guide

This guide covers deploying the Router Agent application to Google Cloud Platform with GPU support.

Prerequisites

  1. Google Cloud Account with billing enabled
  2. gcloud CLI installed and configured
    curl https://sdk.cloud.google.com | bash
    gcloud init
    
  3. Docker installed locally
  4. HF_TOKEN environment variable set (for accessing private models)

Deployment Options

Option 1: Cloud Run (Serverless, CPU only)

Pros:

  • Serverless, pay-per-use
  • Auto-scaling
  • No VM management

Cons:

  • No GPU support (CPU inference only)
  • Cold starts
  • Limited to 8GB memory

Steps:

# Set your project ID
export GCP_PROJECT_ID="your-project-id"
export GCP_REGION="us-central1"

# Make script executable
chmod +x deploy-gcp.sh

# Deploy to Cloud Run
./deploy-gcp.sh cloud-run

Cost: ~$0.10-0.50/hour when active (depends on traffic)

Option 2: Compute Engine with GPU (Recommended for Production)

Pros:

  • Full GPU support (T4, V100, A100)
  • Persistent instance
  • Better for long-running workloads
  • Lower latency (no cold starts)

Cons:

  • Requires VM management
  • Higher cost for always-on instances

Steps:

# Set your project ID and zone
export GCP_PROJECT_ID="your-project-id"
export GCP_ZONE="us-central1-a"
export HF_TOKEN="your-huggingface-token"

# Make script executable
chmod +x deploy-compute-engine.sh

# Deploy to Compute Engine
./deploy-compute-engine.sh

GPU Options:

  • T4 (nvidia-tesla-t4): ~$0.35/hour - Good for 27B-32B models with quantization
  • V100 (nvidia-tesla-v100): ~$2.50/hour - Better performance
  • A100 (nvidia-a100): ~$3.50/hour - Best performance for large models

Cost: GPU instance + storage (~$0.35-3.50/hour depending on GPU type)

Manual Deployment Steps

1. Build and Push Docker Image

# Authenticate Docker
gcloud auth configure-docker

# Set project
gcloud config set project YOUR_PROJECT_ID

# Build image
docker build -t gcr.io/YOUR_PROJECT_ID/router-agent:latest .

# Push to Container Registry
docker push gcr.io/YOUR_PROJECT_ID/router-agent:latest

2. Deploy to Cloud Run (CPU)

gcloud run deploy router-agent \
    --image gcr.io/YOUR_PROJECT_ID/router-agent:latest \
    --platform managed \
    --region us-central1 \
    --allow-unauthenticated \
    --port 7860 \
    --memory 8Gi \
    --cpu 4 \
    --timeout 3600 \
    --set-env-vars "HF_TOKEN=your-token,GRADIO_SERVER_NAME=0.0.0.0,GRADIO_SERVER_PORT=7860"

3. Deploy to Compute Engine (GPU)

# Create VM with GPU
gcloud compute instances create router-agent-gpu \
    --zone=us-central1-a \
    --machine-type=n1-standard-4 \
    --accelerator="type=nvidia-tesla-t4,count=1" \
    --image-family=cos-stable \
    --image-project=cos-cloud \
    --boot-disk-size=100GB \
    --maintenance-policy=TERMINATE \
    --scopes=https://www.googleapis.com/auth/cloud-platform

# SSH into instance
gcloud compute ssh router-agent-gpu --zone=us-central1-a

# On the VM, install Docker and NVIDIA runtime
# Then pull and run the container
docker pull gcr.io/YOUR_PROJECT_ID/router-agent:latest
docker run -d \
    --name router-agent \
    --gpus all \
    -p 7860:7860 \
    -e HF_TOKEN="your-token" \
    gcr.io/YOUR_PROJECT_ID/router-agent:latest

Environment Variables

Set these in Cloud Run or as VM metadata:

  • HF_TOKEN: Hugging Face access token (required for private models)
  • GRADIO_SERVER_NAME: Server hostname (default: 0.0.0.0)
  • GRADIO_SERVER_PORT: Server port (default: 7860)
  • ROUTER_PREFETCH_MODELS: Comma-separated list of models to preload
  • ROUTER_WARM_REMAINING: Set to "1" to warm remaining models

Monitoring and Logs

Cloud Run Logs

gcloud run services logs read router-agent --region us-central1

Compute Engine Logs

gcloud compute instances get-serial-port-output router-agent-gpu --zone us-central1-a

Cost Optimization

  1. Cloud Run: Use only when needed, auto-scales to zero
  2. Compute Engine:
    • Use preemptible instances for 80% cost savings (with risk of termination)
    • Stop instance when not in use: gcloud compute instances stop router-agent-gpu --zone us-central1-a
    • Use smaller GPU types (T4) for development, larger (A100) for production

Troubleshooting

GPU Not Available

  • Check GPU quota: gcloud compute project-info describe --project YOUR_PROJECT_ID
  • Request quota increase if needed
  • Verify GPU drivers are installed on Compute Engine VM

Out of Memory

  • Increase Cloud Run memory: --memory 16Gi
  • Use larger VM instance type
  • Enable model quantization (AWQ/BitsAndBytes)

Cold Starts (Cloud Run)

  • Use Cloud Run min-instances to keep warm
  • Pre-warm models on startup
  • Consider Compute Engine for always-on workloads

Security

  1. Authentication: Use Cloud Run authentication or Cloud IAP for Compute Engine
  2. Secrets: Store HF_TOKEN in Secret Manager
  3. Firewall: Restrict access to specific IP ranges
  4. HTTPS: Use Cloud Load Balancer with SSL certificate

Next Steps

  1. Set up Cloud Load Balancer for HTTPS
  2. Configure monitoring and alerts
  3. Set up CI/CD with Cloud Build
  4. Use Cloud Storage for model caching
  5. Implement auto-scaling policies