Spaces:

Alovestocode
/

ZeroGPU-LLM-Inference

Sleeping

App Files Files Community

ZeroGPU-LLM-Inference / gcp-deployment.md

Alikestocode

Add Google Cloud Platform deployment configurations

aa65d00 about 1 month ago

preview code

raw

history blame

5.32 kB

Google Cloud Platform Deployment Guide

This guide covers deploying the Router Agent application to Google Cloud Platform with GPU support.

Prerequisites

Google Cloud Account with billing enabled

gcloud CLI installed and configured

curl https://sdk.cloud.google.com | bash
gcloud init

Docker installed locally
HF_TOKEN environment variable set (for accessing private models)

Deployment Options

Option 1: Cloud Run (Serverless, CPU only)

Pros:

Serverless, pay-per-use
Auto-scaling
No VM management

Cons:

No GPU support (CPU inference only)
Cold starts
Limited to 8GB memory

Steps:

# Set your project ID
export GCP_PROJECT_ID="your-project-id"
export GCP_REGION="us-central1"

# Make script executable
chmod +x deploy-gcp.sh

# Deploy to Cloud Run
./deploy-gcp.sh cloud-run

Cost: ~$0.10-0.50/hour when active (depends on traffic)

Option 2: Compute Engine with GPU (Recommended for Production)

Pros:

Full GPU support (T4, V100, A100)
Persistent instance
Better for long-running workloads
Lower latency (no cold starts)

Cons:

Requires VM management
Higher cost for always-on instances

Steps:

# Set your project ID and zone
export GCP_PROJECT_ID="your-project-id"
export GCP_ZONE="us-central1-a"
export HF_TOKEN="your-huggingface-token"

# Make script executable
chmod +x deploy-compute-engine.sh

# Deploy to Compute Engine
./deploy-compute-engine.sh

GPU Options:

T4 (nvidia-tesla-t4): ~$0.35/hour - Good for 27B-32B models with quantization
V100 (nvidia-tesla-v100): ~$2.50/hour - Better performance
A100 (nvidia-a100): ~$3.50/hour - Best performance for large models

Cost: GPU instance + storage (~$0.35-3.50/hour depending on GPU type)

Manual Deployment Steps

1. Build and Push Docker Image

# Authenticate Docker
gcloud auth configure-docker

# Set project
gcloud config set project YOUR_PROJECT_ID

# Build image
docker build -t gcr.io/YOUR_PROJECT_ID/router-agent:latest .

# Push to Container Registry
docker push gcr.io/YOUR_PROJECT_ID/router-agent:latest

2. Deploy to Cloud Run (CPU)

gcloud run deploy router-agent \
    --image gcr.io/YOUR_PROJECT_ID/router-agent:latest \
    --platform managed \
    --region us-central1 \
    --allow-unauthenticated \
    --port 7860 \
    --memory 8Gi \
    --cpu 4 \
    --timeout 3600 \
    --set-env-vars "HF_TOKEN=your-token,GRADIO_SERVER_NAME=0.0.0.0,GRADIO_SERVER_PORT=7860"

3. Deploy to Compute Engine (GPU)

# Create VM with GPU
gcloud compute instances create router-agent-gpu \
    --zone=us-central1-a \
    --machine-type=n1-standard-4 \
    --accelerator="type=nvidia-tesla-t4,count=1" \
    --image-family=cos-stable \
    --image-project=cos-cloud \
    --boot-disk-size=100GB \
    --maintenance-policy=TERMINATE \
    --scopes=https://www.googleapis.com/auth/cloud-platform

# SSH into instance
gcloud compute ssh router-agent-gpu --zone=us-central1-a

# On the VM, install Docker and NVIDIA runtime
# Then pull and run the container
docker pull gcr.io/YOUR_PROJECT_ID/router-agent:latest
docker run -d \
    --name router-agent \
    --gpus all \
    -p 7860:7860 \
    -e HF_TOKEN="your-token" \
    gcr.io/YOUR_PROJECT_ID/router-agent:latest

Environment Variables

Set these in Cloud Run or as VM metadata:

HF_TOKEN: Hugging Face access token (required for private models)
GRADIO_SERVER_NAME: Server hostname (default: 0.0.0.0)
GRADIO_SERVER_PORT: Server port (default: 7860)
ROUTER_PREFETCH_MODELS: Comma-separated list of models to preload
ROUTER_WARM_REMAINING: Set to "1" to warm remaining models

Monitoring and Logs

Cloud Run Logs

gcloud run services logs read router-agent --region us-central1

Compute Engine Logs

gcloud compute instances get-serial-port-output router-agent-gpu --zone us-central1-a

Cost Optimization

Cloud Run: Use only when needed, auto-scales to zero
Compute Engine:
- Use preemptible instances for 80% cost savings (with risk of termination)
- Stop instance when not in use: gcloud compute instances stop router-agent-gpu --zone us-central1-a
- Use smaller GPU types (T4) for development, larger (A100) for production

Troubleshooting

GPU Not Available

Check GPU quota: gcloud compute project-info describe --project YOUR_PROJECT_ID
Request quota increase if needed
Verify GPU drivers are installed on Compute Engine VM

Out of Memory

Increase Cloud Run memory: --memory 16Gi
Use larger VM instance type
Enable model quantization (AWQ/BitsAndBytes)

Cold Starts (Cloud Run)

Use Cloud Run min-instances to keep warm
Pre-warm models on startup
Consider Compute Engine for always-on workloads

Security

Authentication: Use Cloud Run authentication or Cloud IAP for Compute Engine
Secrets: Store HF_TOKEN in Secret Manager
Firewall: Restrict access to specific IP ranges
HTTPS: Use Cloud Load Balancer with SSL certificate

Next Steps

Set up Cloud Load Balancer for HTTPS
Configure monitoring and alerts
Set up CI/CD with Cloud Build
Use Cloud Storage for model caching
Implement auto-scaling policies