How to Use Nano Banana Pro API: Complete Guide (2026)

Nano Banana Pro is a serverless GPU inference platform that lets you deploy and serve AI models via API without managing infrastructure. It is especially popular for running custom models, fine-tuned LLMs, and image generation pipelines.

This guide covers everything from initial setup to production deployment with code examples.

What Is Nano Banana Pro?

Banana.dev (now Banana Pro / Nano Banana Pro) provides serverless GPU compute for AI inference. Instead of renting a dedicated GPU server, you deploy your model and pay only for the compute time you actually use.

Key features

Serverless GPU inference -- no server management
Cold start optimization -- models load in seconds
Auto-scaling -- handles traffic spikes automatically
Custom model support -- deploy any PyTorch/TensorFlow model
Pre-built templates -- quick-start with popular models
Pay-per-second billing -- no idle charges

Getting Started

Step 1: Create an account

Go to banana.dev and sign up
Navigate to the dashboard
Generate an API key from Settings > API Keys

Step 2: Install the SDK

# Python SDK
pip install banana-dev

# Node.js SDK
npm install @banana-dev/banana-dev

Step 3: Run your first inference

import banana_dev as banana

api_key = "your-api-key"
model_key = "your-model-key"  # from the dashboard

# Run inference
result = banana.run(
    api_key=api_key,
    model_key=model_key,
    model_inputs={
        "prompt": "A futuristic city at sunset, cyberpunk style",
        "num_inference_steps": 30,
        "guidance_scale": 7.5,
        "width": 1024,
        "height": 1024
    }
)

print(result["modelOutputs"])

import Banana from "@banana-dev/banana-dev";

const banana = new Banana.default("your-api-key");

const result = await banana.run("your-model-key", {
  prompt: "A futuristic city at sunset, cyberpunk style",
  num_inference_steps: 30,
  guidance_scale: 7.5,
  width: 1024,
  height: 1024,
});

console.log(result.modelOutputs);

Deploying a Custom Model

Using the Banana Potassium Framework

Potassium is Banana's framework for wrapping any Python model into a deployable API.

Step 1: Create your project

mkdir my-model && cd my-model
pip install potassium

Step 2: Write your app.py

from potassium import Potassium, Request, Response
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = Potassium("my-llm-api")

@app.init
def init():
    """Load the model into memory"""
    model_name = "microsoft/phi-4"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    context = {
        "model": model,
        "tokenizer": tokenizer
    }
    return context

@app.handler()
def handler(context: dict, request: Request) -> Response:
    """Handle inference requests"""
    model = context["model"]
    tokenizer = context["tokenizer"]

    prompt = request.json["prompt"]
    max_tokens = request.json.get("max_tokens", 512)
    temperature = request.json.get("temperature", 0.7)

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            do_sample=True
        )

    response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return Response(
        json={"output": response_text},
        status=200
    )

if __name__ == "__main__":
    app.serve()

Step 3: Create your Dockerfile

FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Pre-download model weights during build
RUN python -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
    AutoTokenizer.from_pretrained('microsoft/phi-4'); \
    AutoModelForCausalLM.from_pretrained('microsoft/phi-4')"

EXPOSE 8000
CMD ["python", "app.py"]

Step 4: Deploy

# Install Banana CLI
pip install banana-cli

# Login
banana login

# Deploy
banana deploy

Deploying Pre-Built Models

Banana Pro offers templates for popular models. Here are the most common:

Stable Diffusion XL

import banana_dev as banana

result = banana.run(
    api_key="your-api-key",
    model_key="sdxl-model-key",
    model_inputs={
        "prompt": "professional product photo of sneakers, studio lighting",
        "negative_prompt": "blurry, low quality, distorted",
        "num_inference_steps": 30,
        "guidance_scale": 7.5,
        "width": 1024,
        "height": 1024,
        "seed": 42
    }
)

image_base64 = result["modelOutputs"][0]["image"]

Whisper (Speech-to-Text)

result = banana.run(
    api_key="your-api-key",
    model_key="whisper-model-key",
    model_inputs={
        "audio_url": "https://example.com/audio.mp3",
        "language": "en",
        "task": "transcribe"
    }
)

transcription = result["modelOutputs"][0]["text"]
print(transcription)

LLM Inference (Llama, Mistral)

result = banana.run(
    api_key="your-api-key",
    model_key="llama-model-key",
    model_inputs={
        "prompt": "Explain the difference between REST and GraphQL APIs:",
        "max_tokens": 1024,
        "temperature": 0.7,
        "top_p": 0.9
    }
)

print(result["modelOutputs"][0]["text"])

Async and Webhook Support

For long-running tasks, use async mode with webhooks:

import banana_dev as banana

# Start async inference
async_result = banana.start(
    api_key="your-api-key",
    model_key="your-model-key",
    model_inputs={
        "prompt": "Generate a detailed 3D scene...",
        "num_inference_steps": 100
    },
    webhook_url="https://your-server.com/webhook"
)

call_id = async_result["callID"]
print(f"Job started: {call_id}")

# Or poll for results
import time

while True:
    status = banana.check(api_key="your-api-key", call_id=call_id)
    if status["message"] == "success":
        print("Done:", status["modelOutputs"])
        break
    time.sleep(2)

Webhook handler

from flask import Flask, request

app = Flask(__name__)

@app.route("/webhook", methods=["POST"])
def banana_webhook():
    data = request.json
    call_id = data["callID"]
    outputs = data["modelOutputs"]

    # Process results
    print(f"Job {call_id} completed: {outputs}")

    return "OK", 200

Pricing

GPU Tier	GPU	VRAM	Price (per second)	Best For
Basic	A10	24GB	~$0.00019/s	Small models, SDXL
Standard	A100 40GB	40GB	~$0.00032/s	Medium LLMs, fine-tuned models
Premium	A100 80GB	80GB	~$0.00055/s	Large LLMs (70B+)
Ultra	H100	80GB	~$0.00095/s	Fastest inference

Cost example

Running Stable Diffusion XL on an A10:

Average inference time: ~3 seconds
Cost per image: ~$0.00057
1,000 images: ~$0.57

Nano Banana Pro vs Alternatives

Feature	Nano Banana Pro	Replicate	Modal	RunPod Serverless
Pricing model	Per-second	Per-second	Per-second	Per-second
Cold start	~5-15s	~5-30s	~2-10s	~5-20s
Custom models	Yes	Yes	Yes	Yes
Pre-built models	Limited	Large library	Limited	Limited
GPU options	A10, A100, H100	A40, A100, H100	A10, A100, H100	Many options
Min cost	Pay-as-go	Pay-as-go	Pay-as-go	Pay-as-go
Framework	Potassium	Cog	Modal	Custom Docker

Best Practices

1. Optimize cold starts

# Pre-load models in the init function, not in the handler
@app.init
def init():
    # This runs ONCE when the container starts
    model = load_model()  # Do heavy loading here
    return {"model": model}

@app.handler()
def handler(context, request):
    # This runs on EVERY request -- keep it fast
    model = context["model"]
    return model.predict(request.json)

2. Use model caching

# Download model weights during Docker build
RUN python download_model.py

# This prevents downloading on every cold start

3. Batch requests

@app.handler()
def handler(context, request):
    model = context["model"]
    prompts = request.json.get("prompts", [request.json["prompt"]])

    # Process batch for better GPU utilization
    results = model.batch_generate(prompts)

    return Response(json={"outputs": results}, status=200)

4. Set up monitoring

import time
import logging

@app.handler()
def handler(context, request):
    start_time = time.time()

    result = run_inference(context, request)

    duration = time.time() - start_time
    logging.info(f"Inference took {duration:.2f}s")

    return Response(
        json={"output": result, "inference_time": duration},
        status=200
    )

When to Use Nano Banana Pro

Good fit:

Custom or fine-tuned models that are not on standard platforms
Variable traffic with unpredictable spikes
Cost-sensitive projects (pay-per-second vs. hourly GPU rental)
Prototyping and testing model deployments

Not ideal:

Consistently high traffic (dedicated GPUs may be cheaper)
Ultra-low-latency requirements (cold starts add delay)
Simple use cases covered by standard APIs

Alternative: Use Pre-Built AI APIs

If you do not need to deploy custom models and just want to call AI capabilities via API, a managed platform is simpler. Hypereal AI provides ready-to-use APIs for image generation, video creation, lip sync, voice cloning, and more -- no model deployment required. You get instant inference with no cold starts and transparent credit-based pricing.

Conclusion

Nano Banana Pro is a solid choice for deploying custom AI models without managing GPU infrastructure. The Potassium framework makes it straightforward to wrap any Python model into a production API.

Start with a pre-built template to test the platform, then move to custom deployments as your needs grow. For standard AI tasks like image or video generation, consider whether a managed API might be simpler than self-deployment.