How to Use Nano Banana Pro API: Complete Guide (2026)
Deploy and serve AI models with Nano Banana Pro
Start Building with Hypereal
Access Kling, Flux, Sora, Veo & more through a single API. Free credits to start, scale to millions.
No credit card required • 100k+ developers • Enterprise ready
How to Use Nano Banana Pro API: Complete Guide (2026)
Nano Banana Pro is a serverless GPU inference platform that lets you deploy and serve AI models via API without managing infrastructure. It is especially popular for running custom models, fine-tuned LLMs, and image generation pipelines.
This guide covers everything from initial setup to production deployment with code examples.
What Is Nano Banana Pro?
Banana.dev (now Banana Pro / Nano Banana Pro) provides serverless GPU compute for AI inference. Instead of renting a dedicated GPU server, you deploy your model and pay only for the compute time you actually use.
Key features
- Serverless GPU inference -- no server management
- Cold start optimization -- models load in seconds
- Auto-scaling -- handles traffic spikes automatically
- Custom model support -- deploy any PyTorch/TensorFlow model
- Pre-built templates -- quick-start with popular models
- Pay-per-second billing -- no idle charges
Getting Started
Step 1: Create an account
- Go to banana.dev and sign up
- Navigate to the dashboard
- Generate an API key from Settings > API Keys
Step 2: Install the SDK
# Python SDK
pip install banana-dev
# Node.js SDK
npm install @banana-dev/banana-dev
Step 3: Run your first inference
import banana_dev as banana
api_key = "your-api-key"
model_key = "your-model-key" # from the dashboard
# Run inference
result = banana.run(
api_key=api_key,
model_key=model_key,
model_inputs={
"prompt": "A futuristic city at sunset, cyberpunk style",
"num_inference_steps": 30,
"guidance_scale": 7.5,
"width": 1024,
"height": 1024
}
)
print(result["modelOutputs"])
import Banana from "@banana-dev/banana-dev";
const banana = new Banana.default("your-api-key");
const result = await banana.run("your-model-key", {
prompt: "A futuristic city at sunset, cyberpunk style",
num_inference_steps: 30,
guidance_scale: 7.5,
width: 1024,
height: 1024,
});
console.log(result.modelOutputs);
Deploying a Custom Model
Using the Banana Potassium Framework
Potassium is Banana's framework for wrapping any Python model into a deployable API.
Step 1: Create your project
mkdir my-model && cd my-model
pip install potassium
Step 2: Write your app.py
from potassium import Potassium, Request, Response
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = Potassium("my-llm-api")
@app.init
def init():
"""Load the model into memory"""
model_name = "microsoft/phi-4"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
context = {
"model": model,
"tokenizer": tokenizer
}
return context
@app.handler()
def handler(context: dict, request: Request) -> Response:
"""Handle inference requests"""
model = context["model"]
tokenizer = context["tokenizer"]
prompt = request.json["prompt"]
max_tokens = request.json.get("max_tokens", 512)
temperature = request.json.get("temperature", 0.7)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=temperature,
do_sample=True
)
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return Response(
json={"output": response_text},
status=200
)
if __name__ == "__main__":
app.serve()
Step 3: Create your Dockerfile
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# Pre-download model weights during build
RUN python -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
AutoTokenizer.from_pretrained('microsoft/phi-4'); \
AutoModelForCausalLM.from_pretrained('microsoft/phi-4')"
EXPOSE 8000
CMD ["python", "app.py"]
Step 4: Deploy
# Install Banana CLI
pip install banana-cli
# Login
banana login
# Deploy
banana deploy
Deploying Pre-Built Models
Banana Pro offers templates for popular models. Here are the most common:
Stable Diffusion XL
import banana_dev as banana
result = banana.run(
api_key="your-api-key",
model_key="sdxl-model-key",
model_inputs={
"prompt": "professional product photo of sneakers, studio lighting",
"negative_prompt": "blurry, low quality, distorted",
"num_inference_steps": 30,
"guidance_scale": 7.5,
"width": 1024,
"height": 1024,
"seed": 42
}
)
image_base64 = result["modelOutputs"][0]["image"]
Whisper (Speech-to-Text)
result = banana.run(
api_key="your-api-key",
model_key="whisper-model-key",
model_inputs={
"audio_url": "https://example.com/audio.mp3",
"language": "en",
"task": "transcribe"
}
)
transcription = result["modelOutputs"][0]["text"]
print(transcription)
LLM Inference (Llama, Mistral)
result = banana.run(
api_key="your-api-key",
model_key="llama-model-key",
model_inputs={
"prompt": "Explain the difference between REST and GraphQL APIs:",
"max_tokens": 1024,
"temperature": 0.7,
"top_p": 0.9
}
)
print(result["modelOutputs"][0]["text"])
Async and Webhook Support
For long-running tasks, use async mode with webhooks:
import banana_dev as banana
# Start async inference
async_result = banana.start(
api_key="your-api-key",
model_key="your-model-key",
model_inputs={
"prompt": "Generate a detailed 3D scene...",
"num_inference_steps": 100
},
webhook_url="https://your-server.com/webhook"
)
call_id = async_result["callID"]
print(f"Job started: {call_id}")
# Or poll for results
import time
while True:
status = banana.check(api_key="your-api-key", call_id=call_id)
if status["message"] == "success":
print("Done:", status["modelOutputs"])
break
time.sleep(2)
Webhook handler
from flask import Flask, request
app = Flask(__name__)
@app.route("/webhook", methods=["POST"])
def banana_webhook():
data = request.json
call_id = data["callID"]
outputs = data["modelOutputs"]
# Process results
print(f"Job {call_id} completed: {outputs}")
return "OK", 200
Pricing
| GPU Tier | GPU | VRAM | Price (per second) | Best For |
|---|---|---|---|---|
| Basic | A10 | 24GB | ~$0.00019/s | Small models, SDXL |
| Standard | A100 40GB | 40GB | ~$0.00032/s | Medium LLMs, fine-tuned models |
| Premium | A100 80GB | 80GB | ~$0.00055/s | Large LLMs (70B+) |
| Ultra | H100 | 80GB | ~$0.00095/s | Fastest inference |
Cost example
Running Stable Diffusion XL on an A10:
- Average inference time: ~3 seconds
- Cost per image: ~$0.00057
- 1,000 images: ~$0.57
Nano Banana Pro vs Alternatives
| Feature | Nano Banana Pro | Replicate | Modal | RunPod Serverless |
|---|---|---|---|---|
| Pricing model | Per-second | Per-second | Per-second | Per-second |
| Cold start | ~5-15s | ~5-30s | ~2-10s | ~5-20s |
| Custom models | Yes | Yes | Yes | Yes |
| Pre-built models | Limited | Large library | Limited | Limited |
| GPU options | A10, A100, H100 | A40, A100, H100 | A10, A100, H100 | Many options |
| Min cost | Pay-as-go | Pay-as-go | Pay-as-go | Pay-as-go |
| Framework | Potassium | Cog | Modal | Custom Docker |
Best Practices
1. Optimize cold starts
# Pre-load models in the init function, not in the handler
@app.init
def init():
# This runs ONCE when the container starts
model = load_model() # Do heavy loading here
return {"model": model}
@app.handler()
def handler(context, request):
# This runs on EVERY request -- keep it fast
model = context["model"]
return model.predict(request.json)
2. Use model caching
# Download model weights during Docker build
RUN python download_model.py
# This prevents downloading on every cold start
3. Batch requests
@app.handler()
def handler(context, request):
model = context["model"]
prompts = request.json.get("prompts", [request.json["prompt"]])
# Process batch for better GPU utilization
results = model.batch_generate(prompts)
return Response(json={"outputs": results}, status=200)
4. Set up monitoring
import time
import logging
@app.handler()
def handler(context, request):
start_time = time.time()
result = run_inference(context, request)
duration = time.time() - start_time
logging.info(f"Inference took {duration:.2f}s")
return Response(
json={"output": result, "inference_time": duration},
status=200
)
When to Use Nano Banana Pro
Good fit:
- Custom or fine-tuned models that are not on standard platforms
- Variable traffic with unpredictable spikes
- Cost-sensitive projects (pay-per-second vs. hourly GPU rental)
- Prototyping and testing model deployments
Not ideal:
- Consistently high traffic (dedicated GPUs may be cheaper)
- Ultra-low-latency requirements (cold starts add delay)
- Simple use cases covered by standard APIs
Alternative: Use Pre-Built AI APIs
If you do not need to deploy custom models and just want to call AI capabilities via API, a managed platform is simpler. Hypereal AI provides ready-to-use APIs for image generation, video creation, lip sync, voice cloning, and more -- no model deployment required. You get instant inference with no cold starts and transparent credit-based pricing.
Conclusion
Nano Banana Pro is a solid choice for deploying custom AI models without managing GPU infrastructure. The Potassium framework makes it straightforward to wrap any Python model into a production API.
Start with a pre-built template to test the platform, then move to custom deployments as your needs grow. For standard AI tasks like image or video generation, consider whether a managed API might be simpler than self-deployment.
Related Articles
Start Building Today
Get 35 free credits on signup. No credit card required. Generate your first image in under 5 minutes.
