What if you could point your phone at a restaurant menu in any language and instantly get the cultural story behind each dish, hear how to pronounce it, see a photorealistic image of what it looks like, and know if it is safe for your allergies? That is exactly what Menu to Food Tour does, powered by 5 AI agents working together in real time.
Built for the Gemini Live Agent Challenge
This project was built for the Gemini Live Agent Challenge, a Google-sponsored competition with over 10,000 participants and an $80,000 prize pool. The challenge theme is “Redefining Interaction: From Static Chatbots to Immersive Experiences”, pushing developers to go beyond simple text-based AI and build truly multimodal, agentic applications.
The competition asks participants to leverage Gemini models, Google’s Agent Development Kit (ADK), and Google Cloud to create next-generation AI agents that can see, hear, speak, and reason. Menu to Food Tour fits squarely in the Creative Storyteller category, seamlessly combining text, images, and audio in a single interactive output stream.
The Problem: Lost in Translation at Dinner
We have all been there. You sit down at a restaurant serving unfamiliar cuisine, the menu is in another language (or uses terms you do not recognize), and you have no idea what to order. You could Google each dish one by one, but that kills the dining experience. And what about allergies? Pronunciation? The cultural significance of what you are about to eat?
Traditional translation apps give you a dry word-for-word translation. Menu to Food Tour gives you an experience.
How It Works: 5 AI Agents, One Delicious Experience
The magic happens through a team of 5 specialized AI agents, each with a specific job. They work in parallel so you get results in seconds, not minutes. Here is the breakdown:
1. Vision Agent – The Menu Reader
Snap a photo of any menu (or type dish names manually). The Vision Agent uses Gemini 2.5 Flash to analyze the image and extract every dish name, even from handwritten menus or complex multi-column layouts. It outputs a clean, structured JSON array of dishes to process.
2. Story Agent – The Cultural Guide
For each dish, the Story Agent crafts a rich cultural narrative. Where did this dish originate? What is the tradition behind it? What makes it special? The stories are grounded using Google Search to ensure factual accuracy and reduce hallucination. You also get the original-language translation and the cuisine origin.
3. Image Agent – The Food Photographer
No more guessing what a dish looks like. The Image Agent uses Imagen 3.0 to generate a photorealistic image of each dish. It even races two models simultaneously and uses whichever returns first, so you never wait longer than necessary.
4. Audio Agent – The Pronunciation Coach
Want to order like a local? The Audio Agent generates native-language pronunciation audio using Gemini 2.5 Flash TTS. It supports multiple languages including Japanese, Italian, French, Chinese, and more. Just tap the play button and repeat after it.
5. Allergy Agent – The Safety Inspector
Dining with allergies or dietary restrictions? The Allergy Agent analyzes each dish for common allergens (nuts, gluten, dairy, shellfish, etc.), flags dietary categories (vegan, halal, keto), and provides a safety level rating. This alone could save someone from a dangerous allergic reaction.
The Architecture: How It All Connects
Here is the complete sequence diagram showing how data flows from your menu photo through all 5 agents and back to the browser in real time via Server-Sent Events (SSE):

The diagram shows 6 distinct phases:
- Menu Input – Upload a photo, use your camera, or type dish names
- Tour Request – The server checks an LRU cache (so repeated menus are instant) and opens an SSE stream
- Dish Extraction – Vision Agent pulls out all dish names from your menu
- Parallel Processing – Story, Image, Audio, and Allergy agents all fire simultaneously for each dish
- Completion – Results cached, tour saved to browser IndexedDB for offline access
- Post-Tour – Chat about your dishes, replay pronunciations, check specific allergies, or browse past tours
Real-Time Streaming: Watch AI Think
One of the most satisfying parts of the experience is watching results stream in live. Instead of waiting for everything to finish and showing a loading spinner, each agent pushes its result to the browser the moment it is ready. You see the dish card populate piece by piece: first the name, then the story text fills in, then the photo appears, then the audio button lights up, and finally the allergy badges show up.
This is powered by Server-Sent Events (SSE) on the backend and progressive state updates on the Next.js frontend. It feels alive and responsive, exactly what the competition theme demands.
Tech Stack Deep Dive
| Layer | Technology | Why |
|---|---|---|
| AI Orchestration | Google ADK (Agent Development Kit) | Native multi-agent support with parallel execution |
| Primary Model | Gemini 2.0 Flash Lite / 2.5 Flash | Fast, cost-effective, multimodal |
| Image Generation | Imagen 3.0 | Photorealistic food photography quality |
| Text-to-Speech | Gemini 2.5 Flash TTS | Natural multilingual pronunciation |
| Search Grounding | Google Search | Factual cultural narratives, reduced hallucination |
| Backend | FastAPI + Python (async) | High-concurrency SSE streaming |
| Frontend | Next.js 16 + React 19 + TypeScript | Modern, type-safe, fast |
| Styling | Tailwind CSS + Framer Motion | Glassmorphic UI with smooth animations |
| Deployment | Google Cloud Run + Terraform | Serverless, auto-scaling, IaC |
| Observability | OpenTelemetry + Cloud Trace | End-to-end request tracing and cost tracking |
Cost Transparency Built In
Every tour request shows you exactly how much it cost in API tokens. The telemetry panel breaks down cost per agent, shows a waterfall timeline of which agents ran when, and calculates the total. A typical 5-dish tour costs fractions of a cent. This level of transparency is rare in AI applications and was an intentional design choice to build user trust.
Post-Tour Chat: Ask Anything About Your Dishes
After your tour completes, a chat panel opens where you can ask follow-up questions. “Which dish is spiciest?” “Can I substitute the pork in the ramen?” “What wine pairs well with the osso buco?” The chat agent has full context of all the dishes in your tour and maintains session memory across questions.
Building with Google: From Gemini Models to Cloud Run
One of the most rewarding parts of building this project was seeing how well the Google AI and Cloud ecosystem fits together. Here is a look at the key decisions and how each piece of the Google stack earned its place in the architecture.
Why Google ADK for Multi-Agent Orchestration
Google’s Agent Development Kit (ADK) was the natural choice for orchestrating 5 agents that need to run in parallel. ADK provides a native ParallelAgent primitive that lets you fire off the Story, Image, Audio, and Allergy agents simultaneously for each dish. Without ADK, you would need to manually manage asyncio tasks, handle failures per agent, and wire up the coordination logic yourself. ADK abstracts all of that into a clean, declarative agent graph. It also comes with a built-in playground UI (adk web) that was invaluable during development for testing individual agents in isolation before wiring them into the full pipeline.
Choosing the Right Gemini Model for Each Agent
Not every agent needs the same model. Picking the right Gemini variant for each task was critical for balancing speed, quality, and cost:
- Vision Agent (Gemini 2.5 Flash) – Menu images can be blurry, rotated, or in dim restaurant lighting. Flash handles these edge cases well with strong multimodal understanding at low latency.
- Story Agent (Gemini 2.0 Flash Lite) – Cultural narratives need to be rich but the task is straightforward text generation. Flash Lite keeps costs minimal while producing engaging stories.
- Image Agent (Imagen 3.0) – Food photography demands photorealism. Imagen 3.0 generates stunning dish images that look like they belong on a restaurant website. The agent races two model variants and uses whichever responds first.
- Audio Agent (Gemini 2.5 Flash TTS) – Pronunciation requires natural-sounding, multilingual text-to-speech. Flash TTS supports Japanese, Italian, French, Chinese, and more with native-quality audio.
- Allergy Agent (Gemini 2.5 Flash) – Allergen analysis is safety-critical, so it uses the more capable Flash model for higher accuracy on ingredient reasoning.
Google Search Grounding: Fighting Hallucination with Facts
AI-generated cultural narratives are only valuable if they are accurate. The Story Agent uses Google Search Grounding to anchor its narratives in real-world facts. When generating the story for “Tonkotsu Ramen,” for example, the agent queries Google Search for cultural and historical context before composing the narrative. This dramatically reduces hallucination and ensures that the origin stories, regional traditions, and ingredient histories are factually grounded rather than plausibly invented.
Google GenAI SDK: Round-Robin Client Strategy
With 5 agents processing multiple dishes concurrently, rate limits become a real concern. The backend implements a round-robin client switching strategy across three Google API clients: Vertex AI, Vertex Express, and AI Studio. Each API call rotates to the next client, distributing load and avoiding throttling. Combined with an async semaphore limiting concurrency to 10 simultaneous calls and exponential backoff retries, this strategy keeps the pipeline humming even under heavy load.
Cloud Run + Terraform: Production-Ready from Day One
The entire application deploys to Google Cloud Run using Terraform for infrastructure-as-code. Cloud Run was the ideal choice because it auto-scales to zero when idle (no cost for unused compute), scales up automatically when a tour request comes in, and handles the long-lived SSE connections gracefully. The Terraform configuration defines separate environments for staging and production, with Cloud Build handling CI/CD. A single make deploy command builds the container, pushes it to Artifact Registry, and deploys to Cloud Run.
Observability: Cloud Trace and Cloud Logging
When 5 agents are running in parallel across multiple dishes, you need visibility into what is happening. The backend is instrumented with OpenTelemetry, which exports traces to Google Cloud Trace and logs to Cloud Logging. Every agent call gets its own span in the trace, so you can see exactly how long each agent took, which ones ran in parallel, and where bottlenecks occur. The frontend mirrors this with a telemetry panel that shows a waterfall timeline and per-agent cost breakdown in real time. This level of observability was essential during development and remains valuable in production for monitoring performance and costs.
Why This Matters for the Gemini Live Agent Challenge
The Gemini Live Agent Challenge judges on three criteria:
- Innovation and Multimodal UX (40%) – Menu to Food Tour processes images as input and produces text narratives, generated images, and audio as output. It is multimodal in every direction.
- Technical Implementation (30%) – 5 agents orchestrated via Google ADK, deployed on Cloud Run with Terraform, grounded with Google Search, and monitored with OpenTelemetry. Production-grade architecture.
- Demo and Presentation (30%) – The real-time SSE streaming creates a compelling live demo where you can see each agent complete its work progressively.
Try It Yourself
The next time you are at a restaurant with an unfamiliar menu, imagine having a team of AI agents instantly tell you the story behind each dish, show you what it looks like, teach you how to say it, and warn you about allergens. That is the future of dining, and it is being built today with Google Gemini and ADK.
Check out the Gemini Live Agent Challenge on Devpost to see what other developers are building. The competition runs until March 16, 2026, with $80,000 in prizes and a chance to demo at Google Cloud Next 2026.
#GeminiLiveAgentChallenge #GoogleADK #GeminiAPI #MultiAgentAI #AIAgents #GoogleCloud #GenAI #GenerativeAI #Imagen3 #GeminiFlash #MultimodalAI #BuildWithGoogle #GoogleAI #FoodTech #CulinaryAI #AIExperience #NextJS #FastAPI #CloudRun #Terraform #OpenTelemetry #ServerSentEvents #DevPost #AIProject #FullStackAI #RealTimeAI

