Back to Gems of AI

Building a real-time translation API with Gemini 3.1 Flash-Lite

Learn how to build a low-latency, real-time streaming translation API using Python and Google's new Gemini 3.1 Flash-Lite model.

We've all experienced it. You're chatting with someone who speaks another language, and every single message comes with an agonizing two-second delay. You send a text, wait. They reply, you wait. It completely kills the natural rhythm of a conversation. I keep thinking about how we've accepted this artificial lag as the cost of doing business globally.

When Google released Gemini 3.1 Flash-Lite, they made specific claims about its optimization for high-volume, latency-sensitive tasks. Translation was specifically called out. I wanted to see if it actually delivered on those claims, rather than just reading the press release.

Here is how you can build a streaming translation API using the new model, keeping latency low enough that conversations actually feel like conversations.

Getting your API key

Before writing any code, you need access to the model. Google AI Studio handles the provisioning.

If you don't have an account, head to Google AI Studio and sign in. Click the "Get API key" button in the left sidebar. Create a new key in a new or existing Google Cloud project. It takes about thirty seconds.

Copy that key and set it as an environment variable in your terminal. You'll need it for the Python SDK.

export GEMINI_API_KEY="your_api_key_here"

Setting up the environment

We'll use the official google-genai SDK. It handles the API requests and streaming logic smoothly.

Create a new directory for your project, set up a virtual environment, and install the required package:

mkdir gemini-translator
cd gemini-translator
python -m venv venv
source venv/bin/activate
pip install google-genai

Writing the streaming translation script

The secret to a real-time feel isn't just a fast model. It's streaming the response. If you wait for the entire paragraph to translate before showing it to the user, you've already lost. You need to push tokens to the screen the millisecond they are generated.

Create a file named translate.py. We are going to build a simple function that takes a target language and the input text, then streams the output back to the console.

import os
import sys
from google import genai
from google.genai import types

def stream_translation(target_language: str, text_to_translate: str):
    # The SDK automatically picks up the GEMINI_API_KEY environment variable
    client = genai.Client()

    prompt = f"Translate the following text to {target_language}. Only output the translation, nothing else.\n\nText: {text_to_translate}"

    print(f"Translating to {target_language}...\n")
    print("Output: ", end="", flush=True)

    try:
        # We explicitly call the 3.1 Flash-Lite model
        response = client.models.generate_content_stream(
            model='gemini-3.1-flash-lite',
            contents=prompt,
            config=types.GenerateContentConfig(
                temperature=0.1, # Keep it low for accurate translation
            )
        )

        for chunk in response:
            # Print each chunk as it arrives without adding newlines
            print(chunk.text, end="", flush=True)

    except Exception as e:
        print(f"\nError during translation: {e}")

    print("\n")

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python translate.py <target_language> <text>")
        sys.exit(1)

    target_lang = sys.argv[1]
    input_text = " ".join(sys.argv[2:])

    stream_translation(target_lang, input_text)

Notice the temperature setting. I knocked it down to 0.1. You generally want translation to be deterministic and precise, not creative. We use generate_content_stream to yield the text chunks.

Running the latency test

Let's see if the Flash-Lite model is actually fast. I ran a quick terminal test translating a block of English text into Spanish.

python translate.py Spanish "The architecture of the new model allows for parallel processing of tokens, significantly reducing the time to first byte. This is especially useful for applications where users expect immediate feedback."

The results were noticeably fast. The first byte hit my terminal in under 400 milliseconds. The entire sentence finished streaming within a second. It feels instantaneous. If you plug this logic into a WebSocket connection for a web app, the user on the other end would see the words appearing almost as fast as the sender types them.

Wrapping up

Building a translation API that doesn't annoy your users comes down to model choice and streaming implementation. Gemini 3.1 Flash-Lite handles the latency side of the equation well.

If you want to take this further, try wrapping this script in a FastAPI endpoint and serving it via WebSockets to a frontend chat interface. Let me know what you build.

Frequently Asked Questions

What makes Gemini 3.1 Flash-Lite good for translation?

Google specifically optimized the Flash-Lite model for latency-sensitive tasks like real-time translation, ensuring the first byte of response arrives faster than heavier models.

How do I stream translations in Python with Gemini?

You can use the google-genai SDK's generate_content_stream method to yield text chunks as they are generated, rather than waiting for the entire translation to complete.

Why does translation latency matter?

High latency in chat or voice translation disrupts the natural flow of conversation. Sub-second response times are required for communication to feel organic.