How AI Narration Works: From Text to Voice in Multiple Languages
The Voice You Hear
When you play a Conch adventure, a voice tells you what is happening. It describes the forest clearing you just entered, the tension in the merchant's expression, the sound of something heavy shifting behind a locked door. It responds to what you said, what you did, and where you are in the story.
That voice is not a recording. Nobody sat in a studio and performed those lines. The words were written by an AI seconds ago, and the voice was synthesized milliseconds after that. Every time you play, the narration is different -- not pulled from a library of pre-recorded clips, but generated fresh for your specific moment in the story.
This is the narration pipeline, and it is the core of what makes a Conch adventure feel alive. Here is how it works, stage by stage.
Step 1: Assembling the Game State
Before the AI can narrate anything, it needs to know everything. Not "everything" in the vague sense of a chatbot scanning its conversation history -- everything in the precise, structured sense of a game engine querying its state.
When you take an action in a Conch adventure, the system assembles a context packet that includes:
- Your location -- which scene you are in, what it looks like, what connections lead to other scenes
- Your inventory -- every item you are carrying, its properties, and how you got it
- NPCs present -- which characters are in the scene, their personalities, their own inventories, and their disposition toward you
- Recent history -- the last few exchanges between you and the game, so the AI understands conversational continuity
- Adventure rules -- the creator's world-building, tone guidelines, and any special mechanics
This is not a loose summary. It is structured data pulled from the game engine's database -- the same database that tracks item movement, scene transitions, and combat outcomes. The AI receives a precise snapshot of the world as it actually exists, not as it vaguely remembers it.
This distinction matters enormously. A chatbot hallucinates because it has no ground truth. Our AI cannot hallucinate about your inventory, because it receives your actual inventory as input. It cannot forget that you moved to a different room, because it receives your actual location. The game state constrains the narrative, and that constraint is what makes the story feel coherent over hours of play.
Step 2: AI Narrative Generation
With the full game state assembled, the system hands everything to a large language model. The LLM's job is specific: take this player action, in this context, and generate the narrative response.
The output is not freeform text. The model produces structured data that includes the narrative the player will hear, plus any game events that should fire -- a scene transition, an item pickup, a combat encounter. The game engine processes those events to update the world state, and the narrative text moves on to the next stage.
What makes this interesting is the balance between freedom and constraint. The AI has enormous creative latitude in how it describes things -- word choice, pacing, tone, dramatic emphasis. But it has almost no latitude in what it describes. If the player picked up a key, the narrative must acknowledge the key. If the player is in a dungeon, the narrative cannot describe a sunny meadow. The facts come from the engine. The prose comes from the AI.
This is generated fresh every single time. Two players taking the same action in the same scene will hear different narration -- different sentence structures, different descriptive details, different dramatic emphasis. The meaning will be consistent, but the expression will be unique. It is the difference between a script and a performance.
Step 3: From Text to Speech
The generated narrative text now needs to become audio. This is where text-to-speech comes in, and modern TTS is remarkably far from the robotic voices you might remember.
Current-generation TTS engines -- Google Chirp, OpenAI's voice models, ElevenLabs -- produce speech that sounds genuinely natural. They handle pacing, emphasis, and emotional tone. A tense moment sounds tense. A quiet revelation sounds contemplative. The voice rises and falls with the content, not in a mechanical pattern but in response to meaning.
This is not simple text reading. These models understand prosody -- the rhythm and melody of speech. When the narration says "You open the chest, and inside you find... nothing," the pause before "nothing" lands correctly. When a character shouts a warning, the voice carries urgency. When a scene description is calm and atmospheric, the delivery matches.
The result is narration that feels performed, not read. Not at the level of a veteran voice actor delivering a signature performance, but at a level that serves the story well -- and crucially, at a level that can be generated in real time for content that has never existed before.
The Multilingual Dimension
Here is where things get particularly interesting. Conch adventures are typically authored in English, but the narration pipeline can deliver them in multiple languages.
The process is more elegant than you might expect. The AI generates its narrative response in the target language directly -- it does not write English and then translate. A French-speaking player receives narration that was composed in French, with French idioms and sentence structures, not awkward translated-from-English phrasing.
The TTS engines then handle pronunciation natively. A French voice model speaks French. A Japanese voice model speaks Japanese. They are not English voices attempting other languages -- they are models trained specifically for those languages, with native pronunciation, rhythm, and intonation patterns.
The honest caveat: quality varies by language. English, French, Spanish, and German tend to produce excellent results. Less commonly supported languages may have fewer voice options or slightly less natural prosody. The technology improves steadily -- languages that sounded rough a year ago sound significantly better today -- but we would rather be transparent than oversell. If you try an adventure in a less common language, set your expectations for "good and improving" rather than "indistinguishable from native."
Voice Selection and Provider Strengths
Behind the scenes, Conch works with multiple TTS providers, and they are not interchangeable. Each has strengths that make it better suited for different situations.
Some providers prioritize speed -- they can return audio with minimal latency, which matters for real-time gameplay. Others prioritize expressiveness -- richer emotional range, more nuanced delivery. Some handle certain languages better than others.
The system manages this complexity so you do not have to think about it. Adventure creators can configure voice preferences, and the platform routes to the appropriate provider. The goal is always the same: narration that sounds good and arrives fast.
Speed: The Constraint That Shapes Everything
The entire pipeline -- game state assembly, AI generation, text-to-speech -- needs to happen fast enough that the experience feels responsive. When you speak an action, you expect to hear the response without an awkward silence.
This is where streaming becomes essential. The system does not wait for the AI to finish generating the complete narrative before starting TTS. As soon as the first sentence is ready, it begins converting to audio. As soon as the first audio chunk is ready, it begins playing.
The effect is that you start hearing the response while the AI is still generating the rest of it. The pipeline is a relay, not a batch process. Each stage starts working as soon as it has enough input, and the result is a response time that feels conversational rather than computational.
This engineering matters more than most players realize. A two-second delay is fine. A five-second delay feels slow. A ten-second delay breaks immersion. The streaming architecture keeps response times in the range where the experience feels like a conversation with a narrator, not a query to a server.
Beyond Narration: Ambient Soundscapes
Voice narration is the most noticeable audio layer, but it is not the only one. Conch adventures also feature ambient soundscapes that shift with the scene -- forest sounds when you are in a clearing, echoing drips in a cave, rain on stone, the murmur of a crowded tavern.
These soundscapes are matched to scenes using semantic search. The system understands that a "moonlit forest clearing" should have crickets and wind through leaves, not seagulls and waves. When you move between scenes, the ambient audio crossfades to match the new environment.
It is a subtle layer, but it does meaningful work. Ambient sound tells your brain where you are before the narration starts. It fills the spaces between spoken words. It makes the experience feel like a place rather than a voice.
The Trade-Off: AI Versus Professional Voice Acting
It is worth being direct about where AI narration sits relative to professional voice acting. Platforms like Sound Realms and EarReality use human voice actors -- trained performers who deliver polished, directed performances for every line of dialogue.
The quality ceiling is higher. A professional actor brings interpretation, subtlety, and emotional depth that current TTS cannot fully match. If you are comparing a single line of narration, the human performance wins.
But professional voice acting does not scale. Recording, directing, and editing voice performances for an entire adventure takes weeks and significant budget. Adding a new language means hiring new actors and re-recording everything. Making the story dynamic -- responsive to player choices in real time -- is effectively impossible, because you would need to record every possible variation.
AI narration trades some polish for capabilities that traditional voice acting simply cannot offer: infinite variation, real-time generation, multilingual delivery, and responsiveness to player actions that no one predicted when the adventure was created. Every playthrough sounds different. Every player hears narration tailored to their specific choices. And new languages can be added without re-recording a single line.
It is not that one approach is better. They serve different goals. Professional voice acting is ideal for fixed, linear narratives with a set number of languages. AI narration is ideal for dynamic, interactive experiences that need to adapt in real time.
We built Conch for the second category. The narration pipeline -- from game state to AI generation to synthesized voice -- is what makes it possible to have adventures that respond to anything you say, in multiple languages, without a recording studio in sight.
Curious about what this sounds like in practice? Try an adventure on Conch and hear the pipeline for yourself.