April 25, 2026AI Engineering10 min read

The Multilingual Problem Nobody Talks About When Building AI Chatbots

Building a chatbot with LangGraph that supports multiple languages is harder than it looks. Here is how we solved it without calling an LLM for every yes or no.

We are in a golden age of chatbot and agent frameworks. LangGraph, LangChain, CrewAI, AutoGen, Semantic Kernel, the list keeps growing. Every week there is a new library promising to make it easier to build conversational AI. And honestly, a lot of them are great. Routing between nodes, managing state, hooking into tools and APIs, that part of the problem is largely solved.

But here is something none of them tell you: what happens when your users don't speak English?

I ran into this while building an insurance sales chatbot with LangGraph. We were building it for the Indonesian market, so Bahasa Indonesia support was a core requirement from day one. We knew the bot had to understand users typing in Bahasa, not just display text in it. That's when we realized the problem went deeper than we expected.

The gap that nobody talks about

In traditional software development, there is a well-established pattern for handling multiple languages. It's called i18n, short for internationalization (18 letters between the i and the n). The idea is simple: instead of hardcoding display text in your code, you reference a key like greeting.welcome, and a translation catalog maps that key to the right language at runtime.

en:
  greeting.welcome: "Hello"
id:
  greeting.welcome: "Halo"

Tools like i18next, gettext, and Rails i18n have been doing this for decades. It works well.

But here is the thing about i18n: it only handles output. It solves "how do I show the right text to the user in their language?" It does nothing for the reverse problem: "how do I understand what the user typed in their language?"

For a traditional app with dropdown menus and buttons, that's fine. The user clicks "Yes" and you get "Yes". But in a conversational chatbot, the user types whatever feels natural to them. And if they're Indonesian, that might be iya, ya, boleh, oke, or lanjutkan. None of those are English. None of those are in any i18n catalog. And none of the major chatbot frameworks have a built-in answer for this.

There is no established guideline for multilingual input handling in state-based conversational AI. You're on your own.

The challenge we had to solve

Our chatbot guides users through a structured quote flow. It asks one question at a time and waits for the answer before moving to the next step. Things like: "Are you looking for family or individual coverage?" and "Do you want to add medical coverage?"

The business logic behind these questions only cares about a small set of answers. Yes or no. Family, Dual, Group, Individual. Male, Female. I call these closed concepts: questions where the set of valid answers is fixed and finite.

When we started thinking about Bahasa support, we looked at how the existing parsers worked. Every single one of them had inline English string sets:

# sales_state_update.py — the old approach
if normalized in {"yes", "y", "yeah", "yep", "ok", "okay", "sure"}:
    return True

English only. And this same pattern was scattered across about ten different files, covering yes or no, group size, gender, welcome menu options, payment options, addons...

A Bahasa user typing iya, keluarga, or sendiri would get nothing back. The parser returns None, the bot has no valid answer, and it re-asks the same question. We needed to fix this before we could ship to Indonesian users, and we wanted to fix it properly so adding the next language would not mean doing it all over again.

Why not just call the LLM?

The first thing that comes up in any internal discussion like this is: "Just pass the message to the LLM and let it figure out what the user meant." And yes, that would work. But after thinking it through, we felt it was the wrong tool for this job.

Think about what's actually happening when a user types iya in response to "Would you like to proceed?" The intent is not ambiguous. It's not nuanced. It's not a complex thought the LLM needs to untangle. It's a one-word answer to a yes or no question. The answer is yes.

Sending that to an LLM means waiting 200 to 800 milliseconds for something that should resolve in zero, spending tokens on a question that has a deterministic answer, getting results that might vary between calls, and adding a network dependency to a code path that runs on every single user reply.

In a chatbot with a long quote flow, the user might go through 15 to 20 exchanges. If every exchange makes an LLM call for intent resolution, that's a lot of unnecessary cost and latency stacking up. And for the roughly 80% of turns where the user is just clicking a button or typing a short answer to a direct question, you're burning LLM capacity for nothing.

The right answer is: determinism for closed concepts, LLM only for the cases that actually need semantic understanding.

What we built instead

We built a multilingual lexicon. One centralized place that maps user phrases to canonical values, per language.

The structure is simple. Every concept has a set of canonical values defined as Python enums:

class YesNo(Enum):
    TRUE = "TRUE"
    FALSE = "FALSE"

class GroupSize(Enum):
    FAMILY = "Family"
    DUAL = "Dual"
    GROUP = "Group"
    INDIVIDUAL = "Individual"

And each language provides aliases for every canonical value:

class IndonesianLexicon(BaseLexicon):
    lang = "id"

    @property
    def yes_no(self) -> AliasMap:
        return {
            YesNo.TRUE:  ("ya", "iya", "oke", "boleh", "lanjutkan"),
            YesNo.FALSE: ("tidak", "nggak", "gak", "batal"),
        }

    @property
    def group_size(self) -> AliasMap:
        return {
            GroupSize.FAMILY:     ("keluarga", "sekeluarga"),
            GroupSize.DUAL:       ("pasangan", "berdua"),
            GroupSize.GROUP:      ("grup", "rombongan"),
            GroupSize.INDIVIDUAL: ("sendiri", "individu"),
        }

All parsers now call a single function instead of maintaining their own inline sets:

from app.lexicon import resolve, find_in_text, YesNo, GroupSize

# For short replies like "iya" or button clicks
canonical = resolve(YesNo, user_message, lang)   # returns YesNo.TRUE or None

# For free-form replies like "I want medical and covid coverage"
hits = find_in_text(Priority, user_message, lang, multi=True)

Both functions normalize the input, try the session language first, and fall back to English on a miss. Because real users mix languages. Someone might type "ya ok" or "yes please" in the middle of an Indonesian session and that's completely fine.

One file per language, not one file per concept

When we decided where to put the alias data, we had two options.

Option A is one file per concept. A yes_no.py that contains English aliases and Bahasa aliases and Malay aliases all together. Adding a new language means editing every concept file. If you have ten concepts, that's ten files to touch per language.

Option B is one file per language. An en.py and an id.py, each containing all concepts for that language. Adding a new language means one new file.

The question that settled it: which axis grows faster?

We already had English and Bahasa live. Malay, Thai, and Vietnamese were on the roadmap. Every language launch should be one pull request that a single person, ideally a native speaker, can review end to end. Option A turns that into a scattered ten-file diff. Option B keeps it in one cohesive file.

Concepts change maybe once a quarter. Languages grow with the business.

We went with Option B.

The contract that prevents mistakes

The risk with one file per language is forgetting to add a new concept to an existing language file. We handled that with a base class that declares every concept as an abstract property:

class BaseLexicon(ABC):
    lang: str

    @property
    @abstractmethod
    def yes_no(self) -> dict[YesNo, tuple[str, ...]]: ...

    @property
    @abstractmethod
    def group_size(self) -> dict[GroupSize, tuple[str, ...]]: ...

    # one abstract property per concept

If any language file forgets to implement a property, Python refuses to import the module at startup. You get a loud failure immediately, not a silent regression in production.

On top of that, we have two CI tests. The first checks that every canonical value in every concept has at least one alias in every registered language. The second walks the entire codebase and flags any hardcoded English-only string set that looks like a yes or no parser:

# This trips the CI guardrail:
if lowered in {"yes", "y", "yeah", "yep", "ok", "okay", "sure"}:
    ...

# This is what we want instead:
if resolve(YesNo, lowered, lang) is YesNo.TRUE:
    ...

That second test was the most important one. Without it, someone could come along later and write a new parser the old way, and we'd be back where we started.

The two-tier architecture

The lexicon handles the deterministic cases. But users do say things we haven't anticipated. Someone might type "keluarga gue" (informal Indonesian for "my family") or "ga mau ah" (casual "I don't want to"). The alias list can't cover every possible phrasing.

That's where the LLM comes in, but only as a fallback. Here is the full flow for every incoming message:

User message arrives  (e.g. "iya")
          │
          ▼
  Normalize input
  strip · lowercase · collapse whitespace
          │
          ▼
  ┌───────────────────────────┐
  │  Tier 1 — Lexicon lookup  │
  │  try session lang (id)    │
  └─────────────┬─────────────┘
                │
        ┌───────┴────────┐
       HIT              MISS
        │                │
        ▼                ▼
  YesNo.TRUE      English fallback
  returned        lexicon lookup
  0 ms · 0 tokens        │
                  ┌──────┴──────┐
                 HIT           MISS
                  │              │
                  ▼              ▼
           YesNo.TRUE    ┌──────────────────────┐
           returned      │  Tier 2 — LLM call   │
           0 ms · 0 tok  │  sub_intent_helper   │
                         │  constrained output  │
                         │  (action registry)   │
                         └──────────┬───────────┘
                                    │
                           ┌────────┴────────┐
                         action            error
                           │                 │
                           ▼                 ▼
                    canonical value    action="none"
                    returned           safe fallback

The important property here: business logic never sees a raw string. It sees YesNo.TRUE or it sees action="none". Whether that came from a lexicon hit or an LLM call makes no difference downstream.

Every intent detection function follows this pattern:

def _is_X_intent(state) -> bool:
    lang = str(state.get("lang", "en"))
    user_msg = str(state.get("user_message", "")).strip()
    if not user_msg:
        return False

    # Fast path: lexicon, zero cost
    if lexicon_resolve(XAction, user_msg, lang) is XAction.CHECK_X:
        return True

    # Slow path: LLM, only when lexicon misses
    result = sub_intent_helper(
        state=state,
        current_stage="x",
        last_response=last_question,
        option_list=options,
        action_registry=X_ACTIONS,
        user_message=user_msg,
    )
    return result.action == "check_x"

A lexicon hit short-circuits the LLM entirely. We test this explicitly so a future change doesn't accidentally start making LLM calls on button clicks.

When the LLM does run, it's constrained to return one of a predefined set of actions from a registry. It can't go off-script. If it errors, the fallback is a safe action="none" that routes to the default node. No exception bubbles up, no user gets a broken experience.

The boundary between the two tiers is intentional:

Closed concepts with short, predictable answers go to the lexicon. Button clicks, yes or no, group size, gender.
Open-ended, ambiguous, or creative phrasings go to the LLM. "Hey can you confirm the payment went through?" or "Sure, Sompo please, and add COVID coverage."

This hits the right balance. Fast and free for the common case. Smart and forgiving for the long tail.

Adding a new language now

This is where everything comes together. Adding Vietnamese is:

Create app/lexicon/vi.py with VietnameseLexicon(BaseLexicon)
Implement every property (the ABC tells you immediately if you miss one)
Add "vi": VietnameseLexicon() to the registry in __init__.py
Write a few behavioural tests

Every parser, validator, and intent classifier picks it up automatically. No business logic file gets touched. One new file, one register line, a few tests.

What the migration taught us

The lexicon itself wasn't hard to write. The more time-consuming part was going through the existing codebase and replacing inline English string sets with lexicon calls across about ten files. The CI guardrails helped find every instance, but it was still a good chunk of work.

The lesson: if you know from the start that you need to support multiple languages, design for it upfront. The migration cost is real even when you have good tooling around it.

The other thing I noticed is that the hardest part of building AI systems is often not the AI. It is the deterministic layer underneath it. Getting that right is what makes the AI parts actually trustworthy.