All Articles

Building the Harness: How Real ML Systems Actually Run Models

Imagine you baked a really good cake. You spent hours getting the recipe right, tested it on friends, got great feedback. Now someone asks you to open a bakery.

Baking the cake is the easy part. The bakery is the hard part.

That is exactly what happens in machine learning. Training a model is the cake. Getting it to actually work for real users, reliably, every single time, is the bakery. This post is about building the bakery.

Two things people confuse: training and inference

When people talk about AI, they usually mean one of two things without realizing they are different.

Training is when the machine learns. You feed it thousands of examples and it figures out the patterns. This is slow, expensive, and happens once (or occasionally when you retrain).

Inference is when the machine uses what it learned. A user types something, the model reads it and gives back an answer. This is fast, cheap, and happens millions of times a day.

Training vs Inference

Think of it like this: training is studying for an exam. Inference is actually taking the exam. You study once, you take the exam many times.

Most tutorials teach you how to study. This post is about taking the exam in front of real people.

What happens when you just call predict()

When you train a model and save it, you get a file, something like model.pkl. You can load it and ask it questions like this:

import pickle

with open("model.pkl", "rb") as f:
    model = pickle.load(f)

prediction = model.predict([[5.1, 3.5, 1.4, 0.2]])
print(prediction)  # [0]

You pass in some numbers, you get back [0]. But what does 0 mean? And where did those numbers come from? A real user is not going to type four decimal numbers into your app. They are going to type a sentence, upload a photo, or click a button.

This is the gap. The model speaks a very specific language (numbers, arrays). The real world speaks a very different language (text, images, messy data). Someone has to translate.

That someone is the harness.

What a harness is

A harness is just a wrapper. It sits around your model and handles the real world so the model does not have to.

Model Harness Architecture

The harness does five things:

  1. Takes in whatever the user actually sends (a sentence, a photo, a form)
  2. Checks that it is valid (not empty, not the wrong format)
  3. Converts it into the format the model understands
  4. Passes it to the model
  5. Converts the model's raw answer into something humans can read

Here is a simple example. Say you built a model that reads a customer review and tells you if it is positive or negative. A real user sends you the text "this product is terrible". Here is what the harness looks like:

import pickle

class SentimentHarness:
    def __init__(self):
        # Load the model once when the app starts, not on every request
        with open("model.pkl", "rb") as f:
            self.model = pickle.load(f)
        with open("vectorizer.pkl", "rb") as f:
            self.vectorizer = pickle.load(f)

    def predict(self, user_text: str) -> dict:
        # Step 1: Check the input is valid
        if not user_text or not user_text.strip():
            raise ValueError("Please provide some text")

        # Step 2: Convert text to numbers the model understands
        features = self.vectorizer.transform([user_text])

        # Step 3: Ask the model
        result = self.model.predict(features)

        # Step 4: Convert [0] or [1] into something readable
        label = "negative" if result[0] == 0 else "positive"

        return {"sentiment": label}

You use it like this:

harness = SentimentHarness()
harness.predict("this product is terrible")
# {"sentiment": "negative"}

The user gets back a word, not a number. The model does not know or care about that. The harness handles the translation both ways.

The one rule that breaks everything if you ignore it

When you trained the model, you probably cleaned and transformed the data first. Maybe you lowercased all text, removed punctuation, or scaled numbers to be between 0 and 1.

Here is the trap: you must do the exact same transformations at inference time.

If the model learned from scaled numbers and you suddenly give it raw numbers, it will give you wrong answers. It has no idea something is wrong. It will confidently give you garbage. This is one of the most common reasons a model that looks great in testing falls apart in production.

Save everything you used during training, not just the model:

import pickle

# Save the model
with open("model.pkl", "wb") as f:
    pickle.dump(model, f)

# Save the scaler too
with open("scaler.pkl", "wb") as f:
    pickle.dump(scaler, f)

Load them both in your harness and always use them together. They are a pair.

Handling many users at once

When only one person is using your app, calling the model one at a time is fine. But when a hundred people send requests at the same time, doing it one by one gets slow fast.

The good news is that most models can handle a whole group of requests at the same time, just as fast as one. This is called batching.

def predict_batch(self, texts: list[str]) -> list[dict]:
    # Convert all texts to numbers at once
    features = self.vectorizer.transform(texts)

    # Ask the model about all of them in one shot
    results = self.model.predict(features)

    return [
        {"sentiment": "negative" if r == 0 else "positive"}
        for r in results
    ]

Instead of ten trips to the model, you make one. Same result, much faster.

Making it reachable from anywhere

Once you have a harness, you wrap it in a small web server so any app, any device, any language can talk to it:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
harness = SentimentHarness()

class Request(BaseModel):
    text: str

@app.post("/predict")
def predict(req: Request):
    return harness.predict(req.text)

Now anyone can send a request to your server and get back a prediction. Your iOS app, your website, your Slack bot. They all just call POST /predict with some text and get back {"sentiment": "negative"}.

This is what every AI feature you use every day is doing under the hood.

What actually goes wrong in production

A few things that will bite you if you are not watching:

  • You loaded the model fresh for every request. The model file is large. Loading it takes time. Load it once when the server starts and reuse it.
  • Your test data was clean but real users are messy. People send emojis, typos, empty fields. Your harness needs to handle all of it gracefully.
  • The model slowly gets worse over time. The world changes. The patterns the model learned from last year's data might not match this year's user behavior. This is called drift, and you need monitoring to catch it.

None of these are hard to fix. They just require thinking beyond the notebook.

Training teaches the model. The harness is what teaches it to live in the real world.