Model Serving

Model serving is the practice of taking a trained machine learning model and making it available to produce predictions in production, typically by placing it behind an API that other software can call. Training a model and serving a model are different engineering problems: training is a batch, throughput-oriented job run occasionally, while serving is a latency-sensitive service that must answer prediction requests reliably and quickly, often under heavy and unpredictable load.

Serving generally takes one of two shapes. In batch serving, a model is run over a large set of records on a schedule, and the predictions are stored for later use, which suits cases like nightly scoring of every customer. In online serving, the model runs as a live service that responds to individual requests in real time, which suits cases like ranking search results or detecting fraud during a transaction. A dedicated serving system such as TensorFlow Serving, which its documentation describes as “a flexible, high-performance serving system for machine learning models, designed for production environments,” handles the operational concerns of loading models, exposing inference endpoints, and managing versions.

A defining concern of model serving is keeping the serving environment consistent with the training environment. The same documentation highlights the value of deploying new model versions “while keeping the same server architecture and APIs,” so that the model can change without disrupting the clients that depend on it. Production serving systems therefore treat models as versioned artifacts that can be loaded, swapped, and rolled back independently of the code that calls them.

The hardest correctness problem in serving is training-serving skew: the risk that the features a model receives at inference time are computed differently from the features it was trained on. Even a subtle mismatch in how an input is preprocessed can silently erode accuracy, because the model is being asked to reason about inputs that do not match its training distribution. This is one of the main reasons feature stores exist, since sharing a single feature definition between training and serving is the most direct way to eliminate the skew.

Model serving sits at the production end of the machine learning pipeline and is where MLOps meets ordinary operations. Served models are commonly packaged into containers and deployed on orchestration platforms like Kubernetes, so that the same scaling, rollout, and monitoring practices used for conventional services can be applied to inference. The distinctive additions are model versioning, drift monitoring, and vigilance against training-serving skew.

Sources

Related