- According to multiple industry surveys, approximately 87% of machine learning projects never reach production—the bottleneck lies not in model accuracy, but in the absence of engineering pipelines[5]
- Google's three-level MLOps maturity model (Level 0: Manual → Level 1: Pipeline Automation → Level 2: CI/CD Automation) provides enterprises with a clear evolution path[8]
- In ML systems, the actual code for model training accounts for only 5–10% of the overall system; the remaining 90% consists of data management, feature engineering, monitoring, deployment, and other infrastructure[1]
- Enterprises that adopt MLOps can, on average, reduce the model development-to-deployment cycle from months to days, and cut model failure detection time from weeks to minutes[4]
1. 87% of ML Projects Never Make It to Production: Why MLOps Is Essential for AI Deployment
The history of machine learning is rife with an ironic pattern: models that perform brilliantly in the lab frequently falter once they enter the real world. Paleyes et al., in their survey published in ACM Computing Surveys[5], systematically cataloged the challenges of machine learning deployment and found that problems almost never originate in the algorithms themselves—rather, they arise from everything "surrounding" the algorithms: fragile data pipelines, untraceable experiment results, manual deployment processes, and the absence of post-deployment monitoring.
In 2015, a Google research team published a seminal paper at NeurIPS[1] that used a now widely cited architecture diagram to reveal a stark truth: in a production-grade ML system, the actual model training code—the part we invest the most effort into—occupies only a small box. Surrounding it is a massive infrastructure layer: data collection, data validation, feature extraction, resource management, serving infrastructure, and monitoring systems. It is this "glue code" and infrastructure that ultimately determines whether an ML system can operate reliably in production.
Microsoft's large-scale empirical study[2] further confirmed this finding. After interviewing dozens of internal ML teams, they discovered that software engineering best practices—version control, continuous integration, automated testing, and monitoring alerts—were severely neglected in ML development. Data scientists are accustomed to rapid iteration in Jupyter Notebooks, but this workflow has fundamental shortcomings in collaboration, reproducibility, and productionization.
MLOps (Machine Learning Operations) was born precisely to bridge this gap. It extends the core principles of DevOps—automation, monitoring, collaboration, and continuous delivery—to the entire machine learning lifecycle. Kreuzberger et al., in their IEEE Access review[4], defined MLOps as a set of principles and practices aimed at reliably and efficiently deploying and maintaining machine learning models in production environments.
This article provides a comprehensive breakdown of MLOps core components, from theoretical frameworks to hands-on practice. We not only explain the "why" but also provide two ready-to-run Google Colab labs, enabling readers to experience the power of the MLOps toolchain firsthand.
2. The MLOps Maturity Model: An Evolution Path from Manual to Fully Automated
Google Cloud, in its MLOps architecture guide[8], proposed a three-level maturity model that has become the most widely adopted MLOps evolution framework in the industry. Understanding these three levels is the first step in planning an enterprise MLOps strategy.
Level 0: Manual Processes
This is the starting point for most ML teams and the stage where 87% of projects get stuck. Characteristics include:
- Data scientists manually train models locally or in notebooks
- Experiment records rely on Excel or personal notes ("I think lr=0.01 worked better last time")
- Model delivery involves "emailing the pickle file to the engineer"
- Deployment is a one-off manual event with no automated testing
- No systematic monitoring after deployment; model degradation goes unnoticed
Shankar et al.'s interview study[11] vividly depicted the Level 0 predicament: one ML engineer reported that their model deployment process required 47 manual steps, and any error in a single step meant starting over from the beginning.
Level 1: ML Pipeline Automation
At this level, the key breakthrough is encapsulating the training process into automated pipelines:
- Data validation, feature engineering, training, and evaluation form an automated workflow
- When new data arrives, the pipeline can automatically trigger retraining
- Experiment tracking tools (such as MLflow) record the parameters and results of each training run
- Model deployment may still require some manual intervention
- Basic model performance monitoring begins
TFX[9] is Google's exemplary Level 1 implementation. It chains data validation (TFDV), model analysis (TFMA), and serving deployment (TF Serving) into an automated pipeline, transforming model retraining from a manual operation into a one-click trigger.
Level 2: Full CI/CD Automation
The highest level achieves complete automation of ML systems:
- Code changes automatically trigger the CI/CD pipeline
- Automated tests cover data quality, model performance, and service stability
- Models are automatically deployed to production after passing all tests
- Full A/B testing and canary deployment
- Real-time Data Drift and Model Drift detection with automated retraining triggers
Sato et al., in their ThoughtWorks technical report[12], described in detail the complete practice of CD4ML (Continuous Delivery for Machine Learning), demonstrating how software engineering's continuous delivery methodology can be applied to ML systems.
| Dimension | Level 0: Manual | Level 1: Pipeline | Level 2: CI/CD |
|---|---|---|---|
| Training Trigger | Manual execution | Auto-triggered by new data | Auto-triggered by code/data changes |
| Experiment Tracking | Excel / notes | MLflow / W&B | MLflow + automated comparison |
| Model Deployment | Manual scp / email | Semi-automated | Automated + Canary |
| Testing | None | Basic validation | Full coverage: data / model / service |
| Monitoring | None | Basic metrics | Drift detection + auto-alerting |
| Iteration Cycle | Weeks to months | Days | Hours |
3. Experiment Management: Tracking Every Training Run with MLflow
Experiment management is the cornerstone of MLOps. Without systematic experiment tracking, ML development is like writing code in the pre-version-control era—everyone experimenting on their own branch with no one knowing which version is the "right" one.
MLflow[3], open-sourced by Databricks, is currently the most widely adopted ML experiment management platform. It offers four core modules:
3.1 MLflow Tracking: Experiment Tracking
The core concept of MLflow Tracking is the Run—each training execution is a Run, which records:
- Parameters: Hyperparameters (learning rate, batch size, AI PoC epochs, etc.)
- Metrics: Evaluation metrics (accuracy, loss, F1-score, etc.), with support for step-by-step logging
- Artifacts: Outputs (trained model files, confusion matrix plots, feature importance charts, etc.)
- Tags: Custom tags (experiment purpose, dataset version, operator, etc.)
Multiple Runs are organized under an Experiment, and MLflow provides a built-in Web UI for real-time comparison of different Runs' performance. This solves the most common pain point in ML development: "What parameters did I use for that model that performed so well last week?"
3.2 MLflow Models: Standardized Model Packaging
MLflow Models defines a unified model packaging format that wraps models the same way regardless of whether the underlying framework is scikit-learn, PyTorch, or TensorFlow. Each MLflow Model contains:
- MLmodel file: Describes the model's flavors (the various ways it can be loaded)
- Model binary: Serialized model weights
- conda.yaml / requirements.txt: Precise dependency environment descriptions
- input_example: Sample input for inference testing
3.3 MLflow Model Registry: Model Version Management
Model Registry introduces lifecycle management for models. Each registered model can be tagged with different stages:
- Staging: Awaiting validation, undergoing A/B testing or performance evaluation
- Production: Validated and officially serving in production
- Archived: Retired previous versions
This provides clear protocols for model upgrades and rollbacks, replacing the high-risk practice of "directly overwriting the model.pkl in production."
4. Data Versioning and Feature Engineering: DVC and Feature Store
Polyzotis et al., in their research published in ACM SIGMOD Record[6], pointed out that the most underestimated challenge in production ML systems is data management. Model performance depends on training data quality, and data in production environments is constantly changing—making data versioning an indispensable component of MLOps.
4.1 DVC (Data Version Control): Git for Large-Scale Data
Git is the gold standard for code version control, but it cannot handle GB-scale or even TB-scale training data and model files. DVC was created precisely for this purpose—it builds a data versioning layer on top of Git:
- Git tracks .dvc metadata files: Recording data file hashes, sizes, and remote storage locations
- Actual data stored in remote storage: Supporting S3, GCS, Azure Blob, SSH, and more
- Pipeline definition: Using
dvc.yamlto describe data processing DAGs (Directed Acyclic Graphs) - Version switching:
git checkout+dvc checkoutto return to any historical data state
This means every model training run can be precisely mapped to a specific data version, definitively solving the age-old question: "Which dataset was this model trained on?"
4.2 Feature Store: A Central Repository for Features
In large ML teams, different projects often require similar features (e.g., "user's transaction count over the past 30 days"). Without a Feature Store, each team computes independently, leading to:
- Redundant computation: The same feature is independently developed by different teams
- Training-Serving Skew: Features computed in Python during training are re-implemented in Java for serving, causing logical inconsistencies
- Data Leakage (time-travel problem): Accidentally using future data
Feature Stores (such as Feast, Tecton, and Hopsworks) provide unified feature definition, storage, and serving, ensuring that training and inference use exactly the same feature computation logic. Huyen, in her book[10], compared the Feature Store to "middleware" for ML systems—a bridge connecting data engineering and model training.
5. Model Packaging and Deployment: From Flask to BentoML
After model training is complete, the biggest challenge often just begins: how to transform a model in a Python script into a reliable, scalable, and monitorable production service?
5.1 The Evolution of Deployment Approaches
| Stage | Approach | Pros | Cons |
|---|---|---|---|
| V1: Manual | Flask / FastAPI self-wrapped | Rapid prototyping | No standardization, hard to maintain |
| V2: Containerized | Docker + Kubernetes | Environment consistency, scalable | Requires DevOps expertise |
| V3: Framework-based | BentoML / Seldon / KServe | Standardized, built-in best practices | Learning curve |
| V4: Serverless | AWS Lambda / Cloud Run | Zero ops, auto-scaling | Cold starts, model size limits |
5.2 BentoML: The Shortest Path from Model to API
BentoML is an open-source framework specifically designed for ML model serving. Its core philosophy is that data scientists should not need to learn Docker or Kubernetes to deploy a model. BentoML abstracts model deployment into three steps:
- Save the model: Use
bentoml.sklearn.save_model()to store the trained model in a local model repository - Define the service: Use Python decorators to declare API endpoints and define input/output formats
- Package and deploy: BentoML automatically generates a Docker image containing all dependencies and optimized configurations
BentoML also includes built-in production-grade features such as batch inference (Batching), adaptive micro-batching (Adaptive Batching), and multi-model composition (Runner)—features that would require hundreds of additional lines of code when building manually with Flask.
5.3 Deployment Strategies: Blue-Green, Canary, and Shadow Mode
Model updates in production should never be a matter of "shut down the old one, turn on the new one." Mature MLOps practices employ progressive deployment strategies:
- Blue-Green Deployment: Run both old and new versions simultaneously, enabling one-click rollback via traffic switching
- Canary Deployment: Initially route 5% of traffic to the new model, then gradually increase after confirming metrics are healthy
- Shadow Mode: The new model receives all requests but does not return actual results—it only logs predictions for offline comparison
6. Hands-on Lab 1: Complete MLflow Experiment Management Workflow
This lab walks you through the complete MLflow core workflow—from creating experiments, training multiple models, logging parameters and metrics, comparing experiment results, to selecting the best model and registering it.
Open Google Colab (CPU runtime is sufficient), create a new Notebook, and paste the following code blocks in order:
6.1 Environment Setup and Data Preparation
!pip install mlflow scikit-learn matplotlib -q
import mlflow
import mlflow.sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, ConfusionMatrixDisplay
)
import warnings
warnings.filterwarnings('ignore')
# Load Wine dataset (multi-class problem, 3 classes, 13 features)
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names
target_names = wine.target_names
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Training set: {X_train.shape[0]} samples, Test set: {X_test.shape[0]} samples")
print(f"Number of features: {X_train.shape[1]}, Number of classes: {len(target_names)}")
print(f"Class distribution (train): {np.bincount(y_train)}")
print(f"Class distribution (test): {np.bincount(y_test)}")
6.2 Define the Experiment Tracking Function
def train_and_log(model, model_name, params, X_tr, X_te, y_tr, y_te):
"""Train a model and log all information to MLflow"""
with mlflow.start_run(run_name=model_name):
# Log hyperparameters
mlflow.log_params(params)
mlflow.set_tag("model_type", model_name)
mlflow.set_tag("dataset", "wine")
mlflow.set_tag("scaler", "StandardScaler")
# Train
model.fit(X_tr, y_tr)
y_pred = model.predict(X_te)
# Log multiple evaluation metrics
metrics = {
"accuracy": accuracy_score(y_te, y_pred),
"precision_macro": precision_score(y_te, y_pred, average='macro'),
"recall_macro": recall_score(y_te, y_pred, average='macro'),
"f1_macro": f1_score(y_te, y_pred, average='macro'),
}
# Cross-validation score (more robust evaluation)
cv_scores = cross_val_score(model, X_tr, y_tr, cv=5, scoring='accuracy')
metrics["cv_mean_accuracy"] = cv_scores.mean()
metrics["cv_std_accuracy"] = cv_scores.std()
mlflow.log_metrics(metrics)
# Log artifacts: confusion matrix plot
fig, ax = plt.subplots(figsize=(6, 5))
cm = confusion_matrix(y_te, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=target_names)
disp.plot(ax=ax, cmap='Blues')
ax.set_title(f"{model_name} — Confusion Matrix")
plt.tight_layout()
fig.savefig("confusion_matrix.png", dpi=100)
mlflow.log_artifact("confusion_matrix.png")
plt.close()
# Log the model itself
mlflow.sklearn.log_model(model, "model")
print(f" {model_name}: accuracy={metrics['accuracy']:.4f}, "
f"f1={metrics['f1_macro']:.4f}, "
f"cv={metrics['cv_mean_accuracy']:.4f}+/-{metrics['cv_std_accuracy']:.4f}")
return metrics
print("Training and logging function defined successfully")
6.3 Set Up MLflow Experiment and Train Multiple Models
# Create MLflow experiment
experiment_name = "wine_classification_LLM Evaluation"
mlflow.set_experiment(experiment_name)
print("=" * 65)
print(" MLflow Experiment Management — Wine Classification Model Comparison")
print("=" * 65)
# Define model and hyperparameter combinations
experiments = [
{
"name": "LogisticRegression_C0.1",
"model": LogisticRegression(C=0.1, max_iter=1000, random_state=42),
"params": {"algorithm": "LogisticRegression", "C": 0.1, "max_iter": 1000}
},
{
"name": "LogisticRegression_C1.0",
"model": LogisticRegression(C=1.0, max_iter=1000, random_state=42),
"params": {"algorithm": "LogisticRegression", "C": 1.0, "max_iter": 1000}
},
{
"name": "LogisticRegression_C10.0",
"model": LogisticRegression(C=10.0, max_iter=1000, random_state=42),
"params": {"algorithm": "LogisticRegression", "C": 10.0, "max_iter": 1000}
},
{
"name": "RandomForest_100trees",
"model": RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42),
"params": {"algorithm": "RandomForest", "n_estimators": 100, "max_depth": "None"}
},
{
"name": "RandomForest_200trees_depth5",
"model": RandomForestClassifier(n_estimators=200, max_depth=5, random_state=42),
"params": {"algorithm": "RandomForest", "n_estimators": 200, "max_depth": 5}
},
{
"name": "GradientBoosting_100",
"model": GradientBoostingClassifier(
n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
),
"params": {"algorithm": "GradientBoosting", "n_estimators": 100,
"learning_rate": 0.1, "max_depth": 3}
},
{
"name": "GradientBoosting_200_slow",
"model": GradientBoostingClassifier(
n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42
),
"params": {"algorithm": "GradientBoosting", "n_estimators": 200,
"learning_rate": 0.05, "max_depth": 4}
},
{
"name": "SVM_rbf",
"model": SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42),
"params": {"algorithm": "SVM", "kernel": "rbf", "C": 1.0, "gamma": "scale"}
},
]
# Train and log each model to MLflow
all_results = {}
for exp in experiments:
result = train_and_log(
exp["model"], exp["name"], exp["params"],
X_train_scaled, X_test_scaled, y_train, y_test
)
all_results[exp["name"]] = result
print(f"\nCompleted {len(experiments)} experiments, all logged to MLflow")
6.4 Query and Compare Experiment Results
# Use MLflow API to query experiment results
from mlflow.tracking import MlflowClient
client = MlflowClient()
experiment = client.get_experiment_by_name(experiment_name)
runs = client.search_runs(
experiment_ids=[experiment.experiment_id],
order_by=["metrics.f1_macro DESC"]
)
print("=" * 75)
print(f" Experiment Results Ranking (sorted by F1-macro)")
print("=" * 75)
print(f"{'Rank':<5}{'Model':<35}{'Accuracy':<12}{'F1-macro':<12}{'CV Mean':<12}")
print("-" * 75)
for i, run in enumerate(runs):
m = run.data.metrics
print(f" {i+1:<3} {run.info.run_name:<35}"
f"{m['accuracy']:<12.4f}{m['f1_macro']:<12.4f}"
f"{m['cv_mean_accuracy']:<12.4f}")
# Best model
best_run = runs[0]
print(f"\nBest Model: {best_run.info.run_name}")
print(f" Run ID: {best_run.info.run_id}")
print(f" F1-macro: {best_run.data.metrics['f1_macro']:.4f}")
print(f" CV Accuracy: {best_run.data.metrics['cv_mean_accuracy']:.4f}"
f" +/- {best_run.data.metrics['cv_std_accuracy']:.4f}")
# Visual comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
names = [r.info.run_name.replace("_", "\n") for r in runs]
accs = [r.data.metrics['accuracy'] for r in runs]
f1s = [r.data.metrics['f1_macro'] for r in runs]
colors = ['#b8922e' if i == 0 else '#0077b6' for i in range(len(runs))]
axes[0].barh(names, accs, color=colors)
axes[0].set_xlabel('Accuracy')
axes[0].set_title('Model Accuracy Comparison')
axes[0].set_xlim(0.85, 1.01)
axes[1].barh(names, f1s, color=colors)
axes[1].set_xlabel('F1-macro')
axes[1].set_title('Model F1-macro Comparison (Gold = Best)')
axes[1].set_xlim(0.85, 1.01)
plt.tight_layout()
plt.savefig("model_comparison.png", dpi=120, bbox_inches='tight')
plt.show()
print("\nComparison charts saved")
6.5 Register the Best Model to the Model Registry
# Register the best model to MLflow Model Registry
model_name_registry = "wine_classifier_production"
model_uri = f"runs:/{best_run.info.run_id}/model"
registered = mlflow.register_model(model_uri, model_name_registry)
print(f"\nModel registered to Model Registry")
print(f" Model name: {registered.name}")
print(f" Version: {registered.version}")
print(f" Source Run: {best_run.info.run_name}")
# Update model description
client.update_registered_model(
name=model_name_registry,
description="Best Wine classification model — automatically selected through the MLflow experiment management workflow"
)
# Load the registered model and perform inference
loaded_model = mlflow.sklearn.load_model(model_uri)
sample = X_test_scaled[:5]
predictions = loaded_model.predict(sample)
print(f"\nModel inference test (first 5 test samples):")
for i in range(5):
actual = target_names[y_test[i]]
predicted = target_names[predictions[i]]
status = "Correct" if y_test[i] == predictions[i] else "Wrong"
print(f" [{status}] Actual: {actual:<12} Predicted: {predicted}")
print(f"\nLab 1 complete! You have learned:")
print(f" 1. Creating MLflow experiments and tracking multiple models")
print(f" 2. Logging hyperparameters, evaluation metrics, and artifacts")
print(f" 3. Querying and comparing experiment results using the API")
print(f" 4. Registering the best model to the Model Registry")
print(f" 5. Loading a model from the Registry for inference")
7. Hands-on Lab 2: Model Packaging and API Serving
In Lab 1, we used MLflow to manage the experiment workflow and select the best model. In this lab, we will use BentoML to package the model as a REST API ready for external serving—a critical step in the journey from "experiment" to "product."
Open Google Colab (CPU runtime is sufficient), create a new Notebook, and paste the following code blocks in order:
7.1 Environment Setup and Model Training
!pip install bentoml scikit-learn numpy requests -q
import bentoml
import numpy as np
import json
import time
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
# Train a production-grade model
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42
)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
print(f"Model training complete — Test set Accuracy: {acc:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))
7.2 Save the Model to the BentoML Model Store
# Save the model and preprocessor together to BentoML
saved_model = bentoml.sklearn.save_model(
"wine_classifier",
model,
signatures={"predict": {"batchable": True}},
labels={"task": "classification", "dataset": "wine", "framework": "sklearn"},
metadata={
"accuracy": float(acc),
"n_features": X_train.shape[1],
"n_classes": len(wine.target_names),
"feature_names": list(wine.feature_names),
"target_names": list(wine.target_names),
},
custom_objects={
"scaler": scaler # Save the preprocessor alongside the model
}
)
print(f"Model saved to BentoML Model Store")
print(f" Model Tag: {saved_model.tag}")
print(f" Storage Path: {saved_model.path}")
# List all saved models
print(f"\nSaved models list:")
for m in bentoml.models.list():
print(f" - {m.tag} (Created: {m.info.creation_time})")
7.3 Define the BentoML Service
# Write the BentoML Service definition file
service_code = '''
import numpy as np
import bentoml
from bentoml.io import NumpyNdarray, JSON
# Load model and preprocessor
model_runner = bentoml.sklearn.get("wine_classifier:latest").to_runner()
model_ref = bentoml.models.get("wine_classifier:latest")
scaler = model_ref.custom_objects["scaler"]
metadata = model_ref.info.metadata
svc = bentoml.Service("wine_classifier_service", runners=[model_runner])
@svc.api(input=NumpyNdarray(), output=JSON())
async def predict(input_array: np.ndarray) -> dict:
"""Accept raw features, return predicted class and probability"""
# Preprocessing
scaled = scaler.transform(input_array.reshape(1, -1) if input_array.ndim == 1 else input_array)
# Prediction
predictions = await model_runner.predict.async_run(scaled)
target_names = metadata["target_names"]
results = []
for pred in predictions:
results.append({
"class_id": int(pred),
"class_name": target_names[int(pred)],
})
return {"predictions": results, "model": str(model_ref.tag)}
@svc.api(input=JSON(), output=JSON())
async def predict_json(input_data: dict) -> dict:
"""Accept feature data in JSON format"""
features = np.array(input_data["features"])
scaled = scaler.transform(features.reshape(1, -1) if features.ndim == 1 else features)
predictions = await model_runner.predict.async_run(scaled)
target_names = metadata["target_names"]
results = []
for pred in predictions:
results.append({
"class_id": int(pred),
"class_name": target_names[int(pred)],
})
return {
"predictions": results,
"model": str(model_ref.tag),
"feature_names": metadata["feature_names"]
}
@svc.api(input=JSON(), output=JSON())
async def model_info(input_data: dict) -> dict:
"""Return model metadata"""
return {
"model_tag": str(model_ref.tag),
"accuracy": metadata["accuracy"],
"n_features": metadata["n_features"],
"n_classes": metadata["n_classes"],
"feature_names": metadata["feature_names"],
"target_names": metadata["target_names"],
}
'''
with open("service.py", "w") as f:
f.write(service_code)
print("BentoML Service definition file (service.py) created")
print(" Contains 3 API endpoints:")
print(" - /predict : NumPy array input")
print(" - /predict_json : JSON format input")
print(" - /model_info : Model metadata query")
7.4 Simulated API Inference Test
# Test model inference logic directly in Colab (without starting an HTTP Server)
print("=" * 65)
print(" Model Packaging and Inference Test")
print("=" * 65)
# Load the saved model
model_ref = bentoml.models.get("wine_classifier:latest")
loaded_model = bentoml.sklearn.load_model("wine_classifier:latest")
loaded_scaler = model_ref.custom_objects["scaler"]
meta = model_ref.info.metadata
print(f"\nModel Info:")
print(f" Tag: {model_ref.tag}")
print(f" Accuracy: {meta['accuracy']:.4f}")
print(f" Number of features: {meta['n_features']}")
print(f" Classes: {meta['target_names']}")
# Simulated API request — single inference
print(f"\n--- Single Inference Test ---")
sample = X_test[0]
sample_scaled = loaded_scaler.transform(sample.reshape(1, -1))
pred = loaded_model.predict(sample_scaled)
print(f" Input features (first 5): {sample[:5].round(2)}")
print(f" Predicted class ID: {pred[0]}")
print(f" Predicted class name: {wine.target_names[pred[0]]}")
print(f" Actual class name: {wine.target_names[y_test[0]]}")
# Simulated API request — batch inference
print(f"\n--- Batch Inference Test (10 samples) ---")
batch = X_test[:10]
batch_scaled = loaded_scaler.transform(batch)
batch_preds = loaded_model.predict(batch_scaled)
correct = sum(batch_preds == y_test[:10])
print(f" Batch inference results: {correct}/10 correct")
for i in range(10):
actual = wine.target_names[y_test[i]]
predicted = wine.target_names[batch_preds[i]]
status = "OK" if y_test[i] == batch_preds[i] else "NG"
print(f" [{status}] #{i+1} Actual: {actual:<12} Predicted: {predicted}")
# Inference latency test
print(f"\n--- Inference Latency Benchmark ---")
single_sample = loaded_scaler.transform(X_test[0].reshape(1, -1))
batch_100 = loaded_scaler.transform(X_test[:36]) # Use all test data
# Single-sample latency
times_single = []
for _ in range(100):
t0 = time.time()
_ = loaded_model.predict(single_sample)
times_single.append((time.time() - t0) * 1000)
# Batch latency
times_batch = []
for _ in range(100):
t0 = time.time()
_ = loaded_model.predict(batch_100)
times_batch.append((time.time() - t0) * 1000)
print(f" Single inference: {np.mean(times_single):.3f}ms (p99: {np.percentile(times_single, 99):.3f}ms)")
print(f" Batch {len(batch_100)} samples: {np.mean(times_batch):.3f}ms (p99: {np.percentile(times_batch, 99):.3f}ms)")
print(f" Batch efficiency: {np.mean(times_single) * len(batch_100) / np.mean(times_batch):.1f}x")
7.5 Build the Bento and Inspect the Package Structure
# Build bentofile.yaml (BentoML packaging configuration)
bentofile_content = '''
service: "service:svc"
labels:
owner: meta-intelligence
project: wine-classifier
stage: production
include:
- "*.py"
python:
packages:
- scikit-learn
- numpy
'''
with open("bentofile.yaml", "w") as f:
f.write(bentofile_content)
print("bentofile.yaml created")
print("\nPackaging configuration content:")
print(bentofile_content)
# Display the complete deployment workflow
print("=" * 65)
print(" Production Deployment Workflow (CLI Command Guide)")
print("=" * 65)
print("""
In your local development environment, execute the following commands to complete deployment:
# 1. Start local dev server (for testing)
$ bentoml serve service:svc --reload
# 2. Package into a Bento
$ bentoml build
# 3. Containerize (generate Docker image)
$ bentoml containerize wine_classifier_service:latest
# 4. Run the container
$ docker run -p 3000:3000 wine_classifier_service:latest
# 5. Test the API
$ curl -X POST http://localhost:3000/predict_json \\
-H "Content-Type: application/json" \\
-d '{"features": [13.0, 1.5, 2.3, 15.0, 100, 2.8, 3.0, 0.28, 2.29, 5.64, 1.04, 3.92, 1065]}'
""")
print(f"\nLab 2 complete! You have learned:")
print(f" 1. Saving models and preprocessors to BentoML Model Store")
print(f" 2. Defining multi-endpoint API Services")
print(f" 3. Testing model inference (single and batch)")
print(f" 4. Creating packaging configurations and deployment workflows")
print(f" 5. Understanding the complete path from development to containerized deployment")
8. CI/CD for ML: Automated Testing and Continuous Deployment
CI/CD for traditional software is well established, but continuous integration and delivery for ML systems face unique challenges. Sato et al.'s CD4ML framework[12] raised an important point: ML systems have three axes of change that require version control—code, models, and data. A change on any axis may require re-validation and redeployment.
8.1 ML-Specific Testing Strategies
The ML Test Score proposed by Breck et al.[7] defines a comprehensive testing rubric for ML systems, covering four major categories:
Data Tests:
- Whether feature statistical distributions fall within expected ranges (min, max, mean, missing rate)
- Whether training data and serving data schemas are consistent
- Whether Data Leakage exists
- Whether inter-feature correlations remain stable
Model Tests:
- Whether model performance on benchmark test sets exceeds the minimum threshold
- Whether the new model outperforms the current production model (regression testing)
- Whether model performance is equitable across different subgroups (Fairness Testing)
- Model robustness against adversarial examples
Infrastructure Tests:
- Whether the training process is reproducible (Reproducibility)
- Whether model serialization/deserialization is correct
- Whether API endpoint response times are within SLA bounds
- Whether resource usage (memory, CPU, GPU) is within budget
Monitoring Tests:
- Whether comprehensive logging is in place
- Whether alert thresholds are properly configured
- Whether automatic rollback mechanisms exist when model performance degrades
8.2 ML CI/CD Pipeline with GitHub Actions
Below is a typical ML CI/CD Pipeline structure implemented with GitHub Actions:
# .github/workflows/ml-pipeline.yml structure overview
#
# Triggers: push to main / PR / scheduled (daily retraining)
#
# Stage 1: Data Validation
# - Check data schema consistency
# - Validate feature distributions (Great Expectations / Pandera)
# - Detect data drift
#
# Stage 2: Model Training
# - Pull latest training data from DVC
# - Execute training pipeline
# - Log experiments with MLflow
#
# Stage 3: Model Validation
# - Benchmark test set performance >= threshold
# - New model >= current production model
# - Fairness checks pass
# - Latency benchmark passes
#
# Stage 4: Model Deployment
# - BentoML packaging
# - Docker containerization
# - Canary deployment (5% traffic)
# - Monitor for 30 minutes
# - Full traffic cutover
Lu et al.'s survey of the MLOps tool ecosystem[14] pointed out that the fragmentation of MLOps toolchains is one of the biggest barriers to enterprise adoption. Integration between different tools often requires substantial "glue code," which itself becomes a new source of technical debt.
9. Model Monitoring: Data Drift and Model Drift Detection
Deploying a model is not the finish line—it is the starting point of a new journey. Klaise et al.'s monitoring survey[13] systematically categorized the degradation risks facing production ML models, with the two most critical types being Data Drift and Model Drift.
9.1 Data Drift: Changes in Input Data Distribution
Data Drift refers to significant changes in the distribution of input data in production compared to the training data. This is the most common cause of ML model degradation.
Common causes:
- Seasonal variations: E-commerce purchasing behavior differs drastically between holidays and regular days
- Upstream system changes: Data providers modify ETL logic or field definitions
- Evolving user behavior: COVID-19 fundamentally altered user patterns across many industries
- Feature computation errors: Code bugs cause anomalous feature values
Detection methods:
- Statistical tests: KS Test (Kolmogorov-Smirnov), Chi-Square Test, PSI (Population Stability Index)
- Distribution distances: KL Divergence, Wasserstein Distance, Jensen-Shannon Divergence
- Visualization: Time-series comparison charts of feature distributions
9.2 Model Drift (Concept Drift): Changes in Input-Output Relationships
Even when the input distribution remains unchanged, the relationship between inputs and targets can shift. For example, before the pandemic, users searching for "face masks" were mostly in the medical field; after the pandemic, they were the general public—the same input features mapped to labels with fundamentally different meanings.
Detection strategies:
- Direct monitoring: Track model predictive performance on production data (requires delayed labels)
- Indirect monitoring: Track changes in prediction distributions (no labels needed, but lower sensitivity)
- Window comparison: Compare model performance within sliding windows against historical baselines
9.3 Monitoring System Architecture
A comprehensive ML monitoring system should include the following layers:
| Monitoring Layer | Metrics | Tools | Alert Threshold |
|---|---|---|---|
| Infrastructure Layer | CPU, memory, latency, throughput | Prometheus + Grafana | P99 latency > 200ms |
| Data Quality Layer | Missing rate, outliers, schema drift | Great Expectations | Missing rate > 5% |
| Data Drift Layer | PSI, KS statistic | Evidently / NannyML | PSI > 0.2 |
| Model Performance Layer | Accuracy, F1, AUC | MLflow + custom | Below baseline by 3% |
| Business Metrics Layer | Conversion rate, revenue impact | Custom dashboard | Defined by business |
Testi et al., in their IEEE Access study[15], proposed a taxonomy and methodology for MLOps, emphasizing that monitoring should not be merely reactive—waiting for problems to occur—but should proactively predict when models need retraining. They recommend establishing a "model health score" that integrates multi-dimensional metrics to assess the current state of a model.
10. Decision Framework: MLOps Toolchain Selection Guide
Faced with numerous MLOps tools, enterprises often find themselves paralyzed by choice. Below are toolchain recommendations based on team size and maturity:
10.1 Early Stage (1–3 Person ML Team)
| Function | Recommended Tool | Alternative | Rationale |
|---|---|---|---|
| Experiment Tracking | MLflow (local mode) | Weights & Biases | Free, lightweight, no infrastructure needed |
| Version Control | Git + DVC | Git LFS | Unified version control for data and code |
| Model Deployment | BentoML | FastAPI + Docker | Built-in best practices, less glue code |
| Monitoring | Evidently (report mode) | Manual scripts | Open-source, easy to get started |
10.2 Growth Stage (4–10 Person ML Team)
| Function | Recommended Tool | Alternative | Rationale |
|---|---|---|---|
| Experiment Tracking | MLflow (server mode) | Neptune.ai | Team sharing, unified management |
| Pipeline Orchestration | Prefect / Airflow | Kubeflow Pipelines | Scheduling, retries, dependency management |
| Feature Store | Feast | Hopsworks | Avoid redundant feature computation |
| Model Deployment | BentoML + K8s | Seldon Core | Containerized + auto-scaling |
| Monitoring | Evidently + Grafana | NannyML | Real-time drift detection + visualization |
10.3 Mature Stage (10+ Person ML Team / Multi-Model Production)
| Function | Recommended Tool | Alternative | Rationale |
|---|---|---|---|
| End-to-End Platform | Kubeflow | AWS SageMaker | Unified lifecycle management |
| Feature Store | Tecton / Feast on K8s | SageMaker Feature Store | Enterprise-grade feature management |
| Model Serving | KServe | Triton Inference Server | Multi-framework support, GPU inference |
| Monitoring | Evidently + PagerDuty | Fiddler AI | Auto-alerting + incident management |
| Governance | MLflow + custom | Weights & Biases | Model auditing, compliance tracking |
10.4 Three Core Principles for Tool Selection
Regardless of team size, the following principles should guide tool selection:
- Start small, scale incrementally: Do not deploy a full Kubeflow cluster from day one. Begin with MLflow in local mode for experiment management, then upgrade infrastructure as team size and model count grow. Premature architectural investment is a common cause of MLOps adoption failure.
- Prioritize eliminating the biggest pain point: If the team's biggest problem is "we can't find last time's experiment results," introduce experiment tracking first. If it's "deploying a model takes two weeks," establish automated deployment first. Do not try to solve every problem at once.
- Choose open standards over closed platforms: MLflow's model format, ONNX's model exchange format, OCI's container standards—these open standards ensure you are not locked into a single platform and preserve migration flexibility for the future.
11. Conclusion: MLOps Is Not a Tools Problem—It Is a Culture Problem
Let us return to the statistic from the beginning of this article—87% of ML projects never make it to production. By now, the reason should be clear: it is not because our models are not good enough, but because we treated "training a high-accuracy model" as the finish line, overlooking the enormous chasm between experimentation and production.
The core value of MLOps lies not in any single tool—not in MLflow, not in DVC, not in BentoML—but in the cultural shift it represents: from "one-off model development" to "continuous iterative ML systems engineering."
The concept of "hidden technical debt" proposed by Sculley et al. in their foundational paper[1] remains as relevant as ever. Every untracked experiment, every manually deployed model, every unmonitored production service accumulates technical debt. This debt does not disappear on its own—it manifests as model degradation, deployment failures, and debugging nightmares.
For enterprises considering MLOps adoption, our recommendations are:
- Start tracking your experiments with MLflow today. This is the lowest-cost, highest-return first step. As Lab 1 demonstrated, just a few lines of code can fundamentally transform how you manage experiments.
- Establish a standardized model deployment process. Lab 2 showed how BentoML can transform a model from a pickle file into a testable, containerizable, and scalable service.
- Build monitoring from day one. After a model goes live, Data Drift and Model Drift are inevitable. The earlier you establish detection mechanisms, the better you can avoid the disaster of "a model silently failing for three months before anyone notices."
- Invest in team culture, not just tools. The success of MLOps depends on close collaboration among data scientists, ML engineers, and DevOps teams. Tools can facilitate collaboration, but they cannot replace communication.
Machine learning is transitioning from a "research-driven" to an "engineering-driven" era. Organizations that can establish mature MLOps practices will hold a decisive advantage in the race to deploy AI—not because their models are better, but because they can deliver model value to production faster, more reliably, and more sustainably.



