主要な発見
  • According to multiple industry surveys, approximately 87% of machine learning projects never reach production—the bottleneck lies not in model accuracy, but in the absence of engineering pipelines[5]
  • Google's three-level MLOps maturity model (Level 0: Manual → Level 1: Pipeline Automation → Level 2: CI/CD Automation) provides enterprises with a clear evolution path[8]
  • In ML systems, the actual code for model training accounts for only 5–10% of the overall system; the remaining 90% consists of data management, feature engineering, monitoring, deployment, and other infrastructure[1]
  • Enterprises that adopt MLOps can, on average, reduce the model development-to-deployment cycle from months to days, and cut model failure detection time from weeks to minutes[4]

1. 87% of ML Projects Never Make It to Production: Why MLOps Is Essential for AI Deployment

The history of machine learning is rife with an ironic pattern: models that perform brilliantly in the lab frequently falter once they enter the real world. Paleyes et al., in their survey published in ACM Computing Surveys[5], systematically cataloged the challenges of machine learning deployment and found that problems almost never originate in the algorithms themselves—rather, they arise from everything "surrounding" the algorithms: fragile data pipelines, untraceable experiment results, manual deployment processes, and the absence of post-deployment monitoring.

In 2015, a Google research team published a seminal paper at NeurIPS[1] that used a now widely cited architecture diagram to reveal a stark truth: in a production-grade ML system, the actual model training code—the part we invest the most effort into—occupies only a small box. Surrounding it is a massive infrastructure layer: data collection, data validation, feature extraction, resource management, serving infrastructure, and monitoring systems. It is this "glue code" and infrastructure that ultimately determines whether an ML system can operate reliably in production.

Microsoft's large-scale empirical study[2] further confirmed this finding. After interviewing dozens of internal ML teams, they discovered that software engineering best practices—version control, continuous integration, automated testing, and monitoring alerts—were severely neglected in ML development. Data scientists are accustomed to rapid iteration in Jupyter Notebooks, but this workflow has fundamental shortcomings in collaboration, reproducibility, and productionization.

MLOps (Machine Learning Operations) was born precisely to bridge this gap. It extends the core principles of DevOps—automation, monitoring, collaboration, and continuous delivery—to the entire machine learning lifecycle. Kreuzberger et al., in their IEEE Access review[4], defined MLOps as a set of principles and practices aimed at reliably and efficiently deploying and maintaining machine learning models in production environments.

This article provides a comprehensive breakdown of MLOps core components, from theoretical frameworks to hands-on practice. We not only explain the "why" but also provide two ready-to-run Google Colab labs, enabling readers to experience the power of the MLOps toolchain firsthand.

2. The MLOps Maturity Model: An Evolution Path from Manual to Fully Automated

Google Cloud, in its MLOps architecture guide[8], proposed a three-level maturity model that has become the most widely adopted MLOps evolution framework in the industry. Understanding these three levels is the first step in planning an enterprise MLOps strategy.

Level 0: Manual Processes

This is the starting point for most ML teams and the stage where 87% of projects get stuck. Characteristics include:

Shankar et al.'s interview study[11] vividly depicted the Level 0 predicament: one ML engineer reported that their model deployment process required 47 manual steps, and any error in a single step meant starting over from the beginning.

Level 1: ML Pipeline Automation

At this level, the key breakthrough is encapsulating the training process into automated pipelines:

TFX[9] is Google's exemplary Level 1 implementation. It chains data validation (TFDV), model analysis (TFMA), and serving deployment (TF Serving) into an automated pipeline, transforming model retraining from a manual operation into a one-click trigger.

Level 2: Full CI/CD Automation

The highest level achieves complete automation of ML systems:

Sato et al., in their ThoughtWorks technical report[12], described in detail the complete practice of CD4ML (Continuous Delivery for Machine Learning), demonstrating how software engineering's continuous delivery methodology can be applied to ML systems.

DimensionLevel 0: ManualLevel 1: PipelineLevel 2: CI/CD
Training TriggerManual executionAuto-triggered by new dataAuto-triggered by code/data changes
Experiment TrackingExcel / notesMLflow / W&BMLflow + automated comparison
Model DeploymentManual scp / emailSemi-automatedAutomated + Canary
TestingNoneBasic validationFull coverage: data / model / service
MonitoringNoneBasic metricsDrift detection + auto-alerting
Iteration CycleWeeks to monthsDaysHours

3. Experiment Management: Tracking Every Training Run with MLflow

Experiment management is the cornerstone of MLOps. Without systematic experiment tracking, ML development is like writing code in the pre-version-control era—everyone experimenting on their own branch with no one knowing which version is the "right" one.

MLflow[3], open-sourced by Databricks, is currently the most widely adopted ML experiment management platform. It offers four core modules:

3.1 MLflow Tracking: Experiment Tracking

The core concept of MLflow Tracking is the Run—each training execution is a Run, which records:

Multiple Runs are organized under an Experiment, and MLflow provides a built-in Web UI for real-time comparison of different Runs' performance. This solves the most common pain point in ML development: "What parameters did I use for that model that performed so well last week?"

3.2 MLflow Models: Standardized Model Packaging

MLflow Models defines a unified model packaging format that wraps models the same way regardless of whether the underlying framework is scikit-learn, PyTorch, or TensorFlow. Each MLflow Model contains:

3.3 MLflow Model Registry: Model Version Management

Model Registry introduces lifecycle management for models. Each registered model can be tagged with different stages:

This provides clear protocols for model upgrades and rollbacks, replacing the high-risk practice of "directly overwriting the model.pkl in production."

4. Data Versioning and Feature Engineering: DVC and Feature Store

Polyzotis et al., in their research published in ACM SIGMOD Record[6], pointed out that the most underestimated challenge in production ML systems is data management. Model performance depends on training data quality, and data in production environments is constantly changing—making data versioning an indispensable component of MLOps.

4.1 DVC (Data Version Control): Git for Large-Scale Data

Git is the gold standard for code version control, but it cannot handle GB-scale or even TB-scale training data and model files. DVC was created precisely for this purpose—it builds a data versioning layer on top of Git:

This means every model training run can be precisely mapped to a specific data version, definitively solving the age-old question: "Which dataset was this model trained on?"

4.2 Feature Store: A Central Repository for Features

In large ML teams, different projects often require similar features (e.g., "user's transaction count over the past 30 days"). Without a Feature Store, each team computes independently, leading to:

Feature Stores (such as Feast, Tecton, and Hopsworks) provide unified feature definition, storage, and serving, ensuring that training and inference use exactly the same feature computation logic. Huyen, in her book[10], compared the Feature Store to "middleware" for ML systems—a bridge connecting data engineering and model training.

5. Model Packaging and Deployment: From Flask to BentoML

After model training is complete, the biggest challenge often just begins: how to transform a model in a Python script into a reliable, scalable, and monitorable production service?

5.1 The Evolution of Deployment Approaches

StageApproachProsCons
V1: ManualFlask / FastAPI self-wrappedRapid prototypingNo standardization, hard to maintain
V2: ContainerizedDocker + KubernetesEnvironment consistency, scalableRequires DevOps expertise
V3: Framework-basedBentoML / Seldon / KServeStandardized, built-in best practicesLearning curve
V4: ServerlessAWS Lambda / Cloud RunZero ops, auto-scalingCold starts, model size limits

5.2 BentoML: The Shortest Path from Model to API

BentoML is an open-source framework specifically designed for ML model serving. Its core philosophy is that data scientists should not need to learn Docker or Kubernetes to deploy a model. BentoML abstracts model deployment into three steps:

  1. Save the model: Use bentoml.sklearn.save_model() to store the trained model in a local model repository
  2. Define the service: Use Python decorators to declare API endpoints and define input/output formats
  3. Package and deploy: BentoML automatically generates a Docker image containing all dependencies and optimized configurations

BentoML also includes built-in production-grade features such as batch inference (Batching), adaptive micro-batching (Adaptive Batching), and multi-model composition (Runner)—features that would require hundreds of additional lines of code when building manually with Flask.

5.3 Deployment Strategies: Blue-Green, Canary, and Shadow Mode

Model updates in production should never be a matter of "shut down the old one, turn on the new one." Mature MLOps practices employ progressive deployment strategies:

6. Hands-on Lab 1: Complete MLflow Experiment Management Workflow

This lab walks you through the complete MLflow core workflow—from creating experiments, training multiple models, logging parameters and metrics, comparing experiment results, to selecting the best model and registering it.

Open Google Colab (CPU runtime is sufficient), create a new Notebook, and paste the following code blocks in order:

6.1 Environment Setup and Data Preparation

!pip install mlflow scikit-learn matplotlib -q

import mlflow
import mlflow.sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, ConfusionMatrixDisplay
)
import warnings
warnings.filterwarnings('ignore')

# Load Wine dataset (multi-class problem, 3 classes, 13 features)
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names
target_names = wine.target_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train.shape[0]} samples, Test set: {X_test.shape[0]} samples")
print(f"Number of features: {X_train.shape[1]}, Number of classes: {len(target_names)}")
print(f"Class distribution (train): {np.bincount(y_train)}")
print(f"Class distribution (test):  {np.bincount(y_test)}")

6.2 Define the Experiment Tracking Function

def train_and_log(model, model_name, params, X_tr, X_te, y_tr, y_te):
    """Train a model and log all information to MLflow"""
    with mlflow.start_run(run_name=model_name):
        # Log hyperparameters
        mlflow.log_params(params)
        mlflow.set_tag("model_type", model_name)
        mlflow.set_tag("dataset", "wine")
        mlflow.set_tag("scaler", "StandardScaler")

        # Train
        model.fit(X_tr, y_tr)
        y_pred = model.predict(X_te)

        # Log multiple evaluation metrics
        metrics = {
            "accuracy": accuracy_score(y_te, y_pred),
            "precision_macro": precision_score(y_te, y_pred, average='macro'),
            "recall_macro": recall_score(y_te, y_pred, average='macro'),
            "f1_macro": f1_score(y_te, y_pred, average='macro'),
        }

        # Cross-validation score (more robust evaluation)
        cv_scores = cross_val_score(model, X_tr, y_tr, cv=5, scoring='accuracy')
        metrics["cv_mean_accuracy"] = cv_scores.mean()
        metrics["cv_std_accuracy"] = cv_scores.std()

        mlflow.log_metrics(metrics)

        # Log artifacts: confusion matrix plot
        fig, ax = plt.subplots(figsize=(6, 5))
        cm = confusion_matrix(y_te, y_pred)
        disp = ConfusionMatrixDisplay(cm, display_labels=target_names)
        disp.plot(ax=ax, cmap='Blues')
        ax.set_title(f"{model_name} — Confusion Matrix")
        plt.tight_layout()
        fig.savefig("confusion_matrix.png", dpi=100)
        mlflow.log_artifact("confusion_matrix.png")
        plt.close()

        # Log the model itself
        mlflow.sklearn.log_model(model, "model")

        print(f"  {model_name}: accuracy={metrics['accuracy']:.4f}, "
              f"f1={metrics['f1_macro']:.4f}, "
              f"cv={metrics['cv_mean_accuracy']:.4f}+/-{metrics['cv_std_accuracy']:.4f}")

        return metrics

print("Training and logging function defined successfully")

6.3 Set Up MLflow Experiment and Train Multiple Models

# Create MLflow experiment
experiment_name = "wine_classification_LLM Evaluation"
mlflow.set_experiment(experiment_name)

print("=" * 65)
print("  MLflow Experiment Management — Wine Classification Model Comparison")
print("=" * 65)

# Define model and hyperparameter combinations
experiments = [
    {
        "name": "LogisticRegression_C0.1",
        "model": LogisticRegression(C=0.1, max_iter=1000, random_state=42),
        "params": {"algorithm": "LogisticRegression", "C": 0.1, "max_iter": 1000}
    },
    {
        "name": "LogisticRegression_C1.0",
        "model": LogisticRegression(C=1.0, max_iter=1000, random_state=42),
        "params": {"algorithm": "LogisticRegression", "C": 1.0, "max_iter": 1000}
    },
    {
        "name": "LogisticRegression_C10.0",
        "model": LogisticRegression(C=10.0, max_iter=1000, random_state=42),
        "params": {"algorithm": "LogisticRegression", "C": 10.0, "max_iter": 1000}
    },
    {
        "name": "RandomForest_100trees",
        "model": RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42),
        "params": {"algorithm": "RandomForest", "n_estimators": 100, "max_depth": "None"}
    },
    {
        "name": "RandomForest_200trees_depth5",
        "model": RandomForestClassifier(n_estimators=200, max_depth=5, random_state=42),
        "params": {"algorithm": "RandomForest", "n_estimators": 200, "max_depth": 5}
    },
    {
        "name": "GradientBoosting_100",
        "model": GradientBoostingClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
        ),
        "params": {"algorithm": "GradientBoosting", "n_estimators": 100,
                   "learning_rate": 0.1, "max_depth": 3}
    },
    {
        "name": "GradientBoosting_200_slow",
        "model": GradientBoostingClassifier(
            n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42
        ),
        "params": {"algorithm": "GradientBoosting", "n_estimators": 200,
                   "learning_rate": 0.05, "max_depth": 4}
    },
    {
        "name": "SVM_rbf",
        "model": SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42),
        "params": {"algorithm": "SVM", "kernel": "rbf", "C": 1.0, "gamma": "scale"}
    },
]

# Train and log each model to MLflow
all_results = {}
for exp in experiments:
    result = train_and_log(
        exp["model"], exp["name"], exp["params"],
        X_train_scaled, X_test_scaled, y_train, y_test
    )
    all_results[exp["name"]] = result

print(f"\nCompleted {len(experiments)} experiments, all logged to MLflow")

6.4 Query and Compare Experiment Results

# Use MLflow API to query experiment results
from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name(experiment_name)
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.f1_macro DESC"]
)

print("=" * 75)
print(f"  Experiment Results Ranking (sorted by F1-macro)")
print("=" * 75)
print(f"{'Rank':<5}{'Model':<35}{'Accuracy':<12}{'F1-macro':<12}{'CV Mean':<12}")
print("-" * 75)

for i, run in enumerate(runs):
    m = run.data.metrics
    print(f"  {i+1:<3} {run.info.run_name:<35}"
          f"{m['accuracy']:<12.4f}{m['f1_macro']:<12.4f}"
          f"{m['cv_mean_accuracy']:<12.4f}")

# Best model
best_run = runs[0]
print(f"\nBest Model: {best_run.info.run_name}")
print(f"  Run ID: {best_run.info.run_id}")
print(f"  F1-macro: {best_run.data.metrics['f1_macro']:.4f}")
print(f"  CV Accuracy: {best_run.data.metrics['cv_mean_accuracy']:.4f}"
      f" +/- {best_run.data.metrics['cv_std_accuracy']:.4f}")

# Visual comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

names = [r.info.run_name.replace("_", "\n") for r in runs]
accs = [r.data.metrics['accuracy'] for r in runs]
f1s = [r.data.metrics['f1_macro'] for r in runs]

colors = ['#b8922e' if i == 0 else '#0077b6' for i in range(len(runs))]

axes[0].barh(names, accs, color=colors)
axes[0].set_xlabel('Accuracy')
axes[0].set_title('Model Accuracy Comparison')
axes[0].set_xlim(0.85, 1.01)

axes[1].barh(names, f1s, color=colors)
axes[1].set_xlabel('F1-macro')
axes[1].set_title('Model F1-macro Comparison (Gold = Best)')
axes[1].set_xlim(0.85, 1.01)

plt.tight_layout()
plt.savefig("model_comparison.png", dpi=120, bbox_inches='tight')
plt.show()
print("\nComparison charts saved")

6.5 Register the Best Model to the Model Registry

# Register the best model to MLflow Model Registry
model_name_registry = "wine_classifier_production"

model_uri = f"runs:/{best_run.info.run_id}/model"
registered = mlflow.register_model(model_uri, model_name_registry)

print(f"\nModel registered to Model Registry")
print(f"  Model name: {registered.name}")
print(f"  Version: {registered.version}")
print(f"  Source Run: {best_run.info.run_name}")

# Update model description
client.update_registered_model(
    name=model_name_registry,
    description="Best Wine classification model — automatically selected through the MLflow experiment management workflow"
)

# Load the registered model and perform inference
loaded_model = mlflow.sklearn.load_model(model_uri)
sample = X_test_scaled[:5]
predictions = loaded_model.predict(sample)

print(f"\nModel inference test (first 5 test samples):")
for i in range(5):
    actual = target_names[y_test[i]]
    predicted = target_names[predictions[i]]
    status = "Correct" if y_test[i] == predictions[i] else "Wrong"
    print(f"  [{status}] Actual: {actual:<12} Predicted: {predicted}")

print(f"\nLab 1 complete! You have learned:")
print(f"  1. Creating MLflow experiments and tracking multiple models")
print(f"  2. Logging hyperparameters, evaluation metrics, and artifacts")
print(f"  3. Querying and comparing experiment results using the API")
print(f"  4. Registering the best model to the Model Registry")
print(f"  5. Loading a model from the Registry for inference")

7. Hands-on Lab 2: Model Packaging and API Serving

In Lab 1, we used MLflow to manage the experiment workflow and select the best model. In this lab, we will use BentoML to package the model as a REST API ready for external serving—a critical step in the journey from "experiment" to "product."

Open Google Colab (CPU runtime is sufficient), create a new Notebook, and paste the following code blocks in order:

7.1 Environment Setup and Model Training

!pip install bentoml scikit-learn numpy requests -q

import bentoml
import numpy as np
import json
import time
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Train a production-grade model
wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42
)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
print(f"Model training complete — Test set Accuracy: {acc:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

7.2 Save the Model to the BentoML Model Store

# Save the model and preprocessor together to BentoML
saved_model = bentoml.sklearn.save_model(
    "wine_classifier",
    model,
    signatures={"predict": {"batchable": True}},
    labels={"task": "classification", "dataset": "wine", "framework": "sklearn"},
    metadata={
        "accuracy": float(acc),
        "n_features": X_train.shape[1],
        "n_classes": len(wine.target_names),
        "feature_names": list(wine.feature_names),
        "target_names": list(wine.target_names),
    },
    custom_objects={
        "scaler": scaler  # Save the preprocessor alongside the model
    }
)

print(f"Model saved to BentoML Model Store")
print(f"  Model Tag: {saved_model.tag}")
print(f"  Storage Path: {saved_model.path}")

# List all saved models
print(f"\nSaved models list:")
for m in bentoml.models.list():
    print(f"  - {m.tag} (Created: {m.info.creation_time})")

7.3 Define the BentoML Service

# Write the BentoML Service definition file
service_code = '''
import numpy as np
import bentoml
from bentoml.io import NumpyNdarray, JSON

# Load model and preprocessor
model_runner = bentoml.sklearn.get("wine_classifier:latest").to_runner()
model_ref = bentoml.models.get("wine_classifier:latest")
scaler = model_ref.custom_objects["scaler"]
metadata = model_ref.info.metadata

svc = bentoml.Service("wine_classifier_service", runners=[model_runner])

@svc.api(input=NumpyNdarray(), output=JSON())
async def predict(input_array: np.ndarray) -> dict:
    """Accept raw features, return predicted class and probability"""
    # Preprocessing
    scaled = scaler.transform(input_array.reshape(1, -1) if input_array.ndim == 1 else input_array)
    # Prediction
    predictions = await model_runner.predict.async_run(scaled)
    target_names = metadata["target_names"]
    results = []
    for pred in predictions:
        results.append({
            "class_id": int(pred),
            "class_name": target_names[int(pred)],
        })
    return {"predictions": results, "model": str(model_ref.tag)}

@svc.api(input=JSON(), output=JSON())
async def predict_json(input_data: dict) -> dict:
    """Accept feature data in JSON format"""
    features = np.array(input_data["features"])
    scaled = scaler.transform(features.reshape(1, -1) if features.ndim == 1 else features)
    predictions = await model_runner.predict.async_run(scaled)
    target_names = metadata["target_names"]
    results = []
    for pred in predictions:
        results.append({
            "class_id": int(pred),
            "class_name": target_names[int(pred)],
        })
    return {
        "predictions": results,
        "model": str(model_ref.tag),
        "feature_names": metadata["feature_names"]
    }

@svc.api(input=JSON(), output=JSON())
async def model_info(input_data: dict) -> dict:
    """Return model metadata"""
    return {
        "model_tag": str(model_ref.tag),
        "accuracy": metadata["accuracy"],
        "n_features": metadata["n_features"],
        "n_classes": metadata["n_classes"],
        "feature_names": metadata["feature_names"],
        "target_names": metadata["target_names"],
    }
'''

with open("service.py", "w") as f:
    f.write(service_code)

print("BentoML Service definition file (service.py) created")
print("  Contains 3 API endpoints:")
print("  - /predict        : NumPy array input")
print("  - /predict_json   : JSON format input")
print("  - /model_info     : Model metadata query")

7.4 Simulated API Inference Test

# Test model inference logic directly in Colab (without starting an HTTP Server)
print("=" * 65)
print("  Model Packaging and Inference Test")
print("=" * 65)

# Load the saved model
model_ref = bentoml.models.get("wine_classifier:latest")
loaded_model = bentoml.sklearn.load_model("wine_classifier:latest")
loaded_scaler = model_ref.custom_objects["scaler"]
meta = model_ref.info.metadata

print(f"\nModel Info:")
print(f"  Tag: {model_ref.tag}")
print(f"  Accuracy: {meta['accuracy']:.4f}")
print(f"  Number of features: {meta['n_features']}")
print(f"  Classes: {meta['target_names']}")

# Simulated API request — single inference
print(f"\n--- Single Inference Test ---")
sample = X_test[0]
sample_scaled = loaded_scaler.transform(sample.reshape(1, -1))
pred = loaded_model.predict(sample_scaled)
print(f"  Input features (first 5): {sample[:5].round(2)}")
print(f"  Predicted class ID: {pred[0]}")
print(f"  Predicted class name: {wine.target_names[pred[0]]}")
print(f"  Actual class name: {wine.target_names[y_test[0]]}")

# Simulated API request — batch inference
print(f"\n--- Batch Inference Test (10 samples) ---")
batch = X_test[:10]
batch_scaled = loaded_scaler.transform(batch)
batch_preds = loaded_model.predict(batch_scaled)

correct = sum(batch_preds == y_test[:10])
print(f"  Batch inference results: {correct}/10 correct")
for i in range(10):
    actual = wine.target_names[y_test[i]]
    predicted = wine.target_names[batch_preds[i]]
    status = "OK" if y_test[i] == batch_preds[i] else "NG"
    print(f"  [{status}] #{i+1} Actual: {actual:<12} Predicted: {predicted}")

# Inference latency test
print(f"\n--- Inference Latency Benchmark ---")
single_sample = loaded_scaler.transform(X_test[0].reshape(1, -1))
batch_100 = loaded_scaler.transform(X_test[:36])  # Use all test data

# Single-sample latency
times_single = []
for _ in range(100):
    t0 = time.time()
    _ = loaded_model.predict(single_sample)
    times_single.append((time.time() - t0) * 1000)

# Batch latency
times_batch = []
for _ in range(100):
    t0 = time.time()
    _ = loaded_model.predict(batch_100)
    times_batch.append((time.time() - t0) * 1000)

print(f"  Single inference: {np.mean(times_single):.3f}ms (p99: {np.percentile(times_single, 99):.3f}ms)")
print(f"  Batch {len(batch_100)} samples: {np.mean(times_batch):.3f}ms (p99: {np.percentile(times_batch, 99):.3f}ms)")
print(f"  Batch efficiency: {np.mean(times_single) * len(batch_100) / np.mean(times_batch):.1f}x")

7.5 Build the Bento and Inspect the Package Structure

# Build bentofile.yaml (BentoML packaging configuration)
bentofile_content = '''
service: "service:svc"
labels:
  owner: meta-intelligence
  project: wine-classifier
  stage: production
include:
  - "*.py"
python:
  packages:
    - scikit-learn
    - numpy
'''

with open("bentofile.yaml", "w") as f:
    f.write(bentofile_content)

print("bentofile.yaml created")
print("\nPackaging configuration content:")
print(bentofile_content)

# Display the complete deployment workflow
print("=" * 65)
print("  Production Deployment Workflow (CLI Command Guide)")
print("=" * 65)
print("""
In your local development environment, execute the following commands to complete deployment:

# 1. Start local dev server (for testing)
$ bentoml serve service:svc --reload

# 2. Package into a Bento
$ bentoml build

# 3. Containerize (generate Docker image)
$ bentoml containerize wine_classifier_service:latest

# 4. Run the container
$ docker run -p 3000:3000 wine_classifier_service:latest

# 5. Test the API
$ curl -X POST http://localhost:3000/predict_json \\
    -H "Content-Type: application/json" \\
    -d '{"features": [13.0, 1.5, 2.3, 15.0, 100, 2.8, 3.0, 0.28, 2.29, 5.64, 1.04, 3.92, 1065]}'
""")

print(f"\nLab 2 complete! You have learned:")
print(f"  1. Saving models and preprocessors to BentoML Model Store")
print(f"  2. Defining multi-endpoint API Services")
print(f"  3. Testing model inference (single and batch)")
print(f"  4. Creating packaging configurations and deployment workflows")
print(f"  5. Understanding the complete path from development to containerized deployment")

8. CI/CD for ML: Automated Testing and Continuous Deployment

CI/CD for traditional software is well established, but continuous integration and delivery for ML systems face unique challenges. Sato et al.'s CD4ML framework[12] raised an important point: ML systems have three axes of change that require version control—code, models, and data. A change on any axis may require re-validation and redeployment.

8.1 ML-Specific Testing Strategies

The ML Test Score proposed by Breck et al.[7] defines a comprehensive testing rubric for ML systems, covering four major categories:

Data Tests:

Model Tests:

Infrastructure Tests:

Monitoring Tests:

8.2 ML CI/CD Pipeline with GitHub Actions

Below is a typical ML CI/CD Pipeline structure implemented with GitHub Actions:

# .github/workflows/ml-pipeline.yml structure overview
#
# Triggers: push to main / PR / scheduled (daily retraining)
#
# Stage 1: Data Validation
#   - Check data schema consistency
#   - Validate feature distributions (Great Expectations / Pandera)
#   - Detect data drift
#
# Stage 2: Model Training
#   - Pull latest training data from DVC
#   - Execute training pipeline
#   - Log experiments with MLflow
#
# Stage 3: Model Validation
#   - Benchmark test set performance >= threshold
#   - New model >= current production model
#   - Fairness checks pass
#   - Latency benchmark passes
#
# Stage 4: Model Deployment
#   - BentoML packaging
#   - Docker containerization
#   - Canary deployment (5% traffic)
#   - Monitor for 30 minutes
#   - Full traffic cutover

Lu et al.'s survey of the MLOps tool ecosystem[14] pointed out that the fragmentation of MLOps toolchains is one of the biggest barriers to enterprise adoption. Integration between different tools often requires substantial "glue code," which itself becomes a new source of technical debt.

9. Model Monitoring: Data Drift and Model Drift Detection

Deploying a model is not the finish line—it is the starting point of a new journey. Klaise et al.'s monitoring survey[13] systematically categorized the degradation risks facing production ML models, with the two most critical types being Data Drift and Model Drift.

9.1 Data Drift: Changes in Input Data Distribution

Data Drift refers to significant changes in the distribution of input data in production compared to the training data. This is the most common cause of ML model degradation.

Common causes:

Detection methods:

9.2 Model Drift (Concept Drift): Changes in Input-Output Relationships

Even when the input distribution remains unchanged, the relationship between inputs and targets can shift. For example, before the pandemic, users searching for "face masks" were mostly in the medical field; after the pandemic, they were the general public—the same input features mapped to labels with fundamentally different meanings.

Detection strategies:

9.3 Monitoring System Architecture

A comprehensive ML monitoring system should include the following layers:

Monitoring LayerMetricsToolsAlert Threshold
Infrastructure LayerCPU, memory, latency, throughputPrometheus + GrafanaP99 latency > 200ms
Data Quality LayerMissing rate, outliers, schema driftGreat ExpectationsMissing rate > 5%
Data Drift LayerPSI, KS statisticEvidently / NannyMLPSI > 0.2
Model Performance LayerAccuracy, F1, AUCMLflow + customBelow baseline by 3%
Business Metrics LayerConversion rate, revenue impactCustom dashboardDefined by business

Testi et al., in their IEEE Access study[15], proposed a taxonomy and methodology for MLOps, emphasizing that monitoring should not be merely reactive—waiting for problems to occur—but should proactively predict when models need retraining. They recommend establishing a "model health score" that integrates multi-dimensional metrics to assess the current state of a model.

10. Decision Framework: MLOps Toolchain Selection Guide

Faced with numerous MLOps tools, enterprises often find themselves paralyzed by choice. Below are toolchain recommendations based on team size and maturity:

10.1 Early Stage (1–3 Person ML Team)

FunctionRecommended ToolAlternativeRationale
Experiment TrackingMLflow (local mode)Weights & BiasesFree, lightweight, no infrastructure needed
Version ControlGit + DVCGit LFSUnified version control for data and code
Model DeploymentBentoMLFastAPI + DockerBuilt-in best practices, less glue code
MonitoringEvidently (report mode)Manual scriptsOpen-source, easy to get started

10.2 Growth Stage (4–10 Person ML Team)

FunctionRecommended ToolAlternativeRationale
Experiment TrackingMLflow (server mode)Neptune.aiTeam sharing, unified management
Pipeline OrchestrationPrefect / AirflowKubeflow PipelinesScheduling, retries, dependency management
Feature StoreFeastHopsworksAvoid redundant feature computation
Model DeploymentBentoML + K8sSeldon CoreContainerized + auto-scaling
MonitoringEvidently + GrafanaNannyMLReal-time drift detection + visualization

10.3 Mature Stage (10+ Person ML Team / Multi-Model Production)

FunctionRecommended ToolAlternativeRationale
End-to-End PlatformKubeflowAWS SageMakerUnified lifecycle management
Feature StoreTecton / Feast on K8sSageMaker Feature StoreEnterprise-grade feature management
Model ServingKServeTriton Inference ServerMulti-framework support, GPU inference
MonitoringEvidently + PagerDutyFiddler AIAuto-alerting + incident management
GovernanceMLflow + customWeights & BiasesModel auditing, compliance tracking

10.4 Three Core Principles for Tool Selection

Regardless of team size, the following principles should guide tool selection:

  1. Start small, scale incrementally: Do not deploy a full Kubeflow cluster from day one. Begin with MLflow in local mode for experiment management, then upgrade infrastructure as team size and model count grow. Premature architectural investment is a common cause of MLOps adoption failure.
  2. Prioritize eliminating the biggest pain point: If the team's biggest problem is "we can't find last time's experiment results," introduce experiment tracking first. If it's "deploying a model takes two weeks," establish automated deployment first. Do not try to solve every problem at once.
  3. Choose open standards over closed platforms: MLflow's model format, ONNX's model exchange format, OCI's container standards—these open standards ensure you are not locked into a single platform and preserve migration flexibility for the future.

11. Conclusion: MLOps Is Not a Tools Problem—It Is a Culture Problem

Let us return to the statistic from the beginning of this article—87% of ML projects never make it to production. By now, the reason should be clear: it is not because our models are not good enough, but because we treated "training a high-accuracy model" as the finish line, overlooking the enormous chasm between experimentation and production.

The core value of MLOps lies not in any single tool—not in MLflow, not in DVC, not in BentoML—but in the cultural shift it represents: from "one-off model development" to "continuous iterative ML systems engineering."

The concept of "hidden technical debt" proposed by Sculley et al. in their foundational paper[1] remains as relevant as ever. Every untracked experiment, every manually deployed model, every unmonitored production service accumulates technical debt. This debt does not disappear on its own—it manifests as model degradation, deployment failures, and debugging nightmares.

For enterprises considering MLOps adoption, our recommendations are:

  1. Start tracking your experiments with MLflow today. This is the lowest-cost, highest-return first step. As Lab 1 demonstrated, just a few lines of code can fundamentally transform how you manage experiments.
  2. Establish a standardized model deployment process. Lab 2 showed how BentoML can transform a model from a pickle file into a testable, containerizable, and scalable service.
  3. Build monitoring from day one. After a model goes live, Data Drift and Model Drift are inevitable. The earlier you establish detection mechanisms, the better you can avoid the disaster of "a model silently failing for three months before anyone notices."
  4. Invest in team culture, not just tools. The success of MLOps depends on close collaboration among data scientists, ML engineers, and DevOps teams. Tools can facilitate collaboration, but they cannot replace communication.

Machine learning is transitioning from a "research-driven" to an "engineering-driven" era. Organizations that can establish mature MLOps practices will hold a decisive advantage in the race to deploy AI—not because their models are better, but because they can deliver model value to production faster, more reliably, and more sustainably.