MLOps完全ガイド：実験管理からモデルデプロイメントまでのエンドツーエンドパイプライン

主要な発見

According to multiple industry surveys, approximately 87% of machine learning projects never reach production—the bottleneck lies not in model accuracy, but in the absence of engineering pipelines^[5]
Google's three-level MLOps maturity model (Level 0: Manual → Level 1: Pipeline Automation → Level 2: CI/CD Automation) provides enterprises with a clear evolution path^[8]
In ML systems, the actual code for model training accounts for only 5–10% of the overall system; the remaining 90% consists of data management, feature engineering, monitoring, deployment, and other infrastructure^[1]
Enterprises that adopt MLOps can, on average, reduce the model development-to-deployment cycle from months to days, and cut model failure detection time from weeks to minutes^[4]

1. 87% of ML Projects Never Make It to Production: Why MLOps Is Essential for AI Deployment

The history of machine learning is rife with an ironic pattern: models that perform brilliantly in the lab frequently falter once they enter the real world. Paleyes et al., in their survey published in ACM Computing Surveys^[5], systematically cataloged the challenges of machine learning deployment and found that problems almost never originate in the algorithms themselves—rather, they arise from everything "surrounding" the algorithms: fragile data pipelines, untraceable experiment results, manual deployment processes, and the absence of post-deployment monitoring.

In 2015, a Google research team published a seminal paper at NeurIPS^[1] that used a now widely cited architecture diagram to reveal a stark truth: in a production-grade ML system, the actual model training code—the part we invest the most effort into—occupies only a small box. Surrounding it is a massive infrastructure layer: data collection, data validation, feature extraction, resource management, serving infrastructure, and monitoring systems. It is this "glue code" and infrastructure that ultimately determines whether an ML system can operate reliably in production.

Microsoft's large-scale empirical study^[2] further confirmed this finding. After interviewing dozens of internal ML teams, they discovered that software engineering best practices—version control, continuous integration, automated testing, and monitoring alerts—were severely neglected in ML development. Data scientists are accustomed to rapid iteration in Jupyter Notebooks, but this workflow has fundamental shortcomings in collaboration, reproducibility, and productionization.

MLOps (Machine Learning Operations) was born precisely to bridge this gap. It extends the core principles of DevOps—automation, monitoring, collaboration, and continuous delivery—to the entire machine learning lifecycle. Kreuzberger et al., in their IEEE Access review^[4], defined MLOps as a set of principles and practices aimed at reliably and efficiently deploying and maintaining machine learning models in production environments.

This article provides a comprehensive breakdown of MLOps core components, from theoretical frameworks to hands-on practice. We not only explain the "why" but also provide two ready-to-run Google Colab labs, enabling readers to experience the power of the MLOps toolchain firsthand.

2. The MLOps Maturity Model: An Evolution Path from Manual to Fully Automated

Google Cloud, in its MLOps architecture guide^[8], proposed a three-level maturity model that has become the most widely adopted MLOps evolution framework in the industry. Understanding these three levels is the first step in planning an enterprise MLOps strategy.

Level 0: Manual Processes

This is the starting point for most ML teams and the stage where 87% of projects get stuck. Characteristics include:

Data scientists manually train models locally or in notebooks
Experiment records rely on Excel or personal notes ("I think lr=0.01 worked better last time")
Model delivery involves "emailing the pickle file to the engineer"
Deployment is a one-off manual event with no automated testing
No systematic monitoring after deployment; model degradation goes unnoticed

Shankar et al.'s interview study^[11] vividly depicted the Level 0 predicament: one ML engineer reported that their model deployment process required 47 manual steps, and any error in a single step meant starting over from the beginning.

Level 1: ML Pipeline Automation

At this level, the key breakthrough is encapsulating the training process into automated pipelines:

Data validation, feature engineering, training, and evaluation form an automated workflow
When new data arrives, the pipeline can automatically trigger retraining
Experiment tracking tools (such as MLflow) record the parameters and results of each training run
Model deployment may still require some manual intervention
Basic model performance monitoring begins

TFX^[9] is Google's exemplary Level 1 implementation. It chains data validation (TFDV), model analysis (TFMA), and serving deployment (TF Serving) into an automated pipeline, transforming model retraining from a manual operation into a one-click trigger.

Level 2: Full CI/CD Automation

The highest level achieves complete automation of ML systems:

Code changes automatically trigger the CI/CD pipeline
Automated tests cover data quality, model performance, and service stability
Models are automatically deployed to production after passing all tests
Full A/B testing and canary deployment
Real-time Data Drift and Model Drift detection with automated retraining triggers

Sato et al., in their ThoughtWorks technical report^[12], described in detail the complete practice of CD4ML (Continuous Delivery for Machine Learning), demonstrating how software engineering's continuous delivery methodology can be applied to ML systems.

Dimension	Level 0: Manual	Level 1: Pipeline	Level 2: CI/CD
Training Trigger	Manual execution	Auto-triggered by new data	Auto-triggered by code/data changes
Experiment Tracking	Excel / notes	MLflow / W&B	MLflow + automated comparison
Model Deployment	Manual scp / email	Semi-automated	Automated + Canary
Testing	None	Basic validation	Full coverage: data / model / service
Monitoring	None	Basic metrics	Drift detection + auto-alerting
Iteration Cycle	Weeks to months	Days	Hours

3. Experiment Management: Tracking Every Training Run with MLflow

Experiment management is the cornerstone of MLOps. Without systematic experiment tracking, ML development is like writing code in the pre-version-control era—everyone experimenting on their own branch with no one knowing which version is the "right" one.

MLflow^[3], open-sourced by Databricks, is currently the most widely adopted ML experiment management platform. It offers four core modules:

3.1 MLflow Tracking: Experiment Tracking

The core concept of MLflow Tracking is the Run—each training execution is a Run, which records:

Parameters: Hyperparameters (learning rate, batch size, AI PoC epochs, etc.)
Metrics: Evaluation metrics (accuracy, loss, F1-score, etc.), with support for step-by-step logging
Artifacts: Outputs (trained model files, confusion matrix plots, feature importance charts, etc.)
Tags: Custom tags (experiment purpose, dataset version, operator, etc.)

Multiple Runs are organized under an Experiment, and MLflow provides a built-in Web UI for real-time comparison of different Runs' performance. This solves the most common pain point in ML development: "What parameters did I use for that model that performed so well last week?"

3.2 MLflow Models: Standardized Model Packaging

MLflow Models defines a unified model packaging format that wraps models the same way regardless of whether the underlying framework is scikit-learn, PyTorch, or TensorFlow. Each MLflow Model contains:

MLmodel file: Describes the model's flavors (the various ways it can be loaded)
Model binary: Serialized model weights
conda.yaml / requirements.txt: Precise dependency environment descriptions
input_example: Sample input for inference testing

3.3 MLflow Model Registry: Model Version Management

Model Registry introduces lifecycle management for models. Each registered model can be tagged with different stages:

Staging: Awaiting validation, undergoing A/B testing or performance evaluation
Production: Validated and officially serving in production
Archived: Retired previous versions

This provides clear protocols for model upgrades and rollbacks, replacing the high-risk practice of "directly overwriting the model.pkl in production."

4. Data Versioning and Feature Engineering: DVC and Feature Store

Polyzotis et al., in their research published in ACM SIGMOD Record^[6], pointed out that the most underestimated challenge in production ML systems is data management. Model performance depends on training data quality, and data in production environments is constantly changing—making data versioning an indispensable component of MLOps.

4.1 DVC (Data Version Control): Git for Large-Scale Data

Git is the gold standard for code version control, but it cannot handle GB-scale or even TB-scale training data and model files. DVC was created precisely for this purpose—it builds a data versioning layer on top of Git:

Git tracks .dvc metadata files: Recording data file hashes, sizes, and remote storage locations
Actual data stored in remote storage: Supporting S3, GCS, Azure Blob, SSH, and more
Pipeline definition: Using dvc.yaml to describe data processing DAGs (Directed Acyclic Graphs)
Version switching: git checkout + dvc checkout to return to any historical data state

This means every model training run can be precisely mapped to a specific data version, definitively solving the age-old question: "Which dataset was this model trained on?"

4.2 Feature Store: A Central Repository for Features

In large ML teams, different projects often require similar features (e.g., "user's transaction count over the past 30 days"). Without a Feature Store, each team computes independently, leading to:

Redundant computation: The same feature is independently developed by different teams
Training-Serving Skew: Features computed in Python during training are re-implemented in Java for serving, causing logical inconsistencies
Data Leakage (time-travel problem): Accidentally using future data

Feature Stores (such as Feast, Tecton, and Hopsworks) provide unified feature definition, storage, and serving, ensuring that training and inference use exactly the same feature computation logic. Huyen, in her book^[10], compared the Feature Store to "middleware" for ML systems—a bridge connecting data engineering and model training.

5. Model Packaging and Deployment: From Flask to BentoML

After model training is complete, the biggest challenge often just begins: how to transform a model in a Python script into a reliable, scalable, and monitorable production service?

5.1 The Evolution of Deployment Approaches

Stage	Approach	Pros	Cons
V1: Manual	Flask / FastAPI self-wrapped	Rapid prototyping	No standardization, hard to maintain
V2: Containerized	Docker + Kubernetes	Environment consistency, scalable	Requires DevOps expertise
V3: Framework-based	BentoML / Seldon / KServe	Standardized, built-in best practices	Learning curve
V4: Serverless	AWS Lambda / Cloud Run	Zero ops, auto-scaling	Cold starts, model size limits

5.2 BentoML: The Shortest Path from Model to API

BentoML is an open-source framework specifically designed for ML model serving. Its core philosophy is that data scientists should not need to learn Docker or Kubernetes to deploy a model. BentoML abstracts model deployment into three steps:

Save the model: Use bentoml.sklearn.save_model() to store the trained model in a local model repository
Define the service: Use Python decorators to declare API endpoints and define input/output formats
Package and deploy: BentoML automatically generates a Docker image containing all dependencies and optimized configurations

BentoML also includes built-in production-grade features such as batch inference (Batching), adaptive micro-batching (Adaptive Batching), and multi-model composition (Runner)—features that would require hundreds of additional lines of code when building manually with Flask.

5.3 Deployment Strategies: Blue-Green, Canary, and Shadow Mode

Model updates in production should never be a matter of "shut down the old one, turn on the new one." Mature MLOps practices employ progressive deployment strategies:

Blue-Green Deployment: Run both old and new versions simultaneously, enabling one-click rollback via traffic switching
Canary Deployment: Initially route 5% of traffic to the new model, then gradually increase after confirming metrics are healthy
Shadow Mode: The new model receives all requests but does not return actual results—it only logs predictions for offline comparison

6. Hands-on Lab 1: Complete MLflow Experiment Management Workflow

This lab walks you through the complete MLflow core workflow—from creating experiments, training multiple models, logging parameters and metrics, comparing experiment results, to selecting the best model and registering it.

Open Google Colab (CPU runtime is sufficient), create a new Notebook, and paste the following code blocks in order:

6.1 Environment Setup and Data Preparation

!pip install mlflow scikit-learn matplotlib -q

import mlflow
import mlflow.sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, ConfusionMatrixDisplay
)
import warnings
warnings.filterwarnings('ignore')

# Load Wine dataset (multi-class problem, 3 classes, 13 features)
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names
target_names = wine.target_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train.shape[0]} samples, Test set: {X_test.shape[0]} samples")
print(f"Number of features: {X_train.shape[1]}, Number of classes: {len(target_names)}")
print(f"Class distribution (train): {np.bincount(y_train)}")
print(f"Class distribution (test):  {np.bincount(y_test)}")

6.2 Define the Experiment Tracking Function

def train_and_log(model, model_name, params, X_tr, X_te, y_tr, y_te):
    """Train a model and log all information to MLflow"""
    with mlflow.start_run(run_name=model_name):
        # Log hyperparameters
        mlflow.log_params(params)
        mlflow.set_tag("model_type", model_name)
        mlflow.set_tag("dataset", "wine")
        mlflow.set_tag("scaler", "StandardScaler")

        # Train
        model.fit(X_tr, y_tr)
        y_pred = model.predict(X_te)

        # Log multiple evaluation metrics
        metrics = {
            "accuracy": accuracy_score(y_te, y_pred),
            "precision_macro": precision_score(y_te, y_pred, average='macro'),
            "recall_macro": recall_score(y_te, y_pred, average='macro'),
            "f1_macro": f1_score(y_te, y_pred, average='macro'),
        }

        # Cross-validation score (more robust evaluation)
        cv_scores = cross_val_score(model, X_tr, y_tr, cv=5, scoring='accuracy')
        metrics["cv_mean_accuracy"] = cv_scores.mean()
        metrics["cv_std_accuracy"] = cv_scores.std()

        mlflow.log_metrics(metrics)

        # Log artifacts: confusion matrix plot
        fig, ax = plt.subplots(figsize=(6, 5))
        cm = confusion_matrix(y_te, y_pred)
        disp = ConfusionMatrixDisplay(cm, display_labels=target_names)
        disp.plot(ax=ax, cmap='Blues')
        ax.set_title(f"{model_name} — Confusion Matrix")
        plt.tight_layout()
        fig.savefig("confusion_matrix.png", dpi=100)
        mlflow.log_artifact("confusion_matrix.png")
        plt.close()

        # Log the model itself
        mlflow.sklearn.log_model(model, "model")

        print(f"  {model_name}: accuracy={metrics['accuracy']:.4f}, "
              f"f1={metrics['f1_macro']:.4f}, "
              f"cv={metrics['cv_mean_accuracy']:.4f}+/-{metrics['cv_std_accuracy']:.4f}")

        return metrics

print("Training and logging function defined successfully")

6.3 Set Up MLflow Experiment and Train Multiple Models

# Create MLflow experiment
experiment_name = "wine_classification_LLM Evaluation"
mlflow.set_experiment(experiment_name)

print("=" * 65)
print("  MLflow Experiment Management — Wine Classification Model Comparison")
print("=" * 65)

# Define model and hyperparameter combinations
experiments = [
    {
        "name": "LogisticRegression_C0.1",
        "model": LogisticRegression(C=0.1, max_iter=1000, random_state=42),
        "params": {"algorithm": "LogisticRegression", "C": 0.1, "max_iter": 1000}
    },
    {
        "name": "LogisticRegression_C1.0",
        "model": LogisticRegression(C=1.0, max_iter=1000, random_state=42),
        "params": {"algorithm": "LogisticRegression", "C": 1.0, "max_iter": 1000}
    },
    {
        "name": "LogisticRegression_C10.0",
        "model": LogisticRegression(C=10.0, max_iter=1000, random_state=42),
        "params": {"algorithm": "LogisticRegression", "C": 10.0, "max_iter": 1000}
    },
    {
        "name": "RandomForest_100trees",
        "model": RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42),
        "params": {"algorithm": "RandomForest", "n_estimators": 100, "max_depth": "None"}
    },
    {
        "name": "RandomForest_200trees_depth5",
        "model": RandomForestClassifier(n_estimators=200, max_depth=5, random_state=42),
        "params": {"algorithm": "RandomForest", "n_estimators": 200, "max_depth": 5}
    },
    {
        "name": "GradientBoosting_100",
        "model": GradientBoostingClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
        ),
        "params": {"algorithm": "GradientBoosting", "n_estimators": 100,
                   "learning_rate": 0.1, "max_depth": 3}
    },
    {
        "name": "GradientBoosting_200_slow",
        "model": GradientBoostingClassifier(
            n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42
        ),
        "params": {"algorithm": "GradientBoosting", "n_estimators": 200,
                   "learning_rate": 0.05, "max_depth": 4}
    },
    {
        "name": "SVM_rbf",
        "model": SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42),
        "params": {"algorithm": "SVM", "kernel": "rbf", "C": 1.0, "gamma": "scale"}
    },
]

# Train and log each model to MLflow
all_results = {}
for exp in experiments:
    result = train_and_log(
        exp["model"], exp["name"], exp["params"],
        X_train_scaled, X_test_scaled, y_train, y_test
    )
    all_results[exp["name"]] = result

print(f"\nCompleted {len(experiments)} experiments, all logged to MLflow")

6.4 Query and Compare Experiment Results

# Use MLflow API to query experiment results
from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name(experiment_name)
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.f1_macro DESC"]
)

print("=" * 75)
print(f"  Experiment Results Ranking (sorted by F1-macro)")
print("=" * 75)
print(f"{'Rank':<5}{'Model':<35}{'Accuracy':<12}{'F1-macro':<12}{'CV Mean':<12}")
print("-" * 75)

for i, run in enumerate(runs):
    m = run.data.metrics
    print(f"  {i+1:<3} {run.info.run_name:<35}"
          f"{m['accuracy']:<12.4f}{m['f1_macro']:<12.4f}"
          f"{m['cv_mean_accuracy']:<12.4f}")

# Best model
best_run = runs[0]
print(f"\nBest Model: {best_run.info.run_name}")
print(f"  Run ID: {best_run.info.run_id}")
print(f"  F1-macro: {best_run.data.metrics['f1_macro']:.4f}")
print(f"  CV Accuracy: {best_run.data.metrics['cv_mean_accuracy']:.4f}"
      f" +/- {best_run.data.metrics['cv_std_accuracy']:.4f}")

# Visual comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

names = [r.info.run_name.replace("_", "\n") for r in runs]
accs = [r.data.metrics['accuracy'] for r in runs]
f1s = [r.data.metrics['f1_macro'] for r in runs]

colors = ['#b8922e' if i == 0 else '#0077b6' for i in range(len(runs))]

axes[0].barh(names, accs, color=colors)
axes[0].set_xlabel('Accuracy')
axes[0].set_title('Model Accuracy Comparison')
axes[0].set_xlim(0.85, 1.01)

axes[1].barh(names, f1s, color=colors)
axes[1].set_xlabel('F1-macro')
axes[1].set_title('Model F1-macro Comparison (Gold = Best)')
axes[1].set_xlim(0.85, 1.01)

plt.tight_layout()
plt.savefig("model_comparison.png", dpi=120, bbox_inches='tight')
plt.show()
print("\nComparison charts saved")

6.5 Register the Best Model to the Model Registry

# Register the best model to MLflow Model Registry
model_name_registry = "wine_classifier_production"

model_uri = f"runs:/{best_run.info.run_id}/model"
registered = mlflow.register_model(model_uri, model_name_registry)

print(f"\nModel registered to Model Registry")
print(f"  Model name: {registered.name}")
print(f"  Version: {registered.version}")
print(f"  Source Run: {best_run.info.run_name}")

# Update model description
client.update_registered_model(
    name=model_name_registry,
    description="Best Wine classification model — automatically selected through the MLflow experiment management workflow"
)

# Load the registered model and perform inference
loaded_model = mlflow.sklearn.load_model(model_uri)
sample = X_test_scaled[:5]
predictions = loaded_model.predict(sample)

print(f"\nModel inference test (first 5 test samples):")
for i in range(5):
    actual = target_names[y_test[i]]
    predicted = target_names[predictions[i]]
    status = "Correct" if y_test[i] == predictions[i] else "Wrong"
    print(f"  [{status}] Actual: {actual:<12} Predicted: {predicted}")

print(f"\nLab 1 complete! You have learned:")
print(f"  1. Creating MLflow experiments and tracking multiple models")
print(f"  2. Logging hyperparameters, evaluation metrics, and artifacts")
print(f"  3. Querying and comparing experiment results using the API")
print(f"  4. Registering the best model to the Model Registry")
print(f"  5. Loading a model from the Registry for inference")

7. Hands-on Lab 2: Model Packaging and API Serving

In Lab 1, we used MLflow to manage the experiment workflow and select the best model. In this lab, we will use BentoML to package the model as a REST API ready for external serving—a critical step in the journey from "experiment" to "product."

Open Google Colab (CPU runtime is sufficient), create a new Notebook, and paste the following code blocks in order:

7.1 Environment Setup and Model Training

!pip install bentoml scikit-learn numpy requests -q

import bentoml
import numpy as np
import json
import time
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Train a production-grade model
wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42
)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
print(f"Model training complete — Test set Accuracy: {acc:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

7.2 Save the Model to the BentoML Model Store

# Save the model and preprocessor together to BentoML
saved_model = bentoml.sklearn.save_model(
    "wine_classifier",
    model,
    signatures={"predict": {"batchable": True}},
    labels={"task": "classification", "dataset": "wine", "framework": "sklearn"},
    metadata={
        "accuracy": float(acc),
        "n_features": X_train.shape[1],
        "n_classes": len(wine.target_names),
        "feature_names": list(wine.feature_names),
        "target_names": list(wine.target_names),
    },
    custom_objects={
        "scaler": scaler  # Save the preprocessor alongside the model
    }
)

print(f"Model saved to BentoML Model Store")
print(f"  Model Tag: {saved_model.tag}")
print(f"  Storage Path: {saved_model.path}")

# List all saved models
print(f"\nSaved models list:")
for m in bentoml.models.list():
    print(f"  - {m.tag} (Created: {m.info.creation_time})")

7.3 Define the BentoML Service

# Write the BentoML Service definition file
service_code = '''
import numpy as np
import bentoml
from bentoml.io import NumpyNdarray, JSON

# Load model and preprocessor
model_runner = bentoml.sklearn.get("wine_classifier:latest").to_runner()
model_ref = bentoml.models.get("wine_classifier:latest")
scaler = model_ref.custom_objects["scaler"]
metadata = model_ref.info.metadata

svc = bentoml.Service("wine_classifier_service", runners=[model_runner])

@svc.api(input=NumpyNdarray(), output=JSON())
async def predict(input_array: np.ndarray) -> dict:
    """Accept raw features, return predicted class and probability"""
    # Preprocessing
    scaled = scaler.transform(input_array.reshape(1, -1) if input_array.ndim == 1 else input_array)
    # Prediction
    predictions = await model_runner.predict.async_run(scaled)
    target_names = metadata["target_names"]
    results = []
    for pred in predictions:
        results.append({
            "class_id": int(pred),
            "class_name": target_names[int(pred)],
        })
    return {"predictions": results, "model": str(model_ref.tag)}

@svc.api(input=JSON(), output=JSON())
async def predict_json(input_data: dict) -> dict:
    """Accept feature data in JSON format"""
    features = np.array(input_data["features"])
    scaled = scaler.transform(features.reshape(1, -1) if features.ndim == 1 else features)
    predictions = await model_runner.predict.async_run(scaled)
    target_names = metadata["target_names"]
    results = []
    for pred in predictions:
        results.append({
            "class_id": int(pred),
            "class_name": target_names[int(pred)],
        })
    return {
        "predictions": results,
        "model": str(model_ref.tag),
        "feature_names": metadata["feature_names"]
    }

@svc.api(input=JSON(), output=JSON())
async def model_info(input_data: dict) -> dict:
    """Return model metadata"""
    return {
        "model_tag": str(model_ref.tag),
        "accuracy": metadata["accuracy"],
        "n_features": metadata["n_features"],
        "n_classes": metadata["n_classes"],
        "feature_names": metadata["feature_names"],
        "target_names": metadata["target_names"],
    }
'''

with open("service.py", "w") as f:
    f.write(service_code)

print("BentoML Service definition file (service.py) created")
print("  Contains 3 API endpoints:")
print("  - /predict        : NumPy array input")
print("  - /predict_json   : JSON format input")
print("  - /model_info     : Model metadata query")

7.4 Simulated API Inference Test

# Test model inference logic directly in Colab (without starting an HTTP Server)
print("=" * 65)
print("  Model Packaging and Inference Test")
print("=" * 65)

# Load the saved model
model_ref = bentoml.models.get("wine_classifier:latest")
loaded_model = bentoml.sklearn.load_model("wine_classifier:latest")
loaded_scaler = model_ref.custom_objects["scaler"]
meta = model_ref.info.metadata

print(f"\nModel Info:")
print(f"  Tag: {model_ref.tag}")
print(f"  Accuracy: {meta['accuracy']:.4f}")
print(f"  Number of features: {meta['n_features']}")
print(f"  Classes: {meta['target_names']}")

# Simulated API request — single inference
print(f"\n--- Single Inference Test ---")
sample = X_test[0]
sample_scaled = loaded_scaler.transform(sample.reshape(1, -1))
pred = loaded_model.predict(sample_scaled)
print(f"  Input features (first 5): {sample[:5].round(2)}")
print(f"  Predicted class ID: {pred[0]}")
print(f"  Predicted class name: {wine.target_names[pred[0]]}")
print(f"  Actual class name: {wine.target_names[y_test[0]]}")

# Simulated API request — batch inference
print(f"\n--- Batch Inference Test (10 samples) ---")
batch = X_test[:10]
batch_scaled = loaded_scaler.transform(batch)
batch_preds = loaded_model.predict(batch_scaled)

correct = sum(batch_preds == y_test[:10])
print(f"  Batch inference results: {correct}/10 correct")
for i in range(10):
    actual = wine.target_names[y_test[i]]
    predicted = wine.target_names[batch_preds[i]]
    status = "OK" if y_test[i] == batch_preds[i] else "NG"
    print(f"  [{status}] #{i+1} Actual: {actual:<12} Predicted: {predicted}")

# Inference latency test
print(f"\n--- Inference Latency Benchmark ---")
single_sample = loaded_scaler.transform(X_test[0].reshape(1, -1))
batch_100 = loaded_scaler.transform(X_test[:36])  # Use all test data

# Single-sample latency
times_single = []
for _ in range(100):
    t0 = time.time()
    _ = loaded_model.predict(single_sample)
    times_single.append((time.time() - t0) * 1000)

# Batch latency
times_batch = []
for _ in range(100):
    t0 = time.time()
    _ = loaded_model.predict(batch_100)
    times_batch.append((time.time() - t0) * 1000)

print(f"  Single inference: {np.mean(times_single):.3f}ms (p99: {np.percentile(times_single, 99):.3f}ms)")
print(f"  Batch {len(batch_100)} samples: {np.mean(times_batch):.3f}ms (p99: {np.percentile(times_batch, 99):.3f}ms)")
print(f"  Batch efficiency: {np.mean(times_single) * len(batch_100) / np.mean(times_batch):.1f}x")

7.5 Build the Bento and Inspect the Package Structure

# Build bentofile.yaml (BentoML packaging configuration)
bentofile_content = '''
service: "service:svc"
labels:
  owner: meta-intelligence
  project: wine-classifier
  stage: production
include:
  - "*.py"
python:
  packages:
    - scikit-learn
    - numpy
'''

with open("bentofile.yaml", "w") as f:
    f.write(bentofile_content)

print("bentofile.yaml created")
print("\nPackaging configuration content:")
print(bentofile_content)

# Display the complete deployment workflow
print("=" * 65)
print("  Production Deployment Workflow (CLI Command Guide)")
print("=" * 65)
print("""
In your local development environment, execute the following commands to complete deployment:

# 1. Start local dev server (for testing)
$ bentoml serve service:svc --reload

# 2. Package into a Bento
$ bentoml build

# 3. Containerize (generate Docker image)
$ bentoml containerize wine_classifier_service:latest

# 4. Run the container
$ docker run -p 3000:3000 wine_classifier_service:latest

# 5. Test the API
$ curl -X POST http://localhost:3000/predict_json \\
    -H "Content-Type: application/json" \\
    -d '{"features": [13.0, 1.5, 2.3, 15.0, 100, 2.8, 3.0, 0.28, 2.29, 5.64, 1.04, 3.92, 1065]}'
""")

print(f"\nLab 2 complete! You have learned:")
print(f"  1. Saving models and preprocessors to BentoML Model Store")
print(f"  2. Defining multi-endpoint API Services")
print(f"  3. Testing model inference (single and batch)")
print(f"  4. Creating packaging configurations and deployment workflows")
print(f"  5. Understanding the complete path from development to containerized deployment")

8. CI/CD for ML: Automated Testing and Continuous Deployment

CI/CD for traditional software is well established, but continuous integration and delivery for ML systems face unique challenges. Sato et al.'s CD4ML framework^[12] raised an important point: ML systems have three axes of change that require version control—code, models, and data. A change on any axis may require re-validation and redeployment.

8.1 ML-Specific Testing Strategies

The ML Test Score proposed by Breck et al.^[7] defines a comprehensive testing rubric for ML systems, covering four major categories:

Data Tests:

Whether feature statistical distributions fall within expected ranges (min, max, mean, missing rate)
Whether training data and serving data schemas are consistent
Whether Data Leakage exists
Whether inter-feature correlations remain stable

Model Tests:

Whether model performance on benchmark test sets exceeds the minimum threshold
Whether the new model outperforms the current production model (regression testing)
Whether model performance is equitable across different subgroups (Fairness Testing)
Model robustness against adversarial examples

Infrastructure Tests:

Whether the training process is reproducible (Reproducibility)
Whether model serialization/deserialization is correct
Whether API endpoint response times are within SLA bounds
Whether resource usage (memory, CPU, GPU) is within budget

Monitoring Tests:

Whether comprehensive logging is in place
Whether alert thresholds are properly configured
Whether automatic rollback mechanisms exist when model performance degrades

8.2 ML CI/CD Pipeline with GitHub Actions

Below is a typical ML CI/CD Pipeline structure implemented with GitHub Actions:

# .github/workflows/ml-pipeline.yml structure overview
#
# Triggers: push to main / PR / scheduled (daily retraining)
#
# Stage 1: Data Validation
#   - Check data schema consistency
#   - Validate feature distributions (Great Expectations / Pandera)
#   - Detect data drift
#
# Stage 2: Model Training
#   - Pull latest training data from DVC
#   - Execute training pipeline
#   - Log experiments with MLflow
#
# Stage 3: Model Validation
#   - Benchmark test set performance >= threshold
#   - New model >= current production model
#   - Fairness checks pass
#   - Latency benchmark passes
#
# Stage 4: Model Deployment
#   - BentoML packaging
#   - Docker containerization
#   - Canary deployment (5% traffic)
#   - Monitor for 30 minutes
#   - Full traffic cutover

Lu et al.'s survey of the MLOps tool ecosystem^[14] pointed out that the fragmentation of MLOps toolchains is one of the biggest barriers to enterprise adoption. Integration between different tools often requires substantial "glue code," which itself becomes a new source of technical debt.

9. Model Monitoring: Data Drift and Model Drift Detection

Deploying a model is not the finish line—it is the starting point of a new journey. Klaise et al.'s monitoring survey^[13] systematically categorized the degradation risks facing production ML models, with the two most critical types being Data Drift and Model Drift.

9.1 Data Drift: Changes in Input Data Distribution

Data Drift refers to significant changes in the distribution of input data in production compared to the training data. This is the most common cause of ML model degradation.

Common causes:

Seasonal variations: E-commerce purchasing behavior differs drastically between holidays and regular days
Upstream system changes: Data providers modify ETL logic or field definitions
Evolving user behavior: COVID-19 fundamentally altered user patterns across many industries
Feature computation errors: Code bugs cause anomalous feature values

Detection methods:

Statistical tests: KS Test (Kolmogorov-Smirnov), Chi-Square Test, PSI (Population Stability Index)
Distribution distances: KL Divergence, Wasserstein Distance, Jensen-Shannon Divergence
Visualization: Time-series comparison charts of feature distributions

9.2 Model Drift (Concept Drift): Changes in Input-Output Relationships

Even when the input distribution remains unchanged, the relationship between inputs and targets can shift. For example, before the pandemic, users searching for "face masks" were mostly in the medical field; after the pandemic, they were the general public—the same input features mapped to labels with fundamentally different meanings.

Detection strategies:

Direct monitoring: Track model predictive performance on production data (requires delayed labels)
Indirect monitoring: Track changes in prediction distributions (no labels needed, but lower sensitivity)
Window comparison: Compare model performance within sliding windows against historical baselines

9.3 Monitoring System Architecture

A comprehensive ML monitoring system should include the following layers:

Monitoring Layer	Metrics	Tools	Alert Threshold
Infrastructure Layer	CPU, memory, latency, throughput	Prometheus + Grafana	P99 latency > 200ms
Data Quality Layer	Missing rate, outliers, schema drift	Great Expectations	Missing rate > 5%
Data Drift Layer	PSI, KS statistic	Evidently / NannyML	PSI > 0.2
Model Performance Layer	Accuracy, F1, AUC	MLflow + custom	Below baseline by 3%
Business Metrics Layer	Conversion rate, revenue impact	Custom dashboard	Defined by business

Testi et al., in their IEEE Access study^[15], proposed a taxonomy and methodology for MLOps, emphasizing that monitoring should not be merely reactive—waiting for problems to occur—but should proactively predict when models need retraining. They recommend establishing a "model health score" that integrates multi-dimensional metrics to assess the current state of a model.

10. Decision Framework: MLOps Toolchain Selection Guide

Faced with numerous MLOps tools, enterprises often find themselves paralyzed by choice. Below are toolchain recommendations based on team size and maturity:

10.1 Early Stage (1–3 Person ML Team)

Function	Recommended Tool	Alternative	Rationale
Experiment Tracking	MLflow (local mode)	Weights & Biases	Free, lightweight, no infrastructure needed
Version Control	Git + DVC	Git LFS	Unified version control for data and code
Model Deployment	BentoML	FastAPI + Docker	Built-in best practices, less glue code
Monitoring	Evidently (report mode)	Manual scripts	Open-source, easy to get started

10.2 Growth Stage (4–10 Person ML Team)

Function	Recommended Tool	Alternative	Rationale
Experiment Tracking	MLflow (server mode)	Neptune.ai	Team sharing, unified management
Pipeline Orchestration	Prefect / Airflow	Kubeflow Pipelines	Scheduling, retries, dependency management
Feature Store	Feast	Hopsworks	Avoid redundant feature computation
Model Deployment	BentoML + K8s	Seldon Core	Containerized + auto-scaling
Monitoring	Evidently + Grafana	NannyML	Real-time drift detection + visualization

10.3 Mature Stage (10+ Person ML Team / Multi-Model Production)

Function	Recommended Tool	Alternative	Rationale
End-to-End Platform	Kubeflow	AWS SageMaker	Unified lifecycle management
Feature Store	Tecton / Feast on K8s	SageMaker Feature Store	Enterprise-grade feature management
Model Serving	KServe	Triton Inference Server	Multi-framework support, GPU inference
Monitoring	Evidently + PagerDuty	Fiddler AI	Auto-alerting + incident management
Governance	MLflow + custom	Weights & Biases	Model auditing, compliance tracking

10.4 Three Core Principles for Tool Selection

Regardless of team size, the following principles should guide tool selection:

Start small, scale incrementally: Do not deploy a full Kubeflow cluster from day one. Begin with MLflow in local mode for experiment management, then upgrade infrastructure as team size and model count grow. Premature architectural investment is a common cause of MLOps adoption failure.
Prioritize eliminating the biggest pain point: If the team's biggest problem is "we can't find last time's experiment results," introduce experiment tracking first. If it's "deploying a model takes two weeks," establish automated deployment first. Do not try to solve every problem at once.
Choose open standards over closed platforms: MLflow's model format, ONNX's model exchange format, OCI's container standards—these open standards ensure you are not locked into a single platform and preserve migration flexibility for the future.

11. Conclusion: MLOps Is Not a Tools Problem—It Is a Culture Problem

Let us return to the statistic from the beginning of this article—87% of ML projects never make it to production. By now, the reason should be clear: it is not because our models are not good enough, but because we treated "training a high-accuracy model" as the finish line, overlooking the enormous chasm between experimentation and production.

The core value of MLOps lies not in any single tool—not in MLflow, not in DVC, not in BentoML—but in the cultural shift it represents: from "one-off model development" to "continuous iterative ML systems engineering."

The concept of "hidden technical debt" proposed by Sculley et al. in their foundational paper^[1] remains as relevant as ever. Every untracked experiment, every manually deployed model, every unmonitored production service accumulates technical debt. This debt does not disappear on its own—it manifests as model degradation, deployment failures, and debugging nightmares.

For enterprises considering MLOps adoption, our recommendations are:

Start tracking your experiments with MLflow today. This is the lowest-cost, highest-return first step. As Lab 1 demonstrated, just a few lines of code can fundamentally transform how you manage experiments.
Establish a standardized model deployment process. Lab 2 showed how BentoML can transform a model from a pickle file into a testable, containerizable, and scalable service.
Build monitoring from day one. After a model goes live, Data Drift and Model Drift are inevitable. The earlier you establish detection mechanisms, the better you can avoid the disaster of "a model silently failing for three months before anyone notices."
Invest in team culture, not just tools. The success of MLOps depends on close collaboration among data scientists, ML engineers, and DevOps teams. Tools can facilitate collaboration, but they cannot replace communication.

Machine learning is transitioning from a "research-driven" to an "engineering-driven" era. Organizations that can establish mature MLOps practices will hold a decisive advantage in the race to deploy AI—not because their models are better, but because they can deliver model value to production faster, more reliably, and more sustainably.