MLOps 完全指南：實驗管理到模型部署的端到端流程

Key Findings

根據多項產業調查，約 87% 的機器學習專案從未進入生產環境——瓶頸不在模型精度，而在工程化流程的缺失^[5]
Google 提出的 MLOps 成熟度三級模型（Level 0 手動 → Level 1 Pipeline 自動化 → Level 2 CI/CD 自動化）為企業提供了清晰的演進路徑^[8]
ML 系統中真正的程式碼（模型訓練）僅佔整體系統的 5-10%，其餘 90% 是資料管理、特徵工程、監控、部署等基礎設施^[1]
導入 MLOps 的企業平均可將模型從開發到部署的週期從數月縮短至數天，並將模型故障偵測時間從數週降至數分鐘^[4]

一、87% 的 ML 專案未能上線：為何 MLOps 是 AI 落地的必修課

機器學習的歷史充滿了一個諷刺的模式：實驗室裡表現驚艷的模型，一旦進入真實世界便頻頻失靈。Paleyes 等人在 ACM Computing Surveys 的調查中^[5]系統性地歸納了機器學習部署中的挑戰，發現問題幾乎從不出在演算法本身，而是出在演算法「周圍」的一切——資料管線的脆弱性、實驗結果的不可追溯、部署流程的手動化、以及上線後缺乏監控。

2015 年，Google 研究團隊在 NeurIPS 發表的經典論文^[1]用一張至今仍被廣泛引用的架構圖揭示了這個殘酷的真相：在一個生產級 ML 系統中，真正的模型訓練程式碼（那個我們投入最多心血的部分）僅佔整體系統的一小塊方框。環繞它的是龐大的基礎設施——資料收集、資料驗證、特徵提取、資源管理、服務基礎設施、監控系統。這些「膠水程式碼」和基礎設施才是決定 ML 系統能否在生產環境中可靠運行的關鍵。

Microsoft 的大規模實證研究^[2]進一步證實了這一點。他們訪談了數十個內部 ML 團隊後發現，軟體工程的最佳實踐——版本控制、持續整合、自動化測試、監控告警——在 ML 開發中被嚴重忽視。資料科學家習慣在 Jupyter Notebook 裡快速迭代，但這種工作流在協作、可重現性和生產化方面都存在根本缺陷。

MLOps（Machine Learning Operations）正是為了解決這個鴻溝而生。它將 DevOps 的核心理念——自動化、監控、協作、持續交付——擴展至機器學習的全生命週期。Kreuzberger 等人在 IEEE Access 的綜述^[4]中將 MLOps 定義為一套原則與實踐的集合，旨在可靠且高效地部署和維護生產環境中的機器學習模型。

本文將從理論框架到實戰操作，完整拆解 MLOps 的核心元件。我們不僅解釋「為什麼」，更提供兩個可立即運行的 Google Colab 實作，讓讀者親手體驗 MLOps 工具鏈的威力。

二、MLOps 成熟度模型：從手動到全自動的演進路徑

Google Cloud 在其 MLOps 架構指南中^[8]提出了三級成熟度模型，成為業界最廣泛採用的 MLOps 演進框架。理解這三個等級，是規劃企業 MLOps 策略的第一步。

Level 0：手動流程

這是大多數 ML 團隊的起點，也是 87% 專案卡住的地方。特徵包括：

資料科學家在本地或 Notebook 中手動訓練模型
實驗記錄靠 Excel 或個人筆記（「上次用 lr=0.01 好像比較好」）
模型交付方式是「把 pickle 檔寄給工程師」
部署是手動的一次性事件，無自動化測試
上線後沒有系統性監控，模型退化無人察覺

Shankar 等人的訪談研究^[11]生動地描繪了 Level 0 的困境：一位受訪的 ML 工程師表示，他們的模型上線流程需要 47 個手動步驟，任何一步出錯都要從頭來過。

Level 1：ML Pipeline 自動化

在這個層級，關鍵突破是將訓練流程封裝為自動化管線（Pipeline）：

資料驗證、特徵工程、訓練、評估形成自動化流水線
當新資料到來時，Pipeline 可自動觸發重新訓練
實驗追蹤工具（如 MLflow）記錄每次訓練的參數與結果
模型部署仍可能需要部分手動介入
開始有基本的模型效能監控

TFX^[9]是 Google 內部的 Level 1 實踐典範。它將資料驗證（TFDV）、模型分析（TFMA）、服務部署（TF Serving）串成一條自動化管線，使模型重新訓練從手動操作變成一鍵觸發。

Level 2：CI/CD 全自動化

最高層級實現了 ML 系統的完全自動化：

程式碼變更自動觸發 CI/CD Pipeline
自動化測試涵蓋資料品質、模型效能、服務穩定性
模型通過所有測試後自動部署至生產環境
完整的 A/B 測試與漸進式上線（Canary Deployment）
即時監控 Data Drift 與 Model Drift，自動觸發重訓練

Sato 等人在 ThoughtWorks 的技術報告中^[12]詳細描述了 CD4ML（Continuous Delivery for Machine Learning）的完整實踐，展示了如何將軟體工程的持續交付方法論應用於 ML 系統。

面向	Level 0 手動	Level 1 Pipeline	Level 2 CI/CD
訓練觸發	手動執行	新資料自動觸發	程式碼 / 資料變更自動觸發
實驗追蹤	Excel / 筆記	MLflow / W&B	MLflow + 自動比較
模型部署	手動 scp / email	半自動	自動化 + Canary
測試	無	基本驗證	資料 / 模型 / 服務全覆蓋
監控	無	基本指標	Drift 偵測 + 自動告警
迭代週期	數週 ~ 數月	數天	數小時

三、實驗管理：用 MLflow 追蹤每一次訓練

實驗管理是 MLOps 的基石。沒有系統化的實驗追蹤，ML 開發就像在沒有版本控制的年代寫程式——每個人都在自己的分支上胡亂嘗試，誰也不知道哪個版本才是「對的」。

MLflow^[3]由 Databricks 開源，是目前最廣泛採用的 ML 實驗管理平台。它提供四大核心模組：

3.1 MLflow Tracking：實驗追蹤

MLflow Tracking 的核心概念是 Run——每一次訓練執行都是一個 Run，其中記錄：

Parameters：超參數（learning rate、batch size、eAI PoC 概念驗證hs 等）
Metrics：評估指標（accuracy、loss、F1-score 等），支援逐步記錄
Artifacts：產出物（訓練好的模型檔、混淆矩陣圖、特徵重要性等）
Tags：自定義標籤（實驗目的、資料集版本、操作者等）

多個 Run 組織在 Experiment 下，而 MLflow 提供內建的 Web UI 讓你即時比較不同 Run 的表現。這解決了 ML 開發中最常見的痛點：「上週那個表現很好的模型，我到底用了什麼參數？」

3.2 MLflow Models：模型標準化封裝

MLflow Models 定義了一種統一的模型封裝格式，無論底層框架是 scikit-learn、PyTorch 還是 TensorFlow，都用同一種方式打包。每個 MLflow Model 包含：

MLmodel 檔案：描述模型的 flavors（可用哪些方式載入）
模型二進位檔：序列化的模型權重
conda.yaml / requirements.txt：精確的依賴環境描述
input_example：範例輸入，用於推論測試

3.3 MLflow Model Registry：模型版本管理

Model Registry 為模型引入了生命週期管理的概念。每個註冊的模型可以標記為不同的階段：

Staging：待驗證，進行 A/B 測試或效能評估
Production：已通過驗證，正式對外服務
Archived：已退役的舊版本

這使得模型的升級與回滾有了明確的規範，不再是「直接覆蓋線上的 model.pkl」這種高風險操作。

四、資料版控與特徵工程：DVC 與 Feature Store

Polyzotis 等人在 ACM SIGMOD Record 的研究中^[6]指出，生產級 ML 系統中最被低估的挑戰是資料管理。模型的表現取決於訓練資料的品質，而資料在生產環境中是持續變化的——這使得資料版控成為 MLOps 中不可或缺的一環。

4.1 DVC（Data Version Control）：大型資料的 Git

Git 是程式碼版控的黃金標準，但它無法處理 GB 級甚至 TB 級的訓練資料和模型檔案。DVC 正是為此而生——它在 Git 之上建構了一層資料版控機制：

Git 追蹤 .dvc 元資料檔案：記錄資料檔的雜湊值、大小與遠端儲存位置
實際資料存放在遠端儲存：支援 S3、GCS、Azure Blob、SSH 等
Pipeline 定義：用 dvc.yaml 描述資料處理的 DAG（有向無環圖）
版本切換：git checkout + dvc checkout 即可回到任何歷史版本的資料狀態

這意味著每一次模型訓練都可以精確對應到一個資料版本，徹底解決「這個模型是用哪份資料訓練的？」這個老問題。

4.2 Feature Store：特徵的中央倉庫

在大型 ML 團隊中，不同專案往往需要相似的特徵（例如「用戶過去 30 天的交易次數」）。沒有 Feature Store，每個團隊各自計算，導致：

重複計算：同一個特徵被不同團隊重複開發
訓練-服務偏差（Training-Serving Skew）：訓練時用 Python 算特徵，線上用 Java 重新實作，邏輯不一致
時間穿越問題（Data Leakage）：不小心使用了未來資料

Feature Store（如 Feast、Tecton、Hopsworks）提供統一的特徵定義、儲存與服務，確保訓練和推論使用完全相同的特徵計算邏輯。Huyen 在其著作中^[10]將 Feature Store 比作 ML 系統的「中間件」——連接資料工程與模型訓練的橋樑。

五、模型打包與部署：從 Flask 到 BentoML

模型訓練完成後，最大的挑戰往往才剛開始：如何將一個 Python 腳本裡的模型轉化為一個可靠、可擴展、可監控的生產服務？

5.1 部署模式的演進

階段	方式	優點	缺點
V1 手動	Flask / FastAPI 自行封裝	快速原型驗證	無標準化、難以維護
V2 容器化	Docker + Kubernetes	環境一致性、可擴展	需要 DevOps 專業知識
V3 框架化	BentoML / Seldon / KServe	標準化、內建最佳實踐	學習曲線
V4 Serverless	AWS Lambda / Cloud Run	零運維、自動擴縮	冷啟動、模型大小限制

5.2 BentoML：從模型到 API 的最短路徑

BentoML 是一個專為 ML 模型服務化設計的開源框架。它的核心理念是：資料科學家不應該需要學習 Docker、Kubernetes 才能部署模型。BentoML 將模型部署抽象為三個步驟：

儲存模型：用 bentoml.sklearn.save_model() 將訓練好的模型存入本地模型倉庫
定義服務：用 Python decorator 宣告 API 端點，定義輸入輸出格式
打包部署：BentoML 自動生成 Docker image，包含所有依賴與最佳化配置

BentoML 還內建了批量推論（Batching）、自適應微批次（Adaptive Batching）、多模型組合（Runner）等生產級特性，這些在手工用 Flask 搭建時需要數百行額外程式碼。

5.3 部署策略：藍綠、金絲雀與影子模式

生產環境的模型更新不應該是「關掉舊的、打開新的」。成熟的 MLOps 實踐採用漸進式部署策略：

藍綠部署（Blue-Green）：同時運行新舊兩個版本，通過流量切換一鍵回滾
金絲雀部署（Canary）：先將 5% 的流量導向新模型，監控指標正常後逐步擴大
影子模式（Shadow Mode）：新模型接收所有請求但不實際回傳結果，僅記錄預測結果供離線比較

六、Hands-on Lab 1：MLflow 完整實驗管理流程

這個 Lab 將帶你完整走過 MLflow 的核心工作流——從建立實驗、訓練多組模型、記錄參數與指標、比較實驗結果，到選出最佳模型並註冊。

打開 Google Colab（CPU 即可運行），新建 Notebook，依序貼入以下程式碼：

6.1 環境安裝與資料準備

!pip install mlflow scikit-learn matplotlib -q

import mlflow
import mlflow.sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, ConfusionMatrixDisplay
)
import warnings
warnings.filterwarnings('ignore')

# ★ 載入 Wine 資料集（多分類問題，3 個類別，13 個特徵）★
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names
target_names = wine.target_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"訓練集: {X_train.可解釋 AIe[0]} 筆, 測試集: {X_test.shape[0]} 筆")
print(f"特徵數: {X_train.shape[1]}, 類別數: {len(target_names)}")
print(f"類別分布 (訓練): {np.bincount(y_train)}")
print(f"類別分布 (測試):  {np.bincount(y_test)}")

6.2 定義實驗追蹤函式

def train_and_log(model, model_name, params, X_tr, X_te, y_tr, y_te):
    """訓練模型並將所有資訊記錄到 MLflow"""
    with mlflow.start_run(run_name=model_name):
        # ★ 記錄超參數 ★
        mlflow.log_params(params)
        mlflow.set_tag("model_type", model_name)
        mlflow.set_tag("dataset", "wine")
        mlflow.set_tag("scaler", "StandardScaler")

        # 訓練
        model.fit(X_tr, y_tr)
        y_pred = model.predict(X_te)

        # ★ 記錄多項評估指標 ★
        metrics = {
            "accuracy": accuracy_score(y_te, y_pred),
            "precision_macro": precision_score(y_te, y_pred, average='macro'),
            "recall_macro": recall_score(y_te, y_pred, average='macro'),
            "f1_macro": f1_score(y_te, y_pred, average='macro'),
        }

        # 交叉驗證分數（更穩健的評估）
        cv_scores = cross_val_score(model, X_tr, y_tr, cv=5, scoring='accuracy')
        metrics["cv_mean_accuracy"] = cv_scores.mean()
        metrics["cv_std_accuracy"] = cv_scores.std()

        mlflow.log_metrics(metrics)

        # ★ 記錄產出物（Artifacts）：混淆矩陣圖 ★
        fig, ax = plt.subplots(figsize=(6, 5))
        cm = confusion_matrix(y_te, y_pred)
        disp = ConfusionMatrixDisplay(cm, display_labels=target_names)
        disp.plot(ax=ax, cmap='Blues')
        ax.set_title(f"{model_name} — Confusion Matrix")
        plt.tight_layout()
        fig.savefig("confusion_matrix.png", dpi=100)
        mlflow.log_artifact("confusion_matrix.png")
        plt.close()

        # ★ 記錄模型本身 ★
        mlflow.sklearn.log_model(model, "model")

        print(f"  {model_name}: accuracy={metrics['accuracy']:.4f}, "
              f"f1={metrics['f1_macro']:.4f}, "
              f"cv={metrics['cv_mean_accuracy']:.4f}+/-{metrics['cv_std_accuracy']:.4f}")

        return metrics

print("✓ 訓練與記錄函式已定義")

6.3 設定 MLflow 實驗並訓練多組模型

# ★ 建立 MLflow 實驗 ★
experiment_name = "wine_classification_LLM 評估"
mlflow.set_experiment(experiment_name)

print("=" * 65)
print("  MLflow 實驗管理 — Wine 分類模型比較")
print("=" * 65)

# ★ 定義模型與超參數組合 ★
experiments = [
    {
        "name": "LogisticRegression_C0.1",
        "model": LogisticRegression(C=0.1, max_iter=1000, random_state=42),
        "params": {"algorithm": "LogisticRegression", "C": 0.1, "max_iter": 1000}
    },
    {
        "name": "LogisticRegression_C1.0",
        "model": LogisticRegression(C=1.0, max_iter=1000, random_state=42),
        "params": {"algorithm": "LogisticRegression", "C": 1.0, "max_iter": 1000}
    },
    {
        "name": "LogisticRegression_C10.0",
        "model": LogisticRegression(C=10.0, max_iter=1000, random_state=42),
        "params": {"algorithm": "LogisticRegression", "C": 10.0, "max_iter": 1000}
    },
    {
        "name": "RandomForest_100trees",
        "model": RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42),
        "params": {"algorithm": "RandomForest", "n_estimators": 100, "max_depth": "None"}
    },
    {
        "name": "RandomForest_200trees_depth5",
        "model": RandomForestClassifier(n_estimators=200, max_depth=5, random_state=42),
        "params": {"algorithm": "RandomForest", "n_estimators": 200, "max_depth": 5}
    },
    {
        "name": "GradientBoosting_100",
        "model": GradientBoostingClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42
        ),
        "params": {"algorithm": "GradientBoosting", "n_estimators": 100,
                   "learning_rate": 0.1, "max_depth": 3}
    },
    {
        "name": "GradientBoosting_200_slow",
        "model": GradientBoostingClassifier(
            n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42
        ),
        "params": {"algorithm": "GradientBoosting", "n_estimators": 200,
                   "learning_rate": 0.05, "max_depth": 4}
    },
    {
        "name": "SVM_rbf",
        "model": SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42),
        "params": {"algorithm": "SVM", "kernel": "rbf", "C": 1.0, "gamma": "scale"}
    },
]

# ★ 依序訓練並記錄到 MLflow ★
all_results = {}
for exp in experiments:
    result = train_and_log(
        exp["model"], exp["name"], exp["params"],
        X_train_scaled, X_test_scaled, y_train, y_test
    )
    all_results[exp["name"]] = result

print(f"\n✓ 共完成 {len(experiments)} 組實驗，已全部記錄至 MLflow")

6.4 查詢並比較實驗結果

# ★ 使用 MLflow API 查詢實驗結果 ★
from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name(experiment_name)
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.f1_macro DESC"]
)

print("=" * 75)
print(f"  實驗結果排名（依 F1-macro 排序）")
print("=" * 75)
print(f"{'排名':<5}{'模型':<35}{'Accuracy':<12}{'F1-macro':<12}{'CV Mean':<12}")
print("-" * 75)

for i, run in enumerate(runs):
    m = run.data.metrics
    print(f"  {i+1:<3} {run.info.run_name:<35}"
          f"{m['accuracy']:<12.4f}{m['f1_macro']:<12.4f}"
          f"{m['cv_mean_accuracy']:<12.4f}")

# ★ 最佳模型 ★
best_run = runs[0]
print(f"\n★ 最佳模型: {best_run.info.run_name}")
print(f"  Run ID: {best_run.info.run_id}")
print(f"  F1-macro: {best_run.data.metrics['f1_macro']:.4f}")
print(f"  CV Accuracy: {best_run.data.metrics['cv_mean_accuracy']:.4f}"
      f" +/- {best_run.data.metrics['cv_std_accuracy']:.4f}")

# ★ 視覺化比較 ★
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

names = [r.info.run_name.replace("_", "\n") for r in runs]
accs = [r.data.metrics['accuracy'] for r in runs]
f1s = [r.data.metrics['f1_macro'] for r in runs]

colors = ['#b8922e' if i == 0 else '#0077b6' for i in range(len(runs))]

axes[0].barh(names, accs, color=colors)
axes[0].set_xlabel('Accuracy')
axes[0].set_title('模型 Accuracy 比較')
axes[0].set_xlim(0.85, 1.01)

axes[1].barh(names, f1s, color=colors)
axes[1].set_xlabel('F1-macro')
axes[1].set_title('模型 F1-macro 比較（金色 = 最佳）')
axes[1].set_xlim(0.85, 1.01)

plt.tight_layout()
plt.savefig("model_comparison.png", dpi=120, bbox_inches='tight')
plt.show()
print("\n✓ 比較圖表已儲存")

6.5 註冊最佳模型至 Model Registry

# ★ 將最佳模型註冊至 MLflow Model Registry ★
model_name_registry = "wine_classifier_production"

model_uri = f"runs:/{best_run.info.run_id}/model"
registered = mlflow.register_model(model_uri, model_name_registry)

print(f"\n★ 模型已註冊至 Model Registry")
print(f"  模型名稱: {registered.name}")
print(f"  版本: {registered.version}")
print(f"  來源 Run: {best_run.info.run_name}")

# 更新模型描述
client.update_registered_model(
    name=model_name_registry,
    description="Wine 分類最佳模型 — 由 MLflow 實驗管理流程自動選出"
)

# ★ 載入已註冊的模型並進行推論 ★
loaded_model = mlflow.sklearn.load_model(model_uri)
sample = X_test_scaled[:5]
predictions = loaded_model.predict(sample)

print(f"\n★ 模型推論測試（前 5 筆測試資料）:")
for i in range(5):
    actual = target_names[y_test[i]]
    predicted = target_names[predictions[i]]
    status = "正確" if y_test[i] == predictions[i] else "錯誤"
    print(f"  [{status}] 實際: {actual:<12} 預測: {predicted}")

print(f"\n✓ Lab 1 完成！你已學會：")
print(f"  1. 建立 MLflow 實驗並追蹤多組模型")
print(f"  2. 記錄超參數、評估指標與產出物")
print(f"  3. 使用 API 查詢與比較實驗結果")
print(f"  4. 將最佳模型註冊至 Model Registry")
print(f"  5. 從 Registry 載入模型進行推論")

七、Hands-on Lab 2：模型封裝與 API 服務化

在 Lab 1 中，我們用 MLflow 管理了實驗流程並選出最佳模型。在這個 Lab 中，我們將用 BentoML 將模型封裝為一個可對外服務的 REST API——這是從「實驗」到「產品」的關鍵一步。

打開 Google Colab（CPU 即可運行），新建 Notebook，依序貼入以下程式碼：

7.1 環境安裝與模型訓練

!pip install bentoml scikit-learn numpy requests -q

import bentoml
import numpy as np
import json
import time
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

# ★ 訓練一個生產級模型 ★
wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42
)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
print(f"模型訓練完成 — 測試集 Accuracy: {acc:.4f}")
print(f"\n分類報告:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))

7.2 將模型存入 BentoML Model Store

# ★ 將模型和前處理器一起存入 BentoML ★
saved_model = bentoml.sklearn.save_model(
    "wine_classifier",
    model,
    signatures={"predict": {"batchable": True}},
    labels={"task": "classification", "dataset": "wine", "framework": "sklearn"},
    metadata={
        "accuracy": float(acc),
        "n_features": X_train.shape[1],
        "n_classes": len(wine.target_names),
        "feature_names": list(wine.feature_names),
        "target_names": list(wine.target_names),
    },
    custom_objects={
        "scaler": scaler  # 同時儲存前處理器
    }
)

print(f"✓ 模型已存入 BentoML Model Store")
print(f"  模型 Tag: {saved_model.tag}")
print(f"  儲存路徑: {saved_model.path}")

# ★ 查看所有已儲存的模型 ★
print(f"\n已儲存模型列表:")
for m in bentoml.models.list():
    print(f"  - {m.tag} (建立時間: {m.info.creation_time})")

7.3 定義 BentoML Service

# ★ 撰寫 BentoML Service 定義檔 ★
service_code = '''
import numpy as np
import bentoml
from bentoml.io import NumpyNdarray, JSON

# 載入模型與前處理器
model_runner = bentoml.sklearn.get("wine_classifier:latest").to_runner()
model_ref = bentoml.models.get("wine_classifier:latest")
scaler = model_ref.custom_objects["scaler"]
metadata = model_ref.info.metadata

svc = bentoml.Service("wine_classifier_service", runners=[model_runner])

@svc.api(input=NumpyNdarray(), output=JSON())
async def predict(input_array: np.ndarray) -> dict:
    """接收原始特徵，回傳預測類別與機率"""
    # 前處理
    scaled = scaler.transform(input_array.reshape(1, -1) if input_array.ndim == 1 else input_array)
    # 預測
    predictions = await model_runner.predict.async_run(scaled)
    target_names = metadata["target_names"]
    results = []
    for pred in predictions:
        results.append({
            "class_id": int(pred),
            "class_name": target_names[int(pred)],
        })
    return {"predictions": results, "model": str(model_ref.tag)}

@svc.api(input=JSON(), output=JSON())
async def predict_json(input_data: dict) -> dict:
    """接收 JSON 格式的特徵資料"""
    features = np.array(input_data["features"])
    scaled = scaler.transform(features.reshape(1, -1) if features.ndim == 1 else features)
    predictions = await model_runner.predict.async_run(scaled)
    target_names = metadata["target_names"]
    results = []
    for pred in predictions:
        results.append({
            "class_id": int(pred),
            "class_name": target_names[int(pred)],
        })
    return {
        "predictions": results,
        "model": str(model_ref.tag),
        "feature_names": metadata["feature_names"]
    }

@svc.api(input=JSON(), output=JSON())
async def model_info(input_data: dict) -> dict:
    """回傳模型元資訊"""
    return {
        "model_tag": str(model_ref.tag),
        "accuracy": metadata["accuracy"],
        "n_features": metadata["n_features"],
        "n_classes": metadata["n_classes"],
        "feature_names": metadata["feature_names"],
        "target_names": metadata["target_names"],
    }
'''

with open("service.py", "w") as f:
    f.write(service_code)

print("✓ BentoML Service 定義檔 (service.py) 已建立")
print("  包含 3 個 API 端點:")
print("  - /predict        : NumPy 陣列輸入")
print("  - /predict_json   : JSON 格式輸入")
print("  - /model_info     : 模型元資訊查詢")

7.4 模擬 API 推論測試

# ★ 在 Colab 中直接測試模型推論邏輯（不啟動 HTTP Server）★
print("=" * 65)
print("  模型封裝與推論測試")
print("=" * 65)

# 載入已儲存的模型
model_ref = bentoml.models.get("wine_classifier:latest")
loaded_model = bentoml.sklearn.load_model("wine_classifier:latest")
loaded_scaler = model_ref.custom_objects["scaler"]
meta = model_ref.info.metadata

print(f"\n模型資訊:")
print(f"  Tag: {model_ref.tag}")
print(f"  Accuracy: {meta['accuracy']:.4f}")
print(f"  特徵數: {meta['n_features']}")
print(f"  類別: {meta['target_names']}")

# ★ 模擬 API 請求 — 單筆推論 ★
print(f"\n--- 單筆推論測試 ---")
sample = X_test[0]
sample_scaled = loaded_scaler.transform(sample.reshape(1, -1))
pred = loaded_model.predict(sample_scaled)
print(f"  輸入特徵 (前5個): {sample[:5].round(2)}")
print(f"  預測類別 ID: {pred[0]}")
print(f"  預測類別名稱: {wine.target_names[pred[0]]}")
print(f"  實際類別名稱: {wine.target_names[y_test[0]]}")

# ★ 模擬 API 請求 — 批量推論 ★
print(f"\n--- 批量推論測試（10 筆）---")
batch = X_test[:10]
batch_scaled = loaded_scaler.transform(batch)
batch_preds = loaded_model.predict(batch_scaled)

correct = sum(batch_preds == y_test[:10])
print(f"  批量推論結果: {correct}/10 正確")
for i in range(10):
    actual = wine.target_names[y_test[i]]
    predicted = wine.target_names[batch_preds[i]]
    status = "OK" if y_test[i] == batch_preds[i] else "NG"
    print(f"  [{status}] #{i+1} 實際: {actual:<12} 預測: {predicted}")

# ★ 推論延遲測試 ★
print(f"\n--- 推論延遲基準測試 ---")
single_sample = loaded_scaler.transform(X_test[0].reshape(1, -1))
batch_100 = loaded_scaler.transform(X_test[:36])  # 使用所有測試資料

# 單筆延遲
times_single = []
for _ in range(100):
    t0 = time.time()
    _ = loaded_model.predict(single_sample)
    times_single.append((time.time() - t0) * 1000)

# 批量延遲
times_batch = []
for _ in range(100):
    t0 = time.time()
    _ = loaded_model.predict(batch_100)
    times_batch.append((time.time() - t0) * 1000)

print(f"  單筆推論: {np.mean(times_single):.3f}ms (p99: {np.percentile(times_single, 99):.3f}ms)")
print(f"  批量 {len(batch_100)} 筆: {np.mean(times_batch):.3f}ms (p99: {np.percentile(times_batch, 99):.3f}ms)")
print(f"  批量效率: {np.mean(times_single) * len(batch_100) / np.mean(times_batch):.1f}x")

7.5 建構 Bento 並查看打包結構

# ★ 建構 bentofile.yaml（BentoML 打包配置）★
bentofile_content = '''
service: "service:svc"
labels:
  owner: meta-intelligence
  project: wine-classifier
  stage: production
include:
  - "*.py"
python:
  packages:
    - scikit-learn
    - numpy
'''

with open("bentofile.yaml", "w") as f:
    f.write(bentofile_content)

print("✓ bentofile.yaml 已建立")
print("\n打包配置內容:")
print(bentofile_content)

# ★ 展示完整的部署流程 ★
print("=" * 65)
print("  生產部署流程（命令列操作指引）")
print("=" * 65)
print("""
在本地開發環境中，你可以執行以下命令完成部署：

# 1. 啟動本地開發 Server（測試用）
$ bentoml serve service:svc --reload

# 2. 打包成 Bento
$ bentoml build

# 3. 容器化（生成 Docker image）
$ bentoml containerize wine_classifier_service:latest

# 4. 執行容器
$ docker run -p 3000:3000 wine_classifier_service:latest

# 5. 測試 API
$ curl -X POST http://localhost:3000/predict_json \\
    -H "Content-Type: application/json" \\
    -d '{"features": [13.0, 1.5, 2.3, 15.0, 100, 2.8, 3.0, 0.28, 2.29, 5.64, 1.04, 3.92, 1065]}'
""")

print(f"\n✓ Lab 2 完成！你已學會：")
print(f"  1. 將模型與前處理器存入 BentoML Model Store")
print(f"  2. 定義多端點 API Service")
print(f"  3. 測試模型推論（單筆與批量）")
print(f"  4. 建立打包配置與部署流程")
print(f"  5. 瞭解從開發到容器化部署的完整路徑")

八、CI/CD for ML：自動化測試與持續部署

傳統軟體的 CI/CD 已經非常成熟，但 ML 系統的持續整合與交付面臨獨特的挑戰。Sato 等人的 CD4ML 框架^[12]提出了一個重要觀點：ML 系統有三個需要版控的變化軸——程式碼、模型、資料。任何一個軸的變化都可能需要觸發重新驗證與部署。

8.1 ML 專屬的測試策略

Breck 等人提出的 ML Test Score^[7]定義了一套全面的 ML 系統測試評分標準，涵蓋四大類別：

資料測試：

特徵的統計分布是否在預期範圍內（最小值、最大值、平均值、缺失率）
訓練資料與服務資料的 Schema 是否一致
是否存在資料洩漏（Data Leakage）
特徵之間的相關性是否穩定

模型測試：

模型在基準測試集上的表現是否超過最低閾值
新模型是否優於當前生產模型（回歸測試）
模型在不同子群體上的表現是否公平（Fairness Testing）
模型對對抗樣本的魯棒性

基礎設施測試：

訓練過程是否可重現（Reproducibility）
模型序列化 / 反序列化是否正確
API 端點的回應時間是否在 SLA 範圍內
資源使用（記憶體、CPU、GPU）是否在預算內

監控測試：

是否有完善的日誌記錄
告警閾值是否合理設定
模型效能降級時是否有自動回滾機制

8.2 GitHub Actions 的 ML CI/CD Pipeline

以下是一個典型的 ML CI/CD Pipeline 結構，使用 GitHub Actions 實現：

# .github/workflows/ml-pipeline.yml 結構示意
#
# 觸發條件: push to main / PR / 排程（每日重訓練）
#
# Stage 1: 資料驗證
#   - 檢查資料 Schema 一致性
#   - 驗證特徵分布（Great Expectations / Pandera）
#   - 偵測資料漂移
#
# Stage 2: 模型訓練
#   - 從 DVC 拉取最新訓練資料
#   - 執行訓練 Pipeline
#   - 用 MLflow 記錄實驗
#
# Stage 3: 模型驗證
#   - 基準測試集表現 >= 閾值
#   - 新模型 >= 當前生產模型
#   - 公平性檢查通過
#   - 延遲基準測試通過
#
# Stage 4: 模型部署
#   - BentoML 打包
#   - Docker 容器化
#   - 金絲雀部署（5% 流量）
#   - 監控 30 分鐘
#   - 全量切換

Lu 等人的 MLOps 工具生態調查^[14]指出，目前 MLOps 工具鏈的碎片化是企業導入的最大障礙之一。不同工具之間的整合往往需要大量的「膠水程式碼」，而這些程式碼本身又成為新的技術債務來源。

九、模型監控：Data Drift 與 Model Drift 偵測

模型上線不是終點，而是另一個起點。Klaise 等人的監控調查^[13]系統性地分類了生產 ML 模型面臨的退化風險，其中最關鍵的兩類是 Data Drift 與 Model Drift。

9.1 Data Drift：輸入資料分布變化

Data Drift 指的是生產環境中的輸入資料分布，相較於訓練資料發生了顯著變化。這是 ML 模型退化最常見的原因。

常見成因：

季節性變化：電商的購買行為在節日與平日差異巨大
上游系統變更：資料供應方修改了 ETL 邏輯或欄位定義
使用者行為演變：COVID-19 徹底改變了許多行業的使用者模式
特徵計算錯誤：程式碼 Bug 導致特徵值異常

偵測方法：

統計檢定：KS Test（Kolmogorov-Smirnov）、Chi-Square Test、PSI（Population Stability Index）
分布距離：KL Divergence、Wasserstein Distance、Jensen-Shannon Divergence
視覺化：特徵分布的時序比較圖

9.2 Model Drift（Concept Drift）：輸入與輸出關係變化

即使輸入分布不變，輸入與目標之間的關係也可能改變。例如，疫情前「搜尋口罩」的使用者多半與醫療相關，疫情後則是一般大眾——相同的輸入特徵，對應的標籤含義已經不同。

偵測策略：

直接監控：追蹤模型在生產資料上的預測效能（需要延遲標籤）
間接監控：追蹤預測分布的變化（不需要標籤，但靈敏度較低）
視窗比較：滑動視窗內的模型表現與歷史基線比較

9.3 監控系統架構

一個完善的 ML 監控系統應包含以下層次：

監控層	指標	工具	告警閾值
基礎設施層	CPU、記憶體、延遲、吞吐量	Prometheus + Grafana	P99 延遲 > 200ms
資料品質層	缺失率、異常值、Schema 偏差	Great Expectations	缺失率 > 5%
Data Drift 層	PSI、KS 統計量	Evidently / NannyML	PSI > 0.2
模型效能層	Accuracy、F1、AUC	MLflow + 自訂	低於基線 3%
業務指標層	轉換率、收入影響	自訂 Dashboard	依業務定義

Testi 等人在 IEEE Access 的研究^[15]提出了一套 MLOps 分類法與方法論，強調監控不應該只是被動地等待問題發生，而應該主動地預測模型何時需要重新訓練。他們建議建立「模型健康度分數」，綜合多維度指標來判斷模型的當前狀態。

十、決策框架：MLOps 工具鏈選型指南

面對眾多的 MLOps 工具，企業常陷入選擇困難。以下是基於團隊規模與成熟度的選型建議：

10.1 早期階段（1-3 人 ML 團隊）

功能	推薦工具	替代方案	理由
實驗追蹤	MLflow（本地模式）	Weights & Biases	免費、輕量、無需基礎設施
版本控制	Git + DVC	Git LFS	資料與程式碼統一版控
模型部署	BentoML	FastAPI + Docker	內建最佳實踐，減少膠水程式碼
監控	Evidently（報表模式）	手動腳本	開源、易上手

10.2 成長階段（4-10 人 ML 團隊）

功能	推薦工具	替代方案	理由
實驗追蹤	MLflow（伺服器模式）	Neptune.ai	團隊共享、統一管理
Pipeline 編排	Prefect / Airflow	Kubeflow Pipelines	排程、重試、依賴管理
Feature Store	Feast	Hopsworks	避免特徵重複計算
模型部署	BentoML + K8s	Seldon Core	容器化 + 自動擴縮
監控	Evidently + Grafana	NannyML	即時 Drift 偵測 + 視覺化

10.3 成熟階段（10+ 人 ML 團隊 / 多模型生產）

功能	推薦工具	替代方案	理由
端到端平台	Kubeflow	AWS SageMaker	統一管理全生命週期
Feature Store	Tecton / Feast on K8s	SageMaker Feature Store	企業級特徵管理
模型服務	KServe	Triton Inference Server	多框架支援、GPU 推論
監控	Evidently + PagerDuty	Fiddler AI	自動告警 + 事件管理
治理	MLflow + 自訂	Weights & Biases	模型審計、合規追蹤

10.4 選型的三個核心原則

無論團隊規模如何，選型時應遵循以下原則：

從小處著手，逐步擴展：不要一開始就部署完整的 Kubeflow 叢集。先用 MLflow 本地模式管理實驗，等到團隊規模和模型數量成長後再升級基礎設施。過早的架構投資是 MLOps 導入失敗的常見原因。
優先消除最大的痛點：如果團隊最大的問題是「找不到上次的實驗結果」，先導入實驗追蹤。如果是「模型上線要兩週」，先建立自動化部署。不要試圖一次解決所有問題。
選擇開放標準而非封閉平台：MLflow 的模型格式、ONNX 的模型交換格式、OCI 的容器標準——這些開放標準確保你不會被單一平台鎖定，並在未來保留遷移的靈活性。

十一、結語：MLOps 不是工具問題，而是文化問題

回到文章開頭的那個數字——87% 的 ML 專案未能上線。現在我們應該清楚了：這不是因為我們的模型不夠好，而是因為我們把「訓練出一個高精度模型」當作了終點，忽略了從實驗到生產的巨大鴻溝。

MLOps 的核心價值不在於任何單一工具——不在 MLflow、不在 DVC、不在 BentoML——而在於它所代表的文化轉變：從「一次性的模型開發」到「持續迭代的 ML 系統工程」。

Sculley 等人在那篇奠基性論文^[1]中提出的「隱藏技術債」概念，至今仍然適用。每一個沒有追蹤的實驗、每一個手動部署的模型、每一個缺乏監控的線上服務，都是在累積技術債。這些債務不會自動消失——它們會以模型退化、部署失敗、除錯困難的形式爆發。

對於正在考慮導入 MLOps 的企業，我們的建議是：

今天就開始用 MLflow 記錄你的實驗。這是成本最低、回報最高的第一步。正如 Lab 1 所展示的，只需幾行程式碼就能徹底改變實驗管理的方式。
建立模型部署的標準流程。Lab 2 展示了 BentoML 如何將模型從 pickle 檔變成一個可測試、可容器化、可擴展的服務。
從第一天就建立監控。模型上線後，Data Drift 和 Model Drift 是遲早會發生的事。越早建立偵測機制，越能避免「模型默默失效三個月才被發現」的災難。
投資團隊文化，而非僅投資工具。MLOps 的成功取決於資料科學家、ML 工程師、DevOps 團隊的緊密協作。工具可以促進協作，但無法取代溝通。

機器學習正從「研究驅動」走向「工程驅動」的新階段。那些能夠建立成熟 MLOps 實踐的組織，將在 AI 落地的競爭中佔據決定性優勢——不是因為他們的模型更好，而是因為他們能夠更快、更可靠、更持續地將模型價值交付到生產環境中。