- MoE architectures (such as DeepSeek-V3's 671B model) activate only 37B parameters per inference — using a "brain partitioning" strategy to achieve economical inference for ultra-large-scale models
- Speculative Decoding lets a small model "draft" for the large model, achieving 2-3x acceleration with mathematically identical output — the only acceleration technique that guarantees zero accuracy loss
- Token Merging halves ViT's token count with less than 0.2% accuracy loss; applied to Stable Diffusion it achieves 2x acceleration with virtually indistinguishable visual quality
- DeepCache skips redundant U-Net computation in diffusion models, achieving 2.3x training-free acceleration — combined with Token Merging it can exceed 4x
1. AI's "One-Size-Fits-All" Compute Dilemma: You're Paying the Same Compute Bill for Every Token
Pruning, distillation, quantization — these three major model compression techniques share a common trait: they permanently alter the model's size or precision before deployment. Once compression is complete, the model invests the same computation for every input. But reality tells us: not all inputs are equally difficult.
Imagine a translation system processing a contract: "The" requires almost no thought, but "indemnification" may need the model to mobilize all its parameters for an accurate translation. Traditional models invest exactly the same compute for both words — this is a systematic waste. Harvard Business Review notes[1] that global AI infrastructure energy consumption is growing at an alarming rate, with a significant portion of compute being spent on "unnecessary calculations."
MIT Sloan Management Review research[2] further points out that efficient AI deployment often yields higher business returns than pursuing the largest model. Dynamic Computation is the technical paradigm that addresses this problem: letting models automatically adjust compute investment based on input difficulty. Simple inputs pass through quickly; only complex inputs mobilize full resources.
Unlike static compression, dynamic computation does not alter the model's size or precision — the model retains its full capabilities but only uses them when needed. This makes it the fourth pillar of model efficiency, fully orthogonal to and stackable with pruning, distillation, and quantization.
2. Technical Evolution: From Adaptive Computation to Conditional Inference
2.1 Adaptive Computation Time: Letting the Network Decide "How Long to Think"
The theoretical roots of dynamic computation trace back to 2016. Alex Graves proposed Adaptive Computation Time (ACT)[3]: letting Recurrent Neural Networks (RNNs) learn how many computation steps each input requires. The network uses a "halting probability" to decide when to stop computing — simple inputs need only one step, while complex inputs can "think for a few more steps."
ACT's core insight was revolutionary: the amount of computation can itself be a learnable output of the model. This concept laid the foundation for all subsequent dynamic computation techniques — from Early Exit to Mixture of Experts, all are different engineering realizations of this idea.
2.2 Early Exit: Letting Simple Tokens Graduate Early
Modern Transformer architecture models typically have dozens of layers. But does every input need to pass through all of them? Google's CALM (Confident Adaptive Language Modeling)[4] published at NeurIPS 2022 proved the answer is no.
CALM's mechanism is intuitive and elegant: attach a lightweight "confidence classifier" to the output of each layer. When a token's prediction confidence exceeds a threshold, it can exit early — no longer passing through the remaining layers. CALM demonstrated impressive results on T5 models: average compute reduced to one-third while output quality remained virtually unchanged.
Microsoft's SkipDecode[5] further addressed a pain point of Early Exit in real-world deployment — batch inference efficiency. In a batch, different sequences require different computation depths. SkipDecode uses token-level layer-skipping strategies and adaptive KV cache, allowing different sequences to exit at different layers while maintaining hardware efficiency for batch inference.
2.3 Mixture of Experts: Only Activate the Experts You Need
If Early Exit is "vertical" dynamic computation (choosing how many layers to pass through), then Mixture of Experts (MoE) is "horizontal" dynamic computation — activating only a subset of parameters within each layer.
Switch Transformer[6] (JMLR 2022) was a milestone for MoE architecture. Google's William Fedus et al. replaced each FFN layer with multiple "expert" sub-networks, routing each token to only one expert (top-1 routing). This seemingly simple design scaled models to 1.6 trillion parameters — but each inference uses only a small fraction of them.
Mistral AI's Mixtral 8x7B[7] brought MoE to practical deployment. Selecting 2 of 8 experts per token (top-2 routing), with 46.7B total parameters but only 12.9B activated per token — achieving comparable performance to LLaMA-2 70B at less than one-fifth the compute. Mixtral proved that MoE is not just a research concept but a directly deployable efficient architecture design.
DeepSeek-V3[8] represents the current pinnacle of MoE architecture. Among its 671B total parameters are 256 routed experts and 1 shared expert, with top-8 selection per token, actually activating about 37B parameters. Combined with Multi-head Latent Attention (MLA) for KV cache compression, DeepSeek-V3 competes with GPT-4o and Claude 3.5 Sonnet on multiple benchmarks — yet training cost was only approximately $5.57 million (2,048 H800 GPUs for about two months), less than one-tenth that of equivalent dense models.
| Model | Total Params | Active Params | Activation Ratio | Routing Strategy |
|---|---|---|---|---|
| Switch Transformer | 1.6T | ~several B | <1% | Top-1 |
| Mixtral 8x7B | 46.7B | 12.9B | 28% | Top-2 / 8 experts |
| DeepSeek-V3 | 671B | 37B | 5.5% | Top-8 / 256 experts |
2.4 Speculative Decoding: Using a Small Model to "Draft" for the Large Model
Autoregressive language models have a fundamental bottleneck: each token must wait for the previous one to be generated before it can start. This means generating N tokens requires N forward passes, which cannot be parallelized. Speculative decoding breaks this bottleneck with an ingenious "draft-verify" mechanism.
In 2022-2023, Google's Leviathan et al.[9] and DeepMind's Chen et al.[10] independently proposed the same core idea:
- Use a small draft model (e.g., OPT-125M) to quickly generate K candidate tokens
- Feed all K tokens into the large model at once for verification (the large model can process them in parallel)
- Use rejection sampling to decide whether to accept or reject each candidate token
The key breakthrough is the mathematical guarantee: the final output probability distribution is completely identical to using the large model alone. This is not "approximate" — it is mathematically strictly equivalent. This makes speculative decoding the only zero accuracy loss inference acceleration technique. In practice, it achieves 2-3x acceleration, performing best when the capability gap between large and small models is moderate.
ICML 2024's Medusa[11] proposed an even more elegant approach: instead of requiring a separate draft model, it directly adds multiple decoding heads to the original model. Each head predicts future tokens at different positions, verifying them all at once through a tree attention mechanism. Medusa's advantage is simpler deployment (only one model needed), and the decoding heads require only minimal fine-tuning.
2.5 Token Merging: Merging Redundant Tokens
Vision Transformer (ViT) divides images into N patch tokens for processing. But many patches are highly redundant — adjacent patches in a blue sky carry nearly identical information. Token Merging (ToMe)[12] (ICLR 2023) takes the strategy of: merging the most similar tokens within each Transformer layer.
ToMe uses bipartite soft matching to find optimal merge pairs. This algorithm requires no new learnable parameters and no retraining — it works directly on pretrained ViTs. On ImageNet, ToMe boosted ViT-L/16 throughput by 2x with only 0.2% accuracy drop.
Even more exciting is ToMe's application to diffusion models[13]. Each denoising step in Stable Diffusion passes through a U-Net (containing attention layers), and ToMe can merge redundant tokens at each step. The tomesd[14] library lets you accelerate SD with a single line of code by 2x, with virtually indistinguishable image quality. Moreover, ToMe can be stacked with other acceleration techniques — the paper reported up to 5.4x acceleration.
2.6 Mixture-of-Depths: Letting Each Layer Decide Whether to Compute
If MoE selects "which expert to use," then Google DeepMind's Mixture-of-Depths (MoD)[15] is more radical — letting tokens choose "whether to pass through this layer at all."
MoD adds a lightweight router at each layer that determines whether to let a token "skip" the current layer based on its importance. Skipped tokens pass directly to the next layer via residual connections, requiring zero computation. The paper found a striking conclusion: even with only 12.5% computation capacity (processing only 1/8 of tokens per layer), the model could still match the full-capacity version's performance. This means that theoretically, MoD can accelerate inference by over 50% while maintaining comparable model quality.
MoD's significance lies in revealing the astonishing redundancy in Transformer computation — most tokens in most layers simply do not need to be processed.
2.7 Feature Caching: Skipping Redundant Denoising Computation
Diffusion model inference requires dozens of denoising iterations, each passing through the full U-Net. But between adjacent steps, the change in U-Net high-level features is actually minimal. DeepCache[16] (CVPR 2024) leverages this observation: cache high-level features and only update the lower-level features that change more.
Specifically: run the full U-Net only every N steps, with intermediate steps running only the lower-level branches while reading high-level features directly from cache. On Stable Diffusion 1.5, DeepCache achieved 2.3x acceleration with virtually unchanged CLIP Score and FID. More importantly:
- Completely training-free — no model weight modifications needed
- Compatible with other techniques — stackable with ToMe, quantization, etc.
- Highly versatile — supports SD 1.x, SD 2.x, SDXL, Stable Video Diffusion, and more
3. Real-World Applications for Text Generative AI
Dynamic computation's application in text generative AI has moved from research to production. Here are the three most impactful deployment scenarios:
Scenario 1: Large-Scale LLM Services — MoE Architecture
DeepSeek-V3's success proves an important thesis: the next generation of frontier models doesn't necessarily need more compute, but smarter compute allocation. A 671B parameter model sounds expensive, but each inference activates only 37B — putting its inference cost close to that of a 30-40B dense model. Mixtral 8x7B's commercial version (deployed via Mistral API or AWS Bedrock) has been widely adopted by enterprises, as it offers LLaMA-2 70B-level quality at over 6x the inference speed.
Scenario 2: Low-Latency Applications — Speculative Decoding
Conversational AI, real-time translation, code autocompletion, and similar applications are extremely latency-sensitive. Speculative decoding boosts large model response speed by 2-3x without sacrificing any quality. HuggingFace natively supports this in Transformers v4.29+[17], requiring only a single assistant_model parameter to enable. Google's Gemini and Anthropic's Claude also widely use similar techniques in their inference engines.
Scenario 3: Classification and Understanding — Early Exit
For understanding tasks like text classification, sentiment analysis, and entity recognition, most inputs don't need the model's full depth. CALM's early exit mechanism lets the model handle 80% of simple inputs in the first few layers, with only the most difficult 20% needing to pass through all layers. This is particularly well-suited for API services handling high volumes of requests — average latency can be reduced 2-3x.
4. Real-World Applications for Image Generative AI
Diffusion models have higher inference costs than LLMs (dozens of denoising steps x large U-Net), making dynamic computation's value even more significant in this domain:
Scenario 1: Token Merging for Accelerated Generation
tomesd[14] merges redundant tokens at each denoising step in Stable Diffusion. At a 50% merge ratio, SD 1.5's generation speed improves by approximately 2x with reduced VRAM usage, while the quality difference is virtually invisible to the human eye. This is particularly valuable for e-commerce, advertising, and design workflows requiring batch image generation.
Scenario 2: DeepCache for Cache-Based Acceleration
DeepCache[18] demonstrates 2x+ acceleration on SD, SDXL, and Stable Video Diffusion. For SDXL (which requires more VRAM), DeepCache's savings are even more pronounced — enabling models that originally needed high-end GPUs to run smoothly on mid-range devices.
Scenario 3: Technique Stacking — Ultimate Acceleration
One of dynamic computation's most powerful properties is that techniques are stackable. On Stable Diffusion, simultaneously using:
- DeepCache (cache high-level features): 2.3x
- ToMe (merge redundant tokens): 2x
- Quantization (INT8/INT4): 1.5-2x
Stacking all three can achieve 4-8x end-to-end acceleration, bringing Stable Diffusion close to real-time generation on consumer GPUs.
5. Hands-on Lab: Token Merging (Computer Vision)
We start with the most intuitive technique — implementing Token Merging on ViT to experience firsthand how "merging redundant tokens" accelerates inference.
Open Google Colab (CPU works, T4 is faster), create a new Notebook, and paste the following code blocks in order:
5.1 Step 1 — Environment Setup and Model Loading
!pip install timm tome Pillow requests -q
import timm
import tome
import torch
import time
import requests
from PIL import Image
from io import BytesIO
# Download example image (cat from Wikipedia)
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Cat_November_2010-1a.jpg/1200px-Cat_November_2010-1a.jpg"
img = Image.open(BytesIO(requests.get(url).content)).convert("RGB")
# Load pretrained ViT-Base/16 (~86M parameters)
model = timm.create_model("vit_base_patch16_224", pretrained=True)
model.eval()
# Image preprocessing
data_config = timm.data.resolve_model_data_config(model)
transform = timm.data.create_transform(**data_config, is_training=False)
x = transform(img).unsqueeze(0) # [1, 3, 224, 224]
# ImageNet labels
labels_url = "https://storage.googleapis.com/bit_models/ilsvrc2012_wordnet_lemmas.txt"
labels = requests.get(labels_url).text.strip().split("\n")
print(f"Model loaded: ViT-Base/16")
print(f" Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
print(f" Input Tokens: {(224//16)**2 + 1} = 196 patch tokens + 1 cls token")
5.2 Step 2 — Benchmarking Tool
def benchmark(model, x, n_warmup=20, n_runs=200):
"""Measure CPU inference latency (ms)"""
for _ in range(n_warmup):
with torch.no_grad():
model(x)
start = time.perf_counter()
for _ in range(n_runs):
with torch.no_grad():
model(x)
return (time.perf_counter() - start) / n_runs * 1000
def predict(model, x, labels):
"""Get prediction results"""
with torch.no_grad():
logits = model(x)
probs = logits.softmax(-1)
top_prob, top_idx = probs.max(-1)
return labels[top_idx.item()], top_prob.item() * 100
# ---- Baseline data ----
pred, conf = predict(model, x, labels)
base_ms = benchmark(model, x)
print(f"{'='*60}")
print(f" Baseline Model (ViT-Base/16, no Token Merging)")
print(f"{'='*60}")
print(f" Prediction: {pred}")
print(f" Confidence: {conf:.1f}%")
print(f" Inference Latency: {base_ms:.2f} ms")
print(f"{'='*60}")
5.3 Step 3 — Apply Token Merging
# ★ Token Merging: apply in one line ★
# tome.patch.timm modifies the model's forward to merge r tokens per layer
tome.patch.timm(model)
print(f"\n{'='*60}")
print(f" Token Merging Effect Comparison")
print(f"{'='*60}")
print(f"{'r value':<8} {'Latency(ms)':<12} {'Speedup':<8} {'Prediction':<25} {'Confidence'}")
print(f"{'-'*60}")
for r in [0, 4, 8, 16, 24, 32]:
model.r = r # Merge r tokens per layer
pred, conf = predict(model, x, labels)
ms = benchmark(model, x)
speedup = base_ms / ms
marker = " ★ Recommended" if r == 16 else ""
print(f"r={r:<5} {ms:<11.2f} {speedup:<7.2f}x {pred:<25} {conf:.1f}%{marker}")
print(f"{'='*60}")
print(f"\n★ Key Observations:")
print(f" - r=0 is equivalent to the original model (no merging)")
print(f" - r=16 is the optimal balance between quality and speed")
print(f" - Even at r=32 (aggressive merging), predictions usually remain correct")
print(f" - ViT-Base has 12 layers -> r=16 means merging 16 token pairs per layer")
print(f" - Completely training-free — applies directly to any pretrained ViT")
6. Hands-on Lab: Speculative Decoding (Language Models)
Next is the most practical dynamic computation technique for LLMs — speculative decoding. We use HuggingFace's native assistant_model API[17] to experience "using a small model to draft for a large model."
Open Google Colab (select T4 GPU), create a new Notebook, and paste the following code blocks in order:
6.1 Step 1 — Load Target Model and Draft Model
!pip install transformers accelerate -q
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
# ★ Key: one large and one small model from the same family ★
target_name = "facebook/opt-1.3b" # Target model: 1.3B parameters
draft_name = "facebook/opt-125m" # Draft model: 125M parameters (10x smaller)
print("Loading target model (1.3B)...")
tokenizer = AutoTokenizer.from_pretrained(target_name)
target_model = AutoModelForCausalLM.from_pretrained(
target_name, torch_dtype=torch.float16, device_map="auto"
)
print("Loading draft model (125M)...")
draft_model = AutoModelForCausalLM.from_pretrained(
draft_name, torch_dtype=torch.float16, device_map="auto"
)
target_mem = sum(p.numel() * p.element_size() for p in target_model.parameters()) / 1024**3
draft_mem = sum(p.numel() * p.element_size() for p in draft_model.parameters()) / 1024**3
print(f"\nModels loaded successfully")
print(f" Target model: {target_name} ({target_mem:.2f} GB)")
print(f" Draft model: {draft_name} ({draft_mem:.2f} GB)")
print(f" Draft model is only {draft_mem/target_mem*100:.1f}% of target model")
6.2 Step 2 — Standard Generation vs. Speculative Decoding
prompts = [
"The key advantage of dynamic computation in AI is",
"Large language models can be accelerated by",
"In the future, efficient AI inference will",
]
print(f"{'='*70}")
print(f" Standard Generation vs. Speculative Decoding Comparison")
print(f"{'='*70}")
total_standard, total_spec = 0, 0
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt").to(target_model.device)
# ---- Standard generation (token by token) ----
torch.cuda.synchronize()
start = time.perf_counter()
out_standard = target_model.generate(
**inputs, max_new_tokens=80, do_sample=False
)
torch.cuda.synchronize()
t_standard = time.perf_counter() - start
# ---- Speculative decoding (assistant_model) ----
torch.cuda.synchronize()
start = time.perf_counter()
out_spec = target_model.generate(
**inputs, max_new_tokens=80, do_sample=False,
assistant_model=draft_model, # ★ Just add this one parameter ★
)
torch.cuda.synchronize()
t_spec = time.perf_counter() - start
total_standard += t_standard
total_spec += t_spec
tokens = out_standard.shape[1] - inputs["input_ids"].shape[1]
print(f"\n Prompt: {prompt}")
print(f" Standard: {t_standard:.2f}s | Speculative: {t_spec:.2f}s | "
f"Speedup: {t_standard/t_spec:.2f}x | Tokens: {tokens}")
# Verify outputs are identical (speculative decoding's mathematical guarantee)
match = torch.equal(out_standard, out_spec)
print(f" Outputs identical: {'Yes' if match else 'No'}")
6.3 Step 3 — Complete Statistics
print(f"\n{'='*70}")
print(f" Speculative Decoding Summary")
print(f"{'='*70}")
print(f" Standard generation total time: {total_standard:.2f}s")
print(f" Speculative decoding total time: {total_spec:.2f}s")
print(f" Overall speedup ratio: {total_standard/total_spec:.2f}x")
print(f"{'='*70}")
print(f"\n★ Key Observations:")
print(f" - Output is mathematically identical to standard generation (zero accuracy loss)")
print(f" - Acceleration depends on the draft model's 'hit rate'")
print(f" - The closer the draft model's distribution to the target -> greater acceleration")
print(f" - Same-family models (e.g., OPT-125M -> OPT-1.3B) work best")
print(f" - HuggingFace Transformers v4.29+ native support")
print(f" - Stackable with quantization: 4-bit target model + small draft model")
7. Hands-on Lab: DeepCache + ToMe (Diffusion Model Acceleration)
Finally, we stack two dynamic computation techniques — DeepCache (feature caching) and ToMe (token merging) — on Stable Diffusion to experience the power of "training-free" acceleration.
Open Google Colab (select T4 GPU), create a new Notebook, and paste the following code blocks in order:
7.1 Step 1 — Environment Setup
!pip install diffusers transformers accelerate DeepCache tomesd -q
import torch
import time
from diffusers import StableDiffusionPipeline
# Load Stable Diffusion 1.5
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
safety_checker=None,
)
pipe = pipe.to("cuda")
prompt = "a photorealistic mountain landscape at golden hour, 8k detailed"
gen = torch.Generator("cuda").manual_seed(42)
print("Stable Diffusion 1.5 loaded successfully")
print(f" VRAM usage: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
7.2 Step 2 — Baseline Generation
# ---- Baseline: standard 50-step generation ----
gen = torch.Generator("cuda").manual_seed(42)
torch.cuda.synchronize()
start = time.perf_counter()
image_base = pipe(prompt, num_inference_steps=50, generator=gen).images[0]
torch.cuda.synchronize()
base_time = time.perf_counter() - start
print(f"Baseline (50-step standard generation): {base_time:.2f}s")
image_base.save("01_baseline.png")
7.3 Step 3 — DeepCache Acceleration
from DeepCache import DeepCacheSDHelper
# ★ DeepCache: cache high-level U-Net features ★
helper = DeepCacheSDHelper(pipe=pipe)
helper.set_params(
cache_interval=3, # Full U-Net computation only every 3 steps
cache_branch_id=0, # Which layer's features to cache
)
helper.enable()
gen = torch.Generator("cuda").manual_seed(42)
torch.cuda.synchronize()
start = time.perf_counter()
image_dc = pipe(prompt, num_inference_steps=50, generator=gen).images[0]
torch.cuda.synchronize()
dc_time = time.perf_counter() - start
helper.disable()
print(f"DeepCache (interval=3): {dc_time:.2f}s (speedup {base_time/dc_time:.2f}x)")
image_dc.save("02_deepcache.png")
7.4 Step 4 — ToMe Acceleration
import tomesd
# ★ ToMe: merge redundant tokens ★
tomesd.apply_patch(pipe, ratio=0.5) # Merge 50% of tokens
gen = torch.Generator("cuda").manual_seed(42)
torch.cuda.synchronize()
start = time.perf_counter()
image_tome = pipe(prompt, num_inference_steps=50, generator=gen).images[0]
torch.cuda.synchronize()
tome_time = time.perf_counter() - start
tomesd.remove_patch(pipe)
print(f"ToMe (ratio=0.5): {tome_time:.2f}s (speedup {base_time/tome_time:.2f}x)")
image_tome.save("03_tome.png")
7.5 Step 5 — DeepCache + ToMe Stacked
# ★ Stack both: DeepCache + ToMe ★
tomesd.apply_patch(pipe, ratio=0.5)
helper = DeepCacheSDHelper(pipe=pipe)
helper.set_params(cache_interval=3, cache_branch_id=0)
helper.enable()
gen = torch.Generator("cuda").manual_seed(42)
torch.cuda.synchronize()
start = time.perf_counter()
image_both = pipe(prompt, num_inference_steps=50, generator=gen).images[0]
torch.cuda.synchronize()
both_time = time.perf_counter() - start
helper.disable()
tomesd.remove_patch(pipe)
print(f"DeepCache + ToMe: {both_time:.2f}s (speedup {base_time/both_time:.2f}x)")
image_both.save("04_deepcache_tome.png")
7.6 Step 6 — Complete Comparison
print(f"\n{'='*60}")
print(f" Diffusion Model Dynamic Acceleration Complete Comparison")
print(f" Stable Diffusion 1.5 / 50 steps / T4 GPU")
print(f"{'='*60}")
print(f"{'Method':<20} {'Time(s)':<10} {'Speedup':<8} {'Training Required?'}")
print(f"{'-'*60}")
results = [
("Standard (baseline)", base_time, 1.0, "—"),
("DeepCache", dc_time, base_time/dc_time, "No"),
("ToMe (50%)", tome_time, base_time/tome_time, "No"),
("DeepCache + ToMe", both_time, base_time/both_time, "No"),
]
for name, t, speedup, train in results:
print(f"{name:<20} {t:<9.2f} {speedup:<7.2f}x {train}")
print(f"{'='*60}")
print(f"\n★ Key Observations:")
print(f" - Both techniques are completely training-free — plug and play")
print(f" - DeepCache and ToMe acceleration effects are stackable")
print(f" - Adding INT8 quantization, total acceleration can exceed 5x")
print(f" - Compare generated images in files for quality differences (usually imperceptible)")
print(f" - These techniques also apply to SDXL and Stable Video Diffusion")
8. Ecosystem Tool Landscape
The tooling ecosystem for dynamic computation is maturing rapidly, with solutions spanning from research prototypes to production deployment:
Mixture of Experts Frameworks
- vLLM (GitHub): High-efficiency LLM inference engine with native support for Mixtral, DeepSeek-V3, and other MoE models, PagedAttention + tensor parallelism
- SGLang (GitHub): Developed by the LMSYS team, RadixAttention shared prefix caching, MoE expert parallelization
- MegaBlocks (GitHub): Databricks' efficient MoE training framework, addressing expert load imbalance
Speculative Decoding
- HuggingFace Transformers[17] (Documentation): v4.29+ native
assistant_modelparameter, the simplest way to use it - Medusa (GitHub): Multi-head speculative decoding, no separate draft model required
- EAGLE (GitHub): Feature-level speculative decoding, faster than Medusa (3x+ acceleration)
Token Merging
- ToMe (GitHub): Meta Research,
pip install tome, one-line application to timm ViT models - tomesd[14] (GitHub):
pip install tomesd, one-line application to Stable Diffusion pipelines
Diffusion Model Acceleration
- DeepCache[18] (GitHub / PyPI):
pip install DeepCache, U-Net feature caching, supports SD / SDXL / SVD - HuggingFace Diffusers (Documentation): DeepCache integrated in the Diffusers optimization documentation
- PAB (Pyramid Attention Broadcast) (GitHub): Attention caching acceleration for video generation (e.g., Open-Sora)
9. From Technical Metrics to Business Impact
Dynamic computation's impact on enterprise AI deployment is multidimensional:
- Inference cost drops 60-80%: MoE models activate only partial parameters, and speculative decoding lets large models serve at small model speeds. For enterprises making millions of API calls monthly, this means hundreds of thousands of dollars in direct savings
- Latency reduced to less than half: Speculative decoding 2-3x, Token Merging 2x, DeepCache 2.3x. Conversational AI response latency goes from perceptible to instant
- No quality sacrifice: Unlike static compression, speculative decoding guarantees zero accuracy loss, and MoE is actually stronger at equivalent compute. This eliminates the traditional concern that "acceleration = quality degradation"
- Fully compatible with other techniques: Dynamic computation can be stacked on quantization (INT4/INT8), pruning, and distillation. A 4-bit quantized MoE model + speculative decoding can achieve 10x+ end-to-end efficiency improvement
- Hardware barriers dramatically lowered: DeepSeek-V3's 671B model uses MoE to bring inference costs close to a 40B dense model; DeepCache + ToMe bring SD close to real-time generation on mid-range GPUs
- Sustainable AI: The essence of conditional computation is "not wasting compute where it's not needed." Harvard Business Review[1] points out that this kind of computational efficiency improvement is the most direct means of controlling AI's carbon footprint
10. Adoption Path: Three-Phase Deployment Strategy
- Immediate wins — Use existing dynamic computation models and tools: Deploy Mixtral 8x7B or DeepSeek-V3 as the LLM inference backend (via vLLM/SGLang); add tomesd + DeepCache for image generation (two lines of code, training-free). These operations require no model modifications and can be completed in a day
- Incremental validation — Accelerate existing models with speculative decoding: Pair your existing LLM with a smaller same-family draft model (e.g., Phi-3-mini with Phi-3-small), and enable speculative decoding using HuggingFace's
assistant_modelparameter. Measure latency reduction and output consistency to confirm suitability for your use case - Deep optimization — Full-stack dynamic computation integration: Consider Medusa or EAGLE as alternatives to standard speculative decoding (higher acceleration); evaluate MoE fine-tuning (e.g., LoRA fine-tuning specific experts on Mixtral); combine dynamic computation with quantization and pruning to build a complete inference efficiency stack. In image generation pipelines, stack DeepCache + ToMe + INT8 quantization, targeting 5x+ end-to-end acceleration
Dynamic computation is the "fourth pillar" of model efficiency — it doesn't change the model's size or precision, but teaches the model to "adjust effort based on context." Unlike pruning (permanently removing parameters), distillation (training a new model), and quantization (reducing precision), dynamic computation lets models retain full capabilities while simply allocating compute resources more intelligently. These four techniques are fully orthogonal and stackable, together forming the complete toolbox for taking AI from the lab to large-scale deployment.
If your team is evaluating how to dramatically reduce inference costs without sacrificing model quality, or needs to accelerate large model responses in latency-sensitive scenarios, we welcome an in-depth technical conversation. Meta Intelligence's research team can accompany you through the complete journey from performance bottleneck diagnosis to full-stack optimized deployment.



