Key Findings
  • Recommender systems contribute over 35% of Amazon's revenue and 80% of Netflix's viewing volume, making them one of the highest-ROI applications of commercial AI[6]
  • From User-based / Item-based collaborative filtering to matrix factorization[1] to DeepFM[3] and DIN[4], recommendation models have undergone three quantum leaps in feature interaction capability
  • Cold start, data sparsity, and multi-objective optimization are the three core engineering challenges for production recommender systems; Graph Neural Networks[7] and multi-task learning offer breakthrough directions
  • YouTube's Two-Tower architecture[5] (Candidate Generation + Ranking) has become the standard design paradigm for industrial-scale recommender systems

1. Why Recommender Systems Are the Core Engine of Commercial AI

In the digital economy era, users no longer face information scarcity but rather information overload. A mid-sized e-commerce platform may have millions of products, yet a user browses at most a few dozen items per visit. How to precisely present the most relevant content from a massive pool of candidates is the core problem that recommender systems aim to solve[6].

The commercial value of recommender systems has been widely validated. Early research reports from Amazon showed that their retail AI engine contributed over 35% of total revenue; Netflix's statistics indicate that 80% of the videos users watch come from recommendations rather than active search. In Taiwan's e-commerce ecosystem, platforms like momo, PChome, and Shopee also consider recommender systems critical infrastructure for improving conversion rates and average order values.

From a technical perspective, a recommender system is fundamentally an information retrieval and ranking problem: given a set of users, a set of items, and historical interaction data (clicks, purchases, ratings), the system must predict each user's preference for unencountered items and present them accordingly. Behind this seemingly simple framework lie a series of engineering and research challenges including user behavior modeling, feature engineering, large-scale candidate retrieval, re-ranking model training, and online A/B testing.

The evolution of recommender systems can be broadly divided into three phases: the first phase is statistics-based collaborative filtering, which leverages user-item interaction matrices to find similar patterns; the second phase is matrix factorization and Factorization Machines[1][8], which compress high-dimensional sparse interaction matrices into low-dimensional dense representations; the third phase is deep learning recommendation models[6], which use neural networks to automatically learn high-order feature interactions, achieving unprecedented accuracy. This article will analyze the technical principles of these three phases in sequence, with a focus on key considerations for production deployment in e-commerce scenarios.

2. Collaborative Filtering: The Most Classic Recommendation Method

Collaborative Filtering (CF) is the earliest and most intuitive methodology in recommender systems. Its core assumption is remarkably concise: users with similar behavior will have similar preferences in the future. CF does not need to understand item content features (such as product category or article topic); it relies solely on users' historical behavioral data to generate recommendations[9].

CF can be divided into two major categories. User-based CF works by finding neighbor users whose behavioral patterns are most similar to the target user, then recommending items that neighbors like but the target user hasn't encountered. Similarity is typically computed using cosine similarity or Pearson correlation. Item-based CF approaches from the item perspective: if a user likes item A, and item B is highly similar to item A in terms of user behavior (favored by the same user groups), then item B is recommended. Amazon's classic 2003 paper was the industrial pioneer of Item-based CF.

MethodCore IdeaAdvantagesDisadvantages
User-based CFRecommend via similar usersIntuitive, interpretableExpensive computation with large user counts
Item-based CFRecommend via similar itemsItems change slowly, pre-computableCannot capture shifts in user interests
Memory-basedDirect similarity computationNo training needed, real-time updatesPoor performance on sparse matrices
Model-basedLearn latent factor modelsStrong generalizationRequires training time

However, traditional CF faces two fundamental challenges. The first is data sparsity: in scenarios with millions of users and items, the user-item interaction matrix typically has a fill rate below 1%, making similarity computation highly unstable. The second is the cold start problem: for entirely new users or new items, CF cannot generate effective recommendations due to the lack of historical interactions. These limitations drove the development of more powerful methods such as matrix factorization.

3. Matrix Factorization and Factorization Machines

The 2006 Netflix Prize competition was a watershed moment in recommender system development. Netflix released 100 million movie ratings and offered a one-million-dollar prize to any team that could improve recommendation accuracy by 10%. The core of the winning solution was Matrix Factorization (MF)[1].

MF's mathematical framework is elegant: decompose the user-item interaction matrix R (dimensions m x n) into the product of two low-rank matrices — user latent factor matrix P (m x k) and item latent factor matrix Q (n x k), where k is much smaller than m and n. Each user is represented as a k-dimensional vector, each item likewise, and the predicted rating is the dot product of the two vectors:

R ≈ P × Q^T

Predicted rating: r̂(u, i) = p_u · q_i = Σ(k) p_uk × q_ik

Optimization objective: min Σ(u,i∈observed) (r_ui - p_u · q_i)² + λ(||p_u||² + ||q_i||²)

The key insight of MF is that these latent factors automatically capture meaningful semantic dimensions. For example, in movie recommendations, one dimension might correspond to "action vs arthouse," while another might represent "mainstream commercial vs independent." The positional relationships between users and items in this latent factor space naturally encode preference information.

Rendle's Factorization Machines (FM)[8], proposed in 2010, further generalized the matrix factorization concept. FM handles not only user-item interactions but also incorporates arbitrary auxiliary features (such as user age, item category, contextual time), efficiently modeling all second-order feature interactions through factorization:

FM Prediction Formula:
ŷ(x) = w_0 + Σ(i) w_i × x_i + Σ(i<j) <v_i, v_j> × x_i × x_j

Where <v_i, v_j> = Σ(f=1 to k) v_if × v_jf
Computational complexity: O(kn) — linear!

The elegance of FM lies in the fact that it doesn't directly learn O(n²) interaction parameters; instead, it maps each feature to a k-dimensional vector, with interaction weights implicitly derived from vector dot products. This dramatically reduces the number of parameters while enabling the model to generalize on sparse data. FM became an important theoretical foundation for subsequent deep learning recommendation models[3].

4. Deep Learning Recommendation Models: DeepFM, Wide&Deep, DIN

While FM elegantly solves the second-order feature interaction problem, real-world user behavior often involves more complex higher-order interaction patterns. Starting from 2016, deep learning began to fully enter the recommender systems field[6], using the nonlinear fitting capability of neural networks to break through FM's expressiveness ceiling.

Wide&Deep (Google, 2016) was one of the earliest industrial recommendation models to combine deep learning with traditional feature engineering. The Wide component is a generalized linear model responsible for Memorization — learning direct co-occurrence patterns of features; the Deep component is a multi-layer fully connected network responsible for Generalization — discovering new feature combinations from dense embeddings. The weighted combination of both outputs is fed through a sigmoid function to produce the final prediction.

DeepFM[3] made a critical improvement upon Wide&Deep: replacing the Wide part's manual feature crossing with an FM layer. The FM layer and Deep layer share the same feature embeddings, with the former capturing second-order interactions and the latter capturing higher-order interactions, both trained end-to-end jointly. This means DeepFM requires absolutely no manual feature engineering, while simultaneously combining FM's interpretability with DNN's expressiveness.

ModelLow-Order InteractionsHigh-Order InteractionsRequires Feature Engineering
Wide&DeepWide (Linear)Deep (DNN)Wide part requires it
DeepFM[3]FM LayerDNN LayerNot required at all
DCNCross NetworkDNN LayerNot required
DIN[4]Attention-weighted historyDNN LayerNot required

DIN (Deep Interest Network)[4], proposed by the Alibaba team in 2018, brought another critical breakthrough: user interests are diverse and dynamic. Traditional methods compress all of a user's historical behavior into a single fixed vector, which severely loses information when user interests are diverse. DIN introduces an attention mechanism that uses the candidate item as a Query to dynamically weight the user's historical behavior sequence — when the candidate is a dress, the user's past browsing of apparel items receives higher weight; when the candidate is a book, reading-related historical behaviors are amplified. This adaptive user representation significantly improved CTR prediction accuracy.

5. YouTube Recommender System: Industrial Architecture Analysis

If academic research defines the theoretical boundaries of recommender systems, YouTube's recommendation architecture[5] defines the standard paradigm for industrial practice. In 2016, Covington et al. published the core design of YouTube's recommender system, revealing the technical details of selecting recommendations from hundreds of millions of videos for over a billion daily users.

YouTube's architecture follows a funnel-shaped two-stage design: Candidate Generation quickly filters hundreds of candidates from millions of videos, while Ranking performs fine-grained re-ranking of these candidates to select the final dozen or so videos to display.

YouTube Recommendation Architecture:

Stage 1: Candidate Generation (Retrieval)
  Input: User watch history, search history, demographics
  Model: Deep neural network → user embedding vector
  Retrieval: Approximate Nearest Neighbor (ANN) search in video embedding space
  Output: ~hundreds of candidate videos

Stage 2: Ranking (Re-ranking)
  Input: Candidate videos + rich features (watch duration, freshness, channel, language...)
  Model: Deeper, wider DNN predicting expected watch time
  Output: Final ranked recommendation list

The elegance of this two-stage architecture lies in the balance between efficiency and precision. The retrieval stage uses a relatively lightweight model but must scan the entire video library at millisecond-level latency — the key technique here is converting the recommendation problem into a multi-class classification problem, then extracting user and video embeddings after training and using approximate nearest neighbor search (e.g., Faiss, ScaNN) for high-speed retrieval. The ranking stage uses a feature-richer, structurally more complex model but only needs to process the hundreds of candidates filtered by the retrieval stage.

One noteworthy engineering decision is that YouTube's ranking model predicts not click-through rate but expected watch time. This avoids clickbait videos being over-recommended due to high CTR, guiding the model to focus on quality content that truly retains users. This design philosophy has had a profound influence on objective function design for all subsequent recommender systems[6]. Short video platforms and streaming services in Taiwan also commonly adopt similar two-tower retrieval plus ranking architecture patterns.

6. Cold Start Problems and Solutions

Cold start is one of the trickiest engineering problems in recommender systems. When a brand-new user registers or a new product is listed, the system lacks sufficient interaction history to deliver personalized recommendations[10]. In e-commerce, the first 72 hours of a new product represent the golden sales window; if the recommender system cannot effectively surface the new product during this period, it directly impacts the product's lifecycle value.

Cold start problems can be classified into three types: user cold start (new user with no behavioral history), item cold start (new item with no interaction records), and system cold start (entirely new platform with neither user nor item data). Different strategies have been developed for each type:

StrategyApplicable ScenarioCore Approach
Content-based FilteringItem cold startCompute similarity using item content features (text, images, categories)
Hybrid ModelUser + item cold startCombine CF and content-based approaches, complementing each other
Meta-LearningUser cold startRapidly adapt from minimal interactions, e.g., MAML-based methods
Cross-domain TransferSystem cold startTransfer user preference knowledge from related domains
Exploration StrategyAll cold start typesMulti-Armed Bandit to balance exploration and exploitation

In practice, the most effective cold start solutions are typically hybrid strategies. For new users, the system can first display recommendations based on demographics (age, region) and platform popularity rankings, then quickly converge the personalization model through the user's first few click behaviors. For new items, the system leverages product title text, descriptions, images, and other content features, mapping new products to the semantic space of existing items through pre-trained embedding models and "borrowing" their interaction patterns from the most similar existing items.

In recent years, large language models (LLMs) have brought new breakthrough directions for the cold start problem. Through deep semantic understanding of product descriptions, LLMs can infer the matching relationship between user preferences and product characteristics with zero interaction data, effectively alleviating data sparsity[9]. For example, using LLMs for semantic parsing of new product descriptions can automatically build association graphs with existing products, enabling recommendations from the moment of listing.

7. Graph Neural Networks in Recommender Systems

Traditional recommendation models treat users and items as independent entities, connecting them only through interaction matrices. But in real-world scenarios, there are rich graph-structured relationships between users and items: users' social networks, items' GraphRAG structures, user-item-attribute heterogeneous graphs, and more. Graph Neural Networks (GNN) provide a natural way for recommender systems to model these relationships[7].

In GNN-based recommendation models, users and items are represented as nodes in a graph, with their interactions forming edges. GNN propagates information across the graph through a message passing mechanism: each node aggregates its neighbors' representations to update its own embedding. After multiple layers of propagation, each node's embedding encodes not only its own information but also structural information from its multi-hop neighborhood.

GNN Message Passing in Recommender Systems:

User-Item Bipartite Graph:
  User_A ──purchased──→ Item_1
  User_A ──browsed──→ Item_2
  User_B ──purchased──→ Item_1
  User_B ──purchased──→ Item_3

Layer 1: Item_1's embedding ← Aggregate(User_A, User_B) features
Layer 2: User_A's embedding ← Aggregate(Item_1(containing User_B's info), Item_2)
→ User_A indirectly obtains User_B's preference information (Item_3)

Representative models include PinSage (Pinterest's GNN recommendation system, processing a graph with over 3 billion nodes), LightGCN (which simplifies GCN design by removing nonlinear transformations and feature transformations, retaining only neighborhood aggregation, and paradoxically achieving better performance on recommendation tasks), and NGCF (Neural Graph Collaborative Filtering, which explicitly encodes CF signals in the embedding propagation process).

The advantages of GNN in recommender systems are primarily reflected in three areas[7]. First, higher-order connectivity: multi-layer GNNs can capture higher-order similarities between users — even if two users share no direct co-purchasing behavior, potential preference correlations can be discovered through multi-hop paths. Second, side information integration: item attribute relationships in knowledge graphs (such as brand, material, designer) can naturally be encoded as edges in the graph, enriching the semantics of item embeddings. Third, cold start mitigation: even if a new item lacks interaction data, as long as it has attribute relationships with existing items in the knowledge graph, GNN can generate meaningful embedding representations for it through graph propagation.

However, GNN-based recommendation models face significant scalability challenges in industrial deployment. Full graph neighborhood aggregation on graphs with hundreds of millions of nodes incurs extremely high computational costs, so industry commonly adopts graph sampling (such as GraphSAGE's neighborhood sampling), mini-batch training, and other strategies to balance effectiveness and efficiency.

8. Multi-Objective Optimization: Click Rate, Conversion Rate, and Long-Term Value

In real-world recommendation scenarios, systems need to simultaneously optimize multiple interrelated and sometimes conflicting objectives. E-commerce platforms want users not only to click recommended products (CTR) but also to complete purchases (CVR), and ideally to improve long-term user retention and lifetime value (LTV). The tension between these objectives constitutes the core challenge of multi-objective optimization (MOO)[10].

The most intuitive approach is to use a weighted sum of multiple objectives as the final ranking score:

Final Score = α × pCTR + β × pCVR + γ × Estimated AOV + δ × Content Quality Score

Typical weight settings (e-commerce):
  Short-term conversion: α=0.3, β=0.5, γ=0.15, δ=0.05
  Long-term value:       α=0.2, β=0.3, γ=0.1, δ=0.4

However, simple weighted linear combinations have fundamental issues. First, different objectives have vastly different scales and distributions, requiring careful normalization. Second, weight settings are highly dependent on manual expertise, and optimal weights change dynamically with time and context. Third, at a deeper level, there is a causal structure between objectives — a user must first see a product, then click, then add to cart, and finally purchase — this is a sequential decision process.

Alibaba's ESMM (Entire Space Multi-Task Model) proposed an elegant solution: leveraging the causal decomposition pCTCVR = pCTR × pCVR to jointly train CTR and CVR subtasks on the full sample space. This resolves the sample selection bias problem where traditional CVR models could only be trained on clicked samples.

More advanced methods include Multi-gate Dynamic Computation (MMoE): setting up independent Gate networks for each task to dynamically select shared Expert modules, allowing different tasks to adaptively leverage shared representations and task-specific representations. Building on this, PLE (Progressive Layered Extraction) further introduces task-specific Experts and progressive inter-layer connections, more effectively balancing information sharing and conflicts between tasks.

In Taiwan's e-commerce practice, a key consideration in multi-objective optimization is platform ecosystem health. Over-optimizing short-term CTR may lead to clickbait products proliferating; over-optimizing CVR may cause the system to only recommend low-price bestsellers while ignoring long-tail products. Mature recommender systems need to incorporate diversity, novelty, and fairness into optimization objectives[10], ensuring that recommendations are valuable not only for users but also sustainable for merchants and the platform.

9. Conclusion: The Next Paradigm for Recommender Systems

From User-based CF to matrix factorization[1], from FM[8] to DeepFM[3], from DIN[4] to GNN[7], recommender systems have undergone a qualitative transformation from statistical methods to deep learning over the past two decades. Each technological leap has been accompanied by exponential growth in commercial value — more precise recommendations translate to higher conversion rates, longer user engagement, and healthier platform ecosystems.

Reviewing the core threads of technical evolution:

Looking ahead to the next paradigm, we see three converging trends. First, large language models as recommendation engines: LLMs possess powerful semantic understanding and world knowledge, enabling them to understand user intent through natural language, explain recommendation rationale, and dynamically adjust recommendation strategies in conversation. Second, generative recommendation: rather than retrieving from a fixed candidate set, directly generating content descriptions or product combinations that users might need, breaking the "existing inventory" constraint of traditional recommendations. Third, causal inference and counterfactual recommendation: shifting from correlation analysis of "what users clicked" to causal analysis of "what behavioral changes recommendations caused," enabling systems to truly understand the effects of recommendations rather than merely recording surface correlations.

For e-commerce platforms and content services in Taiwan, recommender systems are not just a technical challenge but an organizational capability challenge. Building an effective recommender system requires data engineering, machine learning, frontend interaction design, A/B testing infrastructure, and deep understanding of business metrics. At Meta Intelligence, when helping clients deploy recommender systems, we consistently emphasize "end-to-end thinking" — from data collection, feature engineering, and model training to online serving, the quality of every component determines the upper bound of the final recommendation performance. Technology evolves, but the core principle remains: understand users, respect users, and create genuine value for users.