About This Site
This is a board game recommendation system built on data from BoardGameGeek reviews dataset available on Kaggle. The site analyzes millions of user ratings to provide personalized game recommendations.
What This Site Can Do
- Personalized Recommendations - Rate games you've played to get tailored suggestions for new games you'll likely enjoy
- Intelligent Analysis - Uses machine learning to understand your gaming preferences across multiple dimensions (complexity, theme, mechanics, etc.)
- Game Discovery - Browse and search through thousands of board games with detailed information
- Preference Insights - Visualize your gaming personality and see how your tastes compare to other players
- Data-Driven Predictions - Get confidence ratings and explanations for why certain games are recommended
The system becomes more accurate as you rate more games, with optimal performance achieved after rating 30+ games across different genres and complexity levels.
Mathematical Foundations of Boardgame Ratings
This document explains the mathematical techniques and algorithms used in our boardgame recommendation system. Understanding these concepts will help you appreciate how we transform user ratings into meaningful predictions and recommendations.
Table of Contents
- Overview: From Ratings to Vectors
- Singular Value Decomposition (SVD)
- Collaborative Filtering
- PostgreSQL Vector Similarity and Recommendations
- Linear Regression for New Users
- Statistical Analysis
- Implementation Details
- Myers-Briggs Type Indicator for Board Games
Overview: From Ratings to Vectors
Our recommendation system transforms sparse user-game rating data into dense mathematical vectors that capture latent features about both users and games. This transformation enables us to:
- Predict ratings for games a user hasn't rated
- Find similar games based on underlying characteristics
- Recommend games tailored to individual preferences
- Analyze patterns in user behavior and game properties
The core insight is that user preferences and game characteristics can be represented as points in a high-dimensional space, where proximity indicates similarity.
Singular Value Decomposition (SVD)
What is SVD?
Singular Value Decomposition is a fundamental matrix factorization technique that decomposes our rating matrix R (users × games) into three matrices:
R ≈ U × Σ × V^T
Where: - U contains user feature vectors (users × factors) - Σ is a diagonal matrix of singular values (factors × factors) - V^T contains game feature vectors (factors × games)
The Step-by-Step SVD Process
Step 1: Building the Ratings Matrix R
We start with a sparse ratings matrix where each entry R[i,j] represents user i's rating for game j:
Ratings Matrix R (simplified example):
Game1 Game2 Game3 Game4 Game5
User1 8 ? 6 ? 9
User2 ? 4 ? 3 ?
User3 7 ? ? 5 8
User4 ? 2 9 ? ?
User5 6 ? 8 4 7
Most entries are missing (represented by ?), creating a sparse matrix. In our real system, we have hundreds of thousands of users and tens of thousands of games, with each user typically rating only a small fraction of all games.
Step 2: Matrix Factorization into U, Σ, and V^T
SVD decomposes the ratings matrix R into three components:
Matrix U (User Features): Each row represents a user as a vector in the latent factor space:
U Matrix (users × 15 factors):
F1 F2 F3 ... F15
User1 0.2 -0.8 0.4 ... 0.1
User2 -0.5 0.3 -0.2 ... 0.6
User3 0.8 -0.1 0.9 ... 0.3
...
Matrix Σ (Singular Values): A diagonal matrix containing the importance of each factor:
Σ Matrix (15 × 15):
Factor1: 45.2
Factor2: 32.1
Factor3: 28.7
...
Factor15: 2.3
Matrix V^T (Game Features): Each column represents a game as a vector in the same latent factor space:
V^T Matrix (15 factors × games):
Game1 Game2 Game3 ...
Factor1 0.6 -0.3 0.8 ...
Factor2 -0.2 0.7 -0.4 ...
Factor3 0.9 0.1 0.6 ...
...
Step 3: Understanding the Role of Singular Values
The singular values in the Σ matrix serve several critical purposes:
Factor Importance Ranking: Larger singular values indicate more important factors that explain more variance in the ratings. In our example:
- Factor 1 (σ=45.2) might represent "strategy vs. luck preference"
- Factor 2 (σ=32.1) might represent "game complexity preference"
- Factor 15 (σ=2.3) captures much less important patterns or noise
Dimensionality Reduction: We typically keep only the top k factors (k=15 in our case) and discard factors with small singular values. This reduces noise and computational complexity while retaining the most meaningful patterns.
Variance Explanation: The proportion of total variance explained by factor i is σ²ᵢ / Σ(σ²ⱼ). This tells us how much each factor contributes to explaining user preferences.
Step 4: Converting to User and Game Vectors
In our implementation, we use biased=False, which simplifies the decomposition. The final user and game vectors are computed as:
- User vectors: U × √Σ (incorporating singular values into user representations)
- Game vectors: V^T × √Σ (incorporating singular values into game representations)
This gives us the final vectors:
python
user_vectors = svd.pu # Shape: (n_users, 15) - stored in Redis
game_vectors = svd.qi # Shape: (n_games, 15) - stored in PostgreSQL
Each user becomes a 15-dimensional vector representing their preferences across latent factors:
User1_vector = [0.8, -1.2, 0.6, 0.3, -0.9, 0.1, 0.7, -0.4, 0.5, 0.2, -0.3, 0.8, 0.1, -0.6, 0.4]
Each game becomes a 15-dimensional vector representing its characteristics along the same factors:
Game1_vector = [0.9, -0.8, 0.7, 0.2, -0.5, 0.3, 0.6, -0.1, 0.4, 0.8, -0.2, 0.5, 0.3, -0.7, 0.1]
Step 5: Making Predictions
With biased=False, predicting a rating becomes a simple dot product:
predicted_rating = user_vector · game_vector
predicted_rating = Σ(user[i] × game[i]) for i = 1 to 15
For example:
User1 rating for Game1 = (0.8×0.9) + (-1.2×-0.8) + (0.6×0.7) + ... = 7.3
This prediction represents how much User1 would likely enjoy Game1 based on the learned patterns from all users' ratings.
Mathematical Interpretation
The beauty of this approach is that each factor captures a different aspect of preferences:
- Factor 1 might represent "complexity preference": positive values indicate users who prefer complex games, negative values prefer simple games
- Factor 2 might represent "theme preference": positive for fantasy themes, negative for historical themes
- Factor 3 might represent "interaction level": positive for highly interactive games, negative for solitary games
When we compute the dot product, we're essentially asking: "How well do this user's preferences align with this game's characteristics across all learned factors?"
Why SVD Works for Recommendations
- Dimensionality Reduction: SVD identifies the most important latent factors that explain rating patterns, reducing the original sparse matrix to dense, meaningful representations
- Noise Reduction: By keeping only the top factors (highest singular values), we filter out noise and focus on robust patterns
- Generalization: The learned factors help predict ratings for unseen user-game pairs by capturing fundamental preference patterns
- Computational Efficiency: Dense 15-dimensional vectors are much faster to work with than sparse rating matrices
Our Implementation
We use the Surprise library's SVD implementation with specific parameters:
svd = SVD(n_factors=15, biased=False)
- 15 factors: Balances model complexity with interpretability - enough to capture nuanced preferences without overfitting
- biased=False: Ensures predictions follow the simple formula:
rating = user_vector · game_vector
Collaborative Filtering
The Core Principle
Collaborative filtering assumes that users with similar rating patterns will like similar games. Our approach:
- Identify latent factors that explain why users rate games as they do
- Learn user preferences for each factor (e.g., strategy vs. luck, complexity vs. simplicity)
- Learn game characteristics along the same factors
- Predict ratings by matching user preferences with game characteristics
Matrix Factorization Process
Starting with a sparse ratings matrix where most entries are missing:
Game1 Game2 Game3 Game4
User1 5 ? 3 ?
User2 ? 4 ? 2
User3 3 ? ? 4
User4 ? 1 5 ?
SVD learns that User1 and User3 have similar preferences, and Game1 and Game3 share characteristics, allowing us to fill in the missing ratings.
PostgreSQL Vector Similarity and Recommendations
Inner Product for Predictions
Our system uses inner product (dot product) for vector similarity and rating prediction:
predicted_rating = user_vector · game_vector = Σ(user[i] × game[i])
This choice is intentional because:
- With biased=False in SVD, the inner product directly predicts the rating
- Higher values indicate stronger predicted preference
- Mathematically optimal for collaborative filtering
PostgreSQL with pgvector
We use PostgreSQL with pgvector extension for scalable vector similarity search:
Game Vector Storage
Game vectors are stored in PostgreSQL with dedicated vector columns:
sql
CREATE TABLE games (
id INTEGER PRIMARY KEY,
game_name TEXT,
popularity FLOAT,
vote_average FLOAT,
vector_4 vector(4), -- 4-dimensional vectors
vector_5 vector(5), -- 5-dimensional vectors
vector_15 vector(15) -- 15-dimensional vectors
);
Vector Indexes for Performance
We use IVFFlat indexes optimized for inner product searches:
sql
-- Inner product indexes for fast similarity search
CREATE INDEX ON games USING ivfflat (vector_4 vector_ip_ops);
CREATE INDEX ON games USING ivfflat (vector_5 vector_ip_ops);
CREATE INDEX ON games USING ivfflat (vector_15 vector_ip_ops);
Recommendation Queries
Real-time recommendations use PostgreSQL's native vector operators:
sql
SELECT id, game_name, popularity, vote_average,
(vector_15 <#> '[user_vector_values]') * -1 AS predicted_rating
FROM games
WHERE vector_15 IS NOT NULL
ORDER BY vector_15 <#> '[user_vector_values]'
LIMIT 25;
Key PostgreSQL Vector Operators:
- <#>: Negative inner product (for similarity ranking)
- <->: L2 distance (Euclidean distance)
- <=>: Cosine distance
Performance Optimization
To ensure accurate results with IVFFlat indexes:
sql
SET ivfflat.probes = 10; -- Search more clusters for accuracy
User Vector Storage (Redis)
User vectors remain in Redis for fast access:
json
{
"user_vector_4": [1.57, -0.61, -0.86, 1.63],
"user_vector_5": [1.57, -0.61, -0.86, 1.63, 0.23],
"user_vector_15": [1.57, -0.61, -0.86, 1.63, ...]
}
Hybrid Architecture Benefits
This hybrid approach provides: 1. PostgreSQL for games: ACID compliance, complex queries, vector indexes 2. Redis for users: High-speed access, JSON flexibility, caching 3. Optimal performance: ~1-3ms query times for 25 recommendations from 23K+ games 4. Scalability: Handles millions of games with sub-linear query complexity
Linear Regression for New Users
The Cold Start Problem
When a new user rates a few games, we need to compute their user vector to make recommendations. This is a linear regression problem.
Mathematical Formulation
Given user ratings for games with known vectors, we solve:
Ax = b
Where: - A is the matrix of game vectors (n_games × n_factors) - x is the unknown user vector (n_factors × 1) - b is the vector of user ratings (n_games × 1)
Solution Methods
Normal Equations (primary method):
x = (A^T A)^(-1) A^T bGradient Descent (fallback for numerical issues):
- Iteratively minimize the squared error:
||Ax - b||² - Learning rate: 0.001
- Iterations: 1000
- Iteratively minimize the squared error:
Implementation Details
For new users, we retrieve game vectors from PostgreSQL and solve for user vectors:
def solve_least_squares(game_vectors, ratings)
a_matrix = Matrix[*game_vectors]
b_vector = Matrix.column_vector(ratings)
# Normal equations approach
at_a = a_matrix.transpose * a_matrix
at_b = a_matrix.transpose * b_vector
solve_linear_system(at_a, at_b)
end
# Get game vectors from PostgreSQL
def self.calc_user_vectors(user_ratings, min_ratings)
game_ids = user_ratings.map { |rating| rating[:game_id] }
games = Game.where(id: game_ids).select(:id, vector_column_for(feature_size))
# Extract vectors and ratings for least squares
game_vectors = []
scores = []
user_ratings.each do |rating|
game = games.find { |g| g.id == rating[:game_id] }
next unless game&.send(vector_column_for(feature_size))
game_vectors << game.send(vector_column_for(feature_size))
scores << rating[:score]
end
# Solve for user vector that best explains their ratings
solve_least_squares(game_vectors, scores)
end
Statistical Analysis
Z-Score Analysis
To understand what makes users or games unique, we calculate z-scores for each vector component:
z_score = (value - mean) / standard_deviation
High absolute z-scores indicate distinctive characteristics. For example: - A user with z-score = +2.5 for factor 7 strongly prefers games high in that characteristic - A game with z-score = -3.0 for factor 3 is unusually low in that trait
Caching Strategy
We cache statistical computations for performance:
- User statistics: Mean and standard deviation for each factor across all users
- Game statistics: Similar statistics for games
- Cache invalidation: Based on database size changes and timestamps
- Batch processing: Use Redis pipelines for efficient data retrieval
Percentile Calculations
For each game property, we calculate percentiles to understand relative positioning:
def calculate_percentile(property_index, value)
values = all_games.map { |game| game[:vector][property_index] }.sort
rank = values.count { |v| v <= value }
(rank.to_f / values.length * 100).round(1)
end
Implementation Details
Data Pipeline
- Raw Ratings: User-game-rating triplets from BoardGameGeek (26M+ ratings, 200K+ users)
- SVD Training: Learn user and game vectors using collaborative filtering (Python/Surprise)
- PostgreSQL Storage: Store game vectors with metadata using pgvector extension
- Redis Storage: Store user vectors with JSON structure for fast access
- Index Creation: Build IVFFlat inner product indexes for vector similarity
- Real-time Queries: Serve recommendations via PostgreSQL vector search
Storage Architecture
PostgreSQL (Games)
-- 23,000+ games with multiple vector dimensions
CREATE TABLE games (
id INTEGER PRIMARY KEY,
game_name TEXT NOT NULL,
popularity FLOAT,
vote_average FLOAT,
vector_4 vector(4),
vector_5 vector(5),
vector_15 vector(15)
);
-- Inner product indexes for each dimension
CREATE INDEX ON games USING ivfflat (vector_4 vector_ip_ops);
CREATE INDEX ON games USING ivfflat (vector_5 vector_ip_ops);
CREATE INDEX ON games USING ivfflat (vector_15 vector_ip_ops);
Redis (Users)
// User vectors in JSON format
{
"user:$username": {
"user_vector_4": [1.57, -0.61, -0.86, 1.63],
"user_vector_5": [1.57, -0.61, -0.86, 1.63, 0.23],
"user_vector_15": [1.57, -0.61, ..., 0.45]
}
}
Performance Optimizations
- Vector Indexes: IVFFlat indexes reduce search from O(n) to O(log n) complexity
- Probe Tuning:
SET ivfflat.probes = 10balances accuracy vs. speed - Batch Operations: PostgreSQL bulk inserts for vector updates
- Connection Pooling: Rails ActiveRecord connection management
- Caching: Redis for user vector caching and statistical computations
Error Handling and Validation
- Minimum Ratings: Require at least 5 ratings for reliable user vectors
- RMSE Calculation: Monitor prediction quality during vector computation
- Fallback Methods: Use gradient descent when matrix operations fail
- Data Validation: Ensure vectors have correct dimensionality and numeric types
Scalability Considerations
- Memory Usage: 15-dimensional vectors are compact yet expressive
- Search Performance: PostgreSQL pgvector scales to millions of games with sub-linear complexity
- Hybrid Storage: PostgreSQL ACID properties for games, Redis speed for users
- Update Strategy: Batch PostgreSQL updates for games, real-time Redis updates for users
- Distributed Computing: Can be extended to PostgreSQL replicas and Redis clusters
Query Performance Metrics
Real-world performance measurements:
- Vector similarity search: 1-3ms for top 25 recommendations from 23K+ games
- Index effectiveness: ~200 inner products calculated vs. 23,000 brute force
- Accuracy: IVFFlat with probes=10 provides near-exact results
- Throughput: Handles hundreds of concurrent recommendation requests
Critical Implementation Fix: Inner Product vs L2 Distance
Previous Implementation Issue:
The system was initially using L2 distance (<->) for similarity ranking:
sql
-- INCORRECT: Using geometric distance instead of predicted rating
ORDER BY vector_15 <-> '[user_vector]'
This approach was mathematically incorrect because: 1. L2 distance measures geometric proximity, not rating prediction quality 2. Games could be "close" in vector space but have low predicted ratings 3. Results were meaningless (e.g., "My Little Pony Hide & Seek" as top recommendation)
Corrected Implementation:
Now using inner product (<#>) for proper collaborative filtering:
sql
-- CORRECT: Using inner product to maximize predicted rating
ORDER BY vector_15 <#> '[user_vector]'
Impact of the Fix: - Before: Random, nonsensical recommendations - After: High-quality recommendations like Gloomhaven series, Pandemic Legacy - Mathematical alignment: Now properly implements the SVD prediction formula - User experience: Recommendations went from meaningless to highly relevant
This fix demonstrates the critical importance of using the correct mathematical operation for the underlying model - geometric similarity ≠ collaborative filtering prediction.
Mathematical Intuition
What Do the Factors Represent?
While the 15 factors are learned automatically, they often correspond to interpretable game characteristics:
- Factor 1: Strategy vs. Luck
- Factor 2: Game Complexity
- Factor 3: Player Interaction Level
- Factor 4: Game Duration
- Factor 5: Theme Preference (Fantasy vs. Historical)
- etc.
Users with positive values for "Strategy" factor will be recommended strategy games, while those with negative values might prefer luck-based games.
Prediction Accuracy
The system's effectiveness comes from: 1. Large Dataset: Millions of ratings provide robust statistical foundation 2. Appropriate Dimensionality: 15 factors capture complexity without overfitting 3. Quality Metrics: RMSE tracking ensures model performance 4. Cross-Validation: Train/test splits validate generalization
This mathematical foundation enables our system to provide personalized, accurate recommendations while remaining computationally efficient and scalable.
Myers-Briggs Type Indicator for Board Games
Psychological Background
The Myers-Briggs Type Indicator (MBTI) is a widely-used personality assessment tool in psychology, developed by Katherine Briggs and Isabel Myers based on Carl Jung's psychological types theory. The MBTI categorizes personalities using four binary dimensions:
- Extraversion (E) vs. Introversion (I): How you direct your energy
- Sensing (S) vs. Intuition (N): How you process information
- Thinking (T) vs. Feeling (F): How you make decisions
- Judging (J) vs. Perceiving (P): How you approach the outside world
These four binary choices create 16 distinct personality types (like ENFP, ISTJ, etc.), each representing a unique combination of cognitive preferences and behavioral tendencies.
Application to 4-Feature Models
The Myers-Briggs framework can be applied to any system with exactly 4 latent features that can be reduced to binary characteristics. In our board game recommendation system, when using 4 features, we create a gaming personality type by:
- Calculating the mean for each of the 4 features across all users
- Comparing each user's feature values to these population means
- Creating a 4-bit binary pattern where each bit represents whether the user is above (1) or below (0) the mean for that feature
- Converting the binary pattern to a readable 4-letter code using our mapping function
This approach transforms continuous preference data into discrete personality categories, making user preferences more interpretable and comparable.
The Four Board Game Personality Dimensions
Based on analysis of games with the highest and lowest values for each feature in our 4-feature model, we can infer the psychological dimensions these factors represent:
S/C - Simple vs. Complex
- S (Simple): Light, accessible games with straightforward rules and quick gameplay
- High games: Party games like "Greasy Spoon", "Meridians", "Tide of Fortune"
- Low games: Heavy campaign games like "Gloomhaven", "Frosthaven", "HATE"
- C (Complex): Deep, intricate games requiring significant time investment and rules mastery
- Values strategic depth and mechanical complexity over accessibility
A/I - Amicable vs. Intense
- A (Amicable): Social, party-oriented games emphasizing fun group interactions
- High games: Party games like "Charades", "Taboo", "Pictionary", "Articulate!"
- Low games: Heavy combat/miniature games like "HATE", "Frosthaven"
- I (Intense): Strategic, competitive games with serious gameplay and deep engagement
- Appeals to those seeking challenging, immersive gaming experiences
V/M - Vintage vs. Modern
- V (Vintage): Traditional games with classic, time-tested designs
- High games: Classic games like "Tic-Tac-Toe", "Busen Memo", "Magic Realm"
- Low games: Contemporary Euro-style games like "Ark Nova"
- M (Modern): Contemporary games featuring innovative mechanics and current design philosophy
- Values modern production quality and evolved gameplay mechanics
O/E - Offensive vs. Evasive
- O (Offensive): Direct conflict games with high player interaction and confrontation
- High games: Conflict-driven games like "Oath", "Chess", "Root"
- Low games: Solo-capable games like "Unbroken", low-interaction games
- E (Evasive): Games minimizing direct conflict, focusing on indirect competition or solo play
- Prioritizes peaceful gameplay and personal achievement over confrontation
Gaming Personality Types
This system generates 16 distinct gaming personalities:
- SAVO: Simple-Amicable-Vintage-Offensive - Classic party games with direct competition (e.g., Charades)
- SAVE: Simple-Amicable-Vintage-Evasive - Traditional family games with indirect competition (e.g., Bingo)
- SAMO: Simple-Amicable-Modern-Offensive - Modern party games with player interaction (e.g., Exploding Kittens)
- SAME: Simple-Amicable-Modern-Evasive - Contemporary casual games with minimal conflict (e.g., Sushi Go!)
- SIVO: Simple-Intense-Vintage-Offensive - Classic competitive games requiring focus (e.g., Chess)
- SIVE: Simple-Intense-Vintage-Evasive - Traditional solo puzzles and brain teasers (e.g., Solitaire)
- SIMO: Simple-Intense-Modern-Offensive - Modern quick competitive games (e.g., Love Letter)
- SIME: Simple-Intense-Modern-Evasive - Contemporary solo/puzzle games (e.g., Sagrada)
- CAVO: Complex-Amicable-Vintage-Offensive - Classic social strategy games (e.g., Diplomacy)
- CAVE: Complex-Amicable-Vintage-Evasive - Traditional cooperative complex games (e.g., Bridge)
- CAMO: Complex-Amicable-Modern-Offensive - Modern social deduction and negotiation games (e.g., Secret Hitler)
- CAME: Complex-Amicable-Modern-Evasive - Contemporary cooperative Euro games (e.g., Pandemic)
- CIVO: Complex-Intense-Vintage-Offensive - Classic war and conquest games (e.g., Risk)
- CIVE: Complex-Intense-Vintage-Evasive - Traditional heavy solo experiences (e.g., Magic Realm)
- CIMO: Complex-Intense-Modern-Offensive - Modern competitive heavy games (e.g., Root)
- CIME: Complex-Intense-Modern-Evasive - Contemporary heavy Euro games (e.g., Ark Nova)
Each combination reveals a unique gaming personality profile, helping to explain why certain games resonate with specific players and enabling more targeted recommendations based on deeper psychological preferences rather than just rating patterns.