1. Collect the raw football information
We first gather the facts. This is the stage where we collect match results, expected goals, shots, team form, referee context, weather, and fixture information from the sources that power the project.
FPL Lens
Data Clubhouse
Prediction-led fixture analysis for serious FPL planning
ML process
This page explains the full machine learning pipeline in plain English: what data we collect, how we prepare it, what models we train, how we test them, and how those predictions finally reach the website.
What machine learning means here
We use past football matches to teach models how combinations of signals usually lead to goals, clean sheets, and results.
What success looks like
A good system predicts future matches honestly, handles uncertainty well, and does not fake confidence by leaking future information.
Simple version
1
Feed in old matches
We show the system many past matches with all the pre-match clues we knew at the time.
2
Learn patterns
The model learns how different patterns tend to lead to goals, clean sheets, and match outcomes.
3
Apply to new fixtures
For an upcoming match, we build the same type of input row and let the trained model estimate what is likely to happen.
Pipeline walkthrough
We first gather the facts. This is the stage where we collect match results, expected goals, shots, team form, referee context, weather, and fixture information from the sources that power the project.
Raw data is messy. Team names can differ between providers, dates can drift, and some matches can be missing fields. Before machine learning can happen, all of that has to be lined up properly.
The models do not understand football directly. They only understand numbers. So we convert football ideas into measurable signals called features.
We do not ask one giant model to predict everything. Instead, we train different models for different jobs, because predicting goals is not the same problem as predicting a clean sheet or a match result.
A machine learning model only matters if it can predict matches it has not seen yet. So we test it in time order, not with random shuffling that would accidentally leak future information.
When the website asks for an upcoming match prediction, the backend recreates the same feature set for that fixture, loads the trained model artifacts, and returns ML outputs for display.
The model family
Goals Predictor
Estimate how much attacking output each team is likely to create.
Think of this as a smarter version of asking: how dangerous should each side be in this match?
Clean Sheet Predictor
Estimate the chance that a team finishes the match without conceding.
This is the defender and goalkeeper model.
Results Predictor
Estimate the probabilities of home win, draw, or away win.
This is the outcome model for who gets the result.
What model are we actually using?
The main model family: XGBoost
The project mainly uses XGBoost. XGBoost is a gradient-boosted decision tree system. In simple terms, it builds many small rule-based trees, and each new tree tries to fix mistakes made by the earlier ones.
Why trees fit football data
Football prediction has lots of messy, mixed signals: form, Elo, rest, venue, shot quality, defensive trend, and schedule congestion. Tree models are strong at handling those non-linear relationships without needing the data to behave in a perfectly clean straight-line way.
Why not one giant neural network?
For a project like this, tree models are often a better first choice because they work well on structured tabular data, are easier to debug, and give clearer feature importance and SHAP explanations.
Tree depth and size
Goals predictor
Two XGBoost regressors: one for home output and one for away output.
A depth of 5 or 6 means each tree can make several layers of decisions, but not so many that it memorizes the past. Hundreds of trees means the final prediction is built from many small corrections rather than one giant leap.
Clean-sheet predictor
Two XGBoost binary classifiers: one for home clean-sheet chance and one for away clean-sheet chance.
This model answers a yes-or-no style question, so it uses classification rather than regression, then calibrates the probabilities so the percentages are more trustworthy.
Results predictor
A more complex XGBoost setup for home win, draw, and away win probabilities.
This is the hardest problem because draws are tricky. The project uses a stronger setup here, including draw-aware tuning, class-balance handling, and extra probability blending logic so the model does not just collapse into always preferring wins.
Engineering rules
No future leakage
The model should never get credit for information that would not have existed before kickoff. This is one of the biggest ways ML projects fool themselves.
One feature contract
The same canonical feature set is used across training, validation, and live prediction. That reduces train-serve drift.
Probabilities must be trustworthy
It is not enough to get the winner right sometimes. The percentages also need to be believable and usable.
Why this is hard
How to think about the whole project
Data engineering
Collecting, cleaning, aligning, and storing football information from multiple sources so the rest of the stack has trustworthy inputs.
Feature engineering
Translating football concepts like form, shot quality, rest, tactical context, and team strength into machine-readable numbers.
Model training
Teaching specialist models to answer different football questions using leakage-safe historical training windows.
Validation and benchmarking
Checking whether the models genuinely predict future matches well, not just past matches they accidentally memorized.
Live inference
Generating the same features for upcoming fixtures and turning trained model files into live probabilities for the website.
Product delivery
Exposing all of that through usable screens so the predictions are understandable, inspectable, and easy to challenge.
Glossary
This section is here so a reader does not need to already understand machine learning before reading the rest of the page.
Feature
A single input the model uses to make a decision.
For example: recent xG, team rest days, or Elo difference.
Feature contract
The agreed list of inputs, their names, their order, and how they are calculated.
This keeps training data and live prediction data speaking the same language.
Train-serve drift
A mismatch between what the model saw during training and what it receives when used live.
A feature might mean “last 5 matches” in training but “last 3 matches” in production.
Regression
A model that predicts a number on a scale rather than a yes/no answer.
Used for outputs like expected goals.
Classification
A model that predicts a category or class.
Used for questions like home win, draw, or away win.
Probability
The model’s estimate of how likely an event is.
A 65% home win probability means the model thinks that outcome happens roughly 65 times in 100 similar situations.
MAE
Mean Absolute Error. The average size of the model’s mistakes.
Lower is better because it means the model is closer to the truth on average.
Accuracy
How often the model gets the category right.
If it correctly predicts 52 results out of 100, accuracy is 52%.
AUC
A score for how well the model separates likely events from unlikely events.
Often used for things like clean-sheet probability. Higher is better.
Calibration
A check on whether the model’s percentages are honest.
If the model says 70% many times, about 70% of those events should really happen.
Baseline
A simpler method we compare the ML model against.
This tells us whether the extra ML complexity is actually worth it.
Leakage
When a model accidentally learns from information that would not have been available before the match.
This makes performance look better than it really is and is one of the biggest ML mistakes.