Trees and forests
Machine Learning (ML) excels at finding patterns in data (unsupervised learning) and making predictions (supervised learning). In supervised learning, each row of data has a label, a numerical value that we wish to predict (that’s called a regression problem) or a class to assign the data to (that’s a classification problem). ML models are trained on such data, meaning model parameters are iteratively learned through numerical methods to minimise prediction errors. Model performance is then evaluated on unseen data.
The big question is which models to use, given the tens of ML models that are available. Of course, it is crucial to understand the respective characteristics of each model and on which problem each performs better or worse. Data preparation, including feature engineering, is also key for model performance. However, if you were new to ML or if you wanted to build a quick prototype, is there a model or set of models that could be your first pick?
In a recent article, a broad spectrum of state-of-the-art ML models have been compared on 165 different datasets for classification tasks. In this study one model seems to do pretty well versus others, as shown in the table below. Unsurprisingly, many ML competitions have also shown that this model (when implemented in its most advanced form) features, most of the time, in the top-3 best performers.
Source: Olson, R.S., La Cava, W., Mustahsan, Z., Varik, A., Moore, J.H.
Data-driven advice for applying machine learning to bioinformatics problems. arXiv preprint arXiv:1708.05070 (2017)
That model is called Gradient Tree Boosting (GTB):
- It is based on “decision trees”, i.e. a series of decisions at each node that need to be followed to reach a prediction.
- It is also an “ensemble learning” method, meaning that many hundreds of individual models (decision trees) contribute to the final prediction, as opposed to just one single model.
According to this study, the second-best performer is another ensemble model based on decision trees, called Random Forest (RF) and differs from GTB in the way the trees are built and combine to ultimately make a prediction.
Step 1: Growing a tree
Let’s assume that we want to predict defaulted loans in a banking book. We have access to labelled data, i.e. a dataset of repaid and charged-off loans, and their attributes (loan characteristics, borrower information, credit file, etc.) at loan level. We isolate a portion of that dataset for training purposes. It turns out that there is a very simple way to build a predictive model, called a decision tree.
Let’s start with our training set (the collection of repaid and charged-off loans). This is called the root of the tree. We then iterate over all loan attributes to find which single attribute and which level would reduce the uncertainty the most, i.e. find the splitting rule that reduces misclassification the most. For instance, let’s say annual income > or < $70k is the best split to reduce uncertainty (at this stage our tree predicts that an annual income below $70k means a defaulted loan and above $70k is a repaid one). But we don’t stop there. On the left branch, we repeat the process and for instance find that for that sub-population, the size of the loan at $10k is the best splitting rule. We keep repeating the process for each node until some stopping rule and the decision tree could look like on the picture below.
Hypothetical tree for a credit scorer
When a new loan application arrives, it is easy to apply the decision tree to decide whether the loan is risky or not by answering the appropriate questions (grey nodes).
Step 2: Growing a forest
The problem with using that single tree to make predictions is that it is very likely to be of poor quality. The reason for that is called “high variance”, i.e. predictions may vary greatly depending on the training set used. In other words, the model is “overfitting” the data and therefore will do poorly on test data.
But this is where the magic of ensemble learning happens. When combining many such “weak learners” in a clever way, it is possible to build a strong learner; in most cases, one that would outperform the best single model.
Let’s now introduce bagging and boosting.
Bagging, or Bootstrap aggregation, relies on bootstrapping the data many times and building a new tree on each bootstrapped sample. As a reminder bootstrapping the data (of size n) means taking randomly (uniformly and with replacement) n samples, i.e. some original samples will be omitted and some will be repeated.
Once these hundreds of trees have been built, they all return their own prediction, and the class majority (for classification tasks) or the average output (for regression tasks) forms the ultimate prediction.
Boosting is another powerful ensemble learning technique, that combines weak learners to make a better one. However, the key part is the sequential nature of the algorithms (compared to bagging): trees are built sequentially and learn from the mistakes made by the previous trees. In the end, all trees contribute to the final prediction. For instance, we could give a smaller weight to a tree that did poorly at making a given prediction, lowering its overall importance in the combined prediction.
In other words, what boosting does is to choose training sets for the base learner in such a fashion as to force it to infer something new about the data each time it is called.
Bagged trees, despite the data bootstrapping technique, have many samples of data in common and therefore exhibit some possible strong correlation. The fact that the voting trees look too similar reduces the benefit of bagging. Random Forest brings a clever twist to this issue: on top of looking at bootstrapped samples, each tree is built keeping a subset of dimensions only (the columns in the data). So, trees are built on both a subset of the data and a subset of the dimensions (typically of size equal to the square root of the total number of features). This trick makes the trees much less similar (cf. colours on the chart below) and increase the benefit of bagging, i.e. the reduction in variance to make more accurate predictions.
Bagging technique for Random Forest algorithms
Gradient Tree Boosting
GTB – our overall best performer – is sequential. A first tree is built and the differences between predictions and actual values (called ‘residuals’) are computed. These residuals are used as targets to be predicted by the next tree, forcing this new tree to build splitting rules to minimise such errors. The updated predictions are the combinations of the predictions made by the first two trees (with some ‘learning rate’ to control the speed of learning). The process is repeated until the errors are low enough, or the maximum number of trees has been reached. GTB can be understood as, at each iteration, a modification of the preceding tree, trying to reduce the error rate (hence the soft change in colours on the following chart).
Gradient Tree Boosting algorithm
GTB vs. RF
These two powerful algorithms have exhibited the wisdom of the crowd: when combined cleverly, weak learners can produce strong predictors. However, it is worth noting that RF and GTB work very differently. At a shallow level, RF builds independent trees whilst GTB builds sequential trees. But more fundamentally, it means that RF works on reducing variance (building many low correlation trees to produce a range of outputs), whilst GTB works on reducing bias, i.e. trying to improve the accuracy of predictions at each step.
These concepts are crucial to master in order to handle real world projects. As we know, bias and variance are difficult to reconcile in ML (this is called the ‘bias-variance trade-off’). Their combination determines of how well a model generalizes to unseen data though. High bias means the model is unable to capture the relationship between features and target well: it is underfitting. High variance means the model fits too well the training data. It captures noise in the training data: it is overfitting. Therefore, GTB, despite its superiority, may be prone to overfitting.
To an expert, these concepts may also help understand their best areas of application. For instance, because GTB focuses on iteratively correcting errors, it can be quite powerful in the context of anomaly detection, in other words when the data is significantly imbalanced (the modelling of fraudulent transactions is a good example of such problem).
Also, these models work well if they’re tuned carefully. Hyperparameters are inputs that need to be chosen prior to building the model. It can be shown (cf. the comparative study mentioned earlier) that hyperparameter tuning is key to model performance. In that respect, because of its more sophisticated nature (choices of learning rate and loss function in particular), GTB can be tricky to tune. By comparison, RF mainly has three hyperparameters: the number of trees, their maximum depth and the number of dimensions to consider.
This post is intended to be a gentle introduction to these popular ML models. In real life, they can be a lot more complicated, with many tweaks to their original concepts. It is worth mentioning two variations of GTB:
- In 2016, XGBoost (EXtreme Gradient Boosting) which improves performance and the handling of sparse data, and
- In 2017, LightGBM (Light Gradient Boosting Machine) which uses intelligent sampling to handle very large datasets (a key constraint of sequential models that cannot be easily parallelized).
In subsequent posts, we will cover some other useful AI/ML algorithms. Bridging the knowledge gap in this field helps us in our mission to empower our clients with the tools to understand the emerging data-centric era.