Decision Tree vs Random Forest: Which one is better in 2022?

February 14, 2022

Decision Tree vs Random Forest is a topic that is often debated in the machine learning community. Both algorithms are supervised machine learning algorithms and are used for both regression and classification tasks. However, both algorithms differ in the way they work and also in terms of performance. They have their own pros and cons. Some people prefer Decision Trees and some prefer Random Forest to Decision Trees for their models. In this article, we shall see the comparison between Decision Tree and Random Forest and also discuss which one should be used in certain instances.

Overview

Decision Tree?
Random Forest?
Decision Tree vs Random Forest: Which one is better?
Decision Tree vs Random Forest: When should you use each algorithm?
Conclusion

Decision Tree

Decision Tree is a supervised machine learning algorithm that is used for both classification and regression tasks. It is a tree-based model and works by constructing a model in the form of a tree structure with decision nodes and leaf nodes. Each decision node consists of two or more branches and each leaf node represents a decision. The topmost decision node is called the root node. Decision trees can deal with both continuous and categorical data.

Advantages

There is no need of much standardizing or normalising data.
It can handle both continuous and categorical data.
It takes less computational time
It also provides clear visualizations at node level.
It is very simple and easy to understand.
Also this algorithm provides feature importance for classification.

Disadvantages

It consumes more memory.
It is prone to overfitting and and gives more erros.
It also shows bias and variance.
Reproducibility is very low.

Random Forest

It is an ensemble of several decision trees generated on randomly split data. A group of trees is called a forest. Since these forests are built on randomly selected data this algorithm is called Random forest. Each individual tree depends on the random sample that is generated independently using several parameters like information gain etc.,

Advantages

It works both on regression and classification tasks.
It doesn’t cause overfitting problem.
It gives high accuracy.
Suitable for large data as dataset is into random samples for training.
Handles missing data easily.
There is no need for normalization of data.

Disadvantages

It takes more computational time as several decision trees are constructed.
It is complex to interpret.
This algorithm has limited capabilities when it comes to regression tasks.

Decision Tree vs Random Forest: Which one is better?

Now that we have seen the brief working of both algorithms let us evaluate these two algorithms and see which one is better. To do this we will create two models, one with a decision tree and the other with a random forest. After creating these two models we will compare the results and see which is better. We will create these two models using the Pycaret library.

Pycaret is a low code autoML library that is used to build and experiment with machine learning models. It also offers a hub of datasets from where we can use sample datasets. We will use one such dataset to see Decision Tree vs Random Forest. Firstly, we will install the Pycaret library using the pip command.

pip install pycaret

Now we will import the necessary modules

from pycaret.datasets import get_data
from pycaret.classification import *

Now that we have imported all the necessary things let’s import the credit dataset, which is about classifying loan defaulters, and create a model.

data = get_data('credit')
exp1 = setup(data = data,
            target = 'default')

We imported the dataset and we created an experimental setup of our training environment. Let’s create a decision tree model.

dt = create_model('dt')

The output of the above command will look like below

1 decision tree vs random forest,random forest vs decision tree accuracy,random forest vs decision tree,decision tree vs random forest performance

Our decision tree model produced an accuracy of 71.99%. Now let’s evaluate this decision tree model and get various plots which we will use to compare later.

evaluate_model(dt)

It produces an HTML rendered interactive graph display where we can see many plots.

Now let’s repeat the same process with Random Forest.

rf = create_model('rf')

2 decision tree vs random forest,random forest vs decision tree accuracy,random forest vs decision tree,decision tree vs random forest performance

Here Random Forest model produced an accuracy of 81.40%. Let’s evaluate to get graphs.

evaluate_model(rf)

The above command produces an interactive widget display to see graphs. Let’s see a Decision Tree vs Random Forest graph analysis. First, let us compare the confusion matrix of each algorithm.

Decision tree vs random forest model accuracy — Decision Tree vs Random Forest: Confusion Matrix

From the above image, we can clearly see that Random Forest outperformed the Decision tree in classifying individual labels. Now, let us see the ROC curves

Decision tree vs random forest performance — Decision Tree vs Random Forest: ROC curves

We can clearly see the area under the ROC curves of Random Forest is higher than the area under ROC curves of Decision Trees. Now let’s see the feature importances created by both algorithms

decision tree vs random forest feature importance — Decision Tree vs Random Forest: Feature Importance plot.

The left side graph is of the Decision tree and the right-side graph is of Random Forest. We can clearly see that the Random Forest algorithm has shown the importance of many features in the dataset better than what the decision tree has shown. To sum up, the performance of the Random Forest algorithm is better than the Decision Tree in all aspects. A brief overview of the main differences between the two algorithms is shown below

	Random Forest	Decision tree
Interpretability	Hard to interpret	Easy to interpret
Accuracy	Highly accurate	Accuracy varies
Overfitting	Less likely to overfit data	highly likely overfit to data
Outliers	Not affected by outliers	Affected by outliers
Computation	Computationally intensive	Computationally very effective

Decision Tree vs Random Forest: Comparision

Also see: Pytorch vs Tensorflow

Decision Tree vs Random Forest: When should you use each algorithm?

Though Random Forest performs better than Decision Trees interpretability is a major concern. So if interpretability is not an issue for you, you can use the Random Forest algorithm. Otherwise, if the dataset is very small and you want good interpretability within less time, then you can use the Decision Trees algorithm. In the end, trying both algorithms to check the performance on the dataset is a wise idea as all datasets are not the same.

Conclusion

In the above discussion, we have seen Decision Tree vs Random Forest evaluation and comparison and found that Random Forest is more effective than Decision Tree. Random Forest is an ensemble technique and is one of the widely used machine learning algorithms. For further readings on this algorithm please go through this tutorial.

Aman Kumar

Data Scientist with 3+ years of experience in building data-intensive applications in diverse industries. Proficient in predictive modeling, computer vision, natural language processing, data visualization etc. Aside from being a data scientist, I am also a blogger and photographer.