Random Forest based Features selection: what you need to know?

While there are numerous machine learning algorithms out there today, a few of them stand out when it comes to competitive coding. Random Forests have been a part of the top machine learning algorithms chosen by individuals to create the best predictive models. A recent study on Kaggle showed that after Neural Networks and Gradient Boost Machines, Random Forests seemed to pick off the highest ranks in competitions showing that they can help individuals create the best predictive models. A Random Forest or RF (a supervised learning algorithm) is an algorithm that is unique for its use in both, Classification as well as Regression problems.

In this article, we’ll go over simple steps which can take your model’s prediction to the next level by harnessing proper and effective feature selection using Random Forests aka random forests-based feature selection.

What are Random Forests?

Preferred for Classification problems more than Regression, a Random forest algorithm can handle a dataset containing categorical or continuous variables or both. It can use decision trees or the bagging method. Using averages for regression and a voting basis for classification builds decision trees on different samples. The crux of it is that the training dataset is resampled using “bootstrap”. In recent years, Random Forests has turned out to be one of the best Machine Learning algorithms to be used.

Random Forest based Features selection — Decision tree vs Random forest

A simple example of how Scikit Learn can help you make your own Random Forest Classifier is given below:

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

Now as simple as it seems, Feature selection is a very important part of the algorithm. Let’s see how can Random Forests help you choose the best Features for your dataset. One must note that correlated variables will be of relatable importance, and by using tree-derived feature importance good features can be selected if we are going to build tree-based methods.

Also, read → Decision Tree vs Random Forest

Why use Random Forest-based Feature selection?

Random forests are one of the best feature selectors thanks to the relatively good accuracy and robustness and ease of use that they offer. They also provide two straightforward methods for feature selection: mean decrease impurity and mean decrease accuracy. Following are some of the advantages of selecting features using Random Forests and the impact that it can have on the entire model per say.

Eliminates redundant variables
Reduces overfitting
Reduces Training Time
Improves accuracy of the model

Note: Datasets that come with fewer variables or features might need all of them for the accuracy to be valid. Using feature selection when not necessary can impact your model adversely.

How to choose the best features for your model using Random Forests?

Using the ensemble family of algorithms in Scikit Learn, one can create a Random Forest algorithm in a rather easy way. We’ll use the ensemble class codes and to load the dataset, we will use Pandas.

import pandas as pd
from sklearn.ensemble import RandomForestClassfier

Further, we will have some additional imports to assist in the Feature selection and split our data into training and testing data to avoid any overfitting issues.

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel

The Scikit-learn library provides us with a SelectFromModel class for extracting the best features of a given dataset according to the importance of weights. The meta-estimator then determines the weight importance by comparing it to the given threshold value.

The following code will help us split our dataset into training and testing data as required. The test size can be altered as per requirement.

X_train,y_train,X_test,y_test = train_test_split(df,test_size=0.25)

Let us create an instance of the RF Classifier using the following code:

RF = RandomForestClassifier(n_estimators = 20)

Now, we use the meta-estimator to automatically predict the most important features of the dataset

select = SelectFromModel(RF)
select.fit(X_train, y_train)

The average importance of the features will be taken as the threshold to cross when the meta-estimator decides which features are more important compared to the others.

Another alternative that can be used here is to choose the SelectKBest class from Scikit Learn to choose the ‘k’ best params as follows:

from sklearn.feature_selection import SelectKBest, chi2
select = SelectKBest(score_func=chi2, k=3)
select.fit_transform(x,y)

Once the meta estimators have selected the best features from the dataset for you, you can use the get_support() function on the fitted model to identify these features. To get the names of the features selected from the dataset, use the following code.

print(X_train.columns[(select.get_support())])

Another way to proceed with selecting the best features is having Random Forest give its interpretation of feature importance that can be useful in selecting the most informative set of features according to an RFECV or Recursive Feature Elimination using Cross-Validation.

Find more about how you can use RFECV to choose the best features: here

Also, read → Difference between Bagging and Boosting

What difference does Feature Selection make?

To see the difference it makes in an algorithm before and after using Feature Selection, refer to the following table published in the research paper as cited:

As you can observe clearly, there’s a significant improvement in the model performance using a KNN model for classification when Feature Selection is employed. However, this is only one of the many cases and therefore it is recommended that oy use domain expertise and other examples in your domain to decide if feature selection must be used.

For more information on how the feature selection process works in Random Forests using the aforementioned methods, refer to the following link: Click here

Find more on different methods of using Feature Selection in the following Kaggle notebook here by Phil Boaz (reference only): Click here

Conclusion

Random Forest methods are extremely useful and efficient in selecting the important features of a dataset. A few things it may impact are processing time, accuracy, and more features. While feature selection using RFs will be useful in your models on any day, do consider feature engineering in case you still don’t arrive at the desired accuracy of the model. Nevertheless, for a minimal effort using feature selection, you can improve the overall model performance because in the real world, as noted by many, it’s not just the accuracy but also the applicability of the model that matters i.e. how production-ready is the model.

Yash Gupta

An eternal learner, I believe Data is the panacea to the world's problems. I enjoy Data Science and all things related to data. Let's unravel this mystery about what Data Science really is, together. With over 33 certifications, I enjoy writing about Data Science to make it simpler for everyone to understand. Happy reading and do connect with me on my LinkedIn to know more!