Mastering Feature Reduction: How mRMR Helps Machine Learning Models Cut Through the Noise

A Little Introduction

Imagine trying to find a needle in a haystack, but instead of one needle, you have thousands or hundreds, and each one looks almost identical. That's what it can feel like trying to select the most relevant features from a large pool of variables for machine learning models. Fortunately, techniques like minimum Redundancy Maximum Relevance (mRMR) can help. In this article, we will explore the power of mRMR for feature reduction and how it can improve the efficiency and accuracy of machine learning models.

Feature reduction is a critical task in machine learning that involves selecting a subset of relevant features from a large pool of input variables. The purpose of feature reduction is to improve the efficiency and accuracy of machine learning models by eliminating irrelevant or redundant features. The selection of relevant features can be challenging, especially when dealing with high-dimensional data sets (data in which the number of features (variables observed) is close to or larger than the number of observations).

There are several feature reduction techniques available, including

Principal component analysis
Linear discriminant analysis
Information gain-based approaches
minimum Redundancy Maximum Relevance (mRMR)

Among these methods, mRMR has emerged as a powerful technique for feature selection (not necessarily the best method but it's quite efficient). It identifies the subset of features that have the highest relevance to the target variable while minimizing the redundancy between features.

How mRMR Works under the Hood

In this section, we'll explore the intricacies that go into the mRMR algorithm and discover how it works from the inside out, if you're only interested in the black box version of mRMR i.e. you just want to see how it performs in code, then you can skip to the next section.

It's All in the Name

It's important to remember the full meaning of mRMR is minimum Redundancy Maximum Relevance and I'm still not sure why it's capitalized in this manner. This is important because the algorithm uses two criteria, namely

Maximum relevance and;
Minimum redundancy

Maximum relevance ensures that the selected features have the highest correlation with the target variable. In contrast, minimum redundancy ensures that the selected features are dissimilar or non-redundant with each other. These two criteria work together to select a subset of features that are both informative and diverse.

Measures and Whatnots

To work with the mentioned criteria, we'll need to find a way to measure or determine them and each criterion has its way of doing that.

Relevance: A good feature reduction method can be used here, such as Information Gain (IG) measure. The IG measure quantifies the amount of information provided by each feature about the target variable.
Redundancy: The redundancy between features is measured using correlation between the variables or as a special case, a mutual information (MI) measure, which quantifies the amount of information shared between two features.

The Iterative Process

Like the judge in a competition, the mRMR algorithm begins by ranking the features according to their IG scores, with the highest-scoring features considered to be the most relevant. It then iteratively adds the highest-ranking features to the selected feature subset while ensuring that the selected features are non-redundant to each other using the MI measure. This process continues iteratively until the desired number (k) of features is selected, or until the subset of features stops improving the performance of the machine-learning model.

Let's Get Coding

In this section, we'll use a case study with a real dataset to see how to use the mRMR algorithm with Python. You can access the full code on my Kaggle Notebook here, however, it's a complete project and it's not only based on feature reduction.

First Things First

The dataset has 48 features after preprocessing that contribute to the prediction of ovarian cancer and it needs to be reduced for efficient training.

To start with, you need to pip install the Python package which you can do using the code below from your terminal interface.

pip install mrmr-selection

After the package is installed, we can start working with it. Next up is importing the package. You can check the documentation of the package here for more.

from mrmr import mrmr_classif

How Many is Enough?

We know that we need to reduce the number of features but how many features do we need to get the most efficiency from our model? As are most things in Machine Learning, knowing how many features are just enough is more art than science. You have to keep experimenting till you get the right amount. For this case, we can do an analysis of the entire number of features to find which number of features works best but before that, let's see a sample of mRMR in effect.

# select top 10 features using MRMR

selected_features = mrmr_classif(X=cancer_X_train, y=cancer_y_train, K=10)

print(selected_features)

['Age', 'CEA', 'IBIL', 'NEU', 'Menopause', 'CA125', 'ALB', 'HE4', 'GLO', 'LYM%']

After doing its magic, the mRMR method gives the above-named features and we'll use it to train a Decision Tree Classifier model. All necessary preprocessing has already been done and won't be covered in this article, again you can check it in the Kaggle notebook.

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# create decision tree classifier
clf = DecisionTreeClassifier(random_state=0)

# fit the model to the training data
clf.fit(X_train, y_train)

Then we evaluate it to see how well our 10-feature model performed. Don't worry we'll still get to see how the model would perform if we didn't apply mRMR at all.

# evaluate the model on the test set
print("Decision Trees:")
print(classification_report(y_test, y_pred))

Decision Trees:
              precision    recall  f1-score   support

           0       0.78      0.66      0.71        53
           1       0.70      0.81      0.75        52

    accuracy                           0.73       105
   macro avg       0.74      0.73      0.73       105
weighted avg       0.74      0.73      0.73       105

You can see the performance of using just 10 features but we don't know that 10 features would give the best performance which leads us to the next agenda.

The Iterative Process Again

Starting from 2 features, we'll work our way up to 48 features, iteratively, getting the accuracy of the model trained on the current number of features at the time (P.S. It will take a while). The code is shown below.

# Create a dataframe to store the accuracy and corresponding number of features
accuracy_df = pd.DataFrame(columns=['Number of Features', 'Accuracy'])

# Loop over the number of features and train decision tree models on each subset
for n in range(2, 49):
    # Select the first n features from the list of selected features
    subset_features = mrmr_classif(X=cancer_X_train, y=cancer_y_train, K=n)

    # Train a decision tree model on the selected subset of features
    model = DecisionTreeClassifier()
    model.fit(cancer_X_train[subset_features], cancer_y_train)

    # Evaluate the model on the test set and store the accuracy
    accuracy = model.score(cancer_X_test[subset_features], cancer_y_test)
    accuracy_df = accuracy_df.append({'Number of Features': n, 'Accuracy': accuracy}, ignore_index=True)

The code snippet takes the accuracy and the number of features and stores them in a dictionary and it's in a perfect state for us to visualize it using matplotlib. Don't forget to import using import matplotlib.pyplot as plt

# plot the accuracy against number of features
plt.plot(accuracy_df['Number of Features'], accuracy_df['Accuracy'])
plt.title('Accuracy vs. Number of Features')
plt.xlabel('Number of Features')
plt.ylabel('Accuracy')
plt.yticks(np.arange(0.6, 0.9, 0.025))
plt.grid(True)
plt.show()

As seen, when the features are reduced to 24, it gives us an accuracy of about 85%. If we had gone with our random number of 10, we've had an accuracy (76%) that was lower than if we didn't apply mRMR at all (80%).

The Verdict and Conclusion

It can be seen from our case study that feature reduction can help to improve the performance of your ML model and we explored the wonders and usage of the mRMR feature reduction method.

Cons of mRMR

mRMR is great as we've come to know but it does have its downsides.

It can be computationally expensive, particularly when dealing with large datasets.
It only selects features based on their relevance to the target variable and does not consider the interdependence between features. This can lead to the selection of redundant or irrelevant features, resulting in suboptimal model performance.

Alternatives of mRMR

There are several alternative techniques for feature reduction that can overcome these limitations.

Principal Component Analysis (PCA) is a popular technique that reduces the dimensionality of a dataset by transforming the original variables into a new set of uncorrelated variables called principal components. PCA can efficiently handle large datasets and reduce computational complexity.
Another technique, Recursive Feature Elimination (RFE), eliminates features iteratively based on their contribution to the model's performance. This technique can be beneficial when dealing with small to medium-sized datasets.

Thank you for reading this article, you're a superhuman and you deserve a bottle of vodka or whatever these kids are having

A portrait of three girls sisters indoors at home, looking at camera when eating breakfast.

Mastering Feature Reduction: How mRMR Helps Machine Learning Models Cut Through the Noise

A Little Introduction

How mRMR Works under the Hood

It's All in the Name

Measures and Whatnots

The Iterative Process

Let's Get Coding

First Things First

How Many is Enough?

The Iterative Process Again

The Verdict and Conclusion

Cons of mRMR

Alternatives of mRMR

Comments

More from this blog

9 Important Concepts You should Understand In Association Rule Learning

Good Data Vs Big Data, Which is More Important?

Mastering the Concept of LOGITS in Machine Learning

Flask 101: Writing and Understanding Your First Flask Code

Command Palette

A Little Introduction

How mRMR Works under the Hood

It's All in the Name

Measures and Whatnots

The Iterative Process

Let's Get Coding

First Things First

How Many is Enough?

The Iterative Process Again

The Verdict and Conclusion

Cons of mRMR

Alternatives of mRMR

Comments

More from this blog