Probability for Machine Learning, undefined

Mutual Information

Mutual information quantifies the information shared between two random variables—how much knowing one reduces uncertainty about the other. This concept serves as a fundamental measure of dependence in information theory.

For random variables X and Y, mutual information I(X;Y) is defined as:

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)

where H(X|Y) is the conditional entropy of X given Y, and H(X,Y) is the joint entropy.

Mutual information has several important properties:

I(X;Y) ≥ 0 (non-negative)
I(X;Y) = 0 if and only if X and Y are independent
I(X;Y) = H(X) if Y completely determines X
Symmetric: I(X;Y) = I(Y;X)

Unlike correlation, mutual information captures both linear and non-linear relationships between variables, making it a more comprehensive measure of statistical dependence. This property makes it particularly valuable in complex systems where relationships may not follow simple linear patterns.

Feature Selection with Mutual Information

Feature selection represents one of the most important practical applications of mutual information in machine learning and data science. By leveraging information theory principles, this approach helps identify which features contain the most relevant information for prediction tasks.

Basic Approach: By calculating mutual information between each feature and the target variable, we can rank features by their predictive power without assuming linear relationships. This method outperforms correlation-based approaches for capturing non-linear associations.

Methods and Algorithms:

Filter Methods: Select features based purely on mutual information scores before any modeling
Information Gain: Common in decision trees, measuring reduction in entropy after splitting on a feature
Conditional Mutual Information: I(X;Y|Z) identifies variables that provide additional information beyond what's already selected
Minimum Redundancy Maximum Relevance (mRMR): Balances feature relevance with redundancy among selected features

Advantages:

Captures non-linear relationships missed by correlation-based methods
Applicable to both classification and regression problems
Makes no assumptions about data distributions
Can handle mixed data types (continuous and categorical)

This information-theoretic approach to feature selection helps build parsimonious but powerful predictive models by identifying the most informative variables while avoiding redundancy—ultimately improving model interpretability, reducing overfitting, and accelerating training.