Probability Distributions

Probability distributions are mathematical functions that describe how likely different outcomes are for a random variable. They provide the formal language for uncertainty in machine learning.

Key Distributions in ML:

  • Gaussian (Normal): Characterized by mean mu and variance sigma^2, this distribution models natural phenomena and measurement errors. It's the default assumption in many algorithms due to the Central Limit Theorem.
  • Bernoulli: Models binary outcomes with probability p of success. Fundamental for classification tasks and click-through prediction.
  • Poisson: Models count data and rare events with rate parameter lambda. Useful for modeling website traffic, customer arrivals, or defect counts.
  • Uniform: Equal probability across all outcomes. Often used for initialization strategies and prior assumptions when no information is available.

Practical Distribution Selection: Before building models, visualize your data with histograms and Q-Q plots to identify underlying distributions. If your data is roughly normal, linear models work well. For right-skewed data (like incomes), consider log-transforms or Gamma/exponential models. For count data, Poisson-based models often work better than forced normal approximations. This distribution-matching approach significantly improves model accuracy by respecting the data's inherent probability structure.