Data Science Introduction, Statistical Methods & Inference

Statistical Methods & Inference

Go to the full course

Statistics for Machine Learning

Statistical methods form the mathematical backbone of data science—providing rigorous frameworks for extracting meaningful conclusions from inherently variable data. These approaches allow us to move beyond merely describing observed patterns to making principled inferences about broader populations, testing specific hypotheses with quantifiable confidence, and distinguishing genuine signals from random noise.

Probability Distributions

Probability distributions represent mathematical models that describe how random variables behave—mapping possible outcomes to their associated probabilities. The normal (Gaussian) distribution forms perhaps the most fundamental pattern, describing phenomena where many small independent effects combine additively. Its elegant bell curve appears across natural and social systems—from measurement errors and human heights to economic indicators and test scores. This distribution's mathematical properties make it the foundation for countless statistical methods, with its parameters (mean and standard deviation) completely characterizing its behavior.

Beyond the normal, other distributions model different data-generating processes: binomial distributions represent binary outcomes over multiple trials (like success/failure across independent attempts); Poisson distributions model rare events occurring at constant rates (such as website traffic spikes or manufacturing defects); exponential distributions describe waiting times between independent events. Understanding which distribution naturally models your data guides the selection of appropriate statistical tests and modeling approaches. The central limit theorem—one of statistics' most profound results—explains why so many real-world measurements approximate normal distributions even when individual components don't follow Gaussian patterns. This powerful result states that averages of independent random variables tend toward normal distributions regardless of their original distributions, providing theoretical justification for many statistical methods applied to aggregated data.

Hypothesis Testing

Hypothesis testing provides a structured framework for evaluating evidence against null hypotheses—default assumptions that typically assert no effect, no difference, or no relationship. This approach frames scientific questions as statistical tests, allowing us to quantify the strength of evidence and make decisions with explicit consideration of uncertainty. The process begins by formulating precise null and alternative hypotheses that clearly define what constitutes a meaningful finding versus the status quo assumption.

Different tests address specific analytical scenarios: t-tests compare means between groups or against reference values; chi-square tests evaluate relationships between categorical variables; ANOVA assesses differences across multiple groups while controlling for overall error rates; non-parametric alternatives like Wilcoxon and Mann-Whitney tests make fewer distributional assumptions, offering robustness when data doesn't follow normal distributions. The p-value—a commonly used but often misunderstood metric—quantifies the probability of observing results at least as extreme as those found if the null hypothesis were true. While a threshold of p < 0.05 has become a conventional benchmark for 'statistical significance,' thoughtful analysis considers not just statistical significance but also effect sizes, confidence intervals, and practical importance. Understanding statistical power—the probability of detecting an effect when one truly exists—helps design adequately sized studies that don't miss important relationships due to insufficient sample sizes.

Confidence Intervals

Confidence intervals provide a range of plausible values for population parameters based on sample data, quantifying the precision of our estimates with explicit uncertainty bounds. Unlike point estimates that suggest deceptive precision, these intervals acknowledge sampling variability and communicate the range within which the true value likely falls. A 95% confidence interval means that if we repeated our sampling process many times, about 95% of the resulting intervals would contain the true population parameter.

This approach shifts focus from binary significance decisions toward estimating effect magnitudes with appropriate uncertainty—helping assess whether effects are not just statistically significant but substantively important. Narrow intervals indicate precise estimates from adequate samples, while wide intervals suggest greater uncertainty that might necessitate larger samples or different measurement approaches. When confidence intervals include values that would be practically insignificant, they warn against overinterpreting statistical significance alone. These intervals also facilitate comparisons across studies by showing where estimates overlap, providing more nuanced information than simple significance testing. In modern data science, confidence intervals are particularly valuable for communicating prediction uncertainty to decision-makers, helping them understand not just what the model predicts but how confident they should be in those predictions based on available data.

Regression Analysis

Regression analysis reveals how variables relate to each other, modeling how changes in predictor variables correspond to changes in outcome measures. Unlike simple correlation that only measures association strength, regression quantifies specific relationships while controlling for multiple factors simultaneously. Linear regression—the foundation of this approach—models straight-line relationships where each independent variable's effect is captured by a coefficient representing the expected change in the dependent variable for each one-unit increase in the predictor, holding other variables constant.

Logistic regression extends this framework to binary outcomes, modeling probability transformations through the logit function to predict categorical results like customer conversions or medical diagnoses. These methods produce interpretable coefficients that quantify not just whether variables are related but by how much and in what direction—making them invaluable for both prediction and understanding causal mechanisms. Regression diagnostics ensure model validity by checking critical assumptions: linearity verifies that relationships follow straight-line patterns; independence confirms that observations don't influence each other; homoscedasticity checks that error variance remains constant across predictor values; normality examines whether residuals follow Gaussian distributions. These diagnostic tests protect against misleading results by identifying when model assumptions are violated—signaling when transformations, different modeling approaches, or additional variables might be needed. Through this combination of prediction power and interpretable parameters, regression analysis remains among the most versatile and important tools in the data scientist's analytical arsenal.

Bayesian Statistics

Bayesian statistics represents a fundamentally different philosophical approach to uncertainty, treating probability as a measure of belief rather than just long-run frequency. This framework explicitly incorporates prior knowledge—formal mathematical representations of what was known before collecting new data—and updates these beliefs through Bayes' theorem as evidence accumulates. Unlike traditional approaches that treat parameters as fixed but unknown constants, Bayesian methods model parameters themselves as random variables with probability distributions.

This approach offers several distinct advantages: it provides direct probability statements about parameters ("There's a 95% probability the effect is between 0.2 and 0.5") rather than the more convoluted frequentist interpretations of confidence intervals; it naturally incorporates existing knowledge through prior distributions; and it handles uncertainty more transparently by producing full posterior distributions rather than point estimates. The Bayesian framework particularly shines in sequential decision-making scenarios where beliefs must be continuously updated as new information arrives—making it ideal for reinforcement learning, online algorithms, and adaptive clinical trials. It also excels when data is limited but prior information is strong, allowing more stable inference than purely data-driven approaches. While historically limited by computational challenges, modern computational methods like Markov Chain Monte Carlo (MCMC) and variational inference have made Bayesian approaches practical for increasingly complex models, leading to growing adoption across machine learning and data science applications.