Probability for Machine Learning, Stochastic Processes

Stochastic Processes

Stochastic processes are mathematical models that describe systems evolving randomly over time or space. Unlike deterministic systems that follow exact, predictable patterns, stochastic processes incorporate randomness, making them ideal for modeling real-world phenomena with inherent uncertainty.

These processes form the mathematical foundation for many machine learning algorithms, particularly those dealing with sequential data, reinforcement learning, and probabilistic modeling. Understanding stochastic processes provides crucial insights into how uncertainty propagates through systems over time—a fundamental concept in modern AI systems.

From language models predicting the next word in a sentence to algorithms modeling financial markets, stochastic processes give us the tools to quantify, analyze, and predict systems where randomness plays a central role.

Random Walks

Random walks model paths where each step is determined by chance, such as the journey of a drunkard taking random steps. They can be one-dimensional, two-dimensional, or in higher dimensions.

A simple random walk on integers starts at 0 and at each step moves +1 or -1 with equal probability. Despite their simplicity, random walks exhibit fascinating properties like recurrence (in dimensions 1 and 2, a random walk will eventually return to its starting point with probability 1) and transience (in dimensions 3 and higher, there's a positive probability it never returns).

The expected distance from the origin after n steps in a simple random walk is proportional to √n, not n—illustrating how randomness leads to slower exploration than directed movement. This principle, formalized in the Central Limit Theorem, explains why random walks tend to follow normal distributions at large scales.

In machine learning, random walks underpin algorithms for graph analysis, recommendation systems, MCMC sampling methods, and PageRank-like algorithms for determining node importance in networks.

Deep learning applications include graph neural networks that propagate information through graphs using random walk principles, word embedding models like node2vec that capture semantic relationships through walks on word co-occurrence networks, and exploration strategies in reinforcement learning that balance random walks (exploration) with directed movement (exploitation).

Markov Chains

Markov chains model systems that transition between states based solely on the current state, not the past history. They follow the Markov property: the future depends on the present, but not on the past. Random walks are a special case of Markov chains where transitions follow simple rules.

Formally, a process {Xₙ} is a Markov chain if P(Xₙ₊₁ = j | Xₙ = i, Xₙ₋₁ = iₙ₋₁, ..., X₁ = i₁) = P(Xₙ₊₁ = j | Xₙ = i) for all states i, j and times n. This conditional probability P(Xₙ₊₁ = j | Xₙ = i) is called the transition probability from state i to state j.

The "memoryless" property of Markov chains makes them computationally tractable while still capturing complex behavior. A Markov chain is fully described by its transition matrix P, where element Pᵢⱼ represents the probability of moving from state i to state j in one step.

In machine learning, Markov chains are used for sequence modeling, text generation, recommendation systems, and reinforcement learning. Hidden Markov Models (HMMs) extend this concept to situations where the states are not directly observable, which is valuable for speech recognition and bioinformatics.

PageRank, the algorithm that powered Google's early search engine success, is fundamentally a Markov chain model where web pages are states and links represent transitions. By analyzing the stationary distribution of this Markov chain, PageRank identifies the most important pages on the web.

Continuous-Time Markov Chains

Continuous-time Markov chains extend the concept of Markov chains to processes where transitions can occur at any moment in time. While traditional Markov chains operate in discrete steps, continuous-time versions allow the system to change states at any continuous time point, with exponentially distributed waiting times between transitions.

These processes are characterized by a generator matrix Q, where off-diagonal elements qᵢⱼ (i≠j) represent transition rates from state i to state j, and diagonal elements qᵢᵢ = -∑ⱼ≠ᵢ qᵢⱼ ensure proper probability accounting. The probability of transitioning from state i to j after time t is given by the corresponding element of the matrix exponential e^(Qt).

The memoryless property of the exponential distribution is crucial here—it ensures that the future evolution depends only on the current state, not how long the process has been in that state.

These models are valuable for queueing systems, epidemiological models, reliability analysis, and modeling state transitions in complex systems where events don't occur at regular intervals.

In machine learning applications, continuous-time Markov chains model customer behavior in marketing analytics, patient progression through disease states in healthcare, and the evolution of complex systems like social networks or financial markets. Recent neural ODE (Ordinary Differential Equation) models incorporate continuous-time dynamics inspired by these processes to model irregularly-sampled time series data.

Poisson Processes

Poisson processes model the number of randomly occurring events in a fixed time or space interval, with a constant average rate. They represent a special case of continuous-time Markov chains where we focus on counting events rather than tracking system states.

A Poisson process N(t) must satisfy several key properties: N(0) = 0 (starting at zero), it has independent increments (events in non-overlapping intervals are independent), and for small intervals Δt, the probability of exactly one event is approximately λΔt while the probability of multiple events is negligible.

The number of events N(t) in an interval of length t follows a Poisson distribution with parameter λt: P(N(t) = k) = e^(-λt)(λt)^k/k!, where λ is the rate parameter. The time between consecutive events follows an exponential distribution with mean 1/λ.

These processes are applied in machine learning for modeling customer arrivals, network traffic, failure events, and any situation where discrete events occur randomly over time or space with known average rates.

In recommendation systems, Poisson processes help model user interaction patterns. In anomaly detection, deviations from expected Poisson behavior can signal unusual activity. Healthcare applications include modeling patient arrivals at emergency rooms, disease outbreaks, and mutation occurrences in genetic sequences.

Renewal Theory

Renewal theory studies processes with repeated events and the times between successive events, known as interarrival times. It generalizes Poisson processes by allowing the times between events to follow any distribution, not just the exponential distribution.

The core concept is the renewal function m(t), which gives the expected number of renewals in the interval [0,t]. A key result, the Elementary Renewal Theorem, states that m(t)/t approaches 1/μ as t approaches infinity, where μ is the mean interarrival time.

Renewal processes generalize Poisson processes by allowing non-exponential distributions of interarrival times. This flexibility makes them suitable for modeling complex real-world phenomena where the memoryless property of exponential distributions doesn't hold.

Applications in machine learning include survival analysis, reliability modeling, customer lifetime value estimation, and maintenance scheduling algorithms.

In recommender systems, renewal processes help model repeat purchase behavior by capturing patterns in time intervals between purchases. In healthcare, they're used for predicting hospital readmissions and modeling disease recurrence. Reinforcement learning algorithms use renewal theory concepts to analyze reward processes and develop efficient exploration strategies.

Martingales

Martingales represent fair games where, given the past, the expected future value equals the current value. Building on concepts from Markov processes, martingales add a powerful mathematical framework for analyzing processes where predictions remain stable despite randomness.

Formally, a stochastic process {Xₙ} is a martingale if E[|Xₙ|] < ∞ for all n, and E[Xₙ₊₁|X₁,...,Xₙ] = Xₙ. This conditional expectation property encapsulates the concept of fairness—your expected fortune after the next play equals your current fortune.

Martingales provide powerful theoretical tools through results like the Optional Stopping Theorem and Martingale Convergence Theorem. These results help analyze algorithms that involve random stopping times or exhibit convergent behavior despite randomness.

In machine learning, martingales provide theoretical foundations for online learning algorithms, adaptive sampling strategies, and sequential testing procedures.

Stochastic gradient descent, the workhorse of deep learning optimization, can be analyzed using martingale theory to establish convergence guarantees. Multi-armed bandit algorithms for exploration-exploitation tradeoffs also leverage martingale concentration inequalities to provide theoretical performance bounds.

Brownian Motion

Brownian motion (Wiener process) models continuous, erratic movement such as that of particles in a fluid or stock price fluctuations. It represents the limiting case of a random walk as the step size approaches zero and the number of steps approaches infinity, connecting back to our first topic.

A standard Brownian motion W(t) has several defining properties: W(0) = 0, it has independent increments, and for any times s < t, the increment W(t) - W(s) follows a normal distribution with mean 0 and variance t-s. Its paths are continuous but nowhere differentiable—mathematically capturing the concept of continuous but extremely erratic motion.

Geometric Brownian Motion (GBM), given by the stochastic differential equation dS(t) = μS(t)dt + σS(t)dW(t), extends this concept to model quantities that can't go negative (like stock prices) and incorporates both drift (μ) and volatility (σ) parameters.

In machine learning, Brownian motion serves as the foundation for models in finance (option pricing), physics-informed neural networks, diffusion models for image generation, and stochastic optimization techniques.

Recent breakthrough generative AI models like DALL-E and Stable Diffusion use diffusion processes—directly inspired by Brownian motion—to generate images by gradually denoising random Gaussian noise. The mathematics of Brownian motion helps these models transform noise into coherent, detailed images through a carefully controlled reverse diffusion process.