PREVIOUS PAGE INDEX PAGE NEXT PAGE

Stochastic Processes in AI Vol-1: Randomness, Generative Models and Probability

N.B.- All my books are exclusively available on Amazon. The free notes/materials on globalcodemaster.com do NOT match even 1% with any of my PUBLISHED BOoks. Similar topics ≠ same content. Books have full details, exercises, chapters & structure — website notes do not.No book content is shared here. We fully comply with Amazon policies.
  1. Table of Contents: Stochastic Processes in AI Vol-1

    Randomness, Generative Models and Probability

    1. Introduction to Stochastic Processes in Artificial Intelligence 1.1 Why stochastic processes are central to modern AI (2026 perspective) 1.2 From classical probability to generative modeling revolution 1.3 Brief history: Wiener process → diffusion models → score-based generative modeling 1.4 Role in uncertainty quantification, exploration, sampling, and reasoning 1.5 Structure of Vol-1 and target audience (undergrad/postgrad, researchers, practitioners)

    2. Foundations of Probability – Essential Review for AI 2.1 Probability spaces, random variables, expectation, variance 2.2 Common distributions used in AI (Bernoulli, Gaussian, Categorical, Beta, Gamma, Dirichlet, Poisson) 2.3 Law of large numbers, central limit theorem, and concentration inequalities 2.4 Jensen’s inequality, KL divergence, mutual information 2.5 Monte Carlo estimation and importance sampling basics

    3. Markov Chains – The Simplest Stochastic Process 3.1 Discrete-time Markov chains: transition matrix, state space, irreducibility 3.2 Stationary distribution, ergodicity, detailed balance 3.3 Markov Chain Monte Carlo (MCMC): Metropolis-Hastings, Gibbs sampling 3.4 Continuous-time Markov chains (CTMC) and master equations 3.5 Applications in AI: PageRank, reinforcement learning policy evaluation, text generation (early n-gram models)

    4. Markov Decision Processes (MDP) and Reinforcement Learning Foundations 4.1 MDP definition: states, actions, transition probabilities, rewards 4.2 Bellman equations, value iteration, policy iteration 4.3 Stochastic policies and exploration (ε-greedy, softmax, entropy regularization) 4.4 Stochastic shortest path and discounted infinite-horizon problems 4.5 Connection to generative modeling: MDPs as sequential decision generative models

    5. Poisson Processes and Point Processes in AI 5.1 Homogeneous and non-homogeneous Poisson processes 5.2 Hawkes processes (self-exciting point processes) 5.3 Spatial point processes and Cox processes 5.4 Applications: event prediction, neural spike trains, temporal recommendation systems, arrival modeling in queuing theory for AI systems

    6. Brownian Motion, Wiener Process and Diffusion Processes 6.1 Definition and properties of standard Brownian motion 6.2 Brownian motion with drift, geometric Brownian motion 6.3 Stochastic differential equations (SDEs): Itô vs Stratonovich 6.4 Fokker–Planck equation and probability flow 6.5 First passage times and hitting probabilities 6.6 Why diffusion processes are the mathematical foundation of modern generative AI

    7. Generative Modeling via Stochastic Processes – The Big Picture 7.1 From autoregressive models to continuous-time generative models 7.2 Denoising Diffusion Probabilistic Models (DDPM) – forward & reverse process 7.3 Score-based generative modeling (Song & Ermon) → score matching perspective 7.4 Probability flow ODE vs stochastic sampling (deterministic vs stochastic paths) 7.5 Classifier-free guidance, CFG++, consistency models

    8. Advanced Diffusion Models and Stochastic Processes 8.1 Variance-exploding (VE) vs variance-preserving (VP) formulations 8.2 Rectified flow, flow-matching, and stochastic interpolants 8.3 Diffusion on non-Euclidean manifolds (Riemannian diffusion) 8.4 Latent diffusion models (LDM, Stable Diffusion family) 8.5 Discrete diffusion and absorbing state models (D3PM, MaskGIT)

    9. Stochastic Differential Equations (SDEs) in Generative AI 9.1 Forward SDE → reverse-time SDE → score function 9.2 Numerical solvers: Euler–Maruyama, Heun, predictor-corrector samplers 9.3 Adaptive step-size solvers (DPM-Solver, DEIS, UniPC) 9.4 Connection to optimal control and Schrödinger bridge 9.5 Stochastic optimal control interpretation of diffusion sampling

    10. Practical Implementation Tools and Libraries (2026 Perspective) 10.1 Diffusion frameworks: Diffusers (Hugging Face), score_sde, OpenAI guided-diffusion 10.2 SDE solvers: torchdiffeq, torchsde, jaxdiff 10.3 Manifold diffusion: GeoDiff, Riemannian Score Matching libraries 10.4 Fast sampling: Consistency Models, Latent Consistency Models (LCM), SDXL Turbo 10.5 Mini-project suggestions: DDPM from scratch, score-matching toy model, latent diffusion fine-tuning

    11. Case Studies and Real-World Applications 11.1 Image & video generation (Stable Diffusion 3, Sora-like models) 11.2 Molecule & protein conformation generation (RFdiffusion, Chroma, FrameDiff) 11.3 Time-series forecasting with diffusion (TimeDiff, CSDI) 11.4 Audio & speech synthesis (AudioLDM 2, Grad-TTS variants) 11.5 Stochastic optimal control & planning in robotics

    12. Challenges, Limitations and Open Problems 12.1 Slow sampling speed and acceleration techniques 12.2 Mode collapse and diversity in diffusion models 12.3 Training stability on high-dimensional manifolds 12.4 Theoretical understanding of why score matching works so well 12.5 Energy-efficient diffusion for edge devices

Welcome to Stochastic Processes in AI Vol-1: Randomness, Generative Models and Probability. This tutorial series bridges classical probability theory with the cutting-edge generative AI revolution of 2026. Whether you are an undergraduate student, postgraduate researcher, or industry practitioner, you will gain both mathematical depth and practical implementation skills.

1.1 Why stochastic processes are central to modern AI (2026 perspective)

In 2026, almost every frontier AI system relies on stochastic processes — mathematical models that describe systems evolving randomly over time or space. Here’s why they have become indispensable:

  • Generative AI dominates: Models like Stable Diffusion 3, Sora-style video generators, and Llama-4-scale LLMs are built on stochastic differential equations (SDEs) and diffusion processes. Without them, high-quality image, video, audio, and 3D generation would not exist at current quality.

  • Uncertainty is everywhere: Real-world AI (autonomous driving, medical diagnosis, financial forecasting) must quantify “how sure” the model is. Stochastic processes provide the language for uncertainty.

  • Exploration in decision-making: Reinforcement learning agents (e.g., in robotics or game AI) use stochastic policies to explore unknown environments efficiently.

  • Sampling efficiency: Modern generative models sample billions of high-quality outputs per day using advanced stochastic samplers (DPM-Solver, Consistency Models, Flow Matching).

Numerical example – Why randomness wins A deterministic model trying to generate a realistic face produces the same image every time → boring and unrealistic. A stochastic diffusion model with 50 sampling steps produces thousands of unique, high-quality faces from the same prompt, each with natural variations (skin texture, lighting, expression). This is the power of controlled randomness.

2026 reality: The best models (OpenAI o3, Google Gemini 2.5, Anthropic Claude 4) all have stochastic components at their core — either in training (noise injection) or inference (sampling).

1.2 From classical probability to generative modeling revolution

The journey is a beautiful evolution:

  • Classical probability (17th–19th century): Pascal, Bernoulli, Gauss → basic distributions and expectation.

  • Stochastic processes (early 20th century): Markov chains, Wiener process → systems that evolve randomly over time.

  • Bayesian revolution (1980s–2000s): Probabilistic graphical models, MCMC sampling.

  • Deep generative era (2014–2020): VAEs, GANs → first neural stochastic models.

  • Diffusion & score-based revolution (2020–2026): From DDPM (Ho et al., 2020) to flow-matching and consistency models → state-of-the-art quality.

Key transition point: In 2019–2021, researchers realised that denoising a noisy image step-by-step (reverse diffusion) is mathematically equivalent to solving a stochastic differential equation. This single insight turned probability theory into the engine of today’s generative AI.

Simple numerical analogy Think of generating a photo of a cat:

  • Classical probability = guessing the average cat (blurry mess)

  • GAN = adversarial trickery (good but unstable)

  • Diffusion = start with pure noise (TV static) → gradually remove noise guided by learned probability → crystal-clear cat image.

1.3 Brief history: Wiener process → diffusion models → score-based generative modeling

  • 1923: Norbert Wiener defines the Wiener process (mathematical Brownian motion) — the continuous-time limit of random walks.

  • 1950s–1970s: Physicists use Langevin & Fokker–Planck equations to model particle diffusion.

  • 2015: Sohl-Dickstein et al. introduce early denoising diffusion ideas.

  • 2019–2020: Song & Ermon (Stanford) introduce score-based generative modeling — learning the score function (gradient of log-probability).

  • 2020: Ho, Jain & Abbeel publish Denoising Diffusion Probabilistic Models (DDPM) — the model that started the revolution.

  • 2021–2023: Latent Diffusion (Stable Diffusion), DPM-Solver, Consistency Models, Rectified Flow.

  • 2024–2026: Manifold diffusion, flow-matching, and hybrid stochastic-deterministic samplers dominate industry (Stable Diffusion 3, Sora, Luma Dream Machine, Runway Gen-3).

Key mathematical bridge: The forward diffusion process adds Gaussian noise: xt=αˉtx0+1−αˉtϵ x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon xt​=αˉt​​x0​+1−αˉt​​ϵ The reverse process learns to denoise — exactly solving a stochastic differential equation.

1.4 Role in uncertainty quantification, exploration, sampling, and reasoning

Stochastic processes power four pillars of modern AI:

  1. Uncertainty Quantification

    • Bayesian neural networks, conformal prediction, and diffusion-based uncertainty maps.

    • Example: Medical AI outputs “85% confident this is malignant” instead of binary yes/no.

  2. Exploration

    • In reinforcement learning: stochastic policies (softmax, entropy bonus) prevent agents from getting stuck.

    • Example: AlphaGo/AlphaZero used Monte Carlo Tree Search — a stochastic tree exploration process.

  3. Sampling

    • Generating new data: diffusion models, MCMC, Hamiltonian Monte Carlo.

    • Modern samplers (UniPC, DPM-Solver++) generate 1024×1024 images in 4–8 steps instead of 1000.

  4. Reasoning

    • Chain-of-thought with temperature sampling, stochastic beam search, and probabilistic program synthesis.

    • LLMs use stochastic decoding (top-p, temperature) to produce diverse, creative reasoning paths.

Numerical example – Uncertainty in autonomous driving A stochastic process model predicts:

  • 92% probability of pedestrian crossing in next 3 seconds

  • With 95% confidence interval [0.87, 0.96] → The car slows down safely instead of taking a hard binary decision.

2. Foundations of Probability – Essential Review for AI

Before diving into stochastic processes and generative modeling, we need a solid grasp of probability fundamentals. This section is not just a review — it highlights exactly which concepts appear most frequently in modern AI (diffusion models, VAEs, reinforcement learning, Bayesian deep learning, uncertainty quantification).

2.1 Probability spaces, random variables, expectation, variance

Probability space A probability space is a triple (Ω, ℱ, P):

  • Ω = sample space (all possible outcomes)

  • ℱ = σ-algebra (collection of measurable events)

  • P = probability measure (P: ℱ → [0,1], P(Ω)=1)

Random variable X: a measurable function X: Ω → ℝ It assigns a real number to each outcome.

Expectation (mean) E[X] = ∫ x dP(x) (continuous) or Σ x P(X=x) (discrete)

Variance Var(X) = E[(X - E[X])²] = E[X²] - (E[X])²

Numerical example – coin flip in AI Fair coin: Ω = {Heads, Tails}, P(Heads)=P(Tails)=0.5 Random variable X: 1 if Heads, 0 if Tails E[X] = 0.5 × 1 + 0.5 × 0 = 0.5 Var(X) = E[X²] - (0.5)² = 0.5 - 0.25 = 0.25

AI connection In reinforcement learning, reward R is a random variable → E[R] = expected return, Var(R) = risk/uncertainty of policy.

2.2 Common distributions used in AI

Here are the distributions you will see almost every day in generative AI and probabilistic modeling.

DistributionSupportPMF/PDF formulaParametersAI usage examples (2026)Bernoulli{0,1}P(X=1)=p, P(X=0)=1-pp ∈ [0,1]Binary classification, binary latent variablesCategorical{1,…,K}P(X=k)=π_k, Σ π_k=1π ∈ Δ^{K-1} (simplex)Discrete token prediction (LLMs), one-hot labelsGaussianℝ(1/√(2πσ²)) exp(-(x-μ)²/(2σ²))μ ∈ ℝ, σ>0Noise in diffusion models, latent space in VAEsBeta[0,1]x^{α-1}(1-x)^{β-1} / B(α,β)α,β > 0Beta-VAE, variational dropout rates, priorsGamma(0,∞)x^{α-1} exp(-x/β) / (β^α Γ(α))α (shape), β (rate)Precision parameters, diffusion variance schedulesDirichletsimplex Δ^{K-1}∏ x_i^{α_i-1} / B(α)α ∈ ℝ^K_+Topic models, Dirichlet priors in Bayesian NNsPoisson{0,1,2,…}λ^k exp(-λ) / k!λ > 0Count data, event arrival times, spike trains

Numerical example – Gaussian noise in diffusion In DDPM, at step t we add noise: x_t = √(α_bar_t) x_0 + √(1 - α_bar_t) ε, ε ~ 𝒩(0, I) If α_bar_t = 0.9 → x_t ≈ 0.95 x_0 + 0.316 ε The noise scale grows as t increases → image slowly turns into pure Gaussian noise.

2.3 Law of large numbers, central limit theorem, and concentration inequalities

Law of Large Numbers (LLN) Sample average converges to true expectation: (1/n) Σ_{i=1}^n X_i → E[X] as n → ∞ (almost surely or in probability)

Central Limit Theorem (CLT) Standardized sum converges to standard normal: √n ( (1/n) Σ X_i - μ ) / σ → 𝒩(0,1) as n → ∞

Concentration inequalities (quantify how fast convergence happens)

  • Hoeffding: P( | (1/n) Σ X_i - μ | ≥ ε ) ≤ 2 exp(-2nε² / (b-a)²) (bounded variables)

  • Bernstein, McDiarmid, etc.

Numerical example – Monte Carlo mean estimation Estimate π by throwing darts at unit square: Fraction inside circle ≈ π/4 After n=100 darts: estimate = 0.78 → π̂ ≈ 3.12 After n=10,000 darts: estimate = 0.7854 → π̂ ≈ 3.1416 CLT tells us error shrinks as 1/√n → standard error ≈ 0.008 for n=10,000.

AI connection LLN justifies Monte Carlo sampling in diffusion reverse process. CLT explains why averaging many samples gives stable gradients in score estimation.

2.4 Jensen’s inequality, KL divergence, mutual information

Jensen’s inequality For convex function f: f(E[X]) ≤ E[f(X)] For concave f: reverse inequality.

Example (entropy is concave) H(α p + (1-α) q) ≥ α H(p) + (1-α) H(q)

KL divergence (asymmetric) D_KL(p || q) = E_p [ log (p(x)/q(x)) ] = ∫ p log p - p log q dx Always ≥ 0, =0 iff p=q almost everywhere.

Numerical example p = Bernoulli(0.7), q = Bernoulli(0.5) D_KL(p||q) = 0.7 log(0.7/0.5) + 0.3 log(0.3/0.5) ≈ 0.029 + 0.184 ≈ 0.213 bits

Mutual information I(X;Y) = H(X) - H(X|Y) = D_KL(p(x,y) || p(x)p(y)) Measures shared information between variables.

AI connection KL divergence → ELBO in VAEs, score matching loss in diffusion. Jensen → variational lower bounds. Mutual information → disentanglement in representation learning.

2.5 Monte Carlo estimation and importance sampling basics

Monte Carlo estimation Estimate expectation E[f(X)] ≈ (1/n) Σ f(x_i) where x_i ~ p(x)

Importance sampling (when direct sampling from p is hard) E_p [f(X)] = E_q [ f(X) (p(X)/q(X)) ] ≈ (1/n) Σ f(x_i) w_i where x_i ~ q, w_i = p(x_i)/q(x_i)

Numerical example – estimate rare event probability Want P(X > 5) where X ~ 𝒩(0,1) (very small ~ 2.87×10⁻⁷) Direct MC: need ~10^9 samples. Importance sampling: sample from 𝒩(5,1) → shift mean → only ~10^4–10^5 samples needed for good estimate.

AI connection Monte Carlo used in policy gradient (REINFORCE). Importance sampling → off-policy RL, weighted loss in diffusion training.

3. Markov Chains – The Simplest Stochastic Process

Markov chains are the foundational stochastic process in AI. They model systems that evolve randomly over time where the next state depends only on the current state (memoryless property). Markov chains power early language models, reinforcement learning value iteration, PageRank, MCMC sampling, and many sequential decision processes.

3.1 Discrete-time Markov chains: transition matrix, state space, irreducibility

Definition A discrete-time Markov chain (DTMC) is a sequence of random variables {X₀, X₁, X₂, …} with state space S (finite or countable) satisfying the Markov property:

P(X_{t+1} = j | X_t = i, X_{t-1}, …, X₀) = P(X_{t+1} = j | X_t = i)

Transition matrix P (rows sum to 1) P_{ij} = P(X_{t+1} = j | X_t = i)

Numerical example – simple weather model State space S = {Sunny, Rainy} Transition matrix:

text

Sunny Rainy Sunny 0.9 0.1 Rainy 0.4 0.6

Interpretation:

  • If today is Sunny → 90% chance tomorrow is Sunny

  • If today is Rainy → 60% chance tomorrow is Rainy (persistent rain)

Irreducibility A chain is irreducible if every state is reachable from every other state (strongly connected graph).

Absorbing state If P_{ii} = 1, state i is absorbing (chain stays there forever).

AI relevance

  • State space = discrete tokens in language model

  • Transition matrix = next-token probabilities (early n-gram models)

3.2 Stationary distribution, ergodicity, detailed balance

Stationary distribution π A probability vector π such that π = π P (left eigenvector with eigenvalue 1)

Ergodicity A chain is ergodic if it is irreducible, aperiodic, and positive recurrent. Then there exists a unique stationary distribution π, and the chain converges to π regardless of starting state.

Detailed balance (stronger condition) π_i P_{ij} = π_j P_{ji} for all i,j → time-reversibility (chain looks the same forward and backward)

Numerical example – weather model stationary distribution Solve π = π P, π₁ + π₂ = 1

π₁ = 0.9 π₁ + 0.4 π₂ π₂ = 0.1 π₁ + 0.6 π₂

→ π₁ = 0.8, π₂ = 0.2 Interpretation: In long run, 80% of days are sunny, 20% rainy.

AI connection Stationary distribution in RL = long-run state occupancy under policy. Detailed balance is key for Metropolis-Hastings MCMC to be valid.

3.3 Markov Chain Monte Carlo (MCMC): Metropolis-Hastings, Gibbs sampling

MCMC generates samples from complex target distribution p(x) by constructing a Markov chain whose stationary distribution is p(x).

Metropolis-Hastings algorithm

  1. Propose new state y ~ q(y | x_current)

  2. Compute acceptance ratio A = min(1, [p(y) q(x_current | y)] / [p(x_current) q(y | x_current)])

  3. Accept y with probability A, else stay at x_current

Numerical toy example – sampling from Beta(2,5) Target p(x) ∝ x^{1} (1-x)^{4} (Beta(2,5)) Proposal: uniform [0,1] Start at x=0.5 Propose y=0.7 → A ≈ min(1, (0.7/0.5) × ((1-0.7)/(1-0.5))^4 ) ≈ 0.42 Accept with 42% probability.

Gibbs sampling Special case: propose one coordinate at a time from full conditional p(x_i | x_{-i})

AI relevance

  • MCMC used in Bayesian neural networks (weight sampling)

  • Gibbs sampling in topic models (LDA)

  • Modern variants (HMC, NUTS) power probabilistic programming (Pyro, NumPyro)

3.4 Continuous-time Markov chains (CTMC) and master equations

Continuous-time Markov chain Jumps occur at exponential waiting times. Transition rate matrix Q: Q_{ij} = rate from i to j (i ≠ j), Q_{ii} = -Σ_{j≠i} Q_{ij}

Master equation (forward Kolmogorov) dP(t)/dt = P(t) Q (P(t) = distribution at time t)

Numerical example – simple two-state CTMC States: Healthy (1), Sick (2) Q = [[-0.1, 0.1], [0.4, -0.4]] → From Healthy, rate to Sick = 0.1 per hour → From Sick, recovery rate = 0.4 per hour

Stationary distribution: π Q = 0 → π₁ = 0.8, π₂ = 0.2 (same as discrete case)

AI connection CTMCs model continuous-time event sequences (e.g., neural spike trains, customer arrivals, chemical reaction networks in drug discovery).

3.5 Applications in AI: PageRank, reinforcement learning policy evaluation, text generation (early n-gram models)

  1. PageRank (Google 1998–now) Web as directed graph → Markov chain Transition matrix = normalized adjacency + teleportation (damping factor 0.85) Stationary distribution = PageRank scores

  2. Reinforcement Learning – Policy Evaluation Given policy π, value function v_π(s) = E[return | s, π] Bellman equation: v_π(s) = Σ_{a} π(a|s) Σ_{s',r} p(s',r|s,a) [r + γ v_π(s')] → Iterative policy evaluation = Markov chain on state space with rewards

  3. Text generation – early n-gram models Markov chain on words: P(w_t | w_{t-1}, …, w_{t-n+1}) Example: bigram model → transition matrix = P(next word | current word) Sampling from chain → generates text sequences

Numerical toy example – bigram text generation Vocabulary: {the, cat, sat, on, mat} Bigram transitions learned from corpus: P(sat | cat) = 0.7, P(on | sat) = 0.8, etc. Start with “the” → sample next → “cat” (high prob) → “sat” → “on” → “mat”

Markov chains are simple yet incredibly powerful — they form the foundation for almost every sequential and probabilistic model in AI.

4. Markov Decision Processes (MDP) and Reinforcement Learning Foundations

Markov Decision Processes (MDPs) are the mathematical framework that turns Markov chains into decision-making systems. They are the foundation of reinforcement learning (RL) and have a deep connection to sequential generative modeling (planning as inference, diffusion as policy rollout, etc.).

4.1 MDP definition: states, actions, transition probabilities, rewards

An MDP is a 5-tuple (S, A, P, R, γ):

  • S — state space (finite or continuous)

  • A — action space

  • P(s' | s, a) — transition probability (dynamics model)

  • R(s, a, s') — reward function (or R(s,a) expected reward)

  • γ ∈ [0,1) — discount factor (future rewards less valuable)

The agent observes state s_t, chooses action a_t, receives reward r_{t+1}, and transitions to s_{t+1}.

Numerical example – simple grid world S = {grid positions (1,1) to (5,5)}, goal at (5,5) A = {up, down, left, right} P: deterministic (90% move intended direction, 10% slip to random neighbor) R: +10 at goal, -1 per step (encourage fast reaching)

Analogy MDP = video game

  • State = current screen / level position

  • Action = button press

  • Transition = game physics

  • Reward = score / points

  • Discount = caring more about immediate points than future levels

AI relevance In robotics: state = joint angles + sensor readings, action = torque commands In games: state = board/pixels, action = move

4.2 Bellman equations, value iteration, policy iteration

Value function V^π(s) — expected discounted return starting from s following policy π V^π(s) = E[ Σ_{t=0}^∞ γ^t r_{t+1} | s_0 = s, π ]

Bellman expectation equation V^π(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) [ r + γ V^π(s') ]

Bellman optimality equation (no policy) V^(s) = max_a Σ_{s',r} p(s',r|s,a) [ r + γ V^(s') ]

Value iteration (find V*) Initialize V(s) = 0 for all s Repeat until convergence: V(s) ← max_a Σ_{s',r} p(s',r|s,a) [ r + γ V(s') ]

Policy iteration

  1. Policy evaluation: compute V^π using Bellman expectation (or iterative method)

  2. Policy improvement: π'(s) = argmax_a Σ_{s',r} p(s',r|s,a) [ r + γ V^π(s') ]

  3. Repeat until π' = π

Numerical example – 2-state MDP States: S1 (bad), S2 (good) Actions: stay or switch Transitions deterministic Rewards: S1 → -1, S2 → +1 γ = 0.9

Value iteration converges quickly: V(S1) ≈ -9.09, V(S2) ≈ +10 Optimal policy: always switch to S2

AI connection Value iteration = planning in known environment Policy iteration = classic RL algorithm (e.g., early tabular Q-learning variants)

4.3 Stochastic policies and exploration (ε-greedy, softmax, entropy regularization)

Deterministic policy π(s) = one action Stochastic policy π(a|s) = probability distribution over actions

Exploration strategies

  1. ε-greedy With probability ε: random action With probability 1-ε: greedy action (argmax Q(s,a)) Example: ε=0.1 → 10% random, 90% best known

  2. Softmax (Boltzmann exploration) π(a|s) = exp(Q(s,a)/τ) / Σ exp(Q(s',a')/τ) τ = temperature (high τ → more random, low τ → greedy)

  3. Entropy regularization (maximum entropy RL) Add entropy bonus to objective: J(π) = E[ Σ r_t + α H(π(·|s_t)) ] → Encourages diverse actions → better exploration

Numerical example – softmax Q(s,a1)=5, Q(s,a2)=3, Q(s,a3)=1, τ=1 π(a1|s) ≈ exp(5)/ (exp(5)+exp(3)+exp(1)) ≈ 0.844 π(a2|s) ≈ 0.114, π(a3|s) ≈ 0.042

AI relevance (2026) Entropy regularization is standard in PPO, SAC, DreamerV3 → improves sample efficiency in robotics and games.

4.4 Stochastic shortest path and discounted infinite-horizon problems

Stochastic shortest path (SSP) Minimize expected cost to reach goal from start (no discount, γ=1, absorbing goal state)

Discounted infinite-horizon Minimize E[ Σ γ^t r_{t+1} ] → most common in deep RL (stability via discounting)

Comparison table

SettingDiscount γGoal stateObjectiveTypical use in AIStochastic Shortest Path1YesMinimize expected cost to goalPlanning, navigationDiscounted Infinite<1NoMaximize discounted returnGames, robotics, continuous control

Numerical example – SSP 3 states: Start, Middle, Goal Actions: forward (cost -1), backward (cost -10) Optimal policy: always forward → expected cost = -3 (3 steps)

4.5 Connection to generative modeling: MDPs as sequential decision generative models

Deep insight (2020–2026) A generative model can be viewed as an MDP where:

  • State = current partial sequence / image / molecule

  • Action = next token / pixel / atom addition

  • Transition = deterministic (given action) or stochastic

  • Reward = log-likelihood under data distribution (implicitly learned)

Examples of MDP-as-generation

  • Autoregressive LLMs (GPT series): MDP with state = prefix tokens, action = next token

  • Diffusion models: MDP with state = noisy image x_t, action = denoising step, reward = log p(x_0)

  • Decision Diffuser / Planning as Inference: explicitly cast diffusion sampling as RL policy optimization

  • Flow-matching models: deterministic paths → MDP with fixed transitions

Numerical bridge example In diffusion: Forward process: x_t = f(x_{t-1}, ε_t) (stochastic transition) Reverse process: learn policy π_θ(x_{t-1} | x_t) ≈ true reverse transition Objective: maximize likelihood → equivalent to maximizing cumulative reward under learned dynamics

2026 perspective Many frontier generative models are now explicitly trained with RL objectives (e.g., RLHF + diffusion fine-tuning, reward-weighted flow matching) — the MDP lens unifies them all.

Markov Decision Processes are the bridge between classical control, reinforcement learning, and modern generative AI. Everything that follows in this series builds on this foundation.

5. Poisson Processes and Point Processes in AI

Poisson processes and point processes model the occurrence of random events in time or space. They are among the most important stochastic models after Markov chains — especially in modern AI where we deal with irregular, timestamped events (user clicks, neural spikes, arrivals in cloud servers, molecular collisions, earthquakes, financial trades, etc.).

This section focuses on the most relevant types for AI and their practical applications.

5.1 Homogeneous and non-homogeneous Poisson processes

Poisson process (homogeneous) A counting process {N(t), t ≥ 0} where:

  • N(0) = 0

  • Independent increments

  • Number of events in interval (t, t+τ] ~ Poisson(λτ)

  • λ = constant rate (events per unit time)

Key properties

  • Inter-arrival times are independent Exponential(λ)

  • P(exactly k events in time t) = (λt)^k exp(-λt) / k!

Numerical example – homogeneous Poisson λ = 5 events/hour (e.g., customer arrivals at a website) Probability of exactly 3 arrivals in 1 hour: P(N(1)=3) = (5×1)^3 exp(-5) / 3! ≈ 125 × 0.006738 / 6 ≈ 0.1404 (14%)

Probability of no arrivals in 10 minutes (t=1/6 hour): P(N(1/6)=0) = exp(-5/6) ≈ exp(-0.833) ≈ 0.434

Non-homogeneous Poisson process (NHPP) Rate λ(t) varies with time.

Intensity function λ(t) Cumulative intensity Λ(t) = ∫_0^t λ(s) ds N(t) ~ Poisson(Λ(t))

Numerical example – NHPP λ(t) = 2 + sin(2πt) (periodic rate, e.g., website traffic peaks every hour) Λ(t) = ∫_0^t (2 + sin(2πs)) ds = 2t - (1/(2π)) cos(2πt) + constant Expected events in first 24 hours: Λ(24) ≈ 48 P(no events in first 10 min) = exp(-Λ(1/6)) ≈ exp(-0.333 - small oscillation) ≈ 0.716

AI connection Homogeneous: modeling constant-rate events (e.g., background noise in sensors) Non-homogeneous: time-varying phenomena (daily/weekly patterns in recommendation clicks, neural firing rates modulated by stimuli)

5.2 Hawkes processes (self-exciting point processes)

Hawkes process A self-exciting point process where past events increase the probability of future events (clustering behavior).

Intensity function λ(t) = μ + Σ_{t_i < t} α exp(-β (t - t_i))

  • μ = background rate

  • α = excitation strength

  • β = decay rate

Numerical example – tweet retweet cascade Background μ = 0.1 retweets/min Excitation: each retweet adds α=0.8 immediate retweets, decaying with β=0.5/min After one tweet at t=0: λ(t) = 0.1 + 0.8 exp(-0.5 t) for t>0 At t=1 min: λ(1) ≈ 0.1 + 0.8 × 0.606 ≈ 0.585 retweets/min Expected additional retweets after first: ∫ α exp(-β t) dt = α/β = 0.8/0.5 = 1.6

Real AI applications

  • Viral content prediction (retweets, shares, views)

  • Financial trade clustering (order book events)

  • Earthquake aftershock modeling (used in predictive policing AI)

  • User engagement modeling in social platforms

Analogy Hawkes = contagious disease spread: background cases + each infected person infects others who infect more → exponential growth then decay.

5.3 Spatial point processes and Cox processes

Spatial point process Events occur randomly in space (2D/3D) instead of time.

Homogeneous Poisson point process (PPP) Constant intensity λ per unit area/volume.

Cox process (doubly stochastic PPP) Intensity λ(x) itself is random (e.g., log-Gaussian Cox process).

Numerical example – 2D homogeneous PPP λ = 10 points per km² (e.g., customer locations in a city district) Expected points in 5 km² area: 50 Probability of exactly 2 points in 0.1 km² cell: Poisson(λ×0.1 = 1) → e^{-1} × 1^2 / 2! ≈ 0.184

AI applications

  • Location-based recommendation (users in city as point process)

  • Single-cell RNA-seq: gene expression spots in tissue

  • LiDAR / point cloud processing (obstacles as spatial events)

  • Anomaly detection in spatial data (fraudulent transactions clustered in space)

5.4 Applications: event prediction, neural spike trains, temporal recommendation systems, arrival modeling in queuing theory for AI systems

  1. Event prediction

    • Hawkes process on social media timestamps → predict next viral moment

    • NHPP on server logs → predict next DDoS spike or failure

  2. Neural spike trains

    • Neuron firing times modeled as Poisson or Hawkes (self-exciting due to refractory periods + bursting)

    • Used in brain-computer interfaces, neural decoding for prosthetics

  3. Temporal recommendation systems

    • User click/stream events as point process

    • Hawkes-based models capture “binge-watching” behavior

    • Example: Netflix session prediction → next show recommendation based on recent watching intensity

  4. Arrival modeling in queuing theory for AI systems

    • Cloud inference requests (API calls to LLM) arrive as Poisson/NHPP

    • Hawkes models bursty traffic (e.g., after viral post → surge of queries)

    • Queuing theory + point process → auto-scaling, load balancing in production AI clusters

Numerical benefit example Standard Poisson arrival model underestimates burst → server overloads. Hawkes model fits bursty data → 20–40% better prediction of peak load → cost savings on cloud resources.

Text summary – point process spectrum in AI

text

Simple Poisson → constant background events NHPP → time-varying intensity (daily cycles) Hawkes → self-exciting bursts (viral content, neural bursts) Cox → doubly stochastic (latent spatial drivers)

Poisson and point processes are the natural tools for modeling irregular, bursty, timestamped, or spatially distributed events — exactly the kind of data that powers recommendation engines, neural interfaces, cloud infrastructure, and predictive maintenance in AI systems.

6. Brownian Motion, Wiener Process and Diffusion Processes

Brownian motion (Wiener process) is the continuous-time limit of random walks and the most important continuous stochastic process in mathematics and AI. In 2026, it is the mathematical foundation of almost all state-of-the-art generative models (diffusion models, score-based generative modeling, flow-matching, consistency models, etc.).

6.1 Definition and properties of standard Brownian motion

Standard Brownian motion (Wiener process) W(t), t ≥ 0 is a continuous-time stochastic process with four defining properties:

  1. W(0) = 0 almost surely

  2. Independent increments: for any 0 ≤ t₁ < t₂ < … < tₙ, the increments W(t₂)-W(t₁), …, W(tₙ)-W(t_{n-1}) are independent

  3. Stationary increments: W(t+s) - W(s) ~ 𝒩(0, t) for any t > 0, s ≥ 0

  4. Continuous paths: W(t) is continuous in t almost surely

Key properties derived from these:

  • W(t) ~ 𝒩(0, t) for each fixed t

  • Cov(W(s), W(t)) = min(s, t)

  • Paths are nowhere differentiable almost surely (very wiggly)

Numerical example – simulate Brownian motion At t = 0, W(0) = 0 In small time steps Δt = 0.01, add Gaussian noise √Δt · Z where Z ~ 𝒩(0,1) After 100 steps (t=1): Expected W(1) ≈ 0, variance = 1 Typical path might end around -0.3 to +0.3 (68% confidence interval ≈ [-1, +1])

Text illustration – sample path:

text

t=0 t=0.2 t=0.4 t=0.6 t=0.8 t=1.0 0 ────────► +0.4 ───────► -0.1 ───────► +0.7 ───────► -0.2 ───────► +0.3 (random walk in continuous time)

AI connection Brownian motion is the noise source in diffusion models: x_t ≈ x_0 + √t · ε where ε ~ 𝒩(0,I) (forward process approximation)

6.2 Brownian motion with drift, geometric Brownian motion

Brownian motion with drift W(t) + μ t → Mean = μ t, variance = t → Models processes with constant average velocity (drift) + random fluctuation

Geometric Brownian motion (GBM) dS(t) = μ S(t) dt + σ S(t) dW(t) → S(t) = S(0) exp( (μ - σ²/2) t + σ W(t) )

Numerical example – stock price simulation S(0) = 100, μ = 0.08/year (8% drift), σ = 0.2/year (20% volatility) After t=1 year: Expected S(1) ≈ 100 × exp(0.08) ≈ 108.33 But with volatility: typical paths range 80–140 (log-normal distribution)

AI relevance GBM used in financial time-series modeling, option pricing (Black-Scholes), and as prior in generative models for positive-valued data (e.g., molecular conformations).

6.3 Stochastic differential equations (SDEs): Itô vs Stratonovich

SDE (general form): dX(t) = μ(X(t), t) dt + σ(X(t), t) dW(t) μ = drift, σ = diffusion coefficient

Itô vs Stratonovich interpretation

  • Itô: uses forward difference → chain rule has extra term d(f(X)) = f'(X) dX + (1/2) f''(X) (dX)²

  • Stratonovich: uses midpoint → ordinary chain rule applies

Numerical example – simple SDE Itô: dX = X dt + X dW → Solution: X(t) = X(0) exp( (1 - 1/2) t + W(t) ) = X(0) exp(0.5 t + W(t))

Stratonovich version would have different drift adjustment.

AI connection Modern diffusion models use Itô SDEs (variance-preserving or variance-exploding formulations) because Itô calculus aligns with discrete-time denoising steps and score matching.

6.4 Fokker–Planck equation and probability flow

Fokker–Planck equation (forward Kolmogorov) Describes evolution of probability density p(x,t):

∂p/∂t = - ∇ · (μ p) + (1/2) ∇ · ∇ · (σ σ^T p)

Probability flow ODE (deterministic counterpart) d x / dt = μ(x,t) - (1/2) ∇ · (σ σ^T)(x,t) + σ(x,t) ∇ log p(x,t)

Key insight (Song et al., 2020–2021) Diffusion reverse process can be written as pure ODE (probability flow) or SDE — deterministic ODE often gives sharper samples.

Numerical example – Ornstein–Uhlenbeck process dX = -θ X dt + σ dW (mean-reverting) Fokker–Planck → Gaussian density shrinks toward mean over time.

AI connection Score function ∇ log p_t(x) is learned in score-based generative models → plug into probability flow ODE → deterministic sampling (faster, higher quality).

6.5 First passage times and hitting probabilities

First passage time τ_A = inf { t ≥ 0 : X(t) ∈ A } Time to first hit set A.

Hitting probability P(τ_A < ∞ | X(0)=x) Probability of ever reaching A starting from x.

Numerical example – Brownian motion Standard Brownian motion starting at x=1, barrier at 0: P(hit 0) = 1 (recurrent in 1D) Mean first passage time to 0 is infinite (heavy tails).

AI relevance

  • Escape time from local minima in optimization

  • Time to generate a valid molecule (hitting feasible region)

  • Decision time in RL (first time reward exceeds threshold)

6.6 Why diffusion processes are the mathematical foundation of modern generative AI

Core mathematical bridge (2020–2026)

  1. Forward diffusion = SDE that gradually destroys structure (adds noise) d x = f(x,t) dt + g(t) dW

  2. Reverse process = another SDE that reconstructs data d x = [f(x,t) - g(t)² ∇ log p_t(x)] dt + g(t) dW_backward

  3. Score function s_θ(x,t) ≈ ∇ log p_t(x) is learned via denoising score matching

  4. Sampling = solving reverse SDE numerically (Euler–Maruyama, Heun, DPM-Solver, etc.)

Why it works so well

  • Diffusion is stable and tractable (Gaussian noise)

  • Score matching avoids explicit likelihood computation

  • Probability flow ODE gives deterministic high-quality samples

  • Manifold hypothesis + diffusion naturally handles curved data distributions

2026 reality

  • Stable Diffusion 3, Flux.1, Midjourney v7, Sora, Veo-2, Runway Gen-3, Kling, Luma Dream Machine → all built on diffusion or flow-matching (continuous-time stochastic processes)

  • Pure autoregressive LLMs (GPT-4o, Claude 4) are being hybridized with diffusion for multimodal generation

Analogy Diffusion = sculpting from marble block:

  • Forward: add noise → rough block becomes smooth sphere

  • Reverse: learn how to chisel away noise → recover detailed statue

    7. Generative Modeling via Stochastic Processes – The Big Picture

    This section is the heart of Vol-1. We finally connect classical stochastic processes (especially diffusion processes and SDEs) to the generative modeling revolution that dominates AI in 2026. Almost every high-quality image, video, 3D shape, molecule, protein structure, and audio sample you see today is created using some form of continuous-time generative model rooted in stochastic differential equations.

    We go step-by-step from early autoregressive ideas to the current state-of-the-art (diffusion, score-based, flow-matching, consistency models).

    7.1 From autoregressive models to continuous-time generative models

    Autoregressive models (PixelRNN, PixelCNN, GPT family, early audio models)

    • Generate one token/pixel/sample at a time conditioned on all previous ones

    • p(x) = ∏ p(x_i | x_{<i})

    • Discrete-time, sequential, very slow inference (one step per dimension)

    Limitations

    • O(n) sampling steps for n-dimensional data → impractical for images (1024×1024 = 3 million pixels)

    • No natural way to model continuous distributions

    Continuous-time generative models (diffusion revolution 2020–2026)

    • Treat data as continuous signal x₀

    • Gradually corrupt x₀ → pure noise x_T via forward stochastic process

    • Learn to reverse the corruption → generate new samples from noise

    Key advantages

    • Parallelizable training

    • High-quality samples (especially images, video, 3D)

    • Natural handling of continuous data

    • Mathematical elegance (SDEs, score matching)

    Transition timeline

    • 2014–2018: VAEs, GANs → first deep generative models

    • 2015: Sohl-Dickstein et al. → early diffusion idea

    • 2019–2020: Song & Ermon → score-based generative modeling

    • 2020: Ho et al. → DDPM (the breakthrough paper)

    • 2021–2026: Latent diffusion, classifier-free guidance, consistency models, flow-matching → production quality

    Analogy Autoregressive = writing a book word-by-word (slow, sequential) Diffusion = starting with a blurry photo → gradually sharpening it until crystal clear (parallel training, iterative refinement)

    7. Stochastic Optimal Control and Diffusion for Planning

    Stochastic optimal control (SOC) provides the mathematical lens that unifies reinforcement learning, planning, and modern generative modeling. In 2026, diffusion models are increasingly viewed as a form of stochastic control: generating trajectories (whether pixels or robot actions) is equivalent to steering a stochastic system from noise/current state to a desired distribution/goal.

    This section bridges classical control theory with the diffusion-based planning revolution.

    7.1 Stochastic optimal control formulation of RL

    Stochastic Optimal Control (SOC) Find policy/controller u(t) that minimizes expected cost:

    J(u) = E [ ∫_0^T c(x(t), u(t), t) dt + Φ(x(T)) ]

    subject to stochastic dynamics:

    dx = f(x,u,t) dt + g(x,u,t) dW

    RL as SOC

    • State x = environment state s

    • Control u = action a

    • Cost c = -r (negative reward)

    • Terminal cost Φ = 0 or goal penalty

    • Discount γ → exponential cost decay c(t) = γ^t (-r_t)

    Standard RL objective becomes:

    min_π E_π [ Σ_t γ^t (-r_t) ] = max_π E_π [ Σ_t γ^t r_t ]

    KL-regularized RL (maximum entropy RL, soft Q-learning) Add KL divergence penalty to prevent collapse to deterministic policy:

    J(π) = E [ Σ r_t + α H(π(·|s_t)) ]

    This is equivalent to SOC with control cost proportional to KL(π || uniform).

    Numerical example – simple 1D control State x ∈ ℝ, action u ∈ ℝ Dynamics: dx = u dt + 0.1 dW Cost: c = x² + 0.01 u² Optimal control: u* = -k x (linear feedback) KL-regularized: adds exploration noise → u = -k x + noise

    2026 insight Many state-of-the-art methods (PPO with entropy, SAC, Diffusion Policy) are approximate SOC solvers.

    7.2 Diffusion for trajectory generation and planning (Diffuser, Plan4MC)

    Diffusion for planning Treat entire future trajectory τ = (s_t, a_t, s_{t+1}, …, s_{t+H}) as the “image” to generate.

    Forward diffusion Add noise to trajectory → τ_T ≈ pure Gaussian noise

    Reverse diffusion Condition on current state s_t and goal (or reward) → denoise to feasible, high-reward trajectory

    Diffuser (Janner et al. 2022–2023)

    • Diffusion over trajectory tokens

    • Classifier guidance toward high-reward regions

    • Iterative refinement → plan → execute first action → replan

    Plan4MC / Diffusion Planner variants (2024–2026)

    • Latent diffusion in world-model latent space (Dreamer-style)

    • Reward-conditioned score function → generate diverse plan ensembles

    • Select best trajectory via MPC rollouts or learned value

    Numerical example – block stacking Current state s_t = robot + block positions Condition diffusion on goal = block on target Sample 50 trajectories → evaluate with short-horizon MPC or learned critic → pick top-1 → execute first action

    Advantages

    • Generates diverse plans (handles uncertainty)

    • Naturally incorporates constraints via guidance

    • Scales to long horizons via latent space

    2026 status Diffusion planning is now competitive or superior to classical MPC in manipulation and legged locomotion (real-robot demos in labs).

    7.3 Schrödinger bridge and optimal transport in control

    Schrödinger bridge (1930s, rediscovered 2022–2026) Find the most likely stochastic path (bridge) connecting two distributions p_0 (data/current state) and p_T (noise/goal) while minimizing KL divergence to a reference process (e.g., Brownian motion).

    Mathematical form min_q KL(q || p_ref) subject to marginals q_0 = p_0, q_T = p_T

    Connection to diffusion Reverse diffusion is an approximate Schrödinger bridge from noise to data.

    Connection to control Schrödinger bridge = stochastic optimal control problem with fixed marginals → Optimal drift = reference drift + score difference

    Numerical example – bridge from 𝒩(0,1) to 𝒩(5,1) Reference = Brownian motion Optimal bridge = deterministic path with added controlled noise → Straight-line mean shift + minimal diffusion

    2026 applications

    • Rectified flow / flow-matching ≈ discretized Schrödinger bridges → 1–5 step generation

    • Trajectory planning: bridge from current state distribution to goal distribution

    • Offline RL: bridge between behavior policy and optimal policy

    7.4 Control as inference: KL-regularized RL and reward-weighted regression

    Control as inference Cast RL as inference in a probabilistic graphical model:

    • High reward → high probability

    • Policy π(a|s) → likelihood

    • Add KL divergence KL(π || prior) as prior preference for simple/smooth policies

    KL-regularized RL J(π) = E [ Σ r_t - α KL(π_old || π) ] → Soft Q-learning, MPO, REPS, TRPO/PPO all derive from this

    Reward-weighted regression Update policy by weighted regression:

    π_new(a|s) ∝ π_old(a|s) exp( (1/α) Â(s,a) )

    Numerical example – reward-weighted update Old policy π_old(a1|s) = 0.6, π_old(a2|s) = 0.4 Advantages Â(a1) = +4, Â(a2) = -1 α = 1 → exp(Â/α) = exp(4) ≈ 54.6, exp(-1) ≈ 0.368 New weights → π_new(a1) ≈ 0.987, π_new(a2) ≈ 0.013 → Strong shift toward high-advantage action

    2026 practice

    • PPO = approximate KL-constrained inference

    • Diffusion fine-tuning = reward-weighted denoising

    • Control as inference → unifying language for RL + generative modeling

    7.5 Diffusion policies vs traditional policy networks

    Traditional policy networks π_θ(a|s) = MLP / Transformer → deterministic or Gaussian output Trained with policy gradient / actor-critic

    Diffusion policies (Chi et al. 2023–2025 → widespread in robotics 2026) Policy = diffusion model conditioned on s Generate action sequence a_t, a_{t+1}, … via reverse diffusion Condition on current observation s → denoise to feasible action trajectory

    Advantages

    • Multimodal actions → captures multiple good ways to act

    • Handles constraints naturally (via guidance)

    • Uncertainty-aware → sample variance indicates confidence

    • Long-horizon consistency (diffusion over trajectory)

    Numerical example – robot pushing State s = object + gripper pose Diffusion policy generates 16-step action sequence (joint torques) Sample 50 trajectories → pick highest critic value or most consistent one → Success rate 75–90% vs 50–70% for Gaussian policy

    2026 status

    • Diffusion Policy → SOTA on many real-robot manipulation benchmarks

    • Combines with MPC → hybrid diffusion + model-predictive refinement

    • Used in humanoid robots, dexterous hands, autonomous vehicles

    Stochastic optimal control and diffusion-based planning represent the convergence of generative modeling and decision-making — the most exciting frontier in AI in 2026.8. Advanced Diffusion Models and Stochastic Processes

    This section explores the major advancements and variants that have made diffusion models the dominant generative paradigm in 2026. We cover different formulations of the diffusion process, deterministic/flow-based alternatives, extensions to curved/non-Euclidean domains, latent-space acceleration (the Stable Diffusion family), and discrete/abstractive diffusion models.

    All concepts build directly on the SDE framework from Section 6 and the score-matching objective from Section 7.

    8.1 Variance-exploding (VE) vs variance-preserving (VP) formulations

    The forward diffusion process can be defined in two main ways, differing in how the noise variance evolves over time. This choice affects training stability, sampling behavior, and final sample quality.

    Variance-Exploding (VE) – Song & Ermon / NCSN++ style

    • Forward SDE: dx = √(dσ²(t)/dt) dW

    • Variance σ²(t) starts small (near 0) and explodes to a very large value (σ_max ≈ 50–300)

    • Data signal x₀ decays slowly → at large t, x_t is dominated by isotropic Gaussian noise with huge variance

    • Score function at late t: ∇ log p_t(x) ≈ -x / σ²(t) (pulls toward origin with very small force)

    Variance-Preserving (VP) – Ho et al. DDPM style

    • Forward process (discrete): x_t = √α_bar_t x_0 + √(1-α_bar_t) ε

    • Total variance of x_t remains approximately 1 (preserved) throughout

    • Continuous SDE equivalent: dx = -½ β(t) x dt + √β(t) dW

    • β(t) is the noise schedule (small early, larger later)

    • Score function at late t: ∇ log p_t(x) ≈ -x (unit-scale pull toward origin)

    Comparison Table (2026 perspective)

    AspectVariance-Exploding (VE)Variance-Preserving (VP)Final noise varianceVery large (σ² → 10³–10⁵)Bounded ≈ 1Signal decaySlow (x₀ term persists longer)Fast (x₀ term → 0 quickly)Score magnitude late in processVery small (1/σ²(t))Order 1Numerical stabilityCan be unstable at large σMore stableTypical scheduleExponential or linear σ²(t)Cosine or linear β(t)Popular in productionResearch, some high-fidelity modelsStable Diffusion family, Flux, most open modelsSampling speedSimilar with good solversSlightly faster in practice

    Numerical intuition

    • VE at t large: x_t ≈ 𝒩(0, 10000 I) → score ≈ -x/10000 (very weak pull)

    • VP at t large: x_t ≈ 𝒩(0, I) → score ≈ -x (strong, unit-scale pull) → VP is easier to learn and more numerically stable for most image/video tasks.

    2026 practice VP + cosine schedule is the default in almost all production open models (Stable Diffusion 3, SDXL, Flux.1, AuraFlow). VE is still used in some research for theoretical flexibility or when combining with flow-matching.

    8.2 Rectified flow, flow-matching, and stochastic interpolants

    These deterministic or near-deterministic alternatives to stochastic diffusion often achieve faster sampling with comparable or better quality.

    Rectified flow (Liu et al. 2022–2023 → major refinements 2024–2025)

    • Learn straight-line paths from noise z ~ 𝒩(0,I) to data x₀

    • Velocity field v_θ(z,t) predicts dx/dt along the path

    • Train to minimize difference between predicted and true straight velocity

    • Sampling = integrate ODE from t=1 (noise) to t=0 (data)

    Flow-matching (Lipman et al. 2022–2023 → dominant in 2026)

    • Generalizes rectified flow

    • Learns conditional velocity field u_θ(x|t) that transports marginal p_t to data p_0

    • Objective: regress u_θ(x(t),t) to target velocity (straight-line or optimal transport velocity)

    Stochastic interpolants (Albergo & Vanden-Eijnden 2023+)

    • Add controlled noise to flow-matching paths → hybrid stochastic-deterministic

    • Allows tunable exploration vs determinism

    Numerical comparison (typical ImageNet 256×256, 2026 benchmarks)

    • DDPM/VP (50 steps): FID ≈ 2.0–3.0

    • Flow-matching / rectified flow (5–10 steps): FID ≈ 2.2–3.5

    • Consistency-distilled flow-matching (1–4 steps): FID ≈ 2.8–4.0 → 10–50× faster sampling with small quality trade-off

    Analogy Diffusion = random walk from noise to data (many small noisy steps) Rectified flow / flow-matching = straight highway from noise to data (few large directed steps)

    2026 status Flow-matching + consistency distillation is now the fastest path to high-quality generation. Flux.1, AuraFlow, and many open models use flow-matching as backbone.

    8.3 Diffusion on non-Euclidean manifolds (Riemannian diffusion)

    Standard diffusion assumes flat Euclidean space. Real data often lies on curved manifolds (spheres for directions, hyperbolic for hierarchies, tori for periodic variables, SE(3) for 3D poses).

    Riemannian diffusion Forward SDE defined using Riemannian metric g and Laplace–Beltrami operator Δ_g:

    dx = f(x,t) dt + g(t) √g dW_M (Brownian motion on manifold M)

    Reverse process Learns Riemannian score ∇_M log p_t(x) in tangent space at x Sampling uses Riemannian Euler–Maruyama or geodesic integrators

    Key models & papers (2023–2026)

    • GeoDiff → first practical Riemannian diffusion for molecules (torsion angles on torus)

    • Riemannian Score Matching (Huang et al.) → general framework

    • Manifold Diffusion Models (2024–2025) → extensions to hyperbolic, spherical, Grassmann manifolds

    • Diffusion on SE(3) → 3D pose & molecule generation

    Numerical example – torus for torsion angles Molecule with 5 rotatable bonds → configuration space = torus T⁵ Forward: add toroidal Brownian motion Score learned in tangent space → reverse sampling stays on torus → valid conformations

    Applications

    • Protein/molecule generation (torsion diffusion)

    • Directional image generation (spherical diffusion)

    • Hierarchical graph generation (hyperbolic diffusion)

    • Robot pose planning (SE(3) diffusion)

    8.4 Latent diffusion models (LDM, Stable Diffusion family)

    Latent Diffusion Models (LDM) (Rombach et al. 2022 → foundation of Stable Diffusion 1–3, SDXL, Flux.1, AuraFlow) Run diffusion in low-dimensional latent space instead of high-res pixel space.

    Workflow

    1. Train autoencoder (VAE or VQ-VAE) to compress x → z (e.g., 512×512 → 64×64×4)

    2. Run diffusion on z (much cheaper)

    3. Decode final z → high-resolution image

    Why it works

    • Latent space is smoother and lower-dimensional → faster training/sampling

    • Perceptual compression (KL-regularized VAE) preserves high-frequency details in decoder

    Numerical impact

    • Pixel-space diffusion on 512×512: ~10–20× slower training

    • Latent diffusion: trains on 64×64 latents → 4–8× speedup, same perceptual quality

    2026 extensions

    • SD3 Medium / SD3.5 → larger latents + better VAEs + rectified flow

    • Flux.1 → flow-matching in latent space + massive pretraining

    • LCM-LoRA / SDXL Turbo → 1–4 step latent generation

    8.5 Discrete diffusion and absorbing state models (D3PM, MaskGIT)

    Discrete diffusion Diffusion on discrete tokens (text, graphs, protein sequences, images with VQ-VAE).

    Absorbing state models (D3PM – Austin et al. 2021)

    • Forward: gradually replace tokens with absorbing [MASK] state

    • Reverse: learn to recover original token from masked context

    • Transition matrix: categorical diffusion with absorbing state

    MaskGIT / MAGE / Masked Generative Transformers (2022–2025)

    • Mask large portions → predict masked tokens in parallel (BERT-like)

    • Iterative refinement: mask → predict → remask uncertain tokens → repeat

    Numerical example – discrete text diffusion Sequence: “the cat sat on the mat” Forward: at step t, each token → [MASK] with probability β_t Reverse: model p_θ(token | masked context) After 10–20 iterations → coherent sentence from full mask

    2026 status

    • Discrete diffusion used in DNA/protein sequence design (e.g., EvoDiff)

    • MaskGIT-style models competitive with autoregressive LLMs for infilling, editing, and code generation

    • Hybrid continuous-discrete diffusion → token latents + continuous diffusion (e.g., image tokenization + diffusion)

    This section shows how the diffusion paradigm has evolved into a versatile, high-performance framework — from continuous pixel/video generation to discrete token sequences and curved manifold data. These advancements are behind nearly every production-grade generative system in 2026.

    9. Stochastic Differential Equations (SDEs) in Generative AI

    Stochastic Differential Equations (SDEs) are the continuous-time mathematical backbone of all modern diffusion-based generative models. In 2026, nearly every high-quality image, video, 3D molecule, protein structure, audio, and even planning trajectory is generated by solving an SDE (or its deterministic flow counterpart) in the forward (noise addition) or reverse (denoising) direction.

    This section explains the core SDE formulation, how reverse-time SDEs are derived, practical numerical solvers, adaptive acceleration methods, and the deep theoretical connections to optimal control and Schrödinger bridges.

    9.1 Forward SDE → reverse-time SDE → score function

    Forward SDE (data → noise) The forward diffusion process gradually corrupts clean data x₀ into pure noise x_T:

    dx = f(x, t) dt + g(t) dW

    Common choices in 2026:

    • Variance-Preserving (VP, DDPM style): f(x,t) = -½ β(t) x, g(t) = √β(t)

    • Variance-Exploding (VE): f(x,t) = 0, g(t) = √(dσ²(t)/dt)

    Reverse-time SDE (noise → data) Anderson (1982) showed that the reverse process has the same diffusion coefficient g(t) but adjusted drift:

    dx = [f(x,t) - g(t)² ∇_x log p_t(x)] dt + g(t) dW_backward

    Score function s(x,t) = ∇_x log p_t(x) This is the key quantity we learn: it points toward high-density regions at noise level t.

    Training objective Denoising score matching (equivalent to diffusion loss): L(θ) = E_{t,x_0,ε} [ || s_θ(x_t,t) + ε / g(t) ||² ] → Model s_θ learns to predict the direction to remove noise.

    Numerical example – VP forward/reverse x₀ = 1 (1D data point) β(t) = 0.02 t (linear schedule) At t=0.5: α_bar ≈ 0.995, √(1-α_bar) ≈ 0.1 x_{0.5} ≈ 0.997 + 0.1 ε Score ≈ - (x_{0.5} - 0.997) / 0.005 ≈ -200 ε Reverse drift = -½ β x - β score ≈ -0.01 x + 20 ε → Strong pull back toward original x₀.

    Analogy Forward SDE = slowly dissolving sugar in water (data → noise) Reverse SDE = magically reassembling sugar crystal from solution (noise → data) Score function = force field that guides molecules back to crystal positions.

    9.2 Numerical solvers: Euler–Maruyama, Heun, predictor-corrector samplers

    Sampling from the reverse SDE requires discretizing the continuous-time equation.

    Euler–Maruyama (first-order, simplest) x_{t-Δt} ≈ x_t + [f(x_t,t) - g(t)² s_θ(x_t,t)] Δt + g(t) √Δt Z Z ~ 𝒩(0,I)

    Heun’s method (second-order predictor-corrector) Predictor: x̂ = x_t + drift Δt + diffusion √Δt Z Corrector: average drift at x_t and x̂ → more accurate

    Predictor-Corrector sampler (Song et al. 2021) Predictor: one Euler–Maruyama step Corrector: multiple Langevin MCMC steps (score-based gradient ascent) → Combines fast prediction with refinement

    Numerical comparison (typical FID on CIFAR-10 32×32, 2026 benchmarks)

    • Euler–Maruyama (50 steps): FID ≈ 4–6

    • Heun / PC sampler (20–30 steps): FID ≈ 3–4

    • DPM-Solver / UniPC (10–15 steps): FID ≈ 2.5–3.5

    Analogy Euler–Maruyama = basic forward Euler integration (fast but inaccurate) Heun / PC = Runge–Kutta style (better accuracy per step) → Fewer steps needed for same quality

    9.3 Adaptive step-size solvers (DPM-Solver, DEIS, UniPC)

    DPM-Solver (Lu et al. 2022–2023 → DPM-Solver++ 2023) Analytic multi-step solver for VP/VE SDEs → exact solution under linear assumption → very accurate at large steps

    DEIS (Diffusion Exponential Integrator Sampler) Exponential integrator + adaptive step-size → fewer steps than DPM-Solver

    UniPC (Universal Predictor-Corrector, 2023–2024 → dominant in 2026) Unified framework combining predictor-corrector + multi-step solvers → state-of-the-art speed/quality trade-off

    Numerical example (typical 2026 benchmarks)

    • DDIM / Euler (50 steps): FID ≈ 4.0

    • DPM-Solver++ (15 steps): FID ≈ 3.2

    • UniPC (8 steps): FID ≈ 3.4–3.8 → 6× faster sampling with almost no quality drop

    2026 practice UniPC + LCM-LoRA / SDXL Turbo → 1–4 step generation on consumer GPUs Used in production for real-time image/video editing

    9.4 Connection to optimal control and Schrödinger bridge

    Stochastic optimal control view Diffusion sampling = solving a stochastic control problem Minimize cost functional: E[ ∫ L(x,u,t) dt + terminal cost ] where u(t) = control (drift adjustment), L = regularization on control effort

    Schrödinger bridge (1930s, rediscovered 2022–2026) Find most likely stochastic path from noise distribution q_T to data distribution p_0 Equivalent to stochastic optimal control with fixed marginals

    Recent breakthrough Rectified flow, flow-matching, and stochastic interpolants are approximations of Schrödinger bridge solutions → Deterministic paths → faster, more stable sampling

    Numerical insight Schrödinger bridge between 𝒩(0,I) and data distribution → optimal transport-like paths Flow-matching directly regresses to these optimal velocities → fewer steps needed

    AI connection 2025–2026 models (Flow Matching, Rectified Flow, Consistency Trajectory Models) are essentially discretized Schrödinger bridges → unify diffusion and flow-based generation.

    9.5 Stochastic optimal control interpretation of diffusion sampling

    Full optimal control formulation Sampling reverse SDE = minimizing KL divergence between forward and reverse paths Equivalent to stochastic control:

    • State = x(t)

    • Control = drift adjustment - (1/2) g² ∇ log p

    • Cost = KL divergence to data distribution at t=0

    Practical impact

    • Guidance as control: classifier guidance = extra drift term toward class condition

    • CFG (classifier-free guidance) = learned control that amplifies prompt direction

    • Reward-weighted sampling = change cost functional to include external reward (RL fine-tuning of diffusion)

    Numerical example – CFG as control Base drift = - (1/2) β(t) x + score term Guidance adds w × (score_conditional - score_unconditional) w = 7.5 → strong control toward prompt → sharper, more faithful samples

    2026 frontier Diffusion models are now routinely fine-tuned with RL objectives (reward-weighted sampling, PPO-style) → stochastic optimal control lens explains why they align so well with human preferences.

    This section shows how SDEs are not just a mathematical curiosity — they are the active engine behind every major generative breakthrough in 2026. The next sections cover implementation, case studies, challenges, and future directions.

10. Practical Implementation Tools and Libraries (2026 Perspective)

In March 2026, the Python ecosystem for diffusion models, score-based generation, SDEs, and stochastic processes is extremely mature. Most production-grade models (Stable Diffusion 3.5, Flux.1, SDXL Turbo, LCM-LoRA, AuraFlow, consistency-based generators) are built using a small set of battle-tested libraries.

This section covers the essential tools, their current status, quick-start code, and five hands-on mini-projects you can run today (all Colab-friendly).

10.1 Diffusion frameworks: Diffusers (Hugging Face), score_sde, OpenAI guided-diffusion

Hugging Face Diffusers (the de-facto industry standard in 2026)

  • Repository: https://github.com/huggingface/diffusers

  • Current version: ≥ 0.32.x

  • Install: pip install diffusers[torch] accelerate transformers

  • Supports: DDPM, DDIM, PNDM, LCM, Consistency Models, Stable Diffusion 1–3.5, Flux.1, SDXL, ControlNet, IP-Adapter, LoRA, textual inversion, etc.

  • Features: GPU-accelerated, ONNX export, torch.compile support, fast inference, community pipelines

Quick-start example – generate image with Flux.1 (flow-matching)

Python

from diffusers import FluxPipeline import torch pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 ) pipe.enable_model_cpu_offload() # save VRAM prompt = "A cyberpunk city at night with neon lights and flying cars, ultra detailed, cinematic" image = pipe( prompt, num_inference_steps=20, guidance_scale=3.5, generator=torch.Generator("cuda").manual_seed(42) ).images[0] image.save("cyberpunk_flux.png")

score_sde (Song et al. reference implementation – research favorite)

  • Repository: https://github.com/yang-song/score_sde

  • Still the gold-standard codebase for score-based generative modeling research

  • Supports VE, VP, sub-VP, NCSN++ architectures, continuous-time SDEs

  • Great for custom experiments (e.g., manifold diffusion, new samplers)

OpenAI guided-diffusion (legacy but educational)

2026 recommendation → Use Diffusers for 95% of practical work (production, prototyping, fine-tuning) → Use score_sde when you need full control over SDE formulation or score-matching loss

10.2 SDE solvers: torchdiffeq, torchsde, jaxdiff

torchdiffeq (PyTorch ODE/SDE solvers)

torchsde (dedicated PyTorch SDE solver)

jaxdiff / diffrax (JAX ecosystem – fastest for large-scale research in 2026)

Quick torchsde example – reverse SDE sampling

Python

import torch import torchsde class ReverseSDE(torch.nn.Module): def f(self, t, y): return drift_net(y, t) # learned drift def g(self, t, y): return diffusion_net(y, t) # diffusion coeff sde = ReverseSDE().cuda() y0 = torch.randn(64, 3, 64, 64).cuda() # batch of noise images ts = torch.linspace(1.0, 0.0, 50).cuda() # reverse time ys = torchsde.sdeint(sde, y0, ts, method="heun") generated = ys[-1] # final samples at t=0

10.3 Manifold diffusion: GeoDiff, Riemannian Score Matching libraries

GeoDiff (2022–2023, still widely cited)

Riemannian Score Matching & GeoScore (2023–2026 extensions)

Quick usage pattern (using Geomstats + custom score model)

Python

from geomstats.geometry.hypersphere import Hypersphere manifold = Hypersphere(dim=2) # S² example # score_model = YourScoreNet() # learns ∇ log p_t in tangent space # Forward: spherical Brownian motion # Reverse: sample using Riemannian Euler–Maruyama + learned score

2026 note Manifold diffusion is now standard for 3D molecules (RFdiffusion, Chroma), directional images (spherical diffusion), and hierarchical graphs (hyperbolic diffusion).

10.4 Fast sampling: Consistency Models, Latent Consistency Models (LCM), SDXL Turbo

Consistency Models (Song et al. 2023)

  • Train model to predict x₀ directly from any noisy x_t

  • One-step or few-step generation after distillation

Latent Consistency Models (LCM) (Luo et al. 2023–2024)

  • Distilled version of SDXL → 4–8 step generation in latent space

  • LCM-LoRA: plug-and-play adapter for any SD checkpoint

SDXL Turbo (Stability AI 2023–2024)

  • Adversarial diffusion distillation → 1–4 step generation

  • CFG scale = 0 (adversarial training removes need for guidance)

Quick LCM-LoRA usage (Diffusers)

Python

from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0") pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") pipe.to("cuda") image = pipe( "A cyberpunk city at night with flying cars and neon lights, ultra detailed", num_inference_steps=4, guidance_scale=0.0, generator=torch.manual_seed(42) ).images[0]

2026 status

  • LCM-LoRA + SDXL Turbo → real-time generation on RTX 40-series / mobile GPUs

  • Consistency distillation is now default in most consumer tools

10.5 Mini-project suggestions

  1. Beginner: DDPM from scratch (1D toy data)

    • Dataset: 1D mixture of Gaussians

    • Implement forward noise addition + reverse denoising (score network = MLP)

    • Train denoising objective → sample new points from noise

  2. Intermediate: Score-matching toy model (2D)

    • Use torchsde + simple MLP score network

    • Train on 2D Swiss-roll or 2D Gaussian blobs

    • Sample with Euler–Maruyama vs Heun vs DPM-Solver

  3. Intermediate–Advanced: Latent diffusion fine-tuning

    • Start with SD 1.5 or SDXL base

    • Fine-tune with LoRA on custom dataset (e.g., your own photos or style)

    • Add LCM-LoRA distillation for 4-step fast inference

  4. Advanced: Manifold diffusion on torus

    • Use Geomstats + custom score model

    • Generate periodic signals or 2D torus embeddings

    • Compare Euclidean vs Riemannian diffusion quality

  5. Advanced: Flow-matching from scratch

    • Implement rectified flow or conditional flow-matching

    • Train on CIFAR-10 or small molecule dataset

    • Compare 1-step vs multi-step sampling quality and speed

All projects are runnable on Colab (free tier sufficient for toy versions; Pro for larger models).

This section gives you the exact tools and starting points used by researchers and companies building generative AI in 2026. You can now implement almost any modern diffusion pipeline from scratch or fine-tune production models.

11. Case Studies and Real-World Applications

This section shows how the stochastic processes and diffusion/SDE frameworks from earlier sections power production-grade AI systems in 2026. Each case highlights the specific stochastic technique used, why it outperforms alternatives, typical performance metrics, and the current leading models.

11.1 Image & video generation (Stable Diffusion 3, Sora-like models)

Problem Generate photorealistic or artistic images/videos from text prompts, with high fidelity, prompt adherence, diversity, and fast inference.

Stochastic process used Variance-preserving or variance-exploding diffusion SDEs + score matching + classifier-free guidance + consistency distillation / flow-matching acceleration.

Why diffusion/SDE wins

  • Autoregressive models (early DALL·E) → slow, left-to-right artifacts

  • GANs → mode collapse, training instability

  • Diffusion → stable training, excellent sample quality, natural diversity via stochastic sampling

Leading models in 2026

  • Stable Diffusion 3 Medium / SD3.5 (Stability AI): latent diffusion + rectified flow + CFG++

  • Flux.1 (Black Forest Labs): flow-matching + large-scale pretraining

  • Sora-like models (OpenAI Sora, Google Veo-2, Runway Gen-3, Luma Dream Machine, Kling): spatiotemporal latent diffusion + temporal consistency SDEs

  • Midjourney v7 / Imagen 4 (proprietary): hybrid diffusion + proprietary guidance

Performance highlights

  • ImageNet 256×256 FID: SD3 ≈ 2.1–2.5, Flux.1 ≈ 1.8–2.2 (state-of-the-art open models)

  • Video generation: 5–10 s clips at 720p in 10–30 inference steps (LCM/SDXL Turbo style)

  • Inference speed: 1–4 steps on consumer GPU (RTX 4090 / A100) → real-time preview

Key stochastic insight Reverse SDE sampling with CFG w=7–12 → strong prompt control Consistency distillation / LCM-LoRA → 1–4 step generation

11.2 Molecule & protein conformation generation (RFdiffusion, Chroma, FrameDiff)

Problem Generate valid 3D molecular conformations (small molecules, proteins) or design novel sequences with desired properties (binding affinity, stability).

Stochastic process used Riemannian / manifold diffusion (torsion angles on torus, SE(3) equivariant diffusion on 3D coordinates) + score matching on curved manifolds.

Why diffusion/SDE wins

  • Traditional force-field methods → slow, stuck in local minima

  • VAEs/GANs → invalid geometries, poor diversity

  • Diffusion → explores conformation space gradually → high validity, diversity, and energy stability

Leading models in 2026

  • RFdiffusion (Baker lab, 2022–2025 updates) → SE(3)-equivariant diffusion on protein backbones

  • Chroma (Generate Biomedicines) → discrete + continuous diffusion for full protein design

  • FrameDiff / FoldFlow → flow-matching on rigid frames + SE(3) equivariance

  • DiffDock / DiffLinker → diffusion for protein–ligand docking

Performance highlights

  • Protein design success rate: RFdiffusion variants → 40–70% designs fold correctly (AF2 validation)

  • Binding affinity (PDBBind): DiffDock → RMSD < 2 Å in 60–75% cases (vs 30–40% for traditional docking)

  • Conformation RMSD: FrameDiff → median 1.0–1.5 Å on GEOM-drugs benchmark

Key stochastic insight Manifold diffusion on torus (torsion angles) + SE(3) equivariance → respects bond constraints and rotational symmetry Score function learned in tangent space → valid, low-energy conformations

11.3 Time-series forecasting with diffusion (TimeDiff, CSDI)

Problem Forecast future values in multivariate time-series (weather, traffic, stock prices, sensor data) with uncertainty quantification.

Stochastic process used Diffusion on time-series (mask-and-denoise or forward noise corruption) + score matching for probabilistic forecasting.

Why diffusion/SDE wins

  • Classical ARIMA/LSTM → point forecasts, poor uncertainty

  • Gaussian processes → scale poorly to long sequences

  • Diffusion → full predictive distribution, handles missing data, captures multi-modal futures

Leading models in 2026

  • TimeDiff (2022–2024) → diffusion for deterministic & probabilistic forecasting

  • CSDI (Conditional Score-based Diffusion for Imputation) → imputation + forecasting

  • TimeGrad, ScoreGrad → score-based autoregressive hybrids

  • DiffTime / TSDiff → latent diffusion for long-horizon forecasting

Performance highlights

  • Electricity / Traffic benchmarks (ETTh, ETTm): → MAE / CRPS improvement 10–25% over Informer / Autoformer → Uncertainty calibration: proper scoring rules 15–30% better

Key stochastic insight Reverse diffusion generates multiple plausible futures → ensemble prediction without multiple model training

11.4 Audio & speech synthesis (AudioLDM 2, Grad-TTS variants)

Problem Generate high-fidelity speech (TTS), music, sound effects from text or conditioning.

Stochastic process used Latent diffusion in spectrogram/mel-spectrogram space + continuous-time SDE or flow-matching.

Why diffusion/SDE wins

  • WaveNet-style autoregressive → very slow inference

  • GANs → artifacts, instability

  • Diffusion → high perceptual quality, natural prosody variation, controllable via guidance

Leading models in 2026

  • AudioLDM 2 / Make-An-Audio → latent diffusion on CLAP embeddings

  • Grad-TTS / VALL-E X variants → diffusion + duration predictor

  • NaturalSpeech 3, VoiceCraft, Seed-TTS → hybrid diffusion + flow-matching

  • MusicGen / MusicLM successors → text-to-music diffusion

Performance highlights

  • TTS: MOS scores 4.4–4.7 (near human parity)

  • Inference speed: 1–5 real-time factor on GPU (after LCM-style distillation)

  • Zero-shot voice cloning: 90%+ speaker similarity in few-shot setting

Key stochastic insight Diffusion in latent mel-space + classifier-free guidance → natural prosody & emotion control

11.5 Stochastic optimal control & planning in robotics

Problem Plan trajectories for robots (arms, drones, legged robots) in uncertain environments with safety constraints.

Stochastic process used Model predictive control (MPC) + diffusion-based trajectory generation + stochastic optimal control (SOC) interpretation of diffusion sampling.

Why diffusion/SDE wins

  • Classical MPC → deterministic, brittle to uncertainty

  • RL → sample-inefficient, reward shaping hard

  • Diffusion → generate diverse, high-quality trajectory ensembles → robust planning

Leading approaches in 2026

  • Decision Diffuser / Diffuser (Janner et al. 2022–2025) → diffusion as policy prior

  • DiffMPC / Plan4MC → diffusion for model-predictive planning

  • Stochastic Control via Diffusion (2024–2026) → Schrödinger bridge for trajectory optimization

  • RoboDiffusion / Diffusion Policy → end-to-end diffusion policies for manipulation

Performance highlights

  • Block-stacking / dexterous manipulation: success rate 70–90% (vs 40–60% classical RL)

  • Drone navigation in wind: collision rate ↓ 30–50% with diffusion ensemble planning

Key stochastic insight Diffusion sampling = stochastic optimal control with KL-regularized cost → naturally produces smooth, diverse, uncertainty-aware plans

These case studies demonstrate that stochastic processes — especially diffusion SDEs — are no longer academic curiosities. They are the core technology driving the most impactful AI applications in 2026, from creative generation to scientific discovery and physical control.

12. Challenges, Limitations and Open Problems

Even though diffusion models and stochastic generative methods have become the dominant paradigm in AI by 2026, delivering breathtaking quality across images, video, audio, molecules, proteins, and more, several fundamental and practical challenges remain unsolved. This section covers the five most critical open problems, why they persist, current mitigation strategies, and the most promising research directions moving forward.

12.1 Slow sampling speed and acceleration techniques

The core problem Standard DDPM/VP diffusion requires 50–1000 denoising steps per sample → inference is 10–100× slower than GANs or autoregressive models. Even with major improvements, real-time generation (especially video or interactive editing) on consumer hardware remains difficult, and industrial-scale deployment (millions of daily generations) is expensive in compute cost.

Why it matters

  • Interactive creative tools demand <1 second latency

  • Edge devices (phones, AR glasses, robotics) have strict power and compute budgets

  • High-volume services (social media filters, game asset generation) need cost efficiency

Current acceleration techniques (2026 standard)

  • Predictor-corrector samplers (PC, DPM-Solver++, UniPC) → 10–20 steps

  • Consistency distillation / Latent Consistency Models (LCM) (Song 2023, Luo 2023–2024) → 1–4 steps

  • Flow-matching / rectified flow → deterministic straight paths → 1–8 steps

  • Adversarial diffusion distillation (SDXL Turbo, SD3 Turbo) → 1–4 steps via GAN-like training

  • Progressive / cascaded distillation → train student to mimic teacher at fewer steps

  • Quantization (4-bit/8-bit weights + activations) + torch.compile → 2–4× speedup on GPU/TPU

  • Speculative sampling + draft models → parallel short-path prediction

Remaining open problems

  • Achieving 1-step generation with quality indistinguishable from 50-step models

  • Adaptive step-size that automatically chooses minimal steps per prompt complexity

  • Preserving full diversity when reducing from 50 → 4 steps (current LCM often loses some variation)

  • Maintaining high fidelity in video/long-sequence generation at very low step counts

Outlook 2027–2028 is expected to see native 1-step or near-1-step models (stronger consistency training + flow-matching hybrids) become dominant for consumer use, with edge deployment becoming realistic.

12.2 Mode collapse and diversity in diffusion models

The core problem Despite stochastic sampling, many diffusion models suffer from reduced diversity compared to the real data distribution — especially after heavy classifier-free guidance (CFG w>7), distillation, or fine-tuning.

Common symptoms

  • Overly similar faces, poses, or styles in text-to-image

  • Limited structural diversity in generated molecules (same scaffolds repeated)

  • Mode dropping in multi-modal distributions (ignores rare artistic styles or rare protein folds)

Main causes

  • High CFG scale pushes strongly toward high-density modes (over-amplifies prompt)

  • Distillation collapses stochasticity (consistency models lose variation)

  • Score network overestimates density in low-data / tail regions

  • Training data imbalance → model ignores or under-represents rare modes

Current mitigation strategies (2026 standard)

  • Dynamic CFG / CFG++ → reduce guidance strength in early steps, increase later

  • Negative prompts + attention manipulation → actively suppress unwanted modes

  • Stochastic interpolants / rectified flow with controlled noise → preserve diversity

  • Temperature scaling in consistency models → add tunable randomness

  • Diversity-promoting losses (batch diversity term, Wasserstein regularization, anti-mode collapse penalties)

  • Latent consistency with stochastic refinement → hybrid deterministic + stochastic paths

Remaining open problems

  • Theoretical bound on diversity vs guidance strength vs distillation depth

  • How to explicitly sample from rare or tail modes on demand

  • Reliable metrics for “true” distribution coverage in high dimensions (FID is insufficient)

2026 status Diversity is good enough for most creative and commercial use cases, but scientific applications (molecule design, protein ensemble generation, rare event simulation) still struggle with mode coverage.

12.3 Training stability on high-dimensional manifolds

The core problem Diffusion on non-Euclidean manifolds (torus for torsion angles, hyperbolic for graphs, SE(3) for 3D poses, Grassmann for subspaces) suffers from training instability — exploding/vanishing gradients, mode collapse, numerical drift off manifold, or collapse to trivial solutions.

Main causes

  • Curvature causes score function to become extremely large near manifold boundary

  • Tangent space projection / parallel transport numerical errors accumulate over steps

  • Manifold constraints (unit norm, orthogonality, positive-definiteness) → hard to enforce softly

  • High-dimensional tangent spaces → curse of dimensionality in score estimation

Current mitigation strategies

  • Riemannian gradient clipping & adaptive learning rates

  • Gauge-equivariant networks (normalize curvature effects)

  • Learned projection operators / retraction maps

  • Curriculum training (start with simple manifolds, gradually increase curvature/complexity)

  • Regularization on manifold constraint violation (soft penalty on ||X^T X - I||)

Remaining open problems

  • Stable, scalable score estimation on high-curvature or high-dimensional manifolds

  • Automatic choice of curvature schedule or gauge during training

  • Theoretical convergence guarantees for Riemannian score matching

  • Avoiding drift off manifold in long sampling chains

2026 status Riemannian diffusion is now reliable for small-to-medium molecules / proteins (RFdiffusion, FrameDiff), but still experimental and unstable for large graphs, high-dimensional SE(3), or very high-curvature spaces.

12.4 Theoretical understanding of why score matching works so well

The core problem Score matching (denoising objective) empirically outperforms almost all other generative objectives (GAN loss, VAE ELBO, likelihood, flow-matching in many regimes), but we lack a deep, unified theoretical explanation for its superior sample quality, stability, and generalization.

Known partial answers

  • Avoids explicit density estimation → no normalization constant needed

  • Denoising objective is very stable (Gaussian noise is tractable)

  • Implicitly regularizes via noise scale schedule (progressive difficulty)

  • Score function is lower-dimensional than density → easier to learn

  • Reverse process is well-behaved under mild conditions (Lipschitz continuity)

Major open questions

  • Why does score matching generalize better than likelihood-based methods?

  • Is there a precise information-theoretic or geometric connection between score matching and optimal transport?

  • Can we prove tighter bounds on sample quality (FID, precision/recall) vs training compute / model size?

  • Why do distilled consistency models retain high quality despite massive compression?

  • Is score matching optimal in some minimax sense?

2026 research frontier Active lines include: information-theoretic views (mutual information between noise and data), control-theoretic interpretations (score as optimal feedback law), and geometric perspectives (score matching on manifolds).

12.5 Energy-efficient diffusion for edge devices

The core problem Full diffusion inference (even 4–8 steps) is still too expensive for phones, AR glasses, embedded robotics, or IoT devices — high VRAM, high power draw, high latency.

Current constraints

  • SDXL Turbo / LCM → ~1–2 GB VRAM, 0.5–2 s on flagship phone GPU

  • Video generation → still 10–30 s even on high-end mobile

Current mitigation strategies

  • Quantization (4-bit / 8-bit weights + activations) → 2–4× memory/power reduction

  • Distillation to 1–2 steps (stronger consistency training)

  • Tiny diffusion (small U-Net, pruned latents, depthwise-separable layers)

  • On-device flow-matching (deterministic → lower compute variance)

  • Neural architecture search for edge-friendly backbones

  • Sparsity & pruning (structured sparsity in attention/conv layers)

Remaining open problems

  • 1-step generation with near-zero quality drop on mobile hardware

  • Power-efficient score computation (spiking/neuromorphic diffusion)

  • Latency < 200 ms for interactive editing on AR/VR glasses

  • Maintaining diversity and prompt adherence under extreme quantization/distillation

2026 outlook Edge diffusion is emerging (Apple Intelligence on-device models, Samsung Gauss mobile variants), but full-quality real-time generation on phone-class hardware is still 2027–2028 territory.

These five challenges represent the active research frontiers. Solving any one of them (e.g., 1-step high-quality generation, stable manifold diffusion, or theoretical explanation of score matching superiority) would unlock massive new applications in mobile AI, scientific discovery, real-time creativity, and autonomous systems.

PREVIOUS PAGE INDEX PAGE NEXT PAGE

PREVIOUS PAGE INDEX PAGE NEXT PAGE

Stochastic Processes in AI Vol-2: Markov Chains, Decision Making and AI Algorithms

Table of Contents: Stochastic Processes in AI Vol-2

Markov Chains, Decision Making and AI Algorithms

  1. Introduction to Vol-2: From Markov Chains to Decision Making in AI 1.1 Why Vol-2 focuses on decision-making and algorithmic implications 1.2 Connection between Vol-1 (diffusion & generative) and Vol-2 (planning & control) 1.3 Brief roadmap: Markov → MDP → RL → stochastic control → modern AI 1.4 Target audience: advanced undergrad/postgrad, AI researchers, ML engineers 1.5 Prerequisites (review of Vol-1 concepts: Markov chains, SDEs, score matching)

  2. Advanced Markov Chains and Hidden Markov Models 2.1 Higher-order Markov chains and variable-order Markov models 2.2 Hidden Markov Models (HMM): forward-backward algorithm, Viterbi decoding 2.3 Baum-Welch (EM) algorithm for HMM parameter estimation 2.4 Continuous-state HMMs and switching linear dynamical systems 2.5 Applications: speech recognition, part-of-speech tagging, bioinformatics

  3. Markov Decision Processes – Advanced Topics 3.1 Partially Observable MDPs (POMDPs): belief states and value functions 3.2 Continuous-state & continuous-action MDPs 3.3 Approximate dynamic programming: fitted value iteration, LSTD 3.4 Model-based vs model-free RL – stochastic shortest path revisited 3.5 Safe MDPs and constrained MDPs (constrained policy optimization)

  4. Reinforcement Learning Foundations with Stochastic Processes 4.1 Temporal Difference learning: SARSA, Q-learning, Expected SARSA 4.2 Off-policy vs on-policy learning: importance sampling in policy gradients 4.3 Actor-Critic methods: A2C, A3C, PPO, SAC (maximum entropy RL) 4.4 Eligibility traces and n-step bootstrapping 4.5 Stochastic policies in continuous control: Gaussian policies + entropy regularization

  5. Policy Gradient and Stochastic Policy Optimization 5.1 REINFORCE algorithm and variance reduction (baseline, advantage normalization) 5.2 Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) 5.3 Natural Policy Gradient and KL-constrained optimization 5.4 Stochastic gradient estimation in high-variance environments 5.5 Maximum Entropy Reinforcement Learning (Soft Actor-Critic)

  6. Model-Based Reinforcement Learning and Planning 6.1 Dyna architecture: real + simulated experience 6.2 Model Predictive Control (MPC) with learned dynamics 6.3 MuZero, EfficientZero, DreamerV3 – latent world models 6.4 Planning as inference: diffusion-based planning (Decision Diffuser) 6.5 Stochastic model-based planning with uncertainty-aware models

  7. Stochastic Optimal Control and Diffusion for Planning 7.1 Stochastic optimal control formulation of RL 7.2 Diffusion for trajectory generation and planning (Diffuser, Plan4MC) 7.3 Schrödinger bridge and optimal transport in control 7.4 Control as inference: KL-regularized RL and reward-weighted regression 7.5 Diffusion policies vs traditional policy networks

  8. Multi-Agent and Game-Theoretic Stochastic Processes 8.1 Stochastic games and Markov games 8.2 Nash equilibrium in multi-agent RL 8.3 Mean-field games and mean-field RL 8.4 Population-based training and self-play with stochastic opponents 8.5 Applications: autonomous driving, negotiation agents, poker bots

  9. Practical Implementation Tools and Libraries (2026 Perspective) 9.1 RL frameworks: Stable-Baselines3, CleanRL, RLlib, Tianshou 9.2 Diffusion for planning: Diffuser, Decision Diffuser, Plan4MC repos 9.3 POMDP solvers: pomdp-py, APPL, SARSOP 9.4 Multi-agent: PettingZoo, SMAC, Mava 9.5 Mini-project suggestions: PPO from scratch, diffusion planner, multi-agent game

  10. Case Studies and Real-World Applications 10.1 Autonomous driving & robotics planning (diffusion + MPC) 10.2 Large-scale recommender systems with stochastic policies 10.3 Multi-agent games & e-sports AI (AlphaStar-like systems) 10.4 Healthcare treatment planning (POMDPs & stochastic control) 10.5 Energy management & smart grids (mean-field RL)

  11. Challenges, Limitations and Open Problems 11.1 Sample efficiency and real-world deployment 11.2 Exploration in sparse-reward, long-horizon tasks 11.3 Safety, robustness and constraint satisfaction in stochastic policies 11.4 Multi-agent equilibrium computation and non-stationarity 11.5 Scaling stochastic optimal control to high-dimensional continuous spaces

  12. Summary, Key Takeaways and Further Reading 12.1 Recap: Markov chains → MDPs → RL → stochastic control → modern AI 12.2 Most important concepts for AI practitioners 12.3 Recommended books & surveys (Sutton & Barto, Bertsekas, Todorov) 12.4 Influential papers 2023–2026 12.5 Online courses (Stanford CS234, DeepMind x UCL RL lectures) 12.6 Exercises and capstone project ideas

1. Introduction to Vol-2: From Markov Chains to Decision Making in AI

Welcome to Stochastic Processes in AI Vol-2: Markov Chains, Decision Making and AI Algorithms.

Vol-1 focused on randomness as a tool for creation — how stochastic processes (especially diffusion and SDEs) power the generative revolution: images, videos, molecules, proteins, audio, and even reasoning traces in large language models.

Vol-2 shifts the spotlight to randomness as a tool for decision-making and intelligent action in uncertain environments. We move from passive generation to active planning, control, exploration, and optimization — the core of reinforcement learning, robotics, autonomous systems, game AI, recommender systems, and next-generation agents.

1.1 Why Vol-2 focuses on decision-making and algorithmic implications

Modern AI is no longer just about predicting or generating — it is about acting intelligently in complex, uncertain, partially observable worlds.

Key reasons stochastic processes are central to decision-making in 2026:

  • Uncertainty is unavoidable: Real environments (roads, markets, hospitals, factories) are noisy, non-stationary, and only partially observable. Deterministic algorithms fail; stochastic policies and planning thrive.

  • Exploration–exploitation dilemma: Agents must balance trying new actions (exploration) vs exploiting known good ones. Stochasticity (random policies, entropy bonuses, noise injection) solves this elegantly.

  • Long-horizon reasoning: Many tasks require planning over hundreds or thousands of steps (robotics, supply-chain optimization, medical treatment sequences). Markov chains and MDPs provide the mathematical backbone.

  • Algorithmic scalability: Modern RL and planning algorithms (PPO, SAC, DreamerV3, Decision Diffuser) are built on stochastic process theory — understanding Markov chains, MDPs, and stochastic control is essential to read, implement, and innovate in these areas.

  • Agentic AI & autonomy: The next wave (2026–2030) is autonomous agents that plan, reason, and act using stochastic models — from self-driving cars to enterprise workflow agents.

Simple numerical motivation A deterministic policy in a maze might always take the same path → gets stuck in local minimum or fails under noise. A stochastic policy (ε-greedy or softmax) explores alternatives → finds optimal path with high probability after enough trials.

1.2 Connection between Vol-1 (diffusion & generative) and Vol-2 (planning & control)

Vol-1 and Vol-2 are deeply connected — they are two sides of the same coin:

  • Generation as planning Diffusion sampling = solving a reverse-time stochastic control problem to steer from noise to data (Schrödinger bridge interpretation). → Planning = solving a forward control problem to steer from current state to goal.

  • Score function ≈ value gradient In diffusion: score ∇ log p_t(x) pushes toward high-probability regions. In RL: value gradient or advantage pushes toward high-reward actions.

  • Reverse diffusion ≈ policy rollout Denoising steps = sequential decisions that reconstruct data. RL policy rollout = sequential actions that maximize return.

  • Shared math Both use SDEs, score matching / policy gradients, entropy regularization, and KL divergence terms.

2026 frontier Many cutting-edge systems merge both worlds:

  • Diffusion for planning trajectories (Decision Diffuser, Diffuser)

  • RL fine-tuning of diffusion models (reward-weighted sampling)

  • Stochastic control as unified language for both generation and decision-making

Vol-2 builds directly on Vol-1: every concept here (MDP, policy gradient, stochastic control) is a natural extension of the stochastic processes you learned.

1.3 Brief roadmap: Markov → MDP → RL → stochastic control → modern AI

Vol-2 journey at a glance:

  1. Advanced Markov chains & HMMs → modeling hidden dynamics & sequences

  2. Markov Decision Processes (MDPs) → adding actions & rewards

  3. Reinforcement Learning foundations → learning policies from interaction

  4. Policy gradients & actor-critic → scaling to continuous & high-dimensional problems

  5. Model-based RL & planning → using learned dynamics for faster learning

  6. Stochastic optimal control & diffusion planning → unifying generation & decision-making

  7. Multi-agent & game-theoretic extensions → real-world coordination & competition

  8. Implementation tools + case studies → from theory to code to deployment

  9. Challenges & future directions → open problems in agentic AI

By the end, you will understand how stochastic processes power not only generative models but also autonomous agents, robotic control, game AI, and enterprise decision systems.

1.4 Target audience: advanced undergrad/postgrad, AI researchers, ML engineers

This volume is written for people who already have basic probability, Python, and some exposure to machine learning (from Vol-1 or equivalent).

Ideal readers

  • Advanced undergraduates / postgraduates in CS, AI, data science, control engineering — wanting rigorous yet practical understanding

  • AI researchers — needing deeper mathematical insight into why RL and planning algorithms work (or fail)

  • ML engineers & practitioners — implementing or fine-tuning RL agents, planning systems, or hybrid generative-control models in production

No advanced prerequisites beyond Vol-1 concepts (Markov chains, Brownian motion, SDEs, score matching). Every new idea is built step-by-step with examples, code sketches, and real AI motivation.

2. Advanced Markov Chains and Hidden Markov Models

Markov chains from Vol-1 were the simplest stochastic processes — fully observable states with memoryless transitions. In real AI problems, we often deal with higher-order dependencies, hidden/latent states, or continuous dynamics. This section extends basic Markov chains to more powerful models used in speech, NLP, bioinformatics, robotics, and many sequential AI tasks.

2.1 Higher-order Markov chains and variable-order Markov models

Higher-order Markov chains The next state depends on the last k states (order k), not just the last one.

Transition probability: P(X_{t+1} = j | X_t = i_t, X_{t-1} = i_{t-1}, …, X_{t-k+1} = i_{t-k+1})

Numerical example – bigram (order 2) language model Vocabulary: {the, cat, sat, on, mat} Given “the cat”, P(next word = “sat”) = 0.7, P(“on”) = 0.2, P(“mat”) = 0.1 → Chain has memory of last word (or last two if trigram).

Variable-order Markov models (VOM) Use different orders depending on context — longer history only when it improves prediction (e.g., Prediction by Partial Matching – PPM).

Advantages

  • Capture longer dependencies (e.g., syntax patterns in text)

  • Avoid exponential parameter explosion of fixed high-order chains

AI applications

  • Early text compression (PPM)

  • Variable-length n-gram models in language modeling

  • Sequence prediction in robotics (action sequences with variable context length)

Drawback Still fully observable → cannot handle hidden/latent structure (next subsection).

2.2 Hidden Markov Models (HMM): forward-backward algorithm, Viterbi decoding

Hidden Markov Model (HMM) We observe a sequence of observations O₁, O₂, …, O_T There is a hidden state sequence S₁, S₂, …, S_T that follows a first-order Markov chain Observations are emitted from hidden states via emission probabilities.

Components:

  • States S = {1, …, N}

  • Transition matrix A (N × N)

  • Emission probabilities B (N × M) or continuous densities

  • Initial state distribution π

Three classic problems

  1. Evaluation (likelihood): P(O | model) → Forward algorithm

  2. Decoding (most likely hidden sequence): argmax_S P(S | O) → Viterbi algorithm

  3. Learning (estimate parameters): Baum-Welch (EM)

Forward algorithm (likelihood) α_t(i) = P(O₁…O_t, S_t = i | model) Initialization: α₁(i) = π_i b_i(O₁) Recursion: α_{t+1}(j) = [Σ_i α_t(i) a_{ij}] b_j(O_{t+1}) Total likelihood: Σ_i α_T(i)

Viterbi decoding (most likely path) δ_t(i) = max probability of being in state i at time t with observations so far δ₁(i) = π_i b_i(O₁) δ_{t+1}(j) = max_i [δ_t(i) a_{ij}] b_j(O_{t+1}) Keep backpointers → reconstruct path.

Numerical toy example – weather + activity HMM States: Sunny (S), Rainy (R) Observations: Walk (W), Shop (Sh), Clean (C) Transitions: S→S 0.8, S→R 0.2, R→S 0.4, R→R 0.6 Emissions:

  • Sunny: W 0.6, Sh 0.3, C 0.1

  • Rainy: W 0.1, Sh 0.4, C 0.5

Sequence: Walk, Shop, Walk Forward: compute likelihood Viterbi: most likely path = Sunny → Sunny → Sunny (high probability of walking on sunny days)

AI applications

  • Speech recognition (states = phonemes, observations = acoustic features)

  • Part-of-speech tagging (states = POS tags, observations = words)

  • Gesture recognition, bioinformatics (gene finding)

2.3 Baum-Welch (EM) algorithm for HMM parameter estimation

Baum-Welch = Expectation-Maximization for HMMs (unsupervised learning of A, B, π)

E-step Compute γ_t(i) = P(S_t = i | O, model) = α_t(i) β_t(i) / P(O) ξ_t(i,j) = P(S_t = i, S_{t+1} = j | O, model)

M-step Update transitions: a_{ij} = Σ_t ξ_t(i,j) / Σ_t γ_t(i) Update emissions: b_i(k) = Σ_{t: O_t=k} γ_t(i) / Σ_t γ_t(i) Update initial: π_i = γ_1(i)

Numerical intuition Start with random A, B, π After 10–20 EM iterations → parameters converge to values that maximize likelihood of observed sequence.

AI connection Baum-Welch trained early HMM-based speech recognizers and POS taggers. Modern deep variants (HMM + neural emissions) still used in hybrid ASR systems.

2.4 Continuous-state HMMs and switching linear dynamical systems

Continuous-state HMM Hidden states are continuous vectors (instead of discrete). Emission model: usually Gaussian (linear Gaussian state-space model).

Switching Linear Dynamical Systems (SLDS) Hidden mode (discrete) switches over time, each mode has its own linear-Gaussian dynamics.

Example – SLDS for robot tracking Modes: straight motion, turning left, turning right Each mode has different transition matrix + noise Observation = noisy GPS/accelerometer readings

Inference

  • Forward-backward extended to continuous case (Kalman filter + backward pass)

  • Viterbi becomes max-probability mode sequence + smoothed continuous states

AI applications

  • Maneuver recognition in autonomous driving

  • Human motion capture (walking, running, jumping modes)

  • Financial time-series with regime switching

2.5 Applications: speech recognition, part-of-speech tagging, bioinformatics

Speech recognition

  • States = phonemes or sub-phoneme units

  • Observations = MFCC / spectrogram features

  • HMM + neural acoustic model (hybrid DNN-HMM) → still used in many production ASR systems

Part-of-speech tagging

  • States = POS tags (NN, VB, JJ, etc.)

  • Observations = words

  • Viterbi decoding → most likely tag sequence

  • Modern: neural CRF or Transformer layers on top of HMM-like transition modeling

Bioinformatics

  • Gene finding: states = coding/non-coding regions, observations = DNA sequence

  • Profile HMMs for protein family alignment (Pfam database)

  • Secondary structure prediction

Numerical example – POS tagging accuracy Penn Treebank benchmark:

  • HMM only ≈ 93–94% accuracy

  • HMM + neural features ≈ 97%

  • Modern Transformer-based → 97.5–98%

These advanced Markov models remain essential building blocks in sequential AI tasks — especially where interpretability, uncertainty modeling, or latent structure discovery is needed.

3. Markov Decision Processes – Advanced Topics

Section 4 of Vol-2 introduced basic MDPs and tabular methods (value iteration, policy iteration). This section covers advanced extensions that are essential for real-world AI: partial observability, continuous spaces, approximation methods, model-based vs model-free trade-offs, and safety/constraint-aware decision-making.

3.1 Partially Observable MDPs (POMDPs): belief states and value functions

Partially Observable MDP (POMDP) In real environments, the agent does not observe the true state s — only a noisy observation o. POMDP = (S, A, T, R, Ω, O, γ) where Ω = observation space, O(o|s,a) = observation probability.

Belief state b(s) = probability distribution over hidden states b(s') = P(S' = s' | b, a, o) ∝ O(o|s',a) Σ_s b(s) T(s'|s,a)

Belief space B = probability simplex over S (continuous even if S is discrete!)

Value function over beliefs V(b) = max_a { Σ_s b(s) R(s,a) + γ Σ_{s',o} P(s',o|b,a) V(b') }

Numerical example – tiger problem (classic POMDP) States: TigerLeft, TigerRight Actions: Listen, OpenLeft, OpenRight Observations: TigerLeftHear, TigerRightHear, Nothing Reward: +10 for opening door without tiger, -100 for opening door with tiger, -1 for listening

Belief b = P(tiger left) After listen → update belief via Bayes rule Optimal policy: listen until belief is extreme → open the low-probability door

AI relevance

  • Robotics: robot does not see full environment (POMDP planning)

  • Autonomous driving: partial observability of other vehicles' intentions

  • Healthcare: patient state partially observed through tests

3.2 Continuous-state & continuous-action MDPs

Continuous-state MDPs S = ℝ^d (joint angles, positions, velocities) Continuous-action MDPs A = ℝ^m (torques, steering angles, velocities)

Challenges

  • No enumeration of states/actions → cannot use tabular methods

  • Curse of dimensionality in continuous spaces

Common approaches

  • Function approximation: V(s) ≈ θ · ϕ(s) (linear) or neural network V_θ(s)

  • Policy parameterization: π_θ(a|s) = Gaussian(μ_θ(s), Σ_θ(s))

  • Discretization or tile coding (early methods)

  • Deep RL (DQN for discrete actions, PPO/SAC for continuous)

Numerical example – inverted pendulum State s = [θ, θ̇] (angle, angular velocity) ∈ ℝ² Action a = torque ∈ ℝ Reward r = cos(θ) - 0.1 θ̇² - 0.001 a² Continuous MDP solved via PPO or SAC → stable balancing policy in ~10⁵–10⁶ steps

2026 practice Continuous control → dominated by PPO, SAC, TD-MPC2, DreamerV3 (model-based)

3.3 Approximate dynamic programming: fitted value iteration, LSTD

Fitted Value Iteration (FVI) Approximate Bellman operator with function approximation:

V_{k+1} = T V_k ≈ max_a [ r + γ E V_k(s') ] Fit regressor V̂_{k+1} to targets r + γ V̂_k(s')

Least-Squares Temporal Difference (LSTD) Linear approximation V(s) = θ · ϕ(s) Minimize ||Φ θ - r - γ Φ' θ||² Closed-form solution: θ = (Φ^T (Φ - γ Φ'))⁻¹ Φ^T r

Numerical example – mountain car State: position & velocity (continuous) Use tile coding or neural net as ϕ(s) FVI: iterate value updates → converge to near-optimal policy

AI connection

  • FVI / fitted Q-iteration → basis for Deep Q-Networks (DQN)

  • LSTD → precursor to linear function approximation in modern RL

3.4 Model-based vs model-free RL – stochastic shortest path revisited

Model-based RL Learn transition model P̂(s'|s,a) and reward model R̂(s,a) Then plan with value iteration / MPC using learned model

Model-free RL Learn value/policy directly from experience (no explicit model) Examples: PPO, SAC, DQN

Comparison table

AspectModel-basedModel-freeSample efficiencyHigh (plan with simulated rollouts)Lower (needs real interaction)Computational costHigh (planning step)Lower (just gradient updates)Robustness to model errorSensitive (model bias → policy error)More robust (learns directly from data)Modern examplesDreamerV3, MuZero, TD-MPC2PPO, SAC, Rainbow DQN

Stochastic shortest path revisited Model-based SSP: learn stochastic graph → plan shortest path with Bellman-Ford or value iteration Model-free SSP: learn Q-values → implicit shortest path via greedy policy

2026 trend Hybrid: model-based for imagination + model-free for real interaction (DreamerV3 style)

3.5 Safe MDPs and constrained MDPs (constrained policy optimization)

Constrained MDP Maximize return subject to constraints: E[ Σ cost_t ] ≤ budget or P(collision) ≤ δ

Safe RL approaches

  • Lagrangian methods: add penalty λ × constraint violation

  • Constrained Policy Optimization (CPO) → trust-region method with constraints

  • Projection-based methods (e.g., P3O, FOCOPS)

  • Shielding / safety layers (post-hoc action filtering)

Numerical example – safe navigation Reward: reach goal (+10) Constraint: expected collision cost ≤ 1.0 Unconstrained policy: high speed → reward 9.5, cost 5.0 (unsafe) Constrained policy: slower speed → reward 8.2, cost 0.9 (safe)

AI relevance

  • Autonomous driving: avoid collisions (constrained RL)

  • Robotics: respect joint limits, power budgets

  • Healthcare: treatment policies with safety constraints

  • Finance: trading with risk limits

2026 frontier Constrained diffusion policies, safe exploration with uncertainty-aware models, formal verification of safe RL policies.

Advanced MDP topics extend basic decision-making to real-world complexity: partial observability, continuous control, approximation, model usage, and safety — all critical for autonomous AI systems in 2026.

4. Reinforcement Learning Foundations with Stochastic Processes

Reinforcement Learning (RL) is the branch of AI where an agent learns to make sequential decisions by interacting with an environment to maximize cumulative reward. Stochastic processes are central to RL: the environment is stochastic (uncertain transitions), policies are often stochastic (for exploration), and value estimates are learned from noisy samples.

This section covers the core RL algorithms that rely on stochastic processes, building directly on MDPs from the previous section.

4.1 Temporal Difference learning: SARSA, Q-learning, Expected SARSA

Temporal Difference (TD) learning Update value estimates using the difference between predicted and observed outcomes (bootstrapping).

SARSA (on-policy TD control) Update Q(s,a) using the action actually taken under current policy:

Q(s,a) ← Q(s,a) + α [ r + γ Q(s', a') - Q(s,a) ] where a' ~ π(·|s')

Q-learning (off-policy TD control) Update using max over next actions (greedy target):

Q(s,a) ← Q(s,a) + α [ r + γ max_{a'} Q(s', a') - Q(s,a) ]

Expected SARSA Use expected value over next policy instead of single sample:

Q(s,a) ← Q(s,a) + α [ r + γ E_{a'~π} Q(s', a') - Q(s,a) ]

Numerical toy example – 3-state chain States: S1 → S2 → S3 (goal, r=+10) Actions: left/right (deterministic transitions) γ = 0.9, α = 0.1 Initial Q = 0 everywhere

SARSA (ε-greedy, ε=0.1): Sample path S1-right→S2-right→S3 → update Q(S2,right) toward 10 Q-learning: always updates toward max, faster convergence to optimal

Analogy SARSA = learning from your actual driving style (on-policy) Q-learning = learning the best possible driving (off-policy, assumes optimal next actions)

2026 practice

  • SARSA → less common (on-policy bias)

  • Q-learning → basis for DQN family

  • Expected SARSA → used in many modern actor-critic methods for lower variance

4.2 Off-policy vs on-policy learning: importance sampling in policy gradients

On-policy learning Data collected under current policy π → used to improve π Examples: SARSA, PPO, A2C/A3C

Off-policy learning Data collected under behavior policy μ → used to improve target policy π Examples: Q-learning, DQN, SAC

Importance sampling Correct for distribution mismatch: E_π [f] ≈ (1/n) Σ (π(a_i|s_i) / μ(a_i|s_i)) f(a_i,s_i) where samples from μ

Numerical example – policy gradient Target policy π(a|s) = softmax(θ·ϕ(s)) Behavior policy μ = ε-greedy Advantage A = 2.5 for action a Importance ratio ρ = π(a|s) / μ(a|s) = 0.8 / 0.2 = 4 Weighted update: 4 × 2.5 = 10 (amplifies contribution)

Key trade-off

  • On-policy: lower variance, but more samples needed

  • Off-policy: higher variance (importance weights explode), but reuses old data (sample-efficient)

2026 practice

  • PPO → on-policy (stable, widely used)

  • SAC → off-policy (continuous control, entropy regularization)

  • Importance sampling with clipping (PPO-style) or per-decision importance → reduces variance

4.3 Actor-Critic methods: A2C, A3C, PPO, SAC (maximum entropy RL)

Actor-Critic

  • Actor: learns policy π_θ(a|s)

  • Critic: learns value function V_φ(s) or Q_φ(s,a) → reduces variance in policy gradient

A2C / A3C (Mnih et al. 2016)

  • A2C: synchronous multi-agent (multiple environments)

  • A3C: asynchronous → parallel actors update shared model

PPO (Proximal Policy Optimization, Schulman 2017) Clipped surrogate objective: L(θ) = E [ min(ρ_t A_t, clip(ρ_t, 1-ε, 1+ε) A_t) ] → Prevents large policy updates → stable training

SAC (Soft Actor-Critic, Haarnoja 2018–2019) Maximum entropy RL: J(π) = E [ Σ r_t + α H(π(·|s_t)) ] → Entropy bonus encourages exploration → Off-policy actor-critic with automatic α tuning

Numerical example – entropy bonus Policy π uniform over 4 actions → H(π) = log(4) ≈ 1.386 SAC adds α × 1.386 to reward → favors diverse actions early in training

2026 status

  • PPO → default for most continuous/discrete control tasks

  • SAC → strongest for continuous control (best sample efficiency)

  • A3C legacy, but multi-agent variants still used

4.4 Eligibility traces and n-step bootstrapping

Eligibility traces Combine one-step TD (bootstrapping) with Monte Carlo (full return) Trace e_t(s) = γ λ e_{t-1}(s) + 1(s_t = s) Update: δ_t = r_{t+1} + γ V(s_{t+1}) - V(s_t) ΔV(s) = α δ_t e_t(s)

n-step bootstrapping Use n-step return: G_{t:t+n} = r_{t+1} + γ r_{t+2} + … + γ^{n-1} r_{t+n} + γ^n V(s_{t+n}) → Bias-variance trade-off (n=1 → TD, n=∞ → Monte Carlo)

Numerical example n=3 return: G = r1 + γ r2 + γ² r3 + γ³ V(s4) If V(s4) is accurate → lower variance than 1-step If inaccurate → higher bias

AI connection

  • Eligibility traces → TD(λ) in classic RL

  • n-step → used in A3C, Rainbow DQN, IMPALA

  • Modern PPO/SAC use n-step returns with GAE (Generalized Advantage Estimation)

4.5 Stochastic policies in continuous control: Gaussian policies + entropy regularization

Gaussian policy π_θ(a|s) = 𝒩(μ_θ(s), Σ_θ(s)) Usually diagonal covariance Σ = diag(exp(log_std_θ(s)))

Entropy regularization Add α H(π(·|s)) to objective → prevents premature convergence to deterministic policy SAC automatically tunes α to target entropy value

Numerical example – Gaussian policy State s → μ(s) = [0.5, -0.2], log_std = [-1, -1.5] → std = [exp(-1), exp(-1.5)] ≈ [0.368, 0.223] Sample a ~ 𝒩(μ, std²) → action has natural exploration noise

2026 practice

  • SAC → default for continuous control (OpenAI Gym, MuJoCo, DM Control)

  • Gaussian + squashed tanh → bounded actions (e.g., [-1,1] torque)

  • Entropy coefficient α → auto-tuned → balances exploration/exploitation

These foundations — TD learning, actor-critic, eligibility traces, stochastic policies — form the backbone of modern RL and decision-making systems in AI.

5. Policy Gradient and Stochastic Policy Optimization

Policy gradient methods directly optimize the policy π_θ(a|s) to maximize expected return using gradient ascent on the objective J(θ) = E[return | π_θ]. Unlike value-based methods (Q-learning), policy gradients work naturally with continuous actions and stochastic policies — the dominant approach for continuous control and many high-dimensional tasks in 2026.

5.1 REINFORCE algorithm and variance reduction (baseline, advantage normalization)

REINFORCE (Williams 1992) — the original policy gradient theorem

Objective: J(θ) = E_{τ ~ π_θ} [ R(τ) ] where τ = (s₀,a₀,r₁,s₁,…,s_T) is a trajectory Gradient: ∇_θ J(θ) = E [ R(τ) ∇_θ log π_θ(τ) ]

Monte-Carlo REINFORCE update Sample full trajectory → compute total return G_t = Σ_{k=t}^T γ^{k-t} r_k Update: Δθ = α G_t ∇_θ log π_θ(a_t|s_t)

High variance problem Return G_t has huge variance → noisy gradients → slow/unstable learning

Variance reduction techniques

  1. Baseline Subtract state-dependent baseline b(s_t): Δθ = α (G_t - b(s_t)) ∇_θ log π_θ(a_t|s_t) Optimal baseline ≈ V^π(s_t) → advantage A_t = G_t - V(s_t)

  2. Advantage normalization Normalize advantages across batch: Â_t = (A_t - μ_A) / (σ_A + ε) → Helps with scale invariance and numerical stability

Numerical example – REINFORCE with baseline Trajectory reward sum G_t = 25 Baseline b(s_t) ≈ V(s_t) = 15 (learned critic) Advantage A_t = 25 - 15 = 10 Without baseline: gradient scaled by 25 With baseline: gradient scaled by 10 → 2.5× lower variance

AI connection

  • REINFORCE with baseline → foundation of all modern policy gradient methods

  • Used in early robotic grasping, game playing, and as baseline for PPO/SAC

5.2 Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO)

TRPO (Schulman et al. 2015) Maximize surrogate advantage L(θ) = E [ (π_θ(a|s) / π_old(a|s)) Â_t ] Subject to KL constraint: E [ KL(π_old || π_θ) ] ≤ δ → Large policy updates can destabilize learning

Solution: Conjugate gradient + line search to enforce constraint

PPO (Schulman et al. 2017) — simplified, more sample-efficient TRPO Clipped surrogate objective:

L^{clip}(θ) = E [ min( ρ_t Â_t , clip(ρ_t, 1-ε, 1+ε) Â_t ) ] where ρ_t = π_θ(a|s) / π_old(a|s), ε ≈ 0.1–0.2

Numerical example – PPO clipping ρ_t = 1.8 (new policy much more likely), Â_t = +5 Unclipped: 1.8 × 5 = 9 Clipped (ε=0.2): min(9, 1.2 × 5) = min(9, 6) = 6 → Prevents destructive large updates

2026 status

  • PPO → still the most widely used open-source RL algorithm (stable, easy to implement)

  • PPO variants (PPO-Clip, PPO-Penalty) dominate robotics, games, autonomous driving

5.3 Natural Policy Gradient and KL-constrained optimization

Natural Policy Gradient (NPG) (Kakade 2001) Uses Fisher information matrix F(θ) to precondition gradient:

∇_natural J = F^{-1} ∇ J

Fisher matrix F(θ) = E [ ∇ log π ∇ log π^T ] → Measures local curvature of policy distribution

KL-constrained optimization Maximize surrogate advantage subject to KL(π_old || π_new) ≤ δ TRPO solves exactly via conjugate gradient PPO approximates via clipping

Numerical example – NPG advantage Plain gradient: Δθ = α ∇ J Natural gradient: Δθ = α F^{-1} ∇ J In high-dimensional policy space → natural gradient takes larger, more effective steps along low-curvature directions

AI connection

  • TRPO → direct ancestor of PPO

  • Natural gradient → used in some advanced actor-critic methods and continual learning

5.4 Stochastic gradient estimation in high-variance environments

High-variance issues

  • Long-horizon tasks → return variance explodes (γ^t compounds)

  • Sparse rewards → most trajectories have zero return → noisy gradients

Mitigation techniques

  • Advantage normalization (mean 0, std 1 across batch)

  • Generalized Advantage Estimation (GAE, Schulman 2015): Â_t = δ_t + (γλ) δ_{t+1} + … + (γλ)^{T-t+1} δ_{T-1} λ ≈ 0.95 → bias-variance trade-off

  • Reward normalization / clipping

  • Entropy bonus (prevents premature convergence)

Numerical example – GAE Rewards: r1=0, r2=0, r3=10, γ=0.99, λ=0.95 δ3 = 10 + 0.99 V(s4) - V(s3) ≈ 10 (assuming V(s4)≈0) Â_1 = δ1 + 0.99×0.95 δ2 + 0.99²×0.95² δ3 ≈ 0 + 0 + 8.6 → Advantage spreads reward backward → reduces variance

2026 practice GAE + advantage normalization → standard in PPO, SAC, A2C/A3C implementations

5.5 Maximum Entropy Reinforcement Learning (Soft Actor-Critic)

Maximum Entropy RL Maximize J(π) = E [ Σ r_t + α H(π(·|s_t)) ] → Entropy term encourages exploration + robustness

Soft Actor-Critic (SAC) (Haarnoja et al. 2018–2019)

  • Off-policy actor-critic with entropy regularization

  • Actor: stochastic Gaussian policy

  • Critic: twin Q-networks (reduce overestimation)

  • Automatic α tuning: target entropy = -dim(A)

Numerical example – entropy tuning Action dim = 4 (e.g., 4-joint robot) Target H = -4 (uniform over reasonable range) If current H = -1.5 (too deterministic) → α increases → more exploration If H = -6 (too random) → α decreases → exploit more

2026 status

  • SAC → default for continuous control benchmarks (MuJoCo, DM Control)

  • Extensions: SAC-N, DrQ-v2, REDQ → SOTA sample efficiency

  • Entropy regularization now standard in most actor-critic methods

Policy gradient methods, especially PPO and SAC, remain the workhorse of deep RL in 2026 — especially for continuous control, robotics, and agentic AI systems.

6. Model-Based Reinforcement Learning and Planning

Model-based reinforcement learning (MBRL) learns an explicit model of the environment (dynamics P(s'|s,a) and reward R(s,a)) and uses it for planning, imagination, or policy improvement. This contrasts with model-free methods (PPO, SAC) that learn directly from experience without building a model.

Model-based methods are usually more sample-efficient (fewer real interactions needed), especially in long-horizon or expensive-to-sample environments (robotics, autonomous driving, games with long episodes).

6.1 Dyna architecture: real + simulated experience

Dyna (Sutton 1990–1991) — the classic hybrid model-based / model-free approach

Core idea

  • Learn a model M̂(s,a) → s', r

  • Use real experience (from environment) to update policy/value + model

  • Use simulated experience (from model) to perform additional planning updates

Dyna-Q algorithm

  1. Take real action a in s → observe s', r

  2. Update Q(s,a) ← Q(s,a) + α [r + γ max_a' Q(s',a') - Q(s,a)]

  3. Update model: M̂(s,a) ← (s', r)

  4. Repeat k times: sample s,a from memory → s',r = M̂(s,a) → update Q(s,a) with simulated transition

Numerical example – grid world with Dyna Real step: s=(1,1), a=right → s'=(1,2), r=-1 Update Q + model Then 50 simulated updates: pick random past (s,a) → fake s',r → update Q → Agent learns much faster than pure Q-learning (50× more updates per real step)

Analogy Dyna = daydreaming: after real experience (playing chess game), replay mental simulations (think about alternative moves) to improve faster.

2026 extensions

  • Dyna variants in DreamerV3, MuZero (use learned latent model for imagination)

  • Prioritized experience replay + model-based updates → very high sample efficiency

6.2 Model Predictive Control (MPC) with learned dynamics

Model Predictive Control (MPC) At each time step:

  1. Use current model to predict future states over horizon H

  2. Optimize sequence of actions u_0, u_1, …, u_{H-1} to maximize sum of predicted rewards

  3. Execute only first action u_0 → repeat at next step (receding horizon)

Learned dynamics Replace analytical physics model with neural network dynamics f_θ(s_t, a_t) → s_{t+1}

Numerical example – simple cart-pole MPC Horizon H=20 Current state s = [position, velocity, angle, angular vel] Optimize 20 actions (torques) to keep pole upright longest CEM (Cross-Entropy Method) or iLQR → sample/optimize candidate trajectories Execute first torque → replan next step

2026 practice

  • Neural MPC + learned dynamics → standard in robotics (legged locomotion, manipulation)

  • Diffusion MPC / trajectory diffusion → generate diverse trajectory candidates

Advantages

  • Handles constraints naturally (safety limits on torque/joint angles)

  • Replanning corrects model errors

6.3 MuZero, EfficientZero, DreamerV3 – latent world models

MuZero (Schrittwieser et al. 2020 → EfficientZero 2021) Learns model in latent space (no explicit state reconstruction)

Components:

  • Representation: h = Encoder(o_t) → latent state

  • Dynamics: g(h_t, a_t) → h_{t+1}, reward prediction

  • Prediction: p(h_t) → policy logits, v(h_t) → value

DreamerV3 (Hafner et al. 2023)

  • RSSM (Recurrent State-Space Model) → latent dynamics

  • World model trained with reconstruction + reward + KL regularization

  • Actor-critic in imagination (rollouts in latent space)

Numerical comparison (Atari 100k benchmark, 2026 view)

  • Model-free (Rainbow DQN): ~50–60% human performance

  • MuZero / EfficientZero: ~150–200% human performance

  • DreamerV3: ~180–250% (strong sample efficiency)

Key insight Latent world models allow millions of imagined steps per real step → 10–100× faster learning

6.4 Planning as inference: diffusion-based planning (Decision Diffuser)

Decision Diffuser (Janner et al. 2022–2023 → many follow-ups) Treat planning as conditional generative modeling:

  • Forward diffusion: corrupt trajectory τ_0 → noisy τ_T

  • Reverse diffusion: condition on current state s_t and goal → denoise to feasible trajectory

Advantages

  • Generates diverse plans (stochastic sampling)

  • Handles complex constraints via classifier guidance or reward conditioning

  • Naturally incorporates uncertainty

Numerical example Robot arm task: current state s_t = joint angles Condition diffusion on goal = reach target Sample 100 trajectories → pick highest-reward / safest one → execute first action

2026 extensions

  • Diffusion Policy (Chi et al.) → end-to-end diffusion for robot control

  • Plan4MC → diffusion + MPC hybrid

  • Diffusion for multi-agent planning

6.5 Stochastic model-based planning with uncertainty-aware models

Uncertainty-aware model Predict not only mean s' = f(s,a), but also uncertainty (variance or full distribution)

Methods

  • Ensemble dynamics: train multiple models → variance across ensemble = uncertainty

  • Probabilistic dynamics: Gaussian likelihood or MDN (mixture density network)

  • Epistemic + aleatoric uncertainty separation

Stochastic planning

  • Use uncertainty to guide exploration (high uncertainty → try actions there)

  • Risk-sensitive MPC: minimize expected cost + λ × variance

  • Thompson sampling in model-based RL: sample model from posterior → plan with it

Numerical example – ensemble uncertainty 5 dynamics models predict s' = [3.1, 3.4, 2.9, 3.0, 3.2] Mean = 3.12, std = 0.18 High std → high epistemic uncertainty → agent prefers to explore this action

2026 practice

  • PETS (Probabilistic Ensembles with Trajectory Sampling) → ensemble + CEM

  • DreamerV3 + uncertainty → strong performance in DM Control

  • Diffusion-based planning → naturally uncertainty-aware (stochastic samples)

Model-based RL and planning leverage learned stochastic dynamics to imagine, predict, and optimize far more efficiently than model-free methods — the key to scaling RL to real-world robotics, autonomous systems, and long-horizon decision-making in 2026.

This section is ready for your webpage. It is self-contained, math-accessible, and strongly tied to modern model-based RL practice.

  • 7. Stochastic Optimal Control and Diffusion for Planning

    Stochastic optimal control (SOC) provides the mathematical lens that unifies reinforcement learning, planning, and modern generative modeling. In 2026, diffusion models are increasingly viewed as a form of stochastic control: generating trajectories (whether pixels or robot actions) is equivalent to steering a stochastic system from noise/current state to a desired distribution/goal.

    This section bridges classical control theory with the diffusion-based planning revolution.

    7.1 Stochastic optimal control formulation of RL

    Stochastic Optimal Control (SOC) Find policy/controller u(t) that minimizes expected cost:

    J(u) = E [ ∫_0^T c(x(t), u(t), t) dt + Φ(x(T)) ]

    subject to stochastic dynamics:

    dx = f(x,u,t) dt + g(x,u,t) dW

    RL as SOC

    • State x = environment state s

    • Control u = action a

    • Cost c = -r (negative reward)

    • Terminal cost Φ = 0 or goal penalty

    • Discount γ → exponential cost decay c(t) = γ^t (-r_t)

    Standard RL objective becomes:

    min_π E_π [ Σ_t γ^t (-r_t) ] = max_π E_π [ Σ_t γ^t r_t ]

    KL-regularized RL (maximum entropy RL, soft Q-learning) Add KL divergence penalty to prevent collapse to deterministic policy:

    J(π) = E [ Σ r_t + α H(π(·|s_t)) ]

    This is equivalent to SOC with control cost proportional to KL(π || uniform).

    Numerical example – simple 1D control State x ∈ ℝ, action u ∈ ℝ Dynamics: dx = u dt + 0.1 dW Cost: c = x² + 0.01 u² Optimal control: u* = -k x (linear feedback) KL-regularized: adds exploration noise → u = -k x + noise

    2026 insight Many state-of-the-art methods (PPO with entropy, SAC, Diffusion Policy) are approximate SOC solvers.

    7.2 Diffusion for trajectory generation and planning (Diffuser, Plan4MC)

    Diffusion for planning Treat entire future trajectory τ = (s_t, a_t, s_{t+1}, …, s_{t+H}) as the “image” to generate.

    Forward diffusion Add noise to trajectory → τ_T ≈ pure Gaussian noise

    Reverse diffusion Condition on current state s_t and goal (or reward) → denoise to feasible, high-reward trajectory

    Diffuser (Janner et al. 2022–2023)

    • Diffusion over trajectory tokens

    • Classifier guidance toward high-reward regions

    • Iterative refinement → plan → execute first action → replan

    Plan4MC / Diffusion Planner variants (2024–2026)

    • Latent diffusion in world-model latent space (Dreamer-style)

    • Reward-conditioned score function → generate diverse plan ensembles

    • Select best trajectory via MPC rollouts or learned value

    Numerical example – block stacking Current state s_t = robot + block positions Condition diffusion on goal = block on target Sample 50 trajectories → evaluate with short-horizon MPC or learned critic → pick top-1 → execute first action

    Advantages

    • Generates diverse plans (handles uncertainty)

    • Naturally incorporates constraints via guidance

    • Scales to long horizons via latent space

    2026 status Diffusion planning is now competitive or superior to classical MPC in manipulation and legged locomotion (real-robot demos in labs).

    7.3 Schrödinger bridge and optimal transport in control

    Schrödinger bridge (1930s, rediscovered 2022–2026) Find the most likely stochastic path (bridge) connecting two distributions p_0 (data/current state) and p_T (noise/goal) while minimizing KL divergence to a reference process (e.g., Brownian motion).

    Mathematical form min_q KL(q || p_ref) subject to marginals q_0 = p_0, q_T = p_T

    Connection to diffusion Reverse diffusion is an approximate Schrödinger bridge from noise to data.

    Connection to control Schrödinger bridge = stochastic optimal control problem with fixed marginals → Optimal drift = reference drift + score difference

    Numerical example – bridge from 𝒩(0,1) to 𝒩(5,1) Reference = Brownian motion Optimal bridge = deterministic path with added controlled noise → Straight-line mean shift + minimal diffusion

    2026 applications

    • Rectified flow / flow-matching ≈ discretized Schrödinger bridges → 1–5 step generation

    • Trajectory planning: bridge from current state distribution to goal distribution

    • Offline RL: bridge between behavior policy and optimal policy

    7.4 Control as inference: KL-regularized RL and reward-weighted regression

    Control as inference Cast RL as inference in a probabilistic graphical model:

    • High reward → high probability

    • Policy π(a|s) → likelihood

    • Add KL divergence KL(π || prior) as prior preference for simple/smooth policies

    KL-regularized RL J(π) = E [ Σ r_t - α KL(π_old || π) ] → Soft Q-learning, MPO, REPS, TRPO/PPO all derive from this

    Reward-weighted regression Update policy by weighted regression:

    π_new(a|s) ∝ π_old(a|s) exp( (1/α) Â(s,a) )

    Numerical example – reward-weighted update Old policy π_old(a1|s) = 0.6, π_old(a2|s) = 0.4 Advantages Â(a1) = +4, Â(a2) = -1 α = 1 → exp(Â/α) = exp(4) ≈ 54.6, exp(-1) ≈ 0.368 New weights → π_new(a1) ≈ 0.987, π_new(a2) ≈ 0.013 → Strong shift toward high-advantage action

    2026 practice

    • PPO = approximate KL-constrained inference

    • Diffusion fine-tuning = reward-weighted denoising

    • Control as inference → unifying language for RL + generative modeling

    7.5 Diffusion policies vs traditional policy networks

    Traditional policy networks π_θ(a|s) = MLP / Transformer → deterministic or Gaussian output Trained with policy gradient / actor-critic

    Diffusion policies (Chi et al. 2023–2025 → widespread in robotics 2026) Policy = diffusion model conditioned on s Generate action sequence a_t, a_{t+1}, … via reverse diffusion Condition on current observation s → denoise to feasible action trajectory

    Advantages

    • Multimodal actions → captures multiple good ways to act

    • Handles constraints naturally (via guidance)

    • Uncertainty-aware → sample variance indicates confidence

    • Long-horizon consistency (diffusion over trajectory)

    Numerical example – robot pushing State s = object + gripper pose Diffusion policy generates 16-step action sequence (joint torques) Sample 50 trajectories → pick highest critic value or most consistent one → Success rate 75–90% vs 50–70% for Gaussian policy

    2026 status

    • Diffusion Policy → SOTA on many real-robot manipulation benchmarks

    • Combines with MPC → hybrid diffusion + model-predictive refinement

    • Used in humanoid robots, dexterous hands, autonomous vehicles

    Stochastic optimal control and diffusion-based planning represent the convergence of generative modeling and decision-making — the most exciting frontier in AI in 2026.

8. Advanced Diffusion Models and Stochastic Processes

This section explores the major advancements and variants that have made diffusion models the dominant generative paradigm in 2026. We cover different formulations of the diffusion process, deterministic/flow-based alternatives, extensions to curved/non-Euclidean domains, latent-space acceleration (the Stable Diffusion family), and discrete/abstractive diffusion models.

All concepts build directly on the SDE framework from Section 6 and the score-matching objective from Section 7.

8.1 Variance-exploding (VE) vs variance-preserving (VP) formulations

The forward diffusion process can be defined in two main ways, differing in how the noise variance evolves over time. This choice affects training stability, sampling behavior, and final sample quality.

Variance-Exploding (VE) – Song & Ermon / NCSN++ style

  • Forward SDE: dx = √(dσ²(t)/dt) dW

  • Variance σ²(t) starts small (near 0) and explodes to a very large value (σ_max ≈ 50–300)

  • Data signal x₀ decays slowly → at large t, x_t is dominated by isotropic Gaussian noise with huge variance

  • Score function at late t: ∇ log p_t(x) ≈ -x / σ²(t) (pulls toward origin)

Variance-Preserving (VP) – Ho et al. DDPM style

  • Forward process: x_t = √α_bar_t x_0 + √(1-α_bar_t) ε

  • Total variance of x_t remains approximately 1 (preserved) throughout

  • Continuous SDE equivalent: dx = -½ β(t) x dt + √β(t) dW

  • β(t) is the noise schedule (small early, larger later)

  • Score function at late t: ∇ log p_t(x) ≈ -x (unit-scale pull toward origin)

Comparison Table (2026 perspective)

AspectVariance-Exploding (VE)Variance-Preserving (VP)Final noise varianceVery large (σ² → 10³–10⁵)Bounded ≈ 1Signal decaySlow (x₀ term persists)Fast (x₀ term → 0)Score magnitude late in processVery small (1/σ²(t))Order 1Numerical stabilityCan be unstable at large σMore stableTypical scheduleExponential or linear σ²(t)Cosine or linear β(t)Popular in productionResearch, some high-fidelity modelsStable Diffusion family, Flux, most open modelsSampling speedSimilar with good solversSlightly faster in practice

Numerical intuition

  • VE at t large: x_t ≈ 𝒩(0, 10000 I) → score ≈ -x/10000 (very weak pull)

  • VP at t large: x_t ≈ 𝒩(0, I) → score ≈ -x (strong, unit-scale pull) → VP is easier to learn and more stable for most image/video tasks.

2026 practice VP + cosine schedule is the default in almost all production open models (Stable Diffusion 3, SDXL, Flux.1, AuraFlow). VE is still used in some research for theoretical flexibility.

8.2 Rectified flow, flow-matching, and stochastic interpolants

These deterministic or near-deterministic alternatives to stochastic diffusion often achieve faster sampling with comparable or better quality.

Rectified flow (Liu et al. 2022–2023 → major refinements 2024–2025)

  • Learn straight-line paths from noise z ~ 𝒩(0,I) to data x₀

  • Velocity field v_θ(z,t) predicts dx/dt along the path

  • Train to minimize difference between predicted and true straight velocity

  • Sampling = integrate ODE from t=1 (noise) to t=0 (data)

Flow-matching (Lipman et al. 2022–2023 → dominant in 2026)

  • Generalizes rectified flow

  • Learns conditional velocity field u_θ(x|t) that transports marginal p_t to data p_0

  • Objective: regress u_θ(x(t),t) to target velocity (straight-line or optimal transport velocity)

Stochastic interpolants (Albergo & Vanden-Eijnden 2023+)

  • Add controlled noise to flow-matching paths → hybrid stochastic-deterministic

  • Allows tunable exploration vs determinism

Numerical comparison (typical ImageNet 256×256, 2026 benchmarks)

  • DDPM/VP (50 steps): FID ≈ 2.0–3.0

  • Flow-matching / rectified flow (5–10 steps): FID ≈ 2.2–3.5

  • Consistency-distilled flow-matching (1–4 steps): FID ≈ 2.8–4.0 → 10–50× faster sampling with small quality trade-off

Analogy Diffusion = random walk from noise to data (many small noisy steps) Rectified flow / flow-matching = straight highway from noise to data (few large directed steps)

2026 status Flow-matching + consistency distillation is now the fastest path to high-quality generation. Flux.1, AuraFlow, and many open models use flow-matching as backbone.

8.3 Diffusion on non-Euclidean manifolds (Riemannian diffusion)

Standard diffusion assumes flat Euclidean space. Real data often lies on curved manifolds (spheres for directions, hyperbolic for hierarchies, tori for periodic variables, SE(3) for 3D poses).

Riemannian diffusion Forward SDE defined using Riemannian metric g and Laplace–Beltrami operator Δ_g:

dx = f(x,t) dt + g(t) √g dW_M (Brownian motion on manifold M)

Reverse process Learns Riemannian score ∇_M log p_t(x) in tangent space at x Sampling uses Riemannian Euler–Maruyama or geodesic integrators

Key models & papers (2023–2026)

  • GeoDiff → first practical Riemannian diffusion for molecules (torsion angles on torus)

  • Riemannian Score Matching (Huang et al.) → general framework

  • Manifold Diffusion Models (2024–2025) → extensions to hyperbolic, spherical, Grassmann manifolds

  • Diffusion on SE(3) → 3D pose & molecule generation

Numerical example – torus for torsion angles Molecule with 5 rotatable bonds → configuration space = torus T⁵ Forward: add toroidal Brownian motion Score learned in tangent space → reverse sampling stays on torus → valid conformations

Applications

  • Protein/molecule generation (torsion diffusion)

  • Directional image generation (spherical diffusion)

  • Hierarchical graph generation (hyperbolic diffusion)

  • Robot pose planning (SE(3) diffusion)

8.4 Latent diffusion models (LDM, Stable Diffusion family)

Latent Diffusion Models (LDM) (Rombach et al. 2022 → foundation of Stable Diffusion 1–3, SDXL, Flux.1, AuraFlow) Run diffusion in low-dimensional latent space instead of high-res pixel space.

Workflow

  1. Train autoencoder (VAE or VQ-VAE) to compress x → z (e.g., 512×512 → 64×64×4)

  2. Run diffusion on z (much cheaper)

  3. Decode final z → high-resolution image

Why it works

  • Latent space is smoother and lower-dimensional → faster training/sampling

  • Perceptual compression (KL-regularized VAE) preserves high-frequency details in decoder

Numerical impact

  • Pixel-space diffusion on 512×512: ~10–20× slower training

  • Latent diffusion: trains on 64×64 latents → 4–8× speedup, same perceptual quality

2026 extensions

  • SD3 Medium / SD3.5 → larger latents + better VAEs + rectified flow

  • Flux.1 → flow-matching in latent space + massive pretraining

  • LCM-LoRA / SDXL Turbo → 1–4 step latent generation

8.5 Discrete diffusion and absorbing state models (D3PM, MaskGIT)

Discrete diffusion Diffusion on discrete tokens (text, graphs, protein sequences, images with VQ-VAE).

Absorbing state models (D3PM – Austin et al. 2021)

  • Forward: gradually replace tokens with absorbing [MASK] state

  • Reverse: learn to recover original token from masked context

  • Transition matrix: categorical diffusion with absorbing state

MaskGIT / MAGE / Masked Generative Transformers (2022–2025)

  • Mask large portions → predict masked tokens in parallel (BERT-like)

  • Iterative refinement: mask → predict → remask uncertain tokens → repeat

Numerical example – discrete text diffusion Sequence: “the cat sat on the mat” Forward: at step t, each token → [MASK] with probability β_t Reverse: model p_θ(token | masked context) After 10–20 iterations → coherent sentence from full mask

2026 status

  • Discrete diffusion used in DNA/protein sequence design (e.g., EvoDiff)

  • MaskGIT-style models competitive with autoregressive LLMs for infilling, editing, and code generation

  • Hybrid continuous-discrete diffusion → token latents + continuous diffusion (e.g., image tokenization + diffusion)

This section shows how the diffusion paradigm has evolved into a versatile, high-performance framework — from continuous pixel/video generation to discrete token sequences and curved manifold data. These advancements are behind nearly every production-grade generative system in 2026.

9. Stochastic Differential Equations (SDEs) in Generative AI

Stochastic Differential Equations (SDEs) are the continuous-time mathematical backbone of all modern diffusion-based generative models. In 2026, nearly every high-quality image, video, 3D molecule, protein structure, audio, and even planning trajectory is generated by solving an SDE (or its deterministic flow counterpart) in the forward (noise addition) or reverse (denoising) direction.

This section explains the core SDE formulation, how reverse-time SDEs are derived, practical numerical solvers, adaptive acceleration methods, and the deep theoretical connections to optimal control and Schrödinger bridges.

9.1 Forward SDE → reverse-time SDE → score function

Forward SDE (data → noise) The forward diffusion process gradually corrupts clean data x₀ into pure noise x_T:

dx = f(x, t) dt + g(t) dW

Common choices in 2026:

  • Variance-Preserving (VP, DDPM style): f(x,t) = -½ β(t) x, g(t) = √β(t)

  • Variance-Exploding (VE): f(x,t) = 0, g(t) = √(dσ²(t)/dt)

Reverse-time SDE (noise → data) Anderson (1982) showed that the reverse process has the same diffusion coefficient g(t) but adjusted drift:

dx = [f(x,t) - g(t)² ∇_x log p_t(x)] dt + g(t) dW_backward

Score function s(x,t) = ∇_x log p_t(x) This is the key quantity we learn: it points toward high-density regions at noise level t.

Training objective Denoising score matching (equivalent to diffusion loss): L(θ) = E_{t,x_0,ε} [ || s_θ(x_t,t) + ε / g(t) ||² ] → Model s_θ learns to predict the direction to remove noise.

Numerical example – VP forward/reverse x₀ = 1 (1D data point) β(t) = 0.02 t (linear schedule) At t=0.5: α_bar ≈ 0.995, √(1-α_bar) ≈ 0.1 x_{0.5} ≈ 0.997 + 0.1 ε Score ≈ - (x_{0.5} - 0.997) / 0.005 ≈ -200 ε Reverse drift = -½ β x - β score ≈ -0.01 x + 20 ε → Strong pull back toward original x₀.

Analogy Forward SDE = slowly dissolving sugar in water (data → noise) Reverse SDE = magically reassembling sugar crystal from solution (noise → data) Score function = force field that guides molecules back to crystal positions.

9.2 Numerical solvers: Euler–Maruyama, Heun, predictor-corrector samplers

Sampling from the reverse SDE requires discretizing the continuous-time equation.

Euler–Maruyama (first-order, simplest) x_{t-Δt} ≈ x_t + [f(x_t,t) - g(t)² s_θ(x_t,t)] Δt + g(t) √Δt Z Z ~ 𝒩(0,I)

Heun’s method (second-order predictor-corrector) Predictor: x̂ = x_t + drift Δt + diffusion √Δt Z Corrector: average drift at x_t and x̂ → more accurate

Predictor-Corrector sampler (Song et al. 2021) Predictor: one Euler–Maruyama step Corrector: multiple Langevin MCMC steps (score-based gradient ascent) → Combines fast prediction with refinement

Numerical comparison (typical FID on CIFAR-10 32×32, 2026 benchmarks)

  • Euler–Maruyama (50 steps): FID ≈ 4–6

  • Heun / PC sampler (20–30 steps): FID ≈ 3–4

  • DPM-Solver / UniPC (10–15 steps): FID ≈ 2.5–3.5

Analogy Euler–Maruyama = basic forward Euler integration (fast but inaccurate) Heun / PC = Runge–Kutta style (better accuracy per step) → Fewer steps needed for same quality

9.3 Adaptive step-size solvers (DPM-Solver, DEIS, UniPC)

DPM-Solver (Lu et al. 2022–2023 → DPM-Solver++ 2023) Analytic multi-step solver for VP/VE SDEs → exact solution under linear assumption → very accurate at large steps

DEIS (Diffusion Exponential Integrator Sampler) Exponential integrator + adaptive step-size → fewer steps than DPM-Solver

UniPC (Universal Predictor-Corrector, 2023–2024 → dominant in 2026) Unified framework combining predictor-corrector + multi-step solvers → state-of-the-art speed/quality trade-off

Numerical example (typical 2026 benchmarks)

  • DDIM / Euler (50 steps): FID ≈ 4.0

  • DPM-Solver++ (15 steps): FID ≈ 3.2

  • UniPC (8 steps): FID ≈ 3.4–3.8 → 6× faster sampling with almost no quality drop

2026 practice UniPC + LCM-LoRA / SDXL Turbo → 1–4 step generation on consumer GPUs Used in production for real-time image/video editing

9.4 Connection to optimal control and Schrödinger bridge

Stochastic optimal control view Diffusion sampling = solving a stochastic control problem Minimize cost functional: E[ ∫ L(x,u,t) dt + terminal cost ] where u(t) = control (drift adjustment), L = regularization on control effort

Schrödinger bridge (1930s, rediscovered 2022–2026) Find most likely stochastic path from noise distribution q_T to data distribution p_0 Equivalent to stochastic optimal control with fixed marginals

Recent breakthrough Rectified flow, flow-matching, and stochastic interpolants are approximations of Schrödinger bridge solutions → Deterministic paths → faster, more stable sampling

Numerical insight Schrödinger bridge between 𝒩(0,I) and data distribution → optimal transport-like paths Flow-matching directly regresses to these optimal velocities → fewer steps needed

AI connection 2025–2026 models (Flow Matching, Rectified Flow, Consistency Trajectory Models) are essentially discretized Schrödinger bridges → unify diffusion and flow-based generation.

9.5 Stochastic optimal control interpretation of diffusion sampling

Full optimal control formulation Sampling reverse SDE = minimizing KL divergence between forward and reverse paths Equivalent to stochastic control:

  • State = x(t)

  • Control = drift adjustment - (1/2) g² ∇ log p

  • Cost = KL divergence to data distribution at t=0

Practical impact

  • Guidance as control: classifier guidance = extra drift term toward class condition

  • CFG (classifier-free guidance) = learned control that amplifies prompt direction

  • Reward-weighted sampling = change cost functional to include external reward (RL fine-tuning of diffusion)

Numerical example – CFG as control Base drift = - (1/2) β(t) x + score term Guidance adds w × (score_conditional - score_unconditional) w = 7.5 → strong control toward prompt → sharper, more faithful samples

2026 frontier Diffusion models are now routinely fine-tuned with RL objectives (reward-weighted sampling, PPO-style) → stochastic optimal control lens explains why they align so well with human preferences.

This section shows how SDEs are not just a mathematical curiosity — they are the active engine behind every major generative breakthrough in 2026. The next sections cover implementation, case studies, challenges, and future directions.

10. Practical Implementation Tools and Libraries (2026 Perspective)

In March 2026, the Python ecosystem for diffusion models, score-based generation, SDEs, and stochastic processes is extremely mature. Most production-grade models (Stable Diffusion 3.5, Flux.1, SDXL Turbo, LCM-LoRA, AuraFlow, consistency-based generators) are built using a small set of battle-tested libraries.

This section covers the essential tools, their current status, quick-start code, and five hands-on mini-projects you can run today (all Colab-friendly).

10.1 Diffusion frameworks: Diffusers (Hugging Face), score_sde, OpenAI guided-diffusion

Hugging Face Diffusers (the de-facto industry standard in 2026)

  • Repository: https://github.com/huggingface/diffusers

  • Current version: ≥ 0.32.x

  • Install: pip install diffusers[torch] accelerate transformers

  • Supports: DDPM, DDIM, PNDM, LCM, Consistency Models, Stable Diffusion 1–3.5, Flux.1, SDXL, ControlNet, IP-Adapter, LoRA, textual inversion, etc.

  • Features: GPU-accelerated, ONNX export, torch.compile support, fast inference, community pipelines

Quick-start example – generate image with Flux.1 (flow-matching)

Python

from diffusers import FluxPipeline import torch pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 ) pipe.enable_model_cpu_offload() # save VRAM prompt = "A cyberpunk city at night with neon lights and flying cars, ultra detailed, cinematic" image = pipe( prompt, num_inference_steps=20, guidance_scale=3.5, generator=torch.Generator("cuda").manual_seed(42) ).images[0] image.save("cyberpunk_flux.png")

score_sde (Song et al. reference implementation – research favorite)

  • Repository: https://github.com/yang-song/score_sde

  • Still the gold-standard codebase for score-based generative modeling research

  • Supports VE, VP, sub-VP, NCSN++ architectures, continuous-time SDEs

  • Great for custom experiments (e.g., manifold diffusion, new samplers)

OpenAI guided-diffusion (legacy but educational)

2026 recommendation → Use Diffusers for 95% of practical work (production, prototyping, fine-tuning) → Use score_sde when you need full control over SDE formulation or score-matching loss

10.2 SDE solvers: torchdiffeq, torchsde, jaxdiff

torchdiffeq (PyTorch ODE/SDE solvers)

torchsde (dedicated PyTorch SDE solver)

jaxdiff / diffrax (JAX ecosystem – fastest for large-scale research in 2026)

Quick torchsde example – reverse SDE sampling

Python

import torch import torchsde class ReverseSDE(torch.nn.Module): def f(self, t, y): return drift_net(y, t) # learned drift def g(self, t, y): return diffusion_net(y, t) # diffusion coeff sde = ReverseSDE().cuda() y0 = torch.randn(64, 3, 64, 64).cuda() # batch of noise images ts = torch.linspace(1.0, 0.0, 50).cuda() # reverse time ys = torchsde.sdeint(sde, y0, ts, method="heun") generated = ys[-1] # final samples at t=0

10.3 Manifold diffusion: GeoDiff, Riemannian Score Matching libraries

GeoDiff (2022–2023, still widely cited)

Riemannian Score Matching & GeoScore (2023–2026 extensions)

Quick usage pattern (using Geomstats + custom score model)

Python

from geomstats.geometry.hypersphere import Hypersphere manifold = Hypersphere(dim=2) # S² example # score_model = YourScoreNet() # learns ∇ log p_t in tangent space # Forward: spherical Brownian motion # Reverse: sample using Riemannian Euler–Maruyama + learned score

2026 note Manifold diffusion is now standard for 3D molecules (RFdiffusion, Chroma), directional images (spherical diffusion), and hierarchical graphs (hyperbolic diffusion).

10.4 Fast sampling: Consistency Models, Latent Consistency Models (LCM), SDXL Turbo

Consistency Models (Song et al. 2023)

  • Train model to predict x₀ directly from any noisy x_t

  • One-step or few-step generation after distillation

Latent Consistency Models (LCM) (Luo et al. 2023–2024)

  • Distilled version of SDXL → 4–8 step generation in latent space

  • LCM-LoRA: plug-and-play adapter for any SD checkpoint

SDXL Turbo (Stability AI 2023–2024)

  • Adversarial diffusion distillation → 1–4 step generation

  • CFG scale = 0 (adversarial training removes need for guidance)

Quick LCM-LoRA usage (Diffusers)

Python

from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0") pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") pipe.to("cuda") image = pipe( "A cyberpunk city at night with flying cars and neon lights, ultra detailed", num_inference_steps=4, guidance_scale=0.0, generator=torch.manual_seed(42) ).images[0]

2026 status

  • LCM-LoRA + SDXL Turbo → real-time generation on RTX 40-series / mobile GPUs

  • Consistency distillation is now default in most consumer tools

10.5 Mini-project suggestions

  1. Beginner: DDPM from scratch (1D toy data)

    • Dataset: 1D mixture of Gaussians

    • Implement forward noise addition + reverse denoising (score network = MLP)

    • Train denoising objective → sample new points from noise

  2. Intermediate: Score-matching toy model (2D)

    • Use torchsde + simple MLP score network

    • Train on 2D Swiss-roll or 2D Gaussian blobs

    • Sample with Euler–Maruyama vs Heun vs DPM-Solver

  3. Intermediate–Advanced: Latent diffusion fine-tuning

    • Start with SD 1.5 or SDXL base

    • Fine-tune with LoRA on custom dataset (e.g., your own photos or style)

    • Add LCM-LoRA distillation for 4-step fast inference

  4. Advanced: Manifold diffusion on torus

    • Use Geomstats + custom score model

    • Generate periodic signals or 2D torus embeddings

    • Compare Euclidean vs Riemannian diffusion quality

  5. Advanced: Flow-matching from scratch

    • Implement rectified flow or conditional flow-matching

    • Train on CIFAR-10 or small molecule dataset

    • Compare 1-step vs multi-step sampling quality and speed

All projects are runnable on Colab (free tier sufficient for toy versions; Pro for larger models).

This section gives you the exact tools and starting points used by researchers and companies building generative AI in 2026. You can now implement almost any modern diffusion pipeline from scratch or fine-tune production models.

11. Case Studies and Real-World Applications

This section shows how the stochastic processes and diffusion/SDE frameworks from earlier sections power production-grade AI systems in 2026. Each case highlights the specific stochastic technique used, why it outperforms alternatives, typical performance metrics, and the current leading models.

11.1 Image & video generation (Stable Diffusion 3, Sora-like models)

Problem Generate photorealistic or artistic images/videos from text prompts, with high fidelity, prompt adherence, diversity, and fast inference.

Stochastic process used Variance-preserving or variance-exploding diffusion SDEs + score matching + classifier-free guidance + consistency distillation / flow-matching acceleration.

Why diffusion/SDE wins

  • Autoregressive models (early DALL·E) → slow, left-to-right artifacts

  • GANs → mode collapse, training instability

  • Diffusion → stable training, excellent sample quality, natural diversity via stochastic sampling

Leading models in 2026

  • Stable Diffusion 3 Medium / SD3.5 (Stability AI): latent diffusion + rectified flow + CFG++

  • Flux.1 (Black Forest Labs): flow-matching + large-scale pretraining

  • Sora-like models (OpenAI Sora, Google Veo-2, Runway Gen-3, Luma Dream Machine, Kling): spatiotemporal latent diffusion + temporal consistency SDEs

  • Midjourney v7 / Imagen 4 (proprietary): hybrid diffusion + proprietary guidance

Performance highlights

  • ImageNet 256×256 FID: SD3 ≈ 2.1–2.5, Flux.1 ≈ 1.8–2.2 (state-of-the-art open models)

  • Video generation: 5–10 s clips at 720p in 10–30 inference steps (LCM/SDXL Turbo style)

  • Inference speed: 1–4 steps on consumer GPU (RTX 4090 / A100) → real-time preview

Key stochastic insight Reverse SDE sampling with CFG w=7–12 → strong prompt control Consistency distillation / LCM-LoRA → 1–4 step generation

11.2 Molecule & protein conformation generation (RFdiffusion, Chroma, FrameDiff)

Problem Generate valid 3D molecular conformations (small molecules, proteins) or design novel sequences with desired properties (binding affinity, stability).

Stochastic process used Riemannian / manifold diffusion (torsion angles on torus, SE(3) equivariant diffusion on 3D coordinates) + score matching on curved manifolds.

Why diffusion/SDE wins

  • Traditional force-field methods → slow, stuck in local minima

  • VAEs/GANs → invalid geometries, poor diversity

  • Diffusion → explores conformation space gradually → high validity, diversity, and energy stability

Leading models in 2026

  • RFdiffusion (Baker lab, 2022–2025 updates) → SE(3)-equivariant diffusion on protein backbones

  • Chroma (Generate Biomedicines) → discrete + continuous diffusion for full protein design

  • FrameDiff / FoldFlow → flow-matching on rigid frames + SE(3) equivariance

  • DiffDock / DiffLinker → diffusion for protein–ligand docking

Performance highlights

  • Protein design success rate: RFdiffusion variants → 40–70% designs fold correctly (AF2 validation)

  • Binding affinity (PDBBind): DiffDock → RMSD < 2 Å in 60–75% cases (vs 30–40% for traditional docking)

  • Conformation RMSD: FrameDiff → median 1.0–1.5 Å on GEOM-drugs benchmark

Key stochastic insight Manifold diffusion on torus (torsion angles) + SE(3) equivariance → respects bond constraints and rotational symmetry Score function learned in tangent space → valid, low-energy conformations

11.3 Time-series forecasting with diffusion (TimeDiff, CSDI)

Problem Forecast future values in multivariate time-series (weather, traffic, stock prices, sensor data) with uncertainty quantification.

Stochastic process used Diffusion on time-series (mask-and-denoise or forward noise corruption) + score matching for probabilistic forecasting.

Why diffusion/SDE wins

  • Classical ARIMA/LSTM → point forecasts, poor uncertainty

  • Gaussian processes → scale poorly to long sequences

  • Diffusion → full predictive distribution, handles missing data, captures multi-modal futures

Leading models in 2026

  • TimeDiff (2022–2024) → diffusion for deterministic & probabilistic forecasting

  • CSDI (Conditional Score-based Diffusion for Imputation) → imputation + forecasting

  • TimeGrad, ScoreGrad → score-based autoregressive hybrids

  • DiffTime / TSDiff → latent diffusion for long-horizon forecasting

Performance highlights

  • Electricity / Traffic benchmarks (ETTh, ETTm): → MAE / CRPS improvement 10–25% over Informer / Autoformer → Uncertainty calibration: proper scoring rules 15–30% better

Key stochastic insight Reverse diffusion generates multiple plausible futures → ensemble prediction without multiple model training

11.4 Audio & speech synthesis (AudioLDM 2, Grad-TTS variants)

Problem Generate high-fidelity speech (TTS), music, sound effects from text or conditioning.

Stochastic process used Latent diffusion in spectrogram/mel-spectrogram space + continuous-time SDE or flow-matching.

Why diffusion/SDE wins

  • WaveNet-style autoregressive → very slow inference

  • GANs → artifacts, instability

  • Diffusion → high perceptual quality, natural prosody variation, controllable via guidance

Leading models in 2026

  • AudioLDM 2 / Make-An-Audio → latent diffusion on CLAP embeddings

  • Grad-TTS / VALL-E X variants → diffusion + duration predictor

  • NaturalSpeech 3, VoiceCraft, Seed-TTS → hybrid diffusion + flow-matching

  • MusicGen / MusicLM successors → text-to-music diffusion

Performance highlights

  • TTS: MOS scores 4.4–4.7 (near human parity)

  • Inference speed: 1–5 real-time factor on GPU (after LCM-style distillation)

  • Zero-shot voice cloning: 90%+ speaker similarity in few-shot setting

Key stochastic insight Diffusion in latent mel-space + classifier-free guidance → natural prosody & emotion control

11.5 Stochastic optimal control & planning in robotics

Problem Plan trajectories for robots (arms, drones, legged robots) in uncertain environments with safety constraints.

Stochastic process used Model predictive control (MPC) + diffusion-based trajectory generation + stochastic optimal control (SOC) interpretation of diffusion sampling.

Why diffusion/SDE wins

  • Classical MPC → deterministic, brittle to uncertainty

  • RL → sample-inefficient, reward shaping hard

  • Diffusion → generate diverse, high-quality trajectory ensembles → robust planning

Leading approaches in 2026

  • Decision Diffuser / Diffuser (Janner et al. 2022–2025) → diffusion as policy prior

  • DiffMPC / Plan4MC → diffusion for model-predictive planning

  • Stochastic Control via Diffusion (2024–2026) → Schrödinger bridge for trajectory optimization

  • RoboDiffusion / Diffusion Policy → end-to-end diffusion policies for manipulation

Performance highlights

  • Block-stacking / dexterous manipulation: success rate 70–90% (vs 40–60% classical RL)

  • Drone navigation in wind: collision rate ↓ 30–50% with diffusion ensemble planning

Key stochastic insight Diffusion sampling = stochastic optimal control with KL-regularized cost → naturally produces smooth, diverse, uncertainty-aware plans

These case studies demonstrate that stochastic processes — especially diffusion SDEs — are no longer academic curiosities. They are the core technology driving the most impactful AI applications in 2026, from creative generation to scientific discovery and physical control.

This section is ready for your webpage. It is self-contained, ties theory to real 2026 models, and highlights measurable impact.

PREVIOUS PAGE INDEX PAGE NEXT PAGE

This AI tutorial made complex concepts easy to grasp, and the free PDFs were a lifesaver for my studies.

Amy K

A smiling young woman sitting at a desk with a laptop and AI study notes spread out.
A smiling young woman sitting at a desk with a laptop and AI study notes spread out.

★★★★★

Join AI Learning

Get free AI tutorials and PDFs