All my books are exclusively available on Amazon. The free notes/materials on globalcodemaster.com do NOT match even 1% with any of my published books. Similar topics ≠ same content. Books have full details, exercises, chapters & structure — website notes do not.No book content is shared here. We fully comply with Amazon policies.

PREVIOUS PAGE INDEX PAGE NEXT PAGE

Probability & Bayesian Inference in AI: Uncertainty Handling & Real-World Decisions

N.B.- All my books are exclusively available on Amazon. The free notes/materials on globalcodemaster.com do NOT match even 1% with any of my PUBLISHED BOoks. Similar topics ≠ same content. Books have full details, exercises, chapters & structure — website notes do not.No book content is shared here. We fully comply with Amazon policies.
TABLE OF CONTENT

0. Orientation & How to Use These Notes

0.1 Target Audience & Recommended Learning Pathways 0.2 Prerequisites (Probability Theory, Linear Algebra, Basic Machine Learning) 0.3 Notation & Mathematical Conventions 0.4 Why Bayesian Thinking Matters in Modern AI (2026 Landscape) 0.5 Version History & Update Log

1. Core Probability Foundations for AI

1.1 Probability Spaces, Random Variables & Distributions 1.1.1 Discrete vs. Continuous vs. Mixed Distributions 1.1.2 Expectation, Variance, Covariance, Correlation 1.1.3 Common Distributions in AI (Bernoulli, Categorical, Gaussian, Poisson, Dirichlet, Beta, Gamma) 1.2 Information Theory Essentials 1.2.1 Entropy, Cross-Entropy, KL Divergence 1.2.2 Mutual Information & Conditional Entropy 1.2.3 Perplexity & Bits-per-Character in Language Models 1.3 Concentration Inequalities & High-Probability Bounds 1.3.1 Hoeffding, Bernstein, McDiarmid 1.3.2 Sub-Gaussian & Sub-Exponential Random Variables 1.3.3 Empirical Bernstein & Variance-Adaptive Bounds

2. Bayesian Inference Fundamentals

2.1 Bayes’ Theorem & Posterior Inference 2.1.1 Prior, Likelihood, Posterior, Marginal Likelihood 2.1.2 Conjugate Priors (Beta-Binomial, Dirichlet-Multinomial, Normal-Normal, Gamma-InverseWishart) 2.1.3 MAP vs. Full Posterior vs. Predictive Distribution 2.2 Exact Inference Methods 2.2.1 Sum-Product (Belief Propagation) on Trees & Polytrees 2.2.2 Variable Elimination 2.2.3 Junction Tree Algorithm 2.3 Approximate Inference 2.3.1 Markov Chain Monte Carlo (Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo / NUTS) 2.3.2 Variational Inference (Mean-Field, Structured, Black-Box VI) 2.3.3 Importance Sampling & Sequential Monte Carlo (Particle Filters)

3. Bayesian Models in Machine Learning

3.1 Bayesian Linear & Generalized Linear Models 3.1.1 Bayesian Linear Regression & Evidence Approximation 3.1.2 Bayesian Logistic / Probit Regression 3.1.3 Sparse Bayesian Learning & Automatic Relevance Determination (ARD) 3.2 Gaussian Processes & Nonparametric Bayes 3.2.1 GP Regression & Classification 3.2.2 Kernel Design & Deep Kernel Learning 3.2.3 Scalable GPs (SVGP, Structured Kernel Interpolation, GPflow / GPyTorch) 3.3 Bayesian Neural Networks 3.3.1 Variational Bayes & Bayes-by-Backprop 3.3.2 Deep Ensembles, MC Dropout, SWAG, Laplace Approximation 3.3.3 Last-Layer Bayesian & Probabilistic Backpropagation

4. Uncertainty Quantification in Deep Learning & LLMs

4.1 Aleatoric vs. Epistemic Uncertainty 4.1.1 Predictive Distributions & Uncertainty Decomposition 4.1.2 Calibration (Expected Calibration Error, Brier Score) 4.2 Modern LLM Uncertainty Methods 4.2.1 Verbalized Confidence & Self-Evaluation 4.2.2 Semantic Entropy & Cluster-based Uncertainty 4.2.3 Conformal Prediction for Language Models 4.2.4 Token-level & Sequence-level Uncertainty (P(True), min-p, entropy) 4.3 Uncertainty in Decision-Making & Safety 4.3.1 Uncertainty-Aware Active Learning 4.3.2 Out-of-Distribution Detection & Rejection Rules 4.3.3 Risk-Averse Policies & Safe Exploration

5. Bayesian Decision Theory & Real-World Applications

5.1 Bayesian Decision Theory Basics 5.1.1 Loss Functions & Bayes Risk 5.1.2 Expected Utility Maximization 5.2 Bayesian Optimization 5.2.1 Gaussian Process Surrogate + Acquisition Functions (EI, PI, UCB, Thompson Sampling) 5.2.2 Scalable BO (TuRBO, SAAS-BO, Dragonfly) 5.2.3 Hyperparameter Tuning & Neural Architecture Search 5.3 Bayesian Bandits & Reinforcement Learning 5.3.1 Thompson Sampling & Bayesian UCB 5.3.2 Posterior Sampling for RL (PSRL) 5.3.3 Bayesian Deep RL & Uncertainty-Aware Exploration 5.4 Bayesian Methods in Large Language Models 5.4.1 Bayesian Prompting & In-Context Learning 5.4.2 Uncertainty-Guided Chain-of-Thought & Self-Refine 5.4.3 Bayesian Model Selection for Prompt Ensembles

6. Advanced & Emerging Topics (2025–2026)

6.1 Probabilistic Circuits & Tractable Generative Models 6.2 Normalizing Flows & Continuous Normalizing Flows for Text 6.3 Diffusion Models & Score-Based Generative Modeling (Continuous & Discrete) 6.4 Bayesian Nonparametric Methods at Scale (Dirichlet Process, Hierarchical Pitman-Yor) 6.5 Robust Bayesian Inference & Misspecification 6.6 Open Problems & Thesis Directions

7. Tools, Libraries & Implementation Resources

7.1 Core Probabilistic Programming Frameworks 7.1.1 Pyro, NumPyro, PyMC, Stan (CmdStanPy), TensorFlow Probability, Edward2 7.1.2 Pyro + JAX (NumPyro) for GPU-accelerated inference 7.2 Uncertainty Quantification Libraries 7.2.1 Uncertainty Wizard, Fortuna, TorchUncertainty, Laplace 7.2.2 Semantic Entropy implementations & LLM calibration tools 7.3 Bayesian Optimization & Bandits 7.3.1 Ax (Meta), BoTorch, Optuna (Bayesian samplers), SMAC3 7.3.2 Thompson Sampling & Bayesian bandits libraries 7.4 Gaussian Process Libraries 7.4.1 GPyTorch, GPflow, scikit-learn GPs, tinygp (JAX) 7.5 Evaluation & Benchmarking Suites 7.5.1 Uncertainty Baselines, UCI datasets, GLUE-style uncertainty extensions

8. Assessments, Exercises & Projects

8.1 Conceptual & Proof-Based Questions 8.2 Coding Exercises (Bayesian linear regression, VAE for text, Thompson sampling) 8.3 Mini-Projects (Bayesian hyperparameter tuning, uncertainty-aware LLM rejection, Bayesian optimization loop) 8.4 Advanced / Thesis-Level Project Ideas

0. Orientation & How to Use These Notes

Welcome to Mathematical Models in NLP: Embeddings, Probabilistic Approaches & Language Understanding — a rigorous, up-to-date (2026) resource that bridges classical probabilistic foundations with the mathematical machinery powering modern large language models (LLMs), contextual embeddings, reasoning, and controllable generation.

This section orients the reader, clarifies prerequisites, defines notation, provides historical context, and explains how to navigate the material effectively.

0. Orientation & How to Use These Notes

Welcome to Probability & Bayesian Inference in AI: Uncertainty Handling & Real-World Decisions — a rigorous, research-oriented resource updated for the 2026 AI landscape. This material bridges classical Bayesian statistics with modern deep learning, large language models, reinforcement learning, autonomous systems, and safety-critical applications where uncertainty quantification is no longer optional but essential.

The notes emphasize mathematical clarity, computational practicality, and real-world relevance — from principled decision-making under uncertainty to calibrating trillion-parameter LLMs and building safe autonomous agents.

0.1 Target Audience & Recommended Learning Pathways

Primary audiences

AudienceTypical Background / GoalRecommended PathwayMSc / early PhD studentsBuilding strong probabilistic foundations for AI/ML researchFull sequential read: 0 → 1 → 2 → 3 → 5 → 7 → 8 (exercises & projects)Advanced undergraduatesGaining deeper understanding beyond black-box ML0 → 1 → 2.1–2.2 → 3.1 → 4.1–4.2 → 7.1 (focus on core concepts & simple coding)ML researchers & PhD candidatesWorking on uncertainty in LLMs, safe RL, Bayesian deep learning, or trustworthy AI3 → 4 → 5 → 6 → 8 (frontiers) + selected proofs and advanced projectsIndustry AI engineers (safety, reliability, autonomous systems)Implementing calibrated models, uncertainty-aware agents, Bayesian optimization0 → 2 → 4 → 5 → 7 (tools) → practical parts of 3 & 6Professors / lecturersLecture material, proofs, exercises, capstone / thesis ideasFull read + 8.1–8.3 for assignments, 8.4 for thesis supervision

Suggested learning tracks (2026)

  • Fast practical track (3–6 months): 0 → 1 → 2.1–2.3 → 4 → 5 → 7 (tools & applications)

  • Research-oriented track (9–18 months): Full sequential + deep dives into 3, 6, papers from Appendix C

  • LLM / trustworthiness focus: 4 (uncertainty in deep learning & LLMs) → 5.4 → 6 → 8 (emerging topics)

  • Bayesian optimization & decision-making focus: 2 → 5.2 → 5.3 → selected parts of 3 & 7

0.2 Prerequisites

To benefit fully, readers should already be comfortable with:

Mathematics

  • Probability & statistics: random variables, expectation, variance, joint/conditional/marginal distributions, common distributions (Gaussian, Bernoulli, categorical, Beta, Dirichlet), law of large numbers, central limit theorem

  • Linear algebra: vectors, matrices, eigenvalues/eigenvectors, norms, basic SVD, matrix calculus (gradients, Hessians)

  • Multivariate calculus: partial derivatives, chain rule, gradient-based optimization intuition

Machine learning

  • Supervised learning basics (regression, classification)

  • Neural networks & backpropagation

  • Gradient descent variants (SGD, Adam)

  • Loss functions (cross-entropy, MSE)

  • Familiarity with deep learning frameworks (PyTorch or JAX preferred)

Nice-to-have (reviewed when needed)

  • Introductory Bayesian statistics (conjugate priors, posterior updates)

  • Information theory (entropy, KL divergence)

  • Basic reinforcement learning concepts (MDPs, value functions)

Recommended refreshers (free & concise, 2026 links)

  • Probability: “Probabilistic Machine Learning: Advanced Topics” (Murphy, 2023) – Chapters 1–4

  • Linear algebra & calculus: “Mathematics for Machine Learning” (Deisenroth et al., free PDF) – Chapters 2–6

  • Bayesian basics: “Pattern Recognition and Machine Learning” (Bishop, 2006) – Chapters 1–3

  • PyTorch: official tutorials (autograd, nn.Module, distributions)

0.3 Notation & Mathematical Conventions

Standard modern probabilistic ML notation (aligned with 2023–2026 papers) is used throughout.

Symbol / ConventionMeaning / UsageBold lowercaseVectors: x, μ, θBold uppercaseMatrices: X, Σ, ΘCalligraphicSets / distributions: 𝒳 (data space), 𝒩(μ, Σ), p(θ)Blackboard boldNumber fields: ℝ, ℕExpectation𝔼[·] or E[·]Probabilityℙ(·) or P(·)Indicator𝟙{condition}TransposeAᵀHadamard product⊙~Distributed as (x ~ 𝒩(μ, Σ))∝Proportional to≜Defined as≈Approximately equallogNatural logarithm (unless specified)

Derivations are step-by-step; proofs are complete but concise (references provided for deeper treatments).

0.4 Why Bayesian Thinking Matters in Modern AI (2026 Landscape)

In 2026, AI systems are deployed in high-stakes domains: autonomous driving, medical diagnosis, financial trading, legal reasoning, personalized medicine, robotics, and frontier LLMs used in education, law, and science. These applications demand:

  • Principled uncertainty quantification → avoid overconfident wrong answers

  • Safe exploration → prevent catastrophic failures in RL or agentic systems

  • Robustness to distribution shift → handle out-of-distribution inputs gracefully

  • Calibrated predictions → know when to abstain or seek human input

  • Data-efficient learning → incorporate priors when data is scarce or expensive

  • Interpretability & accountability → explain why a decision was made under uncertainty

  • Alignment & safety → prevent reward hacking and value drift in RLHF/RLAIF

Bayesian methods provide:

  • Coherent uncertainty propagation (epistemic + aleatoric)

  • Principled incorporation of prior knowledge

  • Formal decision theory under uncertainty (Bayes risk, expected utility)

  • Robustness to model misspecification

  • Scalable approximations (variational inference, Laplace, ensembles) that now work at LLM scale

2026 reality check While deep ensembles and last-layer Bayesian methods are production-ready, full Bayesian inference on trillion-parameter models remains intractable. Hybrid approaches (Bayesian last-layer + deterministic backbone, conformal prediction, semantic entropy) dominate practical uncertainty handling in frontier systems.

0.5 Version History & Update Log

VersionDateMajor Additions / Changes1.0Feb 2025Initial release: Sections 0–2, core probability & Bayesian inference basics1.1Jun 2025Added Section 3 (Bayesian models in ML), uncertainty in LLMs, Bayesian optimization1.2Oct 2025Section 4 (LLM-specific uncertainty), diffusion models, modern Bayesian deep learning1.3Jan 20262026 frontier: semantic entropy, conformal prediction for language, RLAIF uncertainty1.4Mar 2026Current version: new exercises, Grok-4 / Gemini 2.5 uncertainty references, updated tools

This is a living document — updated roughly quarterly as new uncertainty quantification techniques, calibration methods, and safety benchmarks emerge.

1. Core Probability Foundations for AI

Probability is the mathematical language of uncertainty — the single most important tool for building reliable, safe, and calibrated AI systems in 2026. Whether you are quantifying epistemic uncertainty in trillion-parameter LLMs, designing safe exploration policies in reinforcement learning, performing Bayesian optimization for hyperparameter tuning, or detecting out-of-distribution inputs in autonomous driving, everything rests on a solid understanding of probability spaces, random variables, distributions, information measures, and concentration phenomena.

This section reviews the essential probabilistic toolkit that appears repeatedly throughout the rest of the notes.

1.1 Probability Spaces, Random Variables & Distributions

1.1.1 Discrete vs. Continuous vs. Mixed Distributions

A probability space is formally defined by the triple (Ω, ℱ, ℙ), where:

  • Ω is the sample space (set of all possible outcomes),

  • ℱ is a σ-algebra (collection of measurable events),

  • ℙ is a probability measure (ℙ: ℱ → [0,1] with countable additivity and normalization).

A random variable X is a measurable function X: Ω → ℝ (or ℝ^k, or more general spaces).

Discrete distributions Support is countable (finite or countably infinite). Probability mass function (pmf): p(x) = ℙ(X = x), ∑ p(x) = 1.

Continuous distributions Support has positive Lebesgue measure (intervals, ℝ^d). Probability density function (pdf): f(x) such that ℙ(a ≤ X ≤ b) = ∫_a^b f(x) dx, ∫ f(x) dx = 1.

Mixed distributions Combine discrete and continuous parts (e.g., point masses + density). Common in AI: censored data, zero-inflated models, mixture models with Dirac deltas.

2026 relevance:

  • Discrete: token distributions in LLMs, categorical action spaces in RL

  • Continuous: latent variables in VAEs, embeddings, Gaussian processes

  • Mixed: many real-world datasets (count data with excess zeros, survival analysis)

1.1.2 Expectation, Variance, Covariance, Correlation

Expectation (mean) For discrete X: 𝔼[X] = ∑ x p(x) For continuous X: 𝔼[X] = ∫ x f(x) dx Linearity: 𝔼[aX + bY + c] = a𝔼[X] + b𝔼[Y] + c (always, no independence required)

Variance Var(X) = 𝔼[(X – 𝔼[X])²] = 𝔼[X²] – (𝔼[X])² Standard deviation σ = √Var(X)

Covariance Cov(X,Y) = 𝔼[(X – 𝔼[X])(Y – 𝔼[Y])] = 𝔼[XY] – 𝔼[X]𝔼[Y]

Correlation ρ(X,Y) = Cov(X,Y) / (σ_X σ_Y) ∈ [–1, 1] Measures linear dependence (not causation or non-linear association)

Key AI uses:

  • Variance appears in stochastic gradient noise analysis

  • Covariance matrices in multivariate Gaussians (covariance estimation, Gaussian processes)

  • Correlation in feature selection and redundancy detection

1.1.3 Common Distributions in AI (Bernoulli, Categorical, Gaussian, Poisson, Dirichlet, Beta, Gamma)

Bernoulli(p) Binary outcome: X ∈ {0,1}, P(X=1) = p Used for: binary classification, success/failure events, coin flips

Categorical(p₁,…,p_K) Generalization of Bernoulli to K classes: X ∈ {1,…,K}, P(X=k) = p_k, ∑ p_k = 1 Used for: multi-class classification, token prediction in LLMs

Multivariate Gaussian 𝒩(μ, Σ) Continuous vector-valued: density ∝ exp(–½(x–μ)ᵀ Σ⁻¹ (x–μ)) Used for: latent variables, embeddings, Gaussian processes, uncertainty in regression

Poisson(λ) Discrete count: P(X=k) = e^{-λ} λ^k / k! Used for: count data (clicks, events), rate modeling

Dirichlet(α₁,…,α_K) Distribution over probability simplex: density ∝ ∏ p_i^{α_i – 1}, ∑ p_i = 1 Used for: priors over categorical distributions, topic models (LDA), mixture weights

Beta(α, β) Distribution on [0,1]: density ∝ p^{α–1} (1–p)^{β–1} Used for: conjugate prior for Bernoulli, success probability modeling

Gamma(α, β) (shape-rate parameterization) Positive continuous: density ∝ x^{α–1} e^{-β x} Used for: conjugate prior for Poisson rate, precision in Gaussians, inverse-gamma for variance

2026 quick reference:

  • Categorical + Dirichlet → token distributions & topic priors in LLMs

  • Gaussian → embeddings, latents, GP surrogates in BO

  • Beta/Dirichlet → Bayesian nonparametrics & hierarchical priors

  • Poisson/Gamma → count data & rate processes in recommender systems, RL

1.2 Information Theory Essentials

Information theory quantifies uncertainty, information gain, and divergence — foundational for loss functions, compression, generative modeling, and uncertainty measurement.

1.2.1 Entropy, Cross-Entropy, KL Divergence

Shannon entropy H(X) = – ∑ p(x) log p(x) (bits if log₂, nats if ln) Measures average surprise / uncertainty.

Cross-entropy H(p,q) = – ∑ p(x) log q(x) = H(p) + D_KL(p || q)

KL divergence D_KL(p || q) = ∑ p(x) log (p(x)/q(x)) ≥ 0 Asymmetric: measures extra bits needed when using q to encode samples from p. In ML: training objective is usually minimizing cross-entropy (negative log-likelihood).

1.2.2 Mutual Information & Conditional Entropy

Mutual information I(X;Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) = D_KL(p(x,y) || p(x)p(y)) Quantifies shared information between variables.

Conditional entropy H(X|Y) = H(X,Y) – H(Y) Average uncertainty in X given Y.

AI applications:

  • Mutual information maximization → contrastive learning (CLIP, SimCLR)

  • Conditional entropy → uncertainty decomposition in LLMs

  • I(X;Y) → feature selection, disentanglement in VAEs

1.2.3 Perplexity & Bits-per-Character in Language Models

Perplexity PPL(q) = 2^{H(p,q)} = exp(H(p,q)) Exponential of average cross-entropy per token → effective branching factor.

Bits-per-character / Bits-per-byte H(p) in bits/char → theoretical compression limit. English text ≈ 1–1.5 bits/char (human-level); modern LLMs approach ~0.8–1.2 bpc on diverse corpora.

2026 note: Perplexity saturates quickly on large models → downstream reasoning & human judgments increasingly important.

1.3 Concentration Inequalities & High-Probability Bounds

Concentration inequalities bound how much a random variable deviates from its mean — essential for generalization theory, SGD analysis, and high-probability guarantees.

1.3.1 Hoeffding, Bernstein, McDiarmid

Hoeffding’s inequality (bounded variables) X_i ∈ [a_i, b_i], independent, S = ∑ X_i ℙ(|S – 𝔼[S]| ≥ t) ≤ 2 exp( –2t² / ∑ (b_i – a_i)² )

Bernstein’s inequality (sub-exponential tails) Tighter when variance is small: involves both bound and variance σ².

McDiarmid’s inequality (bounded differences) If changing one data point changes function f by at most c_i: ℙ(|f – 𝔼[f]| ≥ t) ≤ 2 exp( –2t² / ∑ c_i² ) Very useful for uniform convergence over hypothesis classes.

1.3.2 Sub-Gaussian & Sub-Exponential Random Variables

Sub-Gaussian X with parameter σ²: 𝔼[exp(λ(X – 𝔼X))] ≤ exp(λ² σ² / 2) Tails decay at least as fast as Gaussian.

Sub-exponential (heavier tails): 𝔼[exp(λ|X – 𝔼X|)] ≤ exp(λ² σ² / 2) for |λ| < 1/b or similar.

Most gradient noise in deep learning is empirically sub-exponential → Bernstein-type bounds preferred over Hoeffding in SGD theory.

1.3.3 Empirical Bernstein & Variance-Adaptive Bounds

Empirical Bernstein (Maurer & Pontil 2009; refined 2020s) Uses empirical variance instead of worst-case bounds → tighter generalization guarantees when variance is small.

2026 relevance:

  • Sharpness-aware minimization (SAM) & generalization bounds

  • High-probability convergence rates for stochastic optimization

  • Uncertainty quantification via empirical Bernstein confidence intervals

2. Bayesian Inference Fundamentals

Bayesian inference is the process of updating beliefs about unknown parameters or latent variables in light of observed data, using probability as the language of uncertainty. In contrast to frequentist approaches (which treat parameters as fixed but unknown), Bayesian methods treat parameters as random variables with probability distributions — priors before seeing data, posteriors after.

This section introduces the core mechanics of Bayesian reasoning, exact inference techniques for tractable models, and the most widely used approximate methods that scale to modern AI problems (including Bayesian neural networks, LLMs, and reinforcement learning agents in 2026).

2.1 Bayes’ Theorem & Posterior Inference

2.1.1 Prior, Likelihood, Posterior, Marginal Likelihood

Bayes’ theorem is the cornerstone:

P(θ | D) = [P(D | θ) P(θ)] / P(D)

where:

  • θ = parameters / latent variables

  • D = observed data

  • P(θ) = prior distribution (belief about θ before seeing D)

  • P(D | θ) = likelihood (how well θ explains D)

  • P(θ | D) = posterior distribution (updated belief after seeing D)

  • P(D) = marginal likelihood (or evidence) = ∫ P(D | θ) P(θ) dθ

The marginal likelihood normalizes the posterior and is often intractable — this is why approximate inference is central to Bayesian ML.

Key intuitions:

  • Strong prior + weak data → posterior close to prior

  • Weak prior + strong data → posterior dominated by likelihood

  • Marginal likelihood acts as an Occam’s razor: simpler models (more concentrated priors) are preferred when they explain data well.

2.1.2 Conjugate Priors (Beta-Binomial, Dirichlet-Multinomial, Normal-Normal, Gamma-InverseWishart)

A prior is conjugate to a likelihood if the posterior belongs to the same family as the prior — enabling closed-form updates.

Beta-Binomial Likelihood: Bernoulli / Binomial (binary or count success) Prior: Beta(α, β) Posterior: Beta(α + successes, β + failures)

Dirichlet-Multinomial Likelihood: Categorical / Multinomial (K classes) Prior: Dirichlet(α₁, …, α_K) Posterior: Dirichlet(α₁ + counts₁, …, α_K + counts_K)

Normal-Normal (known variance) Likelihood: 𝒩(x | μ, σ²) Prior: 𝒩(μ₀, τ₀²) Posterior: 𝒩(μₙ, τₙ²) with closed-form precision-weighted update

Normal-Inverse-Gamma / Gamma-InverseWishart For unknown mean and variance (univariate / multivariate Gaussian) Prior on precision (inverse variance) is Gamma; posterior remains conjugate.

2026 relevance:

  • Beta-Binomial → Thompson sampling in bandits

  • Dirichlet-Multinomial → topic models, mixture weights

  • Normal priors → Bayesian linear regression, Gaussian processes

2.1.3 MAP vs. Full Posterior vs. Predictive Distribution

Maximum A Posteriori (MAP) θ_MAP = argmax_θ P(θ | D) = argmax_θ [log P(D | θ) + log P(θ)] = argmin_θ [–log P(D | θ) + (–log P(θ))] → regularized maximum likelihood (–log P(θ) is penalty term)

Full posterior P(θ | D) Captures entire uncertainty → ideal but usually intractable.

Posterior predictive distribution P(x_new | D) = ∫ P(x_new | θ) P(θ | D) dθ Averages predictions over posterior uncertainty → better calibrated than point estimates.

2026 practice:

  • MAP → fast baseline (e.g., L2 regularization = Gaussian prior)

  • Full posterior → variational / MCMC / Laplace approximation

  • Predictive distribution → uncertainty quantification in safety-critical AI

2.2 Exact Inference Methods

Exact inference computes P(θ | D) or marginals exactly — only feasible on models with tree-like or low-treewidth structure.

2.2.1 Sum-Product (Belief Propagation) on Trees & Polytrees

Sum-product algorithm (Pearl 1988; Kschischang et al. 2001) Message-passing on factor graphs or junction trees:

  • Sum messages over variables (marginalization)

  • Product messages over factors (multiplication of potentials)

On trees/polytrees → exact marginals and MAP in linear time. On general graphs → loopy belief propagation (approximate).

Used in: Bayesian networks, HMMs, conditional random fields with tree structure.

2.2.2 Variable Elimination

Variable elimination (Zhang & Poole 1996) Eliminates variables one by one by summing (or maxing) over them:

P(x₁, x₂) = ∑{x₃} … ∑{x_n} P(x₁…x_n)

Order of elimination dramatically affects computational cost (NP-hard to find optimal order).

Used in: small-to-medium Bayesian networks, exact inference benchmarks.

2.2.3 Junction Tree Algorithm

Junction tree algorithm (Lauritzen & Spiegelhalter 1988) Transforms general graph into a junction tree (tree of cliques) → exact inference via message passing on the tree.

Complexity exponential in treewidth (size of largest clique).

2026 status: Junction tree still used in small/medium structured models (e.g., medical diagnosis nets, parsing with CRFs); larger models rely on approximate methods.

2.3 Approximate Inference

Modern Bayesian AI almost always uses approximate inference due to intractable posteriors.

2.3.1 Markov Chain Monte Carlo (Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo / NUTS)

Markov Chain Monte Carlo (MCMC) generates samples from the posterior by constructing a Markov chain whose stationary distribution is P(θ | D).

Gibbs sampling Alternately sample each variable conditioned on all others → exact conditionals required.

Metropolis-Hastings Propose θ' ~ q(θ' | θ) → accept/reject with probability min(1, [P(θ'|D)/P(θ|D)] [q(θ|θ')/q(θ'|θ)] )

Hamiltonian Monte Carlo (HMC) / No-U-Turn Sampler (NUTS) Uses gradient information + Hamiltonian dynamics → efficient exploration of high-dimensional continuous posteriors. 2026 default in probabilistic programming (NumPyro, PyMC, Stan) for Bayesian neural nets and hierarchical models.

2.3.2 Variational Inference (Mean-Field, Structured, Black-Box VI)

Variational inference approximates posterior with simpler q(θ; ϕ) by maximizing ELBO:

ELBO(ϕ) = 𝔼_{q_ϕ} [log p(D, θ)] – 𝔼_{q_ϕ} [log q_ϕ(θ)] = 𝔼_{q} [log p(D | θ)] – D_KL(q_ϕ || p(θ))

Mean-field VI: q(θ) factorizes over parameters → fast but underestimates variance.

Structured VI: imposes structure (e.g., Gaussian with full covariance) → more accurate but harder to optimize.

Black-box VI (Ranganath et al. 2014; 2026 extensions): Score-gradient estimators + reparameterization trick → works with arbitrary differentiable models.

2026 status: Black-box VI (via Pyro/NumPyro) dominant for Bayesian deep learning; amortized VI scales to LLMs.

2.3.3 Importance Sampling & Sequential Monte Carlo (Particle Filters)

Importance sampling Sample from proposal q(θ) → weight w_i = p(D, θ_i) / q(θ_i) → weighted average approximates posterior.

Sequential Monte Carlo (SMC) / Particle Filters Resample particles according to weights at each step → handles sequential data (e.g., online Bayesian updating).

2026 usage: SMC in Bayesian filtering (robotics, tracking), importance-weighted autoencoders, and LLM uncertainty estimation via weighted ensembles.

3. Bayesian Models in Machine Learning

This section applies the Bayesian inference foundations from Section 2 to concrete machine learning models. We cover Bayesian linear models (with closed-form solutions and evidence approximation), generalized linear models for classification, sparse Bayesian learning, Gaussian processes (nonparametric Bayesian regression/classification), and Bayesian neural networks — including scalable approximations that remain practical even for large models in 2026.

Bayesian approaches provide principled uncertainty quantification, automatic regularization via priors, robustness to small data regimes, and natural model selection via the marginal likelihood — advantages that become increasingly valuable in safety-critical AI, active learning, and trustworthy large-scale systems.

3.1 Bayesian Linear & Generalized Linear Models

3.1.1 Bayesian Linear Regression & Evidence Approximation

Model y = X w + ε, ε ~ 𝒩(0, σ² I) Prior: w ~ 𝒩(0, α⁻¹ I) (isotropic Gaussian, α = precision)

Posterior (conjugate): w | y, X, α, σ² ~ 𝒩(m_N, S_N) S_N⁻¹ = α I + β XX m_N = β S_N Xy (β = 1/σ²)

Predictive distribution y_* | x_, D ~ 𝒩(x_mN, xᵀ SN x + σ²)

Evidence approximation (Type-II ML, MacKay 1992) Maximize marginal likelihood p(y | X, α, β) = ∫ p(y | X, w, β) p(w | α) dw = 𝒩(y | 0, β⁻¹ I + α⁻¹ X Xᵀ)

Closed-form update for α, β via EM-like fixed-point iteration → automatic relevance determination (ARD) emerges when α_i different per feature.

2026 usage: Still fastest Bayesian baseline for tabular data; evidence approximation used in sparse Bayesian learning and hyperparameter tuning.

3.1.2 Bayesian Logistic / Probit Regression

Logistic regression (non-conjugate) Likelihood: y_i ~ Bernoulli(σ(x_iᵀ w)) Prior: w ~ 𝒩(0, α⁻¹ I)

No closed-form posterior → approximate methods required.

Probit regression y_i ~ Bernoulli(Φ(x_iᵀ w)) where Φ is CDF of standard normal → slightly easier Gaussian integrals.

Common approximations:

  • Laplace approximation around MAP → Gaussian posterior

  • Variational Bayes (mean-field) → factorized Gaussian q(w)

  • Expectation Propagation (EP) → moment-matching → better calibration than VB

2026 practice:

  • Used in uncertainty-aware classification (medical diagnosis, fraud detection)

  • Last-layer Bayesian logistic on frozen deep features → cheap uncertainty in vision-language models

3.1.3 Sparse Bayesian Learning & Automatic Relevance Determination (ARD)

Sparse Bayesian Learning (SBL) (Tipping 2001 – Relevance Vector Machine) Prior per weight: w_i ~ 𝒩(0, α_i⁻¹) → separate precision α_i per feature Evidence approximation → optimize α_i → irrelevant features get α_i → ∞ → w_i → 0 (automatic sparsity)

ARD → effective feature selection without cross-validation.

Modern extensions (2024–2026):

  • Hierarchical priors → group-level sparsity (e.g., horseshoe prior)

  • Scalable SBL via stochastic variational inference

  • ARD in deep models → sparse Bayesian last-layer or attention pruning

Advantages over L1 regularization:

  • Full posterior (not point estimate)

  • Automatic hyperparameter tuning via evidence

  • Better uncertainty estimates on sparse features

3.2 Gaussian Processes & Nonparametric Bayes

Gaussian processes (GPs) are the canonical nonparametric Bayesian regression model — prior over functions.

3.2.1 GP Regression & Classification

GP prior f ~ GP(0, k) where k is covariance function (kernel) f(x) ~ 𝒩(0, k(x, x))

Posterior (noise-free case) f_* | X, y, x* ~ 𝒩(μ, Σ_)

μ_* = K{*X} (K{XX} + σ² I)⁻¹ y Σ_* = K{x* x*} – K{X} (K{XX} + σ² I)⁻¹ K{X}

GP classification Latent function f → probit/logistic link → approximate posterior via Laplace, EP, or variational inference.

2026 usage:

  • Gold standard for small-to-medium data (n < 10k)

  • Benchmark for uncertainty quantification

  • Surrogate in Bayesian optimization

3.2.2 Kernel Design & Deep Kernel Learning

Kernel design

  • Squared exponential (RBF): k(x,x') = σ² exp(–‖x–x'‖² / (2ℓ²))

  • Matérn, Periodic, Linear + RBF combinations

  • Additive / multiplicative kernels for structure

Deep Kernel Learning (Wilson et al. 2016–2025) k(x,x') = k_base(φ(x), φ(x')) where φ is deep neural net feature extractor → expressive, non-stationary kernels 2026: DKL + SVGP → scalable nonparametric modeling for tabular & time-series data.

3.2.3 Scalable GPs (SVGP, Structured Kernel Interpolation, GPflow / GPyTorch)

Sparse Variational GP (SVGP) (Hensman et al. 2013) Inducing points Z → variational posterior q(f) ≈ p(f | u), u = f(Z) ELBO maximized w.r.t. variational parameters → scales to n ≈ 10⁶ with mini-batching.

Structured Kernel Interpolation (SKI / KISS-GP) (Wilson & Nickisch 2015) Kronecker + grid structure → O(n log n) exact inference on gridded data.

Modern libraries (2026):

  • GPyTorch (Cornell) → GPU-accelerated, deep kernels, SVGP, exact GPs

  • GPflow (Cambridge) → TensorFlow-based, SVGP, MCMC

  • tinygp (JAX) → lightweight, fast for small-to-medium data

2026 status: SVGP + deep kernels dominant for scalable nonparametric Bayesian modeling; used in BO, time-series, spatial statistics.

3.3 Bayesian Neural Networks

Bayesian neural networks place distributions over weights → capture epistemic uncertainty.

3.3.1 Variational Bayes & Bayes-by-Backprop

Bayes-by-Backprop (Blundell et al. 2015) Reparameterization trick: θ = μ + σ ⊙ ε, ε ~ 𝒩(0,I) → stochastic gradient on ELBO (reconstruction – KL)

Mean-field VI: q(w) = ∏ 𝒩(w_i | μ_i, σ_i²) → tractable KL term, but underestimates posterior variance.

2026 usage: Still baseline for Bayesian deep learning; extended to LSTMs, Transformers (last-layer or sparse VI).

3.3.2 Deep Ensembles, MC Dropout, SWAG, Laplace Approximation

Deep Ensembles (Lakshminarayanan et al. 2017) Train M independent models → predictive mean & variance from ensemble → excellent calibration.

MC Dropout (Gal & Ghahramani 2016) Dropout at test time → Monte-Carlo sampling → approximate Bayesian inference.

SWAG (Maddox et al. 2019) Fit Gaussian to SGD trajectory → captures multimodal modes → better uncertainty than single model.

Laplace approximation (MacKay 1992; Daxberger et al. 2021) Second-order expansion around MAP → Gaussian posterior → cheap Hessian or KFAC approximation.

2026 status: Deep ensembles + last-layer Laplace remain production baselines for calibrated deep learning.

3.3.3 Last-Layer Bayesian & Probabilistic Backpropagation

Last-layer Bayesian Freeze early layers → Bayesian last layer (Laplace, VI, ensemble) → captures most predictive uncertainty at low cost.

Probabilistic backpropagation (Hernández-Lobato & Adams 2015) Approximate gradients through stochastic weights → scalable Bayesian training.

2026 trend: Last-layer Bayesian + conformal prediction → state-of-the-art uncertainty for large frozen models (e.g., LLM feature extractors).

4. Uncertainty Quantification in Deep Learning & LLMs

Uncertainty quantification (UQ) is one of the most critical frontiers in modern AI, especially for large language models (LLMs), autonomous systems, medical AI, and any safety-critical or high-stakes deployment. Deep learning models — including LLMs — are notorious for being overconfident on out-of-distribution (OOD) inputs, hallucinating facts, or producing high-confidence wrong answers. Bayesian and probabilistic methods provide principled ways to measure and decompose uncertainty, calibrate predictions, and enable safer decision-making.

This section distinguishes the two main types of uncertainty, reviews predictive distributions and calibration metrics, surveys the state-of-the-art UQ techniques specifically for LLMs (2025–2026), and discusses how uncertainty informs real-world decisions, active learning, OOD detection, and safe exploration.

4.1 Aleatoric vs. Epistemic Uncertainty

Uncertainty in predictions can be decomposed into two fundamental types:

4.1.1 Predictive Distributions & Uncertainty Decomposition

Aleatoric uncertainty (data noise / irreducible uncertainty) Inherent stochasticity in the data-generating process — cannot be reduced even with infinite data. Examples:

  • Label noise in classification

  • Sensor noise in robotics

  • Inherent randomness in language (multiple correct phrasings)

Epistemic uncertainty (model ignorance / reducible uncertainty) Arises from lack of knowledge about the true model or parameters — can be reduced with more data or better modeling. Examples:

  • Ambiguity on rare events

  • Distribution shift / OOD inputs

  • Limited training data in tail domains

Predictive distribution The full Bayesian predictive is:

p(y_* | x_, D) = ∫ p(y_ | x_*, θ) p(θ | D) dθ

  • Aleatoric → captured by p(y_* | x_*, θ) (likelihood variance)

  • Epistemic → captured by integrating over posterior p(θ | D)

Decomposition (Depeweg et al. 2018; Kendall & Gal 2017) Total predictive variance = E[Var(y_* | θ)] + Var[E(y_* | θ)] = aleatoric + epistemic

In practice (2026):

  • Deep ensembles & MC Dropout estimate both via sample variance

  • Last-layer Bayesian captures mostly epistemic

  • Semantic entropy (LLM-specific) targets epistemic uncertainty

4.1.2 Calibration (Expected Calibration Error, Brier Score)

Calibration: Model confidence should match true accuracy. A model is calibrated if for every confidence c, the accuracy among predictions with confidence c is exactly c.

Expected Calibration Error (ECE) (Naeini et al. 2015) Bin predictions into M confidence bins → compute accuracy vs. average confidence per bin → weighted absolute difference.

ECE = ∑_{m=1}^M (B_m / n) |acc(B_m) – conf(B_m)| where B_m is bin m, acc = accuracy in bin, conf = average confidence in bin.

Brier score (quadratic scoring rule) BS = (1/n) ∑ (ŷ_i – y_i)² for binary, or multi-class generalization Proper scoring rule → rewards calibrated probabilities.

2026 status:

  • ECE still widely reported but sensitive to binning → Brier score + negative log-likelihood preferred

  • Temperature scaling + Platt scaling remain simple post-hoc calibration methods

  • Deep ensembles & last-layer Bayesian → best calibration out-of-the-box

4.2 Modern LLM Uncertainty Methods

LLMs introduce unique challenges: discrete token output, extremely high-dimensional latent space, and emergent reasoning behaviors. Traditional Bayesian methods (full VI, MCMC) are intractable; specialized techniques have emerged.

4.2.1 Verbalized Confidence & Self-Evaluation

Verbalized confidence (Kuhn et al. 2023; 2025 extensions) Prompt LLM to output numerical confidence (e.g., “I am 85% confident”) or Likert-scale belief. Surprisingly well-calibrated on some tasks (especially after fine-tuning on confidence-labeled data).

Self-evaluation Prompt model to critique its own answer → verbalized uncertainty (“I’m unsure because…”) → can be used to trigger rejection or refinement.

Limitations: Position bias, verbosity dependence, overconfidence on hard questions.

4.2.2 Semantic Entropy & Cluster-based Uncertainty

Semantic entropy (Farquhar et al. 2024; Kuhn et al. 2025 refinements) Generate multiple samples from LLM → cluster by semantic equivalence (embedding distance + clustering) → compute entropy over clusters (not tokens).

Key insight: Token entropy mixes aleatoric (paraphrasing) and epistemic (factual disagreement) uncertainty. Semantic clustering isolates epistemic uncertainty → strong correlation with hallucination / factual error.

2026 status: One of the most reliable black-box UQ methods for LLMs → used in production rejection rules, RAG confidence scoring, and safety layers.

4.2.3 Conformal Prediction for Language Models

Conformal prediction (Vovk et al. 2005; Angelopoulos & Bates 2021–2026) Distribution-free, finite-sample guarantee: construct prediction sets C(x) such that ℙ(y_* ∈ C(x_*) | D) ≥ 1 – α

Conformal language modeling (2024–2026):

  • Token-level sets → top-k tokens with coverage guarantee

  • Sequence-level sets → sets of full answers (via rejection sampling or beam search)

  • Verbalized sets → prompt LLM to output set of plausible answers

Advantages: No retraining, rigorous coverage, black-box compatible.

2026 trend: Conformal + semantic entropy → state-of-the-art UQ for production LLMs (rejection, abstention, human handoff).

4.2.4 Token-level & Sequence-level Uncertainty (P(True), min-p, entropy)

Token-level:

  • Entropy of next-token distribution → high entropy → high uncertainty

  • min-p sampling (min-p filtering): sample only from tokens with p ≥ min-p × max_p → avoids low-probability tail

  • P(True) (Kadavath et al. 2022): P(“true” token after “The answer is”) → proxy for correctness

Sequence-level:

  • Average token entropy / log-prob

  • Variance of sequence log-prob across samples

  • Self-consistency entropy (multiple CoT paths)

2026 practice: min-p + semantic entropy combination → best trade-off between quality and uncertainty awareness in sampling.

4.3 Uncertainty in Decision-Making & Safety

4.3.1 Uncertainty-Aware Active Learning

Active learning selects most informative samples to label. Uncertainty sampling: choose x with highest predictive entropy or least confidence. Bayesian active learning: maximize expected information gain I(y_; θ | D, x_)

2026 applications:

  • Data-efficient fine-tuning of LLMs

  • Active preference collection in RLHF

  • Medical image annotation, robotics exploration

4.3.2 Out-of-Distribution Detection & Rejection Rules

OOD detection:

  • Predictive entropy / max softmax probability

  • Semantic entropy (LLM-specific)

  • Energy score (Liu et al. 2020) → –log-sum-exp of logits

  • Last-layer Gaussian density

Rejection rules:

  • If uncertainty > threshold → abstain / escalate to human

  • Conformal sets → output set only if small enough

2026 trend: Production LLMs use hybrid rejection (semantic entropy + entropy + verbalized confidence) → reduces harmful hallucinations by 40–70% in high-stakes settings.

4.3.3 Risk-Averse Policies & Safe Exploration

Risk-averse decision-making Use Conditional Value-at-Risk (CVaR) or entropic risk measures instead of expected reward → penalize tail risks.

Safe exploration in RL:

  • Thompson sampling (posterior sampling) → natural exploration

  • Uncertainty bonus in UCB → optimistic under uncertainty

  • Bayes-RL (posterior sampling for RL) → principled uncertainty-aware policy

2026 applications:

  • Autonomous driving → reject high-uncertainty maneuvers

  • Medical AI → escalate high-epistemic-uncertainty diagnoses

  • LLM agents → defer to tools/humans when uncertainty high

5. Bayesian Decision Theory & Real-World Applications

Bayesian decision theory provides the principled framework for making optimal choices under uncertainty — exactly what modern AI systems need when deployed in high-stakes, partially observable, or safety-critical environments. This section bridges Bayesian inference (from previous sections) to decision-making: how to define loss or utility functions, compute Bayes risk, maximize expected utility, and apply these ideas to Bayesian optimization, bandits, reinforcement learning, and large language models in 2026.

5.1 Bayesian Decision Theory Basics

5.1.1 Loss Functions & Bayes Risk

A decision problem is defined by:

  • Action space 𝒜 (possible decisions / actions)

  • State / parameter space Θ (unknown true state)

  • Loss function L(θ, a) ∈ ℝ⁺ (cost of taking action a when true state is θ)

The Bayes risk of a decision rule δ (mapping from data to action) is the expected loss under the prior:

R(δ, π) = ∫_Θ R(δ | θ) π(θ) dθ where R(δ | θ) = ∫ L(θ, δ(x)) p(x | θ) dx (risk conditional on θ)

The Bayes optimal decision rule δ* minimizes Bayes risk:

δ* = argmin_δ R(δ, π)

In practice, we often compute the Bayes action for a fixed posterior:

a* = argmin_a ∫ L(θ, a) p(θ | D) dθ = argmin_a 𝔼_{θ ~ p(θ|D)} [L(θ, a)]

Common loss functions in AI:

  • 0-1 loss → classification error

  • Squared loss → regression MSE

  • Absolute loss → median prediction

  • Asymmetric losses → risk-averse or safety-critical decisions (e.g., false negative cost > false positive)

5.1.2 Expected Utility Maximization

Instead of minimizing loss, we can maximize expected utility U(θ, a) = –L(θ, a) (utility = negative loss).

Expected utility under posterior:

EU(a | D) = ∫ U(θ, a) p(θ | D) dθ

Optimal action: a* = argmax_a EU(a | D)

Von Neumann–Morgenstern utility theory justifies this under rational preferences (axioms of completeness, transitivity, continuity, independence).

2026 relevance:

  • Expected utility maximization underpins safe RL, autonomous driving (minimize collision risk), medical treatment planning, and calibrated LLM decision-making (e.g., when to abstain).

5.2 Bayesian Optimization

Bayesian optimization (BO) finds the global optimum of a black-box objective f(x) that is expensive to evaluate (e.g., hyperparameter tuning, neural architecture search, robotics control).

5.2.1 Gaussian Process Surrogate + Acquisition Functions (EI, PI, UCB, Thompson Sampling)

Surrogate model Usually Gaussian Process (GP): f ~ GP(0, k) Posterior GP gives mean μ(x) and variance σ²(x) → uncertainty estimate.

Acquisition function α(x) balances exploration (high uncertainty) and exploitation (high predicted value).

  • Expected Improvement (EI) α_EI(x) = 𝔼[max(0, f(x) – f(x⁺))] where x⁺ is current best Closed-form under Gaussian posterior

  • Probability of Improvement (PI) α_PI(x) = ℙ(f(x) > f(x⁺) + ξ) → more conservative

  • Upper Confidence Bound (UCB) α_UCB(x) = μ(x) + κ σ(x) → deterministic, theoretical regret bounds

  • Thompson Sampling Sample f̃ ~ posterior GP → choose x = argmax f̃(x) → simple, asymptotically optimal regret

2026 best practice: Thompson Sampling or EI with dynamic κ (entropy search style) → most robust across tasks.

5.2.2 Scalable BO (TuRBO, SAAS-BO, Dragonfly)

TuRBO (Eriksson et al. 2019–2025) Trust Region Bayesian Optimization → multiple local trust regions + local GPs → scales to high dimensions (dozens of hyperparameters).

SAAS-BO (Eriksson et al. 2021) Sparsity-inducing Additive Additive Structure → horseshoe prior on lengthscales → automatic relevance determination in high-d spaces.

Dragonfly (Kandasamy et al. 2020–2025) Multi-fidelity BO + parallel evaluations → asynchronous, multi-worker scaling.

2026 status:

  • Ax (Meta) + BoTorch → production standard (integrates TuRBO, SAAS, multi-fidelity)

  • Used daily for LLM hyperparameter tuning, prompt optimization, architecture search

5.2.3 Hyperparameter Tuning & Neural Architecture Search

Hyperparameter tuning BO outperforms grid/random search on expensive objectives (e.g., training 7B+ LLMs).

Neural Architecture Search (NAS) BO on NAS space (DARTS-like continuous relaxation or discrete search) → efficient discovery of efficient architectures.

2026 applications:

  • Tuning learning rate schedules, quantization bits, LoRA ranks

  • Searching MoE routing, attention variants, SSM hyperparameters

5.3 Bayesian Bandits & Reinforcement Learning

5.3.1 Thompson Sampling & Bayesian UCB

Multi-armed bandits Choose arm a_t → receive reward r_t ~ p(r | a_t)

Thompson Sampling (Thompson 1933; modern resurgence 2010s) Maintain posterior p(θ_a | history) for each arm → sample θ̃_a ~ posterior → choose a_t = argmax_a θ̃_a → naturally balances exploration/exploitation.

Bayesian UCB Choose a_t = argmax_a [μ_a + κ σ_a] → deterministic, regret bounds.

2026 usage:

  • Online A/B testing in recommender systems

  • Prompt selection in LLMs

  • Resource allocation in cloud ML training

5.3.2 Posterior Sampling for RL (PSRL)

Posterior Sampling for Reinforcement Learning (Osband et al. 2013) Maintain posterior over MDP parameters → sample MDP → solve exactly (or approximately) → act optimally under sampled MDP → repeat.

Advantages:

  • Principled exploration

  • Asymptotically optimal regret in tabular MDPs

2026 extensions:

  • Deep PSRL → sample deep dynamics model → MPC or value-based planning

  • Used in robotics, game AI, recommendation policies

5.3.3 Bayesian Deep RL & Uncertainty-Aware Exploration

Bayesian deep RL

  • Deep ensembles for Q-function / policy → epistemic uncertainty bonus

  • Bootstrapped DQN → variance-driven exploration

  • Probabilistic ensembles + model-based RL (e.g., PETS, MBPO with Bayesian backbones)

Uncertainty-aware exploration

  • Add epistemic uncertainty to reward (UCB-style bonus)

  • Sample from posterior predictive → optimistic planning

  • Disagreement-based exploration (ensemble variance)

2026 trend: Bayesian deep RL + posterior sampling → state-of-the-art in sim-to-real transfer, safe exploration, and LLM agent planning under uncertainty.

5.4 Bayesian Methods in Large Language Models

5.4.1 Bayesian Prompting & In-Context Learning

Bayesian prompting Treat in-context examples as prior → posterior predictive approximates Bayesian update. Prompt with diverse demonstrations → ensemble-like effect.

2026 usage:

  • Few-shot uncertainty estimation

  • Prompt ensembling → average multiple prompt completions

5.4.2 Uncertainty-Guided Chain-of-Thought & Self-Refine

Uncertainty-guided CoT Generate multiple reasoning paths → weight by semantic entropy or self-evaluated confidence → majority vote or best-of-N.

Self-Refine with uncertainty If epistemic uncertainty high → trigger critique → refine answer → repeat until low uncertainty.

2026 applications:

  • Reduce hallucinations in math / code generation

  • Improve reliability in agentic workflows

5.4.3 Bayesian Model Selection for Prompt Ensembles

Treat different prompts / templates as models → use Bayesian model averaging or evidence approximation to weight them.

2026 practice:

  • Prompt ensembling with semantic entropy weighting

  • Bayesian prompt selection via Thompson sampling or UCB

6. Advanced & Emerging Topics (2025–2026)

By 2025–2026, Bayesian and probabilistic modeling in AI has moved far beyond simple conjugate updates and mean-field variational inference. The field now focuses on scalable, tractable generative models that can perform exact marginal inference, continuous normalizing flows adapted to discrete data like text, diffusion and score-based models for both continuous and discrete domains, nonparametric methods that adapt complexity with data size, and robust inference techniques that remain reliable under model misspecification. These advances address core limitations of classical Bayesian deep learning: intractability at scale, poor sample efficiency, lack of exact likelihood computation, and brittleness when assumptions are violated.

This section surveys the mathematical foundations, key algorithms, and 2025–2026 research frontiers that are reshaping uncertainty-aware, generative, and nonparametric AI.

6.1 Probabilistic Circuits & Tractable Generative Models

Probabilistic circuits (PCs) are a class of structured generative models that guarantee tractable (polynomial-time) exact inference for marginals, conditionals, and likelihoods — a property that is intractable for most deep generative models (VAEs, normalizing flows, diffusion).

Core Properties of Probabilistic Circuits

A probabilistic circuit is a directed acyclic graph with:

  • Input nodes (distribution leaves: Gaussian, categorical, etc.)

  • Sum nodes (weighted mixture: ∑ w_i C_i)

  • Product nodes (independent factors: ∏ C_i)

Tractability guarantees require:

  • Structured decomposability (children of product nodes have disjoint scopes)

  • Determinism (sum nodes have mutually exclusive paths)

  • Smoothness (sum nodes have complete support)

These conditions enable exact and efficient computation of:

  • Likelihood p(x)

  • Marginal p(x_S) for any subset S

  • Conditional p(x_S | x_E) for evidence x_E

  • Most MAP/MPE queries

Key 2025–2026 models:

  • SPNs (Sum-Product Networks, Poon & Domingos 2011 → 2025 scalable variants)

  • CNs (Cutset Networks)

  • AR-Circuits (Autoregressive Circuits)

  • EiNets (Efficient Inference Networks)

  • Probabilistic Sentential Decision Diagrams (PSDDs) → used in hybrid neuro-symbolic systems

Applications in 2026:

  • Tractable density estimation in safety-critical domains

  • Exact posterior predictive in Bayesian pipelines

  • Tractable amortized inference for LLMs (circuit-based likelihoods)

  • Neuro-symbolic reasoning (PSDDs + neural leaves)

6.2 Normalizing Flows & Continuous Normalizing Flows for Text

Normalizing flows transform a simple base distribution (e.g., Gaussian) into a complex target distribution via invertible, differentiable mappings with tractable Jacobian determinant.

Core Idea

Let z ~ p_z(z) (easy to sample/evaluate), x = f(z), f invertible → p_x(x) = p_z(f⁻¹(x)) |det J_f⁻¹(x)| log p_x(x) = log p_z(f⁻¹(x)) + log |det J_f⁻¹(x)|

Continuous normalizing flows (CNFs) Use neural ODEs: dz/dt = g(z(t), t; θ) → infinite-depth flow with exact log-det via trace of Jacobian.

Challenges for discrete data (text, tokens):

  • Standard flows operate on continuous spaces

  • Discrete → continuous embedding → flow → round-trip quantization

2025–2026 advances:

  • Discrete flows (Hoogeboom et al. 2019 → Masked Autoregressive Flows for discrete)

  • CNFs for text (Kidger et al. 2020 → Neural SDEs + score-based variants)

  • Flow-based tokenizers (2025 papers): learn invertible subword tokenization + density estimation

  • Augmented flows (add auxiliary continuous noise → dequantization)

Applications:

  • Tractable likelihood for discrete data (better than VAEs)

  • Controllable generation via latent interpolation

  • Density estimation in language modeling (hybrid with autoregressive)

6.3 Diffusion Models & Score-Based Generative Modeling (Continuous & Discrete)

Diffusion models (and their score-based formulation) have become dominant generative paradigms in vision and are rapidly adapting to discrete domains like text.

6.3.1 Diffusion-LM, SSD-LM, GenAI Diffusion Variants

Diffusion-LM (Li et al. 2022) Embed tokens into continuous space → run continuous diffusion → round-trip decoding → classifier-free guidance for controllable text generation.

SSD-LM (Han et al. 2023) Semi-autoregressive discrete diffusion → parallelize generation while preserving dependencies.

2025–2026 variants:

  • Masked diffusion (MaskGIT-style for text)

  • Continuous-time discrete diffusion (score matching on embedding space)

  • Hybrid autoregressive-diffusion (e.g., recent LLaDA-style models)

  • Score entropy-based discrete diffusion (SEDD, Lou et al. 2024)

Advantages over autoregressive:

  • Parallel sampling

  • Better global coherence

  • Natural controllability (guidance on semantics, length, style)

6.3.2 Score-based Generative Modeling on Discrete Spaces

Score-based generative modeling (Song & Ermon 2019–2021) Learn score function s_θ(x,t) ≈ ∇_x log p_t(x) → sample via Langevin dynamics or predictor-corrector.

Discrete adaptations (2023–2026):

  • D3PM (Austin et al. 2021): absorbing diffusion on categorical space

  • CDCD (Campbell et al. 2023): continuous relaxation + score matching

  • SEDD (Lou et al. 2024): score entropy divergence minimization → state-of-the-art discrete score models

2026 status:

  • Discrete diffusion and score-based models competitive with autoregressive LLMs on infilling, style transfer, and constrained generation

  • Lag on open-ended long-form quality but excel in controllable & parallel tasks

6.4 Bayesian Nonparametric Methods at Scale (Dirichlet Process, Hierarchical Pitman-Yor)

Bayesian nonparametric methods let model complexity grow with data — no fixed number of clusters/components.

Dirichlet Process (DP) (Ferguson 1973) DP(α, G₀) → prior over distributions → stick-breaking construction → infinite mixture model.

Hierarchical Dirichlet Process (HDP) Hierarchical prior → shared atoms across groups → topic models (HDP-LDA).

Pitman-Yor Process (two-parameter generalization) Discount parameter d → power-law behavior → better for natural language (Zipf’s law).

2025–2026 scalable approximations:

  • Stick-breaking variational inference

  • Memoized online variational inference

  • Split-merge MCMC for HDP

  • Deep hierarchical Pitman-Yor → neural topic models at scale

Applications:

  • Infinite topic models for large corpora

  • Nonparametric clustering in recommender systems

  • Hierarchical priors in LLMs for syntax/semantics

6.5 Robust Bayesian Inference & Misspecification

Real-world data often violates modeling assumptions → robust Bayesian methods remain reliable under misspecification.

Key techniques:

  • Power posteriors p(θ | D) ∝ p(D | θ)^β p(θ), β < 1 → downweights likelihood

  • Generalized Bayesian inference (Dempster-Shafer, imprecise probability)

  • Bayesian robustness (Huber contamination models, density power divergence)

  • Misspecification-aware VI (2025 papers): detect and correct divergence via importance weighting

2026 frontier:

  • Robust Bayesian deep learning (power posteriors + ensembles)

  • Safe LLM alignment under distribution shift

  • Robust BO under non-stationary objectives

6.6 Open Problems & Thesis Directions

  1. Tractable exact inference at LLM scale Can probabilistic circuits or structured flows achieve exact likelihood for billion-parameter models?

  2. Discrete diffusion scaling laws How do discrete diffusion models scale with compute/data compared to autoregressive LLMs? (power laws?)

  3. Robust Bayesian inference for foundation models Develop misspecification-robust VI or MCMC that scales to 10¹²+ parameter posteriors.

  4. Nonparametric priors for language structure Hierarchical Pitman-Yor or neural DP priors for syntax/semantics in LLMs — can they improve compositionality?

  5. Uncertainty-aware test-time adaptation Use epistemic uncertainty to trigger online Bayesian updates during deployment (continual learning without forgetting).

  6. Safe Bayesian decision-making in agents Formal regret bounds for Bayesian deep RL agents under partial observability and misspecification.

    7. Tools, Libraries & Implementation Resources

    This section provides a practical, up-to-date (mid-2026) overview of the most widely used open-source tools, libraries, and frameworks for probabilistic modeling, Bayesian inference, uncertainty quantification, Bayesian optimization, and related benchmarking in AI/ML. Emphasis is placed on tools that scale to modern deep learning and large language model (LLM) workflows, support GPU/TPU acceleration, and are actively maintained by strong communities or industry labs.

    All recommendations reflect the state of the ecosystem as of March–June 2026.

    7.1 Core Probabilistic Programming Frameworks

    Probabilistic programming languages (PPLs) allow users to specify generative models and perform inference automatically — essential for Bayesian deep learning, uncertainty-aware LLMs, Bayesian optimization, and safe RL.

    7.1.1 Pyro, NumPyro, PyMC, Stan (CmdStanPy), TensorFlow Probability, Edward2

    Pyro (Uber AI → now part of Meta AI ecosystem)

    • Built on PyTorch → dynamic computation graphs, GPU acceleration

    • Strong in deep probabilistic models (VAEs, Bayesian NNs, normalizing flows)

    • Black-box VI, SVI, MCMC (NUTS/HMC), importance sampling

    • 2026 status: Still widely used for research on Bayesian deep learning and uncertainty in vision/language models

    NumPyro (Pyro + JAX backend)

    • JAX-accelerated version of Pyro → massive speed-ups on GPU/TPU

    • Same API as Pyro but leverages JAX autodiff, vmap, pmap, JIT

    • Best choice in 2026 for large-scale Bayesian inference (e.g., Bayesian Transformers, diffusion models)

    • Supports NumPyro plate notation for mini-batching

    PyMC (formerly PyMC3 → PyMC v5+)

    • Theano → Aesara → PyTensor backend (flexible)

    • Intuitive syntax, strong in hierarchical modeling

    • NUTS (No-U-Turn Sampler), variational inference, SMC

    • 2026 status: Dominant in statistics and epidemiology; excellent Jupyter integration

    Stan (CmdStanPy)

    • State-of-the-art MCMC (NUTS) → highest effective sample size per second

    • CmdStanPy → Python interface to Stan

    • Best for complex hierarchical models where precision matters

    • 2026 usage: Gold standard for robust Bayesian modeling in academia and pharma

    TensorFlow Probability (TFP)

    • Built on TensorFlow → excellent for production deployment

    • JointDistribution, bijectors (flows), variational layers, MCMC

    • 2026 status: Strong in Google ecosystem (e.g., internal LLM safety & uncertainty)

    Edward2 (now mostly legacy)

    • TensorFlow-based probabilistic programming → largely superseded by TFP

    2026 recommendation:

    • Research & deep models → NumPyro (speed) or Pyro (PyTorch ecosystem)

    • Hierarchical/complex models → PyMC or CmdStanPy

    • Production / Google stack → TFP

    7.1.2 Pyro + JAX (NumPyro) for GPU-accelerated inference

    NumPyro is the de-facto choice in 2026 for GPU/TPU-accelerated Bayesian inference at scale:

    • JAX → just-in-time compilation (XLA), automatic differentiation, vectorization (vmap), parallelization (pmap)

    • Supports massive models (Bayesian Transformers, diffusion on large datasets)

    • Black-box VI with reparameterization + score-gradient estimators

    • NUTS/HMC with GPU-friendly mass matrix adaptation

    • Example use cases: Bayesian last-layer on frozen LLMs, uncertainty-aware fine-tuning, Bayesian prompt optimization

    Practical tip: Use NumPyro + JAX for any Bayesian model that needs to scale beyond ~10k–100k parameters.

    7.2 Uncertainty Quantification Libraries

    These libraries provide ready-to-use methods for estimating epistemic & aleatoric uncertainty in deep models and LLMs.

    7.2.1 Uncertainty Wizard, Fortuna, TorchUncertainty, Laplace

    Uncertainty Wizard Lightweight, framework-agnostic → MC Dropout, ensembles, test-time augmentation Easy integration with PyTorch models → production-ready UQ baselines

    Fortuna (Spotify 2023–2026) Comprehensive UQ for deep learning:

    • Deep ensembles, MC Dropout, SWAG, Laplace, conformal prediction

    • Last-layer Bayesian + temperature scaling

    • Strong focus on calibration & OOD detection

    • 2026 status: One of the most complete open-source UQ toolkits

    TorchUncertainty PyTorch-native → ensembles, MC Dropout, evidential deep learning, conformal prediction Active development → excellent documentation & tutorials

    Laplace (Daxberger et al. 2021–2026) Fast, scalable Laplace approximation:

    • Full-network, last-layer, KFAC, diagonal, low-rank approximations

    • Predictive distributions & uncertainty metrics

    • Works on frozen LLMs → very practical for 2026 workflows

    7.2.2 Semantic Entropy implementations & LLM calibration tools

    Semantic Entropy (Farquhar et al. 2024–2026) Open implementations: GitHub repos (original paper code + community forks)

    • Cluster LLM samples via embedding similarity → entropy over clusters

    • Detects epistemic uncertainty (hallucinations) better than token entropy

    LLM calibration tools (2025–2026 ecosystem):

    • Verbalized confidence wrappers → prompt templates + parsing

    • Conformal prediction for language (e.g., conformal token sets, sequence-level sets)

    • P(True) & min-p filters → uncertainty-aware sampling

    • TorchUncertainty + Fortuna → plug-and-play calibration for LLMs

    2026 recommendation: Combine semantic entropy + last-layer Laplace + conformal prediction → strongest black-box + white-box UQ for production LLMs.

    7.3 Bayesian Optimization & Bandits

    7.3.1 Ax (Meta), BoTorch, Optuna (Bayesian samplers), SMAC3

    Ax (Meta AI) Production-grade BO platform → integrates with BoTorch Supports multi-fidelity, multi-objective, parallel evaluations, TuRBO, SAAS-BO Used internally at Meta for LLM tuning & architecture search

    BoTorch (PyTorch-based BO library) Core engine behind Ax → modular, GPU-accelerated Acquisition functions (EI, PI, UCB, Thompson), GP models, trust regions 2026 status: De-facto research standard for Bayesian optimization

    Optuna Define-by-run API → TPE, CMA-ES, Bayesian samplers (via BoTorch integration) Pruning, multi-objective, distributed trials → very user-friendly

    SMAC3 Successor to SMAC2 → Bayesian optimization with random forests Strong on tabular / mixed-integer spaces → excellent for hyperparameter tuning

    7.3.2 Thompson Sampling & Bayesian bandits libraries

    Thompson Sampling implementations:

    • BoTorch → built-in Thompson sampling acquisition

    • Ax → supports TS for multi-arm bandits

    • Simple NumPyro / Pyro examples → custom bandits

    Bandit libraries (2026):

    • BoTorch → Bayesian bandits + deep kernel surrogates

    • Sherpa → hyperparameter optimization with bandit-style pruning

    • RLlib / Ray Tune → multi-armed bandits + BO hybrids for RL

    2026 trend: BO + Thompson sampling + Ray Tune → dominant for distributed hyperparameter tuning of LLMs and agents.

    7.4 Gaussian Process Libraries

    7.4.1 GPyTorch, GPflow, scikit-learn GPs, tinygp (JAX)

    GPyTorch (Cornell → most popular in 2026) GPU-accelerated, scalable GPs (SVGP, SKI/KISS-GP, deep kernels) BoTorch integration → Bayesian optimization backend Supports exact GPs up to ~10k points, SVGP to millions

    GPflow (Cambridge → TensorFlow ecosystem) SVGP, MCMC, natural gradient VI → strong in hierarchical GPs

    scikit-learn GPs Simple, exact GPs → baseline for small data (<1k–3k points)

    tinygp (JAX-based) Lightweight, fast exact GPs → excellent for prototyping & small-to-medium data

    2026 recommendation:

    • GPyTorch + BoTorch → default for scalable BO & GP research

    • tinygp → quick JAX-based experiments

    7.5 Evaluation & Benchmarking Suites

    7.5.1 Uncertainty Baselines, UCI datasets, GLUE-style uncertainty extensions

    Uncertainty Baselines (Google Research 2020–2026) Standardized implementations of deep ensembles, MC Dropout, SWAG, Laplace, etc. Datasets: UCI regression/classification, ImageNet, CIFAR, OOD benchmarks Still widely cited for reproducible UQ comparisons

    UCI datasets (classic regression/classification) Energy, Concrete, Yacht, Boston, etc. → standard for Bayesian regression & GP evaluation

    GLUE-style uncertainty extensions

    • GLUE/SuperGLUE with MC Dropout ensembles

    • Uncertainty-aware versions of MMLU, BIG-bench, HELM

    • LLM-specific: semantic entropy on MMLU, TruthfulQA, HaluEval

    2026 status:

    • Uncertainty Baselines + Fortuna → gold-standard UQ reproducibility

    • HELM Safety / HELM Classic → holistic evaluation including uncertainty & calibration

    • Custom LLM uncertainty benchmarks (semantic entropy + rejection accuracy) emerging

    Key Takeaway for 2026 NumPyro + GPyTorch/BoTorch + Fortuna/Laplace → the most powerful open-source stack for scalable Bayesian inference & UQ. Ax + Ray Tune → production Bayesian optimization. Uncertainty Baselines & HELM → reproducible benchmarking.

    8. Assessments, Exercises & Projects

    This section provides a carefully scaffolded set of learning activities — from conceptual reinforcement and mathematical proofs to practical coding exercises, structured mini-projects, and open-ended thesis-level research ideas. The exercises are aligned with the core material in Sections 1–7 and are designed to suit:

    • MSc / early PhD students preparing for research, qualifying exams, or publications

    • Advanced undergraduates building strong probabilistic foundations

    • Industry AI engineers / data scientists implementing uncertainty-aware systems

    • Professors / lecturers seeking assignments, lab exercises, or capstone/thesis topics

    All activities emphasize mathematical rigor, computational reproducibility, and real-world relevance (e.g., LLM calibration, safe decision-making, Bayesian optimization at scale).

    8.1 Conceptual & Proof-Based Questions

    Purpose: Solidify core probabilistic reasoning, understand key derivations, and prepare for research interviews, exams, or paper discussions.

    Short conceptual questions (quiz / interview / discussion style)

    1. Explain the difference between aleatoric and epistemic uncertainty. Give one concrete example in each category from modern LLMs (e.g., token generation vs. factual hallucination).

    2. Why does the marginal likelihood (evidence) act as an automatic Occam’s razor in Bayesian model selection? Illustrate with a simple example (e.g., polynomial regression of different degrees).

    3. Describe why conjugate priors lead to closed-form posterior updates. Why is this property so valuable in online / streaming Bayesian learning?

    4. In variational inference, why does the KL(q||p) term in the ELBO act as a regularizer? What happens to the posterior approximation when β > 1 in β-VAE?

    5. Explain why semantic entropy is often more reliable than token-level entropy for detecting hallucinations in LLMs. What assumption does it make about semantic equivalence?

    6. Why is Thompson sampling asymptotically optimal in multi-armed bandits while ε-greedy is not? Link your answer to posterior sampling and exploration–exploitation balance.

    7. In Bayesian optimization, why does the Expected Improvement (EI) acquisition function naturally balance exploration and exploitation?

    8. Describe one reason why deep ensembles often provide better calibrated uncertainty estimates than MC Dropout in deep neural networks.

    9. Why do misspecified models (wrong likelihood or prior) still produce reasonable predictions under power posterior or density power divergence methods?

    10. In safe exploration for RL, why does posterior sampling (PSRL) tend to be more effective than optimistic methods like UCB in partially observable or sparse-reward settings?

    Proof / derivation questions (homework / exam / qualifying level)

    1. Derive the closed-form posterior for Bayesian linear regression with Gaussian prior and Gaussian likelihood (known variance). Show the expressions for posterior mean and covariance.

    2. Prove that the Beta distribution is conjugate to the Bernoulli likelihood: start from prior Beta(α, β), update with s successes and f failures, and show posterior is Beta(α+s, β+f).

    3. Derive the ELBO for mean-field variational inference in a simple Bayesian linear regression model. Show how it decomposes into expected log-likelihood and KL divergence terms.

    4. Show that the Bayes optimal action under 0-1 loss is the mode of the posterior predictive distribution (MAP prediction).

    5. Derive the update equations for α and β in the evidence approximation for Bayesian linear regression (Type-II ML / empirical Bayes).

    6. Prove that Thompson sampling in multi-armed bandits achieves logarithmic regret (sketch the key steps from Agrawal & Goyal or Chapelle & Li).

    7. Show why the junction tree algorithm computes exact marginals on arbitrary graphs by transforming them into a tree of cliques (high-level sketch is acceptable).

    8.2 Coding Exercises

    Language: Python (PyTorch 2.6+, NumPyro/JAX, PyMC, BoTorch where relevant). Use GPU if available.

    Exercise 1 – Bayesian Linear Regression from scratch Implement Bayesian linear regression with known variance using conjugate updates.

    • Dataset: toy 1D regression (e.g., noisy sine wave) or Boston housing subset

    • Prior: Normal(0, τ₀²) on weights

    • Compute posterior mean & covariance analytically

    • Plot predictive distribution (mean ± 2 std) with uncertainty bands

    • Bonus: Implement evidence approximation (optimize α, β iteratively)

    Exercise 2 – Simple VAE for text generation Build a basic VAE for discrete text (character-level or small vocabulary).

    • Encoder: LSTM → μ, logvar

    • Reparameterization trick → sample z

    • Decoder: LSTM conditioned on z

    • Use Gumbel-softmax or straight-through for discrete latents

    • Train with β-annealing to avoid posterior collapse

    • Generate samples by sampling z ~ prior → decode autoregressively

    • Visualize latent space interpolations

    Exercise 3 – Thompson Sampling for multi-armed bandits Implement a Bernoulli multi-armed bandit environment.

    • Use Beta(1,1) prior for each arm

    • Run Thompson sampling: sample θ ~ Beta(α, β) for each arm → choose argmax θ

    • Compare cumulative regret vs. ε-greedy and UCB

    • Bonus: Extend to contextual bandits with Bayesian linear regression per arm

    Starter resources

    • NumPyro tutorials: Bayesian regression, VAE

    • BoTorch examples: Thompson sampling acquisition

    • Pyro VAE examples (for text character-level)

    8.3 Mini-Projects

    Duration: 3–10 weeks (individual or small team)

    Project A – Bayesian Hyperparameter Tuning Pipeline Goal: Build an end-to-end Bayesian optimization loop for tuning a small neural network or LLM.

    • Model: simple CNN on CIFAR-10 or LoRA fine-tuning on Llama-3.2-1B

    • Objective: validation accuracy or negative log-likelihood

    • Surrogate: GPyTorch or BoTorch GP

    • Acquisition: EI or Thompson sampling

    • Use Ax or BoTorch for the loop

    • Evaluate: compare against grid search, random search, Optuna TPE

    • Bonus: Add multi-fidelity (low-resolution → high-resolution training)

    Project B – Uncertainty-Aware LLM Rejection Rule Goal: Implement a production-style rejection mechanism for LLM outputs.

    • Model: Llama-3.1-8B-Instruct or Qwen2-7B-Instruct

    • Task: MMLU subset or TruthfulQA

    • UQ methods: semantic entropy + token entropy + verbalized confidence

    • Rejection rule: abstain if combined uncertainty > threshold

    • Evaluate: coverage vs. accuracy trade-off, hallucination reduction rate

    • Bonus: Use conformal prediction to set theoretically valid rejection thresholds

    Project C – Bayesian Optimization Loop from Scratch Goal: Implement a complete BO system without high-level libraries.

    • Objective: synthetic 2D–6D test functions (Branin, Hartmann, Ackley)

    • Surrogate: GP regression (GPyTorch or tinygp)

    • Acquisition: Expected Improvement (analytical formula)

    • Run 50–100 iterations → plot regret vs. evaluations

    • Bonus: Add trust regions (TuRBO-style) or SAAS sparsity prior

    8.4 Advanced / Thesis-Level Project Ideas

    Suitable for MSc thesis, PhD qualifying projects, research internships, or conference submissions (6–24 months)

    1. Uncertainty-Guided Self-Refine for Reasoning in LLMs Integrate semantic entropy + last-layer Laplace uncertainty into self-refine loops. Trigger refinement steps only when epistemic uncertainty is high. Evaluate on GSM8K, MATH, GPQA Diamond — measure accuracy gain vs. compute cost.

    2. Robust Bayesian Fine-Tuning under Distribution Shift Develop power-posterior or density power divergence-based fine-tuning for LLMs on shifted data (domain adaptation, continual learning). Compare robustness vs. standard SFT + LoRA on synthetic and real shifts (e.g., news → medical text).

    3. Scalable Bayesian Optimization with Deep Kernel Learning & Trust Regions Combine SAAS-BO sparsity + deep kernels + TuRBO trust regions for high-dimensional hyperparameter tuning of large models (e.g., LoRA ranks, learning rates, quantization bits for 70B+ LLMs). Benchmark against Ax/BoTorch defaults.

    4. Conformal Prediction Sets for Sequence-Level LLM Outputs Construct conformal prediction sets for full answers (not just tokens) via rejection sampling or beam search. Achieve marginal coverage guarantees on MMLU, TruthfulQA, HaluEval. Analyze set size vs. coverage trade-off.

    5. Bayesian Nonparametric Priors for LLM Syntax & Semantics Explore hierarchical Pitman-Yor or neural Dirichlet process priors over syntax trees or semantic roles in LLM hidden states. Can they improve compositionality or generalization on long-context reasoning tasks?

    6. Safe Exploration in Bayesian Deep RL with Posterior Sampling Implement posterior sampling for deep model-based RL (e.g., PETS-style with Bayesian ensembles). Evaluate safe exploration (avoidance of catastrophic states) on MuJoCo or safety-gym environments.

    Suggested evaluation rubric for advanced projects

    • Theoretical contribution / novelty (new bound, method, analysis) — 30%

    • Implementation quality & reproducibility (clean code, seeds, ablations) — 25%

    • Empirical rigor (multiple runs, statistical tests, baselines) — 25%

    • Ethical & safety discussion (bias, misuse, energy, societal impact) — 10%

    • Clarity of write-up (paper-quality structure & presentation) — 10%

    These activities scale from classroom assignments to submissions at NeurIPS, ICML, ICLR, UAI, AISTATS, CoRL, or safety-focused workshops.

PREVIOUS PAGE INDEX PAGE NEXT PAGE

Join AI Learning

Get free AI tutorials and PDFs