AI Mastery

Your go-to source for complete AI tutorials, notes, and free PDF downloads

Free Reading Alert! All my books are FREE on Kindle Unlimited or eBooks just ₹145!

Check now: https://www.amazon.in/stores/Anshuman-Mishra/author/B0DQVNPL7P

Start reading! 🚀

फ्री रीडिंग का मौका! मेरी सारी किताबें Kindle Unlimited में FREE या ईबुक सिर्फ ₹145 में!

अभी देखें: https://www.amazon.in/stores/Anshuman-Mishra/author/B0DQVNPL7P पढ़ना शुरू करें! 🚀🚀

PREVIOUS PAGE INDEX PAGE NEXT PAGE

Calculus & Optimization for Machine Learning: Gradients, Hessians & Convergence Techniques

N.B.- All my books are exclusively available on Amazon. The free notes/materials on globalcodemaster.com do NOT match even 1% with any of my PUBLISHED BOoks. Similar topics ≠ same content. Books have full details, exercises, chapters & structure — website notes do not.No book content is shared here. We fully comply with Amazon policies.

TABLE OF CONTENT

1. Introduction to Calculus in Machine Learning 1.1 Why calculus matters in ML: From loss landscapes to training dynamics 1.2 Mathematical optimization as the core of supervised/unsupervised learning 1.3 Taxonomy: First-order vs second-order vs zeroth-order methods 1.4 Loss functions, objective landscapes, and typical challenges (non-convexity, saddle points, ill-conditioning) 1.5 Notation and review of vector/matrix calculus basics

2. Univariate and Multivariate Differential Calculus 2.1 Limits, continuity, and differentiability 2.2 Derivatives, tangent lines, and directional derivatives 2.3 Partial derivatives and the gradient vector ∇f 2.4 Higher-order derivatives: second derivatives, curvature intuition 2.5 Chain rule in single-variable and multivariable settings 2.6 Taylor series expansions (first-, second-, and higher-order) 2.7 Linear and quadratic approximations of functions

3. Vector and Matrix Calculus Essentials for ML 3.1 Gradient of scalar-valued functions: ∇_θ J(θ) 3.2 Jacobian matrix (gradients of vector-valued functions) 3.3 Hessian matrix H = ∇²J(θ): definition, symmetry, interpretation as curvature 3.4 Gradient and Hessian of common ML building blocks (linear layers, activations, norms) 3.5 Useful identities: matrix calculus cheat-sheet (trace trick, denominator layout vs numerator layout) 3.6 Automatic differentiation vs symbolic vs finite differences

4. First-Order Optimization: Gradient-Based Methods 4.1 Gradient descent (GD): vanilla / batch gradient descent 4.2 Convergence analysis: Lipschitz smooth & strongly convex functions 4.3 Step-size selection: fixed, diminishing, backtracking line search 4.4 Momentum & acceleration: Polyak momentum, Nesterov accelerated gradient (NAG) 4.5 Adaptive gradient methods: AdaGrad, RMSProp, Adam (& variants: AdamW, AMSGrad) 4.6 Practical considerations: learning rate schedules, warm-up, gradient clipping

5. Stochastic and Mini-Batch Gradient Methods 5.1 From batch GD → stochastic GD (SGD) 5.2 Mini-batch gradient descent: bias-variance trade-off 5.3 Noise as implicit regularization 5.4 Convergence of SGD: Robbins-Monro conditions, sublinear rates 5.5 Variance reduction techniques: SVRG, SAGA, SARAH 5.6 Large-batch training phenomena and scaling rules

6. Second-Order Optimization: Hessian-Aware Methods 6.1 Newton's method: quadratic local model and update rule 6.2 Modified Newton / damped Newton / Levenberg-Marquardt 6.3 Trust-region methods 6.4 Quasi-Newton methods: BFGS, L-BFGS (limited-memory version) 6.5 Natural gradient descent & Fisher information matrix approximation 6.6 Hessian-free optimization (Martens-style, CG solver) 6.7 Kronecker-factored approximate curvature (K-FAC) 6.8 Practical second-order methods in deep learning: pros, cons, when to use

7. Convergence Theory and Analysis 7.1 Convex smooth optimization: rates for gradient descent family 7.2 Strongly convex case: linear convergence 7.3 Non-convex optimization: stationarity, escape from saddle points 7.4 Stochastic convergence: almost sure convergence, expected suboptimality 7.5 Complexity lower bounds and optimal methods 7.6 Influence of conditioning number, Lipschitz constants, smoothness 7.7 Landscape pathologies in deep learning: sharp vs flat minima, mode connectivity

8. Advanced Convergence Techniques and Modern Variants 8.1 Learning rate free / adaptive methods: Adadelta, Yogi, Sophia 8.2 Lookahead, RAdam, Ranger, Lion 8.3 Barzilai-Borwein step size & spectral methods 8.4 Polyak-Łojasiewicz (PL) inequality and beyond 8.5 Heavy-ball momentum under non-convexity 8.6 Proximal methods and regularization in optimization view 8.7 Distributed / parallel optimization: data-parallel, model-parallel implications

9. Backpropagation and Automatic Differentiation Revisited 9.1 Computational graph and reverse-mode AD 9.2 Gradient computation through chain rule 9.3 Vector-Jacobian products (VJP) and Jacobian-vector products (JVP) 9.4 Efficient Hessian-vector products (finite differences vs AD tricks) 9.5 Higher-order derivatives in AD frameworks (Hessian-free, K-FAC)

10. Special Topics and Frontiers (2025–2026 Perspective) 10.1 Sharpness-aware minimization (SAM) and related methods 10.2 Second-order preconditioning in large language models 10.3 Grokking and optimization dynamics 10.4 Implicit bias of optimizers (SGD vs Adam) 10.5 When do second-order methods outperform first-order in practice? 10.6 Open challenges: scaling second-order methods to trillion-parameter models

1. Introduction to Calculus in Machine Learning

Calculus is the mathematical foundation that powers almost every modern machine learning algorithm. At its heart, training a model means minimizing a loss function — a high-dimensional surface that tells us “how bad” our current parameters are. Calculus gives us the tools to navigate this surface intelligently: measuring slopes (gradients), curvatures (Hessians), and predicting how the surface changes as we move.

1.1 Why calculus matters in ML: From loss landscapes to training dynamics

Imagine training a neural network. The loss landscape is the graph of the loss function J(θ) J(\theta) J(θ) where θ \theta θ contains millions or billions of parameters (weights and biases). In 2D it looks like a hilly terrain; in millions of dimensions it is an unimaginably complex hypersurface.

Training dynamics = how the parameters θ \theta θ move downhill on this surface.
Without calculus, we would have to guess directions randomly (extremely inefficient).
With calculus, we compute the direction of steepest descent (negative gradient) and take controlled steps.

Concrete example: Linear regression We want to fit a line y^=wx+b \hat{y} = w x + b y^=wx+b to data points (xi,yi) (x_i, y_i) (xi,yi).

Loss (Mean Squared Error):

J(w,b)=12n∑i=1n(yi−(wxi+b))2J(w, b) = \frac{1}{2n} \sum_{i=1}^n (y_i - (w x_i + b))^2J(w,b)=2n1i=1∑n(yi−(wxi+b))2

The surface J(w,b) J(w, b) J(w,b) is a paraboloid (bowl shape). The lowest point is the best line.

Gradient tells us the slope:

∇J=(∂J∂w∂J∂b)=(1n∑(y^i−yi)xi1n∑(y^i−yi))\nabla J = \begin{pmatrix} \frac{\partial J}{\partial w} \\ \frac{\partial J}{\partial b} \end{pmatrix} = \begin{pmatrix} \frac{1}{n} \sum ( \hat{y}_i - y_i ) x_i \\ \frac{1}{n} \sum ( \hat{y}_i - y_i ) \end{pmatrix}∇J=(∂w∂J∂b∂J)=(n1∑(y^i−yi)xin1∑(y^i−yi))

If ∂J∂w>0 \frac{\partial J}{\partial w} > 0 ∂w∂J>0, increasing w w w makes loss worse → we should decrease w w w.

In deep learning the landscape is non-convex (many hills and valleys), so calculus becomes even more critical for escaping poor regions.

Visualization intuition: Think of a hiker on a mountain at night. The gradient is the direction the slope feels steepest under their feet. Gradient descent is repeatedly stepping in the opposite direction until reaching a valley.

1.2 Mathematical optimization as the core of supervised/unsupervised learning

Virtually every ML task reduces to solving:

θ∗=arg⁡min⁡θJ(θ)\theta^* = \arg\min_{\theta} J(\theta)θ∗=argθminJ(θ)

where J(θ) J(\theta) J(θ) is the objective (loss) function.

Supervised learning

Regression: minimize MSE / Huber loss
Classification: minimize cross-entropy
Example: Neural net for image classification

J(θ)=−1n∑i=1n∑k=1Cyi,klog⁡(y^i,k(θ))J(\theta) = -\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^C y_{i,k} \log(\hat{y}_{i,k}(\theta))J(θ)=−n1i=1∑nk=1∑Cyi,klog(y^i,k(θ))

Unsupervised learning

K-means: minimize within-cluster sum of squares
Autoencoder: minimize reconstruction error
GANs: min-max game (two coupled optimizations)

Even reinforcement learning (policy gradients) and generative models (VAEs) ultimately rely on gradient-based optimization.

Key insight: No matter how complex the model (linear regression → transformer with trillions of parameters), the training loop is always:

Forward pass → compute loss
Backward pass → compute gradients via calculus
Update parameters using those gradients

1.3 Taxonomy: First-order vs second-order vs zeroth-order methods

Optimization algorithms are classified by how much information about the loss surface they use:

Method TypeInformation UsedTypical AlgorithmsSpeed per iterationMemoryConvergence SpeedUse-case in MLZeroth-orderOnly function values J(θ) J(\theta) J(θ)Random search, Evolution strategies, Bayesian optimization, SPSAFastLowVery slowHyperparameter tuning, black-boxFirst-orderGradient ∇J(θ) \nabla J(\theta) ∇J(θ)SGD, Adam, RMSProp, NesterovFastLowGood99% of deep learning todaySecond-orderGradient + Hessian ∇2J \nabla^2 J ∇2JNewton, L-BFGS, K-FAC, Natural GradientSlowHighExcellent locallySmall models, fine-tuning, research

Zeroth-order example No derivatives: estimate direction by trying small perturbations:

gi≈J(θ+ϵei)−J(θ)ϵg_i \approx \frac{J(\theta + \epsilon e_i) - J(\theta)}{\epsilon}gi≈ϵJ(θ+ϵei)−J(θ)

First-order (most popular) Only needs ∇J \nabla J ∇J. Cheap because of automatic differentiation (backpropagation).

Second-order Uses curvature (Hessian) to take “smarter” steps that account for how the slope itself changes. Newton update:

θt+1=θt−ηH−1∇J(θt)\theta_{t+1} = \theta_t - \eta H^{-1} \nabla J(\theta_t)θt+1=θt−ηH−1∇J(θt)

(where H H H is the Hessian). Much faster convergence near minima, but computationally expensive.

1.4 Loss functions, objective landscapes, and typical challenges

Common loss functions and their shapes

Quadratic (MSE)J(θ)=12∥y−Xθ∥2 J(\theta) = \frac{1}{2} \|\mathbf{y} - X\theta\|^2 J(θ)=21∥y−Xθ∥2 → perfectly convex bowl.
Cross-entropy (classification) Non-convex in neural nets.
Hinge loss (SVM) Piecewise linear.

Major challenges in real loss landscapes

a) Non-convexity Deep nets have countless local minima and saddle points. Example: A simple 2-layer network can have saddle points where gradient = 0 but it is not a minimum.

b) Saddle points Points where ∇J=0 \nabla J = 0 ∇J=0 but Hessian has both positive and negative eigenvalues. In high dimensions (>1000), almost all critical points are saddles, not local minima (Dauphin et al., 2014).

c) Ill-conditioning Condition number κ=λmax⁡(H)λmin⁡(H) \kappa = \frac{\lambda_{\max}(H)}{\lambda_{\min}(H)} κ=λmin(H)λmax(H) is huge (10^6+ in deep nets). Consequence: gradient steps oscillate or crawl along narrow valleys. Classic example: Rosenbrock “banana” function

J(x,y)=100(y−x2)2+(1−x)2J(x,y) = 100(y - x^2)^2 + (1 - x)^2J(x,y)=100(y−x2)2+(1−x)2

The valley is extremely narrow → vanilla GD zig-zags for thousands of steps.

d) Flat regions / plateaus Gradients near zero → training stalls.

e) Sharp vs flat minima Modern generalization theory (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) suggests flat minima generalize better.

1.5 Notation and review of vector/matrix calculus basics

Basic notation

Scalar: a a a, J J J
Vector: x∈Rd \mathbf{x} \in \mathbb{R}^d x∈Rd (bold lowercase)
Matrix: X∈Rm×n \mathbf{X} \in \mathbb{R}^{m \times n} X∈Rm×n (bold uppercase)
Function: scalar-valued f:Rd→R f: \mathbb{R}^d \to \mathbb{R} f:Rd→R, vector-valued f:Rd→Rk \mathbf{f}: \mathbb{R}^d \to \mathbb{R}^k f:Rd→Rk

Gradient (first derivative)

∇f(x)=(∂f∂x1⋮∂f∂xd)\nabla f(\mathbf{x}) = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_d} \end{pmatrix}∇f(x)=∂x1∂f⋮∂xd∂f

Jacobian (for vector-valued functions)

Jf=(∂f1∂x1⋯∂f1∂xd⋮⋱⋮∂fk∂x1⋯∂fk∂xd)\mathbf{J}_f = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_d} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_k}{\partial x_1} & \cdots & \frac{\partial f_k}{\partial x_d} \end{pmatrix}Jf=∂x1∂f1⋮∂x1∂fk⋯⋱⋯∂xd∂f1⋮∂xd∂fk

Hessian (second derivative matrix, symmetric)

H=∇2f(x)=(∂2f∂xi∂xj)i,j=1d\mathbf{H} = \nabla^2 f(\mathbf{x}) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_i \partial x_j} \end{pmatrix}_{i,j=1}^dH=∇2f(x)=(∂xi∂xj∂2f)i,j=1d

Chain rule (backpropagation foundation) For z=f(g(x)) z = f(g(\mathbf{x})) z=f(g(x)):

∂z∂x=∂z∂g⋅∂g∂x(scalar case)\frac{\partial z}{\partial \mathbf{x}} = \frac{\partial z}{\partial g} \cdot \frac{\partial g}{\partial \mathbf{x}} \quad \text{(scalar case)}∂x∂z=∂g∂z⋅∂x∂g(scalar case)

Matrix form (most used in ML):

∇xf=JgT∇gf\nabla_{\mathbf{x}} f = \mathbf{J}_g^T \nabla_g f∇xf=JgT∇gf

Useful identities

Derivative of linear layer: ∂∂W(Wx)=xT \frac{\partial}{\partial \mathbf{W}} (\mathbf{W}\mathbf{x}) = \mathbf{x}^T ∂W∂(Wx)=xT
Trace trick: ∂∂Atr⁡(AB)=BT \frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A B}) = \mathbf{B}^T ∂A∂tr(AB)=BT
For quadratic form: ∇x(xTAx)=(A+AT)x \nabla_{\mathbf{x}} (\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x} ∇x(xTAx)=(A+AT)x

Example: Gradient of cross-entropy + softmax

Let z \mathbf{z} z be logits, p=softmax⁡(z) \mathbf{p} = \operatorname{softmax}(\mathbf{z}) p=softmax(z), true label one-hot y \mathbf{y} y.

∂J∂z=p−y\frac{\partial J}{\partial \mathbf{z}} = \mathbf{p} - \mathbf{y}∂z∂J=p−y

This simple closed form is why cross-entropy + softmax is so popular.

Review summary table

ConceptSymbolWhat it measuresML meaningGradient∇J \nabla J ∇JDirection of steepest ascentLearning signalHessian∇2J \nabla^2 J ∇2JCurvature (2nd-order change)Step-size adaptation, trust regionJacobianJ \mathbf{J} JHow vector output changes w.r.t inputBackprop through layers

2. Univariate and Multivariate Differential Calculus

please write detailed table of cotent for-Calculus & Optimization for ML: Gradients, Hessians & Convergence Techniques

Here is a detailed, structured Table of Contents for a comprehensive module/course/book titled:

Calculus & Optimization for Machine Learning: Gradients, Hessians & Convergence Techniques

This outline is designed for an intermediate-to-advanced audience (e.g., graduate students, ML engineers/researchers) and balances mathematical rigor with practical ML relevance. It emphasizes gradients (first-order), Hessians (second-order), and modern convergence techniques used in training deep models.

Appendices A. Matrix calculus reference tables and identities B. Common loss functions and their gradients / Hessians C. Proofs of key convergence theorems D. Implementation notes (PyTorch / JAX examples for gradients, Hessians, custom optimizers) E. Recommended reading & seminal papers

This structure provides a logical progression:

start with foundational calculus →
move to first-order practical methods (what most people use) →
explore second-order theory & approximations →
end with convergence guarantees and cutting-edge techniques.

Let me know if you'd like to expand any chapter into subheadings, add/remove sections, adjust the depth, or focus more on deep learning vs general ML optimization!

1.9s

Fast

35 sources

please write in details with explaantion with examples and matematical examples on-1. Introduction to Calculus in Machine Learning 1.1 Why calculus matters in ML: From loss landscapes to training dynamics 1.2 Mathematical optimization as the core of supervised/unsupervised learning 1.3 Taxonomy: First-order vs second-order vs zeroth-order methods 1.4 Loss functions, objective landscapes, and typical challenges (non-convexity, saddle points, ill-conditioning) 1.5 Notation and review of vector/matrix calculus basics

Thought for 11s

1. Introduction to Calculus in Machine Learning

1.1 Why calculus matters in ML: From loss landscapes to training dynamics

Training dynamics = how the parameters θ \theta θ move downhill on this surface.
Without calculus, we would have to guess directions randomly (extremely inefficient).
With calculus, we compute the direction of steepest descent (negative gradient) and take controlled steps.

Concrete example: Linear regression We want to fit a line y^=wx+b \hat{y} = w x + b y^=wx+b to data points (xi,yi) (x_i, y_i) (xi,yi).

Loss (Mean Squared Error):

J(w,b)=12n∑i=1n(yi−(wxi+b))2J(w, b) = \frac{1}{2n} \sum_{i=1}^n (y_i - (w x_i + b))^2J(w,b)=2n1i=1∑n(yi−(wxi+b))2

The surface J(w,b) J(w, b) J(w,b) is a paraboloid (bowl shape). The lowest point is the best line.

Gradient tells us the slope:

If ∂J∂w>0 \frac{\partial J}{\partial w} > 0 ∂w∂J>0, increasing w w w makes loss worse → we should decrease w w w.

In deep learning the landscape is non-convex (many hills and valleys), so calculus becomes even more critical for escaping poor regions.

1.2 Mathematical optimization as the core of supervised/unsupervised learning

Virtually every ML task reduces to solving:

θ∗=arg⁡min⁡θJ(θ)\theta^* = \arg\min_{\theta} J(\theta)θ∗=argθminJ(θ)

where J(θ) J(\theta) J(θ) is the objective (loss) function.

Supervised learning

Regression: minimize MSE / Huber loss
Classification: minimize cross-entropy
Example: Neural net for image classification

J(θ)=−1n∑i=1n∑k=1Cyi,klog⁡(y^i,k(θ))J(\theta) = -\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^C y_{i,k} \log(\hat{y}_{i,k}(\theta))J(θ)=−n1i=1∑nk=1∑Cyi,klog(y^i,k(θ))

Unsupervised learning

K-means: minimize within-cluster sum of squares
Autoencoder: minimize reconstruction error
GANs: min-max game (two coupled optimizations)

Even reinforcement learning (policy gradients) and generative models (VAEs) ultimately rely on gradient-based optimization.

Key insight: No matter how complex the model (linear regression → transformer with trillions of parameters), the training loop is always:

Forward pass → compute loss
Backward pass → compute gradients via calculus
Update parameters using those gradients

1.3 Taxonomy: First-order vs second-order vs zeroth-order methods

Optimization algorithms are classified by how much information about the loss surface they use:

Zeroth-order example No derivatives: estimate direction by trying small perturbations:

gi≈J(θ+ϵei)−J(θ)ϵg_i \approx \frac{J(\theta + \epsilon e_i) - J(\theta)}{\epsilon}gi≈ϵJ(θ+ϵei)−J(θ)

First-order (most popular) Only needs ∇J \nabla J ∇J. Cheap because of automatic differentiation (backpropagation).

Second-order Uses curvature (Hessian) to take “smarter” steps that account for how the slope itself changes. Newton update:

θt+1=θt−ηH−1∇J(θt)\theta_{t+1} = \theta_t - \eta H^{-1} \nabla J(\theta_t)θt+1=θt−ηH−1∇J(θt)

(where H H H is the Hessian). Much faster convergence near minima, but computationally expensive.

1.4 Loss functions, objective landscapes, and typical challenges

Common loss functions and their shapes

Quadratic (MSE)J(θ)=12∥y−Xθ∥2 J(\theta) = \frac{1}{2} \|\mathbf{y} - X\theta\|^2 J(θ)=21∥y−Xθ∥2 → perfectly convex bowl.
Cross-entropy (classification) Non-convex in neural nets.
Hinge loss (SVM) Piecewise linear.

Major challenges in real loss landscapes

a) Non-convexity Deep nets have countless local minima and saddle points. Example: A simple 2-layer network can have saddle points where gradient = 0 but it is not a minimum.

J(x,y)=100(y−x2)2+(1−x)2J(x,y) = 100(y - x^2)^2 + (1 - x)^2J(x,y)=100(y−x2)2+(1−x)2

The valley is extremely narrow → vanilla GD zig-zags for thousands of steps.

d) Flat regions / plateaus Gradients near zero → training stalls.

e) Sharp vs flat minima Modern generalization theory (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) suggests flat minima generalize better.

1.5 Notation and review of vector/matrix calculus basics

Basic notation

Scalar: a a a, J J J
Vector: x∈Rd \mathbf{x} \in \mathbb{R}^d x∈Rd (bold lowercase)
Matrix: X∈Rm×n \mathbf{X} \in \mathbb{R}^{m \times n} X∈Rm×n (bold uppercase)
Function: scalar-valued f:Rd→R f: \mathbb{R}^d \to \mathbb{R} f:Rd→R, vector-valued f:Rd→Rk \mathbf{f}: \mathbb{R}^d \to \mathbb{R}^k f:Rd→Rk

Gradient (first derivative)

∇f(x)=(∂f∂x1⋮∂f∂xd)\nabla f(\mathbf{x}) = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_d} \end{pmatrix}∇f(x)=∂x1∂f⋮∂xd∂f

Jacobian (for vector-valued functions)

Hessian (second derivative matrix, symmetric)

H=∇2f(x)=(∂2f∂xi∂xj)i,j=1d\mathbf{H} = \nabla^2 f(\mathbf{x}) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_i \partial x_j} \end{pmatrix}_{i,j=1}^dH=∇2f(x)=(∂xi∂xj∂2f)i,j=1d

Chain rule (backpropagation foundation) For z=f(g(x)) z = f(g(\mathbf{x})) z=f(g(x)):

Matrix form (most used in ML):

∇xf=JgT∇gf\nabla_{\mathbf{x}} f = \mathbf{J}_g^T \nabla_g f∇xf=JgT∇gf

Useful identities

Derivative of linear layer: ∂∂W(Wx)=xT \frac{\partial}{\partial \mathbf{W}} (\mathbf{W}\mathbf{x}) = \mathbf{x}^T ∂W∂(Wx)=xT
Trace trick: ∂∂Atr⁡(AB)=BT \frac{\partial}{\partial \mathbf{A}} \operatorname{tr}(\mathbf{A B}) = \mathbf{B}^T ∂A∂tr(AB)=BT
For quadratic form: ∇x(xTAx)=(A+AT)x \nabla_{\mathbf{x}} (\mathbf{x}^T \mathbf{A} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x} ∇x(xTAx)=(A+AT)x

Example: Gradient of cross-entropy + softmax

Let z \mathbf{z} z be logits, p=softmax⁡(z) \mathbf{p} = \operatorname{softmax}(\mathbf{z}) p=softmax(z), true label one-hot y \mathbf{y} y.

∂J∂z=p−y\frac{\partial J}{\partial \mathbf{z}} = \mathbf{p} - \mathbf{y}∂z∂J=p−y

This simple closed form is why cross-entropy + softmax is so popular.

Review summary table

With this foundation in place, the rest of the module will build first-order methods (Chapters 4–5), second-order methods (Chapter 6), and rigorous convergence guarantees (Chapter 7). You now understand why every line of PyTorch/TensorFlow/JAX training code ultimately rests on these calculus tools.

1.6s

Expert

Univariate and Multivariate Differential Calculus 2.1 Limits, continuity, and differentiability 2.2 Derivatives, tangent lines, and directional derivatives 2.3 Partial derivatives and the gradient vector ∇f 2.4 Higher-order derivatives: second derivatives, curvature intuition 2.5 Chain rule in single-variable and multivariable settings 2.6 Taylor series expansions (first-, second-, and higher-order) 2.7 Linear and quadratic approximations of functions

2. Univariate and Multivariate Differential Calculus

This chapter reviews the core concepts of differentiation in one and multiple variables — the foundation for understanding gradients, Hessians, and optimization in machine learning. We progress from basic ideas (limits → derivatives) to multivariable tools (partial derivatives → gradients → Taylor approximations) that are directly used in backpropagation and loss landscape analysis.

2.1 Limits, continuity, and differentiability

Limits describe the behavior of a function as the input approaches a value.

Definition (univariate): lim_{h→0} f(x₀ + h) = L means that as h gets arbitrarily small, f gets arbitrarily close to L.

Continuity: f is continuous at x₀ if lim_{x→x₀} f(x) = f(x₀) (i.e., limit exists and equals the function value).

Differentiability is stronger: f is differentiable at x₀ if the following limit exists (and is finite): f'(x₀) = lim_{h→0} [f(x₀ + h) - f(x₀)] / h

Key implications:

Differentiability ⇒ continuity (but not vice versa — e.g., f(x) = |x| is continuous everywhere but not differentiable at x=0).
In ML, we almost always assume loss functions are differentiable (or at least almost everywhere differentiable).

Example: f(x) = x² is differentiable everywhere: f'(x) = lim_{h→0} [(x+h)² - x²]/h = lim_{h→0} (2xh + h²)/h = 2x

2.2 Derivatives, tangent lines, and directional derivatives

Derivative f'(x) = rate of change = slope of the tangent line at x.

The tangent line at (x₀, f(x₀)) is: y = f(x₀) + f'(x₀)(x - x₀)

Here are some illustrative examples of functions and their tangent lines:

cuemath.com

youtube.com

(Images show typical sketches: tangent approximating the curve locally, sometimes crossing elsewhere — common in non-linear functions.)

Directional derivative (univariate preview → full in multivariable): In one dimension there's only left/right, but the idea generalizes.

2.3 Partial derivatives and the gradient vector ∇f

For f: ℝⁿ → ℝ (scalar-valued function of many variables), e.g., loss J(θ₁, θ₂, ..., θ_d).

Partial derivative with respect to θᵢ (treating others as constant): ∂f/∂θᵢ = lim_{h→0} [f(θ₁,...,θᵢ+h,...,θ_d) - f(θ)] / h

Gradient vector collects all partials: ∇f(θ) = ( ∂f/∂θ₁ , ∂f/∂θ₂ , ..., ∂f/∂θ_d )ᵀ ∈ ℝᵈ

Geometric meaning: ∇f points in the direction of steepest ascent of f. Its magnitude ‖∇f‖ is the rate of that steepest increase.

Directional derivative in direction of unit vector û: Dᵤ f = ∇f · û = ‖∇f‖ cos φ Maximum when û = ∇f / ‖∇f‖ (→ steepest ascent = ‖∇f‖) Minimum when û = -∇f / ‖∇f‖ (→ steepest descent = -‖∇f‖)

Here is a classic visualization of gradient and directional derivative on a surface:

medium.com

geogebra.org

(Gradient arrow points uphill; directional derivative measures slope along any chosen direction.)

ML example: For MSE loss J(w,b) = (1/(2n)) Σ (yᵢ - (w xᵢ + b))² ∂J/∂w = (1/n) Σ (ŷᵢ - yᵢ) xᵢ ∂J/∂b = (1/n) Σ (ŷᵢ - yᵢ) → ∇J = [average error × feature, average error]ᵀ

2.4 Higher-order derivatives: second derivatives, curvature intuition

Second partials: ∂²f/∂θᵢ∂θⱼ (Hessian entries)

Hessian matrix (symmetric for twice continuously differentiable f): H_f(θ) = [ ∂²f / ∂θᵢ ∂θⱼ ]_{i,j}

Curvature intuition:

Positive definite Hessian → local minimum (bowl shape, curves up in all directions)
Negative definite → local maximum
Indefinite → saddle point (up in some directions, down in others)
In high dimensions, most critical points are saddles.

Example — Rosenbrock function (classic ill-conditioned test): f(x,y) = 100(y - x²)² + (1 - x)² Gradient vanishes at (1,1), Hessian shows very narrow valley → poor conditioning.

youtube.com

❖ Sketching the Derivative of a Function ❖

(Contours elongated → gradient descent zigzags slowly.)

2.5 Chain rule in single-variable and multivariable settings

Single variable: If z = f(g(x)), then dz/dx = (df/dg) × (dg/dx)

Multivariable chain rule (most important for backpropagation):

Case 1: z = f(u,v), where u = g(x), v = h(x) dz/dx = (∂f/∂u)(du/dx) + (∂f/∂v)(dv/dx)

Case 2: vector input — let θ ∈ ℝᵈ, z = f(g(θ)) ∇_θ z = J_g(θ)ᵀ ∇_g z (where J_g is the Jacobian of g)

In neural networks: each layer applies chain rule → backprop computes ∇J w.r.t. all parameters efficiently via reverse-mode AD.

2.6 Taylor series expansions (first-, second-, and higher-order)

Taylor theorem approximates smooth functions locally with polynomials.

Univariate around x₀: f(x) ≈ f(x₀) + f'(x₀)(x - x₀) + (1/2)f''(x₀)(x - x₀)² + (1/6)f'''(ξ)(x - x₀)³ + ...

Multivariate around θ₀: f(θ) ≈ f(θ₀) + ∇f(θ₀)ᵀ (θ - θ₀) + (1/2)(θ - θ₀)ᵀ H_f(θ₀) (θ - θ₀) + higher terms

First-order (linear): tangent plane/hyperplane Second-order (quadratic): best local parabolic approximation

Here are visual examples of quadratic approximations matching curves/surfaces:

youtube.com

statisticshowto.com

davidlowryduda.com

(Quadratic term captures curvature; error decreases rapidly near expansion point.)

2.7 Linear and quadratic approximations of functions

Linear approximation (first-order Taylor): f(θ) ≈ f(θ₀) + ∇f(θ₀)ᵀ (θ - θ₀) → tangent plane in multivariable case

Visual example of tangent plane to a surface:

youtube.com

justtothepoint.com

Quadratic approximation: f(θ) ≈ f(θ₀) + ∇fᵀ Δθ + (1/2) Δθᵀ H Δθ Used in:

Newton's method (sets quadratic ≈ 0 to find step)
Trust-region methods
Approximating loss landscape locally
Understanding why Adam/RMSProp adapt per parameter (approximate diagonal Hessian)

ML relevance summary:

Gradient = first-order local info → direction to descend
Hessian = second-order → curvature → how aggressively to step, preconditioning
Taylor → justification for gradient descent (follow linear piece downhill) and Newton (follow quadratic minimum)

These tools let us move from "what is the slope?" to "how curved is the bowl, and in which directions?" — exactly what first- and second-order optimizers exploit in Chapters 4–6.

Next sections will apply these ideas to concrete optimization algorithms.

3.3s

Fast

please do not create image file beacuase my web page does not support image file directly please write text file only

2. Univariate and Multivariate Differential Calculus (Continuing with pure text explanations, mathematical examples, and no image files or embedded graphics — everything described in words and equations only.)

2.1 Limits, continuity, and differentiability

Limits tell us what value a function approaches as the input gets close to a specific point.

Univariate limit example: Consider f(x) = (x² − 1)/(x − 1) As x approaches 1, direct substitution gives 0/0 (undefined), but we can simplify: f(x) = (x − 1)(x + 1)/(x − 1) = x + 1 (for x ≠ 1) So lim_{x→1} f(x) = 2

Continuity at a point x₀: Three conditions must hold:

f(x₀) is defined
lim_{x→x₀} f(x) exists
lim_{x→x₀} f(x) = f(x₀)

Differentiability at x₀ requires the derivative limit to exist: f'(x₀) = lim_{h→0} [f(x₀ + h) − f(x₀)] / h

Classic non-differentiable but continuous example: f(x) = |x| At x = 0: Right derivative = lim_{h→0⁺} ( |h| − 0 )/h = 1 Left derivative = lim_{h→0⁻} ( |h| − 0 )/h = −1 Derivatives don’t match → not differentiable at 0 (sharp corner), but continuous everywhere.

In machine learning, most loss functions (MSE, cross-entropy, etc.) are differentiable almost everywhere, except possibly at a few points (e.g., ReLU has a kink at 0, but subgradients handle it).

2.2 Derivatives, tangent lines, and directional derivatives

Derivative f'(x) = instantaneous rate of change = slope of the tangent line.

Tangent line equation at point (x₀, f(x₀)): y = f(x₀) + f'(x₀) (x − x₀)

Numerical example: f(x) = x³ − 2x + 1 f'(x) = 3x² − 2 At x₀ = 1: f(1) = 1 − 2 + 1 = 0, f'(1) = 3 − 2 = 1 Tangent line: y = 0 + 1·(x − 1) = x − 1

Directional derivative preview (full treatment in 2.3): In 1D, only two directions exist: positive (along +1) or negative (along −1). The derivative f'(x₀) is the directional derivative in the +x direction.

2.3 Partial derivatives and the gradient vector ∇f

For a function of several variables, e.g., f(x, y) = x²y + 3xy³ + sin(x)

Partial with respect to x (treat y constant): ∂f/∂x = 2xy + 3y³ + cos(x)

Partial with respect to y (treat x constant): ∂f/∂y = x² + 9xy²

Gradient vector: ∇f(x,y) = ( ∂f/∂x , ∂f/∂y ) = (2xy + 3y³ + cos(x), x² + 9xy²)

Geometric interpretation:

Direction of ∇f = direction of fastest increase of f
Magnitude ‖∇f‖ = steepest rate of increase
Therefore −∇f = direction of fastest decrease (used in gradient descent)

Directional derivative in direction of unit vector u = (u₁, u₂): D_u f = ∇f · u = (∂f/∂x) u₁ + (∂f/∂y) u₂

Maximum value occurs when u points exactly along ∇f: max D_u f = ‖∇f‖ Minimum (steepest descent) = −‖∇f‖

Machine learning example — linear regression loss J(w, b) = (1/2) Σ (yᵢ − (w xᵢ + b))² (dropping 1/n for simplicity)

∂J/∂w = Σ (w xᵢ + b − yᵢ) xᵢ ∂J/∂b = Σ (w xᵢ + b − yᵢ)

∇J = [ Σ(errorᵢ · xᵢ) , Σ(errorᵢ) ]ᵀ This vector tells us exactly how much to change w and b to reduce loss fastest.

2.4 Higher-order derivatives: second derivatives, curvature intuition

Second partial derivatives:

For f(x,y): ∂²f/∂x² , ∂²f/∂y² , ∂²f/∂x∂y = ∂²f/∂y∂x (equality holds if f is twice continuously differentiable)

Hessian matrix (2×2 case): H = | ∂²f/∂x² ∂²f/∂x∂y | | ∂²f/∂y∂x ∂²f/∂y² |

Curvature interpretation:

If H is positive definite (both eigenvalues > 0) → local minimum (curves upward in all directions)
If negative definite (both < 0) → local maximum
If indefinite (one positive, one negative eigenvalue) → saddle point
If singular (zero eigenvalue) → flat direction or inflection

Simple example — quadratic bowl f(x,y) = x² + 4y² ∇f = (2x, 8y) H = | 2 0 | | 0 8 | → positive definite → minimum at (0,0)

Ill-conditioned example — Rosenbrock function f(x,y) = 100(y − x²)² + (1 − x)² At the minimum (1,1): H ≈ | 1200 −400 | | −400 2 | Eigenvalues ≈ 1202 and ≈0.002 → condition number ≈ 600,000 Very narrow valley → gradient descent takes tiny steps or oscillates.

2.5 Chain rule in single-variable and multivariable settings

Single variable: z = f(g(x)) dz/dx = f'(g(x)) · g'(x)

Multivariable – scalar outer function: z = f(u,v), u = g(x,y), v = h(x,y) ∂z/∂x = (∂f/∂u)(∂u/∂x) + (∂f/∂v)(∂v/∂x) ∂z/∂y = (∂f/∂u)(∂u/∂y) + (∂f/∂v)(∂v/∂y)

Vector form (most relevant to backpropagation): Let θ ∈ ℝᵈ, let z = f(g(θ)) where g: ℝᵈ → ℝᵏ Then ∇_θ z = J_g(θ)ᵀ · ∇_g z (J_g is the k×d Jacobian matrix of g)

This matrix multiplication form is exactly how gradients flow backward through layers in neural networks.

Tiny example: Let z = sin(u) + u², u = x² + y ∂z/∂x = (cos(u) + 2u) · (2x) ∂z/∂y = (cos(u) + 2u) · 1

2.6 Taylor series expansions (first-, second-, and higher-order)

Univariate Taylor expansion around x₀: f(x) ≈ f(x₀) + f'(x₀)(x − x₀) + (1/2)f''(x₀)(x − x₀)² + (1/6)f'''(x₀)(x − x₀)³ + ...

Multivariate Taylor expansion around θ₀: f(θ) ≈ f(θ₀) + ∇f(θ₀)ᵀ (θ − θ₀)

(1/2) (θ − θ₀)ᵀ H_f(θ₀) (θ − θ₀)
higher-order remainder

Example – univariate: f(x) = e^x around x=0 f(0)=1, f'(0)=1, f''(0)=1, f'''(0)=1, ... e^x ≈ 1 + x + (1/2)x² + (1/6)x³ + ...

Example – quadratic approximation: f(x,y) = x² + xy + 3y² around (0,0) f(0,0) = 0 ∇f = (2x + y, x + 6y) → ∇f(0,0) = (0,0) H = | 2 1 | | 1 6 | So f(x,y) ≈ (1/2) [x y] |2 1| |x| = (1/2)(2x² + 2xy + 6y²) = x² + xy + 3y² |1 6| |y|

(The quadratic term exactly matches the function since it is already quadratic.)

2.7 Linear and quadratic approximations of functions

Linear (first-order) approximation: f(θ) ≈ f(θ₀) + ∇f(θ₀)ᵀ (θ − θ₀) This is the equation of the tangent hyperplane.

Use in optimization: Gradient descent follows this linear piece: move in direction −∇f.

Quadratic (second-order) approximation: f(θ) ≈ f(θ₀) + ∇fᵀ Δθ + (1/2) Δθᵀ H Δθ (where Δθ = θ − θ₀)

Use in optimization:

Newton’s method finds the minimum of this quadratic approximation → sets gradient of quadratic to zero → Δθ = −H⁻¹ ∇f
Trust-region methods stay inside a region where the quadratic approximation is trusted
Explains adaptive step sizes: large curvature (large eigenvalue) → small step in that direction

Summary table of approximations

OrderApproximation formulaGeometric meaningOptimization use case0thf(θ) ≈ f(θ₀)Constant valueZeroth-order methods1stf(θ) ≈ f(θ₀) + ∇fᵀ ΔθTangent hyperplaneGradient descent family2ndf(θ) ≈ f(θ₀) + ∇fᵀ Δθ + (1/2) Δθᵀ H ΔθLocal paraboloidNewton, quasi-Newton, trust-region

These concepts — especially the gradient, Hessian, chain rule, and Taylor expansions — form the mathematical backbone of every gradient-based training algorithm you will see in later chapters.

Let me know which section (3 or beyond) you would like next, or if you want more worked numerical examples for any subsection here.

2.1s

Fast

Vector and Matrix Calculus Essentials for ML 3.1 Gradient of scalar-valued functions: ∇_θ J(θ) 3.2 Jacobian matrix (gradients of vector-valued functions) 3.3 Hessian matrix H = ∇²J(θ): definition, symmetry, interpretation as curvature 3.4 Gradient and Hessian of common ML building blocks (linear layers, activations, norms) 3.5 Useful identities: matrix calculus cheat-sheet (trace trick, denominator layout vs numerator layout) 3.6 Automatic differentiation vs symbolic vs finite differences

3. Vector and Matrix Calculus Essentials for Machine Learning

This chapter focuses on the calculus tools that are used most frequently when deriving gradients and Hessians in neural networks, loss functions, and optimization algorithms. We emphasize notation conventions used in modern ML frameworks and provide concrete examples for the most common components.

3.1 Gradient of scalar-valued functions: ∇_θ J(θ)

Most machine learning objectives are scalar-valued functions J: ℝᵈ → ℝ, where θ ∈ ℝᵈ is the vector of all trainable parameters (weights, biases, etc.).

Definition The gradient ∇_θ J(θ) is the column vector of all partial derivatives:

∇_θ J(θ) = ⎡ ∂J/∂θ₁ ⎤ ⎢ ∂J/∂θ₂ ⎥ ⎢ ⋮ ⎥ ⎢ ∂J/∂θ_d ⎥ ⎣ ⎦

Interpretation in ML

Direction of steepest ascent of the loss
−∇_θ J(θ) is the direction of steepest descent
Backpropagation computes exactly this vector efficiently

Concrete example — logistic regression binary cross-entropy Let σ(z) = 1/(1 + e⁻ᶻ) be the sigmoid. For one sample: J(θ) = − [ y log σ(θᵀx) + (1−y) log(1 − σ(θᵀx)) ]

Gradient w.r.t. θ: ∇_θ J = (σ(θᵀx) − y) ⋅ x

For n samples (vector form): ∇_θ J(θ) = (1/n) Xᵀ (σ(Xθ) − y) where X is n×d feature matrix, y is n×1 label vector.

This is the form almost every first-order optimizer receives.

3.2 Jacobian matrix (gradients of vector-valued functions)

When the function outputs a vector, f: ℝᵈ → ℝᵏ, we use the Jacobian matrix J_f ∈ ℝ^{k × d}:

(J_f)_{ij} = ∂fᵢ / ∂θⱼ

Important cases in ML

Layer output before activation: h = Wx + b → Jacobian w.r.t. input x is W
Softmax output: p = softmax(z), z ∈ ℝᶜ (logits) Jacobian J_p/z ∈ ℝ^{C×C} is diag(p) − p pᵀ (very important for cross-entropy gradient)

Chain rule in vector form If scalar loss ℓ depends on vector a, and a = f(θ), then ∇_θ ℓ = J_f(θ)ᵀ ⋅ ∇_a ℓ

This transpose-Jacobian-vector-product form is the mathematical basis of reverse-mode automatic differentiation (backpropagation).

3.3 Hessian matrix H = ∇²J(θ): definition, symmetry, interpretation as curvature

Definition The Hessian of a scalar function J is the square symmetric matrix of second partial derivatives:

H_{ij} = ∂²J / ∂θᵢ ∂θⱼ

H = ∇_θ (∇_θ J(θ))ᵀ (gradient of the gradient)

Properties

Symmetric: Hᵀ = H (when J is twice continuously differentiable)
H is the Jacobian of the gradient vector

Curvature interpretation For small step δ: J(θ + δ) ≈ J(θ) + ∇Jᵀ δ + (1/2) δᵀ H δ

If H is positive definite → local minimum (loss curves upward in every direction)
If H has negative eigenvalues → saddle or maximum in those directions
Large |eigenvalue| → strong curvature → sensitive direction
Small eigenvalue → flat direction → ill-conditioning

ML reality In deep networks, the Hessian is almost never computed fully (O(d²) memory and time). Instead we use:

Diagonal approximation
Kronecker-factored approximations (K-FAC)
Hessian-vector products (via forward-over-reverse or reverse-over-forward AD)

3.4 Gradient and Hessian of common ML building blocks

Here are the derivatives needed in almost every modern network.

A. Linear layer z = Wx + b (W ∈ ℝ^{m×n}, x ∈ ℝⁿ, b,z ∈ ℝᵐ)

∇_W J = (∇_z J) xᵀ ∇_b J = ∇_z J ∇_x J = Wᵀ (∇_z J)

B. Element-wise activation (ReLU, sigmoid, tanh, GELU, etc.) Let a = σ(z), σ applied elementwise Then ∇_z J = diag(σ'(z)) ⋅ ∇_a J

Examples:

ReLU: σ'(z) = 1 if z > 0 else 0
Sigmoid: σ'(z) = σ(z)(1 − σ(z))
Tanh: σ'(z) = 1 − tanh²(z)

C. Squared Euclidean norm (regularization term) R(θ) = (1/2) ‖θ‖₂² = (1/2) θᵀ θ ∇_θ R = θ H_R = I (identity matrix)

D. Cross-entropy + softmax (single example) J = − ∑ y_k log p_k , p = softmax(z) ∇_z J = p − y (beautiful simplification!)

Hessian of cross-entropy w.r.t. logits z: H = diag(p) − p pᵀ (This matrix is positive semi-definite with rank at most C−1.)

E. L2 regularization on weights only Common form: J_total = J_data + (λ/2) ∑_{layers} ‖Wₗ‖₂² → adds λ I to the diagonal blocks of the Hessian corresponding to each weight matrix.

3.5 Useful identities: matrix calculus cheat-sheet

Important conventions

Most ML papers and frameworks use denominator layout (gradient is column vector, Jacobian has output dimension first).
Some older control theory / statistics literature uses numerator layout (gradient is row vector).

Key identities (denominator layout – standard in PyTorch/JAX/TensorFlow)

Linear: ∂/∂X (aᵀ X b) = a bᵀ
Quadratic form: ∂/∂x (xᵀ A x) = (A + Aᵀ) x If A symmetric → 2 A x
Trace trick (very powerful): ∂/∂A tr(A B) = Bᵀ ∂/∂A tr(A B Aᵀ C) = Bᵀ Aᵀ Cᵀ + C A B
Frobenius inner product: ⟨A, B⟩ = tr(Aᵀ B) = vec(A)ᵀ vec(B) → ∂/∂A ⟨A, f(A)⟩ = f(A)ᵀ (if scalar)
Matrix-vector product: ∂/∂x (A x) = Aᵀ
Hadamard (elementwise) product: ∂/∂x (a ⊙ x) = diag(a)
Softmax Jacobian: ∂p/∂z = diag(p) − p pᵀ

Quick reference table

ExpressionDerivative w.r.t. variableResultaᵀ X bXa bᵀxᵀ A xx(A + Aᵀ) xtr(Xᵀ A X B)XAᵀ X Bᵀ + A X B‖X‖₂² = tr(Xᵀ X)X2 Xlog det(X)XX⁻ᵀsoftmax(z) cross-entropyzsoftmax(z) − y

3.6 Automatic differentiation vs symbolic vs finite differences

Three main ways to compute derivatives in practice:

MethodHow it worksPrecisionSpeedMemoryUse-case in MLFinite differences(f(θ + εeᵢ) − f(θ))/ε ≈ ∂f/∂θᵢLow (rounding error, cancellation)Slow (d forward passes)LowDebugging, black-box, zeroth-order opt.SymbolicManipulates expression algebraicallyExact (fractions)Slow for large expr.High during deriv.Small models, theoretical workAutomatic (AD)Builds computational graph, applies chain ruleMachine precisionFast (one forward + one backward)Moderate–highVirtually all deep learning today

Modes of AD

Forward mode: good when #inputs ≪ #outputs (e.g., Hessian-vector products)
Reverse mode (backpropagation): good when #outputs ≪ #inputs (scalar loss, millions of parameters) → dominant in DL

Practical comparison example Suppose θ has 10 million parameters, J is scalar loss.

Finite differences → ~10 million forward passes → impossible
Symbolic → expression blowup → impossible
Reverse-mode AD → 1 forward + 1 backward pass → feasible (what PyTorch .backward() does)

JAX / PyTorch nuance Both support higher-order derivatives via nested autodiff: grad(grad(J)) ≈ Hessian-vector product when used with vmap/jvp.

This chapter gives you the core mathematical vocabulary and identities needed to understand and derive gradients/Hessians for almost any neural architecture. The next chapters apply these tools to concrete optimization algorithms.

Let me know if you want the next section (4. First-Order Optimization) or if you'd like deeper worked examples for any subsection here (e.g., full derivation of K-FAC Hessian approximation, more layer derivatives, etc.).

2.2s

Fast

First-Order Optimization: Gradient-Based Methods 4.1 Gradient descent (GD): vanilla / batch gradient descent 4.2 Convergence analysis: Lipschitz smooth & strongly convex functions 4.3 Step-size selection: fixed, diminishing, backtracking line search 4.4 Momentum & acceleration: Polyak momentum, Nesterov accelerated gradient (NAG) 4.5 Adaptive gradient methods: AdaGrad, RMSProp, Adam (& variants: AdamW, AMSGrad) 4.6 Practical considerations: learning rate schedules, warm-up, gradient clipping

4. First-Order Optimization: Gradient-Based Methods

First-order methods rely solely on the gradient ∇J(θ) (or stochastic estimates thereof) to update parameters. They dominate deep learning because they scale well to high dimensions (millions to trillions of parameters) thanks to automatic differentiation and mini-batching.

4.1 Gradient descent (GD): vanilla / batch gradient descent

Vanilla / Batch Gradient Descent (GD) Update rule (deterministic, full-batch):

θ_{t+1} = θ_t − η ∇_θ J(θ_t)

where η > 0 is the learning rate (step size).

Algorithm sketch:

Initialize θ₀
For t = 0, 1, 2, … until convergence:
- Compute full gradient ∇J(θ_t) over entire dataset
- θ_{t+1} ← θ_t − η ∇J(θ_t)

Pros: Smooth convergence path, exact gradient Cons: Computationally prohibitive for large datasets (one pass = one epoch)

In practice, we almost always use mini-batch or stochastic variants (covered in Chapter 5).

Mathematical example — quadratic bowl J(θ) = (1/2) θᵀ A θ, A positive definite Optimal step size η = 1/λ_max(A) gives fastest convergence.

4.2 Convergence analysis: Lipschitz smooth & strongly convex functions

We analyze convergence under standard assumptions used in optimization theory.

Key assumptions

L-smooth (Lipschitz continuous gradient): ‖∇J(θ) − ∇J(θ')‖ ≤ L ‖θ − θ'‖ → J has bounded curvature → descent lemma: J(θ − η ∇J(θ)) ≤ J(θ) − η ‖∇J(θ)‖² + (L η² / 2) ‖∇J(θ)‖²
μ-strongly convex (μ ≥ 0): J(θ') ≥ J(θ) + ∇J(θ)ᵀ (θ' − θ) + (μ/2) ‖θ' − θ‖² → Unique minimizer θ^, J(θ) ≥ J(θ^) + (μ/2) ‖θ − θ^*‖²

Convergence rates for gradient descent (fixed step size η ≤ 1/L)

Function classRate on function suboptimality J(θ_t) − J^*Rate on distance ‖θ_t − θ^*‖²Typical η choiceNotesL-smooth convex (μ=0)O(1/t)O(1/t)η = 1/LSublinear; optimal among first-order methodsL-smooth + μ-strongly convexLinear: O((1 − μ/L)^t)Linear: O((1 − μ/L)^t)η = 2/(μ + L) or 1/LCondition number κ = L/μ hurts rate

Proof intuition for smooth convex case (descent lemma + bounded gradients away from optimum): J(θ_{t+1}) − J^* ≤ J(θ_t) − J^* − (η/2) ‖∇J(θ_t)‖² ≤ J(θ_t) − J^* − (η μ / L) (J(θ_t) − J^*) Summing telescopes to O(1/t) after t steps.

Strongly convex case achieves linear (exponential) convergence — very fast once close to optimum.

In deep learning, functions are non-convex, so we get convergence to stationary points (‖∇J‖ → 0), but rates are more subtle (see Chapter 7).

4.3 Step-size selection: fixed, diminishing, backtracking line search

Fixed step size η constant (e.g., 0.001). Simple but requires careful tuning.

Diminishing step size (Robbins–Monro conditions for stochastic case): ∑ η_t = ∞ and ∑ η_t² < ∞ Examples: η_t = c / t, η_t = c / √t → Ensures convergence in expectation for SGD.

Backtracking line search (Armijo condition) Start with η = η_init While J(θ − η ∇J) > J(θ) − c η ‖∇J‖² (c ≈ 10⁻⁴): η ← β η (β = 0.5 or 0.8) → Adaptive per iteration, guarantees descent.

In deep learning, fixed or scheduled η is far more common than line search (too expensive per step).

4.4 Momentum & acceleration: Polyak momentum, Nesterov accelerated gradient (NAG)

Momentum accelerates convergence by adding a velocity term.

Polyak heavy-ball momentum (classical momentum):

v_{t+1} = β v_t − η ∇J(θ_t) θ_{t+1} = θ_t + v_{t+1}

(β ≈ 0.9)

Nesterov Accelerated Gradient (NAG) — optimal among first-order methods for smooth convex functions:

v_{t+1} = β v_t − η ∇J(θ_t + β v_t) (look-ahead gradient) θ_{t+1} = θ_t + v_{t+1}

or equivalent form:

y_{t+1} = θ_t + β (θ_t − θ_{t-1}) θ_{t+1} = y_{t+1} − η ∇J(y_{t+1})

Convergence rates (smooth convex case, optimal step-size):

GD: O(1/t)
NAG: O(1/t²) — optimal rate for any first-order method on L-smooth convex functions

For strongly convex: NAG achieves O(√(L/μ) log(1/ε)) iterations to reach ε-suboptimality (better dependence on condition number than GD).

In deep learning, momentum (especially Nesterov-style in SGD) helps escape flat regions and dampens oscillations in ill-conditioned landscapes.

4.5 Adaptive gradient methods: AdaGrad, RMSProp, Adam (& variants: AdamW, AMSGrad)

Adaptive methods scale the learning rate per parameter based on historical gradient magnitudes.

AdaGrad (2011):

g_t = g_{t-1} + (∇J_t)² (elementwise) θ_{t+1} = θ_t − η / (√g_t + ε) ⊙ ∇J_t

→ Good for sparse data, but learning rate shrinks too aggressively (can stop learning).

RMSProp (2012):

m_t = β₁ m_{t-1} + (1−β₁) ∇J_t v_t = β₂ v_{t-1} + (1−β₂) (∇J_t)² θ_{t+1} = θ_t − η / (√v_t + ε) ⊙ m_t

Adam (2014) — most popular:

m_t = β₁ m_{t-1} + (1−β₁) ∇J_t v_t = β₂ v_{t-1} + (1−β₂) (∇J_t)² m̂_t = m_t / (1 − β₁ᵗ) (bias correction) v̂_t = v_t / (1 − β₂ᵗ) θ_{t+1} = θ_t − η / (√v̂_t + ε) ⊙ m̂_t

Default hyperparameters: β₁=0.9, β₂=0.999, ε=10⁻⁸

Variants

AdamW (2019): Decouples weight decay from adaptive LR → better generalization θ_{t+1} = θ_t − η ( ∇J_t / (√v̂_t + ε) + λ θ_t ) (weight decay λ outside adaptive term)
AMSGrad (2018): Fixes non-convergence issue by using max(v_t) instead of v_t

Recent theoretical results (2024–2025) show Adam can achieve sharper rates than RMSProp in certain regimes, sometimes approaching accelerated rates under specific conditions (e.g., higher-order analysis with momentum-like effects).

4.6 Practical considerations: learning rate schedules, warm-up, gradient clipping

Learning rate schedules (most common in 2025 practice)

Step decay: Multiply LR by 0.1 every X epochs
Cosine annealing (SGDR / cosine with restarts): η_t = η_min + (1/2)(η_max − η_min)(1 + cos(π t / T))
One-cycle policy: Linear increase → cosine decay → very fast convergence
Exponential decay: η_t = η₀ ⋅ γ^t

Warm-up (standard since ~2017, still essential in 2025) Start with very small LR (e.g., 1/10 of target), linearly increase to target over W steps (W ≈ 1000–10000). Prevents early instability, exploding gradients, especially with large batches or adaptive optimizers.

Gradient clipping (by value or by norm) Clip if ‖∇J‖ > C (e.g., C=1.0 or 0.5): ∇J ← C ⋅ ∇J / max(C, ‖∇J‖)

Best practices (2025):

Always use warm-up + cosine decay for transformers / large models
Gradient clipping by norm (global norm) is default in most libraries
Combine with AdamW + weight decay (0.01–0.1)
Monitor gradient norm during training — sudden spikes indicate instability
For very large models: warm-up over more steps (proportional to model size)

Summary table of popular choices (2025)

OptimizerTypical LRScheduleWarm-up stepsClipping normWeight decaySGD + Mom.0.1–1.0Cosine / step2000–50000.5–5.0—Adam / AdamW1e-4 – 5e-4Cosine annealing1000–100001.00.01–0.1Lion1e-4One-cycle / cosineYesOften none0.1

These techniques turn theoretical first-order methods into practical powerhouses for training massive models.

Next up: Chapter 5 on stochastic and mini-batch methods, or let me know if you'd like numerical examples, PyTorch snippets, or deeper proofs for any subsection here!

2.6s

Fast

35 sources

Stochastic and Mini-Batch Gradient Methods 5.1 From batch GD → stochastic GD (SGD) 5.2 Mini-batch gradient descent: bias-variance trade-off 5.3 Noise as implicit regularization 5.4 Convergence of SGD: Robbins-Monro conditions, sublinear rates 5.5 Variance reduction techniques: SVRG, SAGA, SARAH 5.6 Large-batch training phenomena and scaling rules

5. Stochastic and Mini-Batch Gradient Methods

While batch gradient descent provides exact gradients, it is computationally infeasible for modern datasets with millions or billions of examples. Stochastic and mini-batch methods approximate the full gradient using subsets of data, enabling faster iterations and scaling to enormous models.

5.1 From batch GD → stochastic GD (SGD)

Batch Gradient Descent (GD) computes the exact gradient over the entire dataset:

∇J(θ) = (1/n) ∑_{i=1}^n ∇_θ ℓ(θ; x_i, y_i)

Stochastic Gradient Descent (SGD) uses a single random example (or a small batch) at each step:

θ_{t+1} = θ_t − η g_t where g_t = ∇θ ℓ(θ_t; x{i_t}, y_{i_t}) (i_t chosen uniformly at random)

This is an unbiased estimate: E[g_t | θ_t] = ∇J(θ_t)

Transition intuition:

Batch GD: one very accurate but expensive step per epoch.
SGD: many cheap, noisy steps per epoch → faster wall-clock progress early on.
The noise introduces stochasticity that helps escape local minima/saddles in non-convex settings.

In practice (2025–2026), pure single-example SGD is rare; almost everyone uses mini-batch SGD (batch size 32–8192+ depending on hardware and model).

5.2 Mini-batch gradient descent: bias-variance trade-off

Mini-batch gradient:

g_t = (1/B) ∑_{j=1}^B ∇θ ℓ(θ_t; z{i_j}) (B = mini-batch size)

Bias: Always zero (unbiased estimator of full gradient).

Variance of g_t: Var(g_t) ≈ (1/B) Var(single gradient)

Small B → high variance → noisy updates → more exploration, potential escape from sharp minima.
Large B → low variance → smoother path, closer to deterministic GD → can converge to sharper minima, sometimes worse generalization.

Trade-off summary (empirical & theoretical consensus 2025):

Batch SizeGradient VarianceWall-clock Speed per EpochGeneralization (typical)Sharpness of MinimaCommon Use Case1 (pure SGD)Very highVery fast iterationsOften bestFlattestSmall/medium models, research32–256HighFastStrongFlatMost vision/early LLM pretraining1024–8192+LowSlower per iteration but fewer steps neededCan degradeSharperLarge-scale training with scaling rulesGlobal (full batch)ZeroExtremely slowCan be worstSharpestSmall problems only

Larger batches require learning rate scaling (often linear or √B) to maintain progress.

5.3 Noise as implicit regularization

The stochasticity in SGD acts as a form of implicit regularization, often leading to better generalization than deterministic methods.

Mechanisms (supported by theory and practice up to 2025–2026):

Preference for flat minima SGD noise perturbs parameters → trajectories favor regions where loss is insensitive to perturbations (flat minima). Sharp minima generalize poorly (high sensitivity); flat minima are more robust → better test performance. Evidence: Increasing batch size sharpens minima unless compensated by noise injection or LR adjustment.
Escape from saddles/plateaus Noise helps SGD escape saddle points faster than GD in high dimensions.
Effective regularization equivalent SGD with small batches behaves similarly to explicit L2 regularization or label smoothing in some regimes.

Modern view (2025): SGD noise aligns with directions of low curvature → implicit bias toward flat, well-generalizing solutions. Recent analyses link this to stability of trajectories under noise covariance.

In large language models, small-batch SGD-like noise (or explicit noise addition) remains beneficial even at scale for generalization.

5.4 Convergence of SGD: Robbins-Monro conditions, sublinear rates

Robbins–Monro conditions (classic 1951, still foundational in 2025–2026 analyses):

For SGD convergence in expectation:

∑_{t=1}^∞ η_t = ∞ (steps accumulate infinitely → reach anywhere)
∑_{t=1}^∞ η_t² < ∞ (noise averages out)

Common schedules satisfying this: η_t = c / t^α with 1/2 < α ≤ 1 (e.g., 1/√t or 1/t).

Convergence rates (under assumptions):

Convex, L-smooth: J(θ_t) − J^* = O(1/√t) in expectation (optimal for stochastic first-order methods).
Strongly convex + smooth: linear convergence possible with diminishing steps or variance reduction (see 5.5).
Non-convex: convergence to stationary point E[‖∇J(θ_t)‖²] → 0, typically O(1/√t) rate for function value or gradient norm.

Recent work (2024–2025) relaxes conditions (e.g., local Łojasiewicz inequality for deep nets) and shows almost-sure convergence under milder step-size rules, or exponential convergence in special cases (e.g., separable logistic regression).

5.5 Variance reduction techniques: SVRG, SAGA, SARAH

These methods reduce gradient variance periodically → faster convergence than plain SGD, especially on finite-sum problems (standard in ML: J(θ) = (1/n) ∑ ℓ_i(θ)).

SVRG (Stochastic Variance Reduced Gradient, 2013–2014)

Every m iterations: compute full gradient w_t = ∇J(θ̃) (snapshot)
Then for inner loop: g_t = ∇ℓ_{i_t}(θ_t) − ∇ℓ_{i_t}(θ̃) + w_t → unbiased, variance reduced (zero when θ_t = θ̃)

Convergence: O(n + √(n κ) log(1/ε)) for strongly convex (better than SGD's O(n/ε)).

SAGA (Stochastic Average Gradient, 2014)

Maintain table of past gradients for each example
Update: g_t = ∇ℓ_{i_t}(θ_t) − old_grad_{i_t} + (1/n) ∑ all old_grads
After update, replace old_grad_{i_t}

Linear convergence for strongly convex, memory O(n d) — practical for medium n.

SARAH (Stochastic Recursive Gradient Algorithm, 2017–2019)

Inner loop: v_t = ∇ℓ_{i_t}(θ_t) − ∇ℓ_{i_t}(θ_{t-1}) + v_{t-1} → recursive variance reduction

Strong convergence guarantees, often best empirical performance among variance-reduced methods.

Recent advances (2024–2025):

Proximal/Stochastic proximal point extensions
Shuffling variants (e.g., Adjusted Shuffling SARAH) → improved complexity in strongly convex settings
Integration with operator splitting and generalized equations
Applications to cocoercive variational inequalities and high-dimensional control

In deep learning, these are less common today (due to huge n and non-convexity), but useful for fine-tuning, small-to-medium models, or when full passes are affordable.

5.6 Large-batch training phenomena and scaling rules

Phenomena observed (2017–2025):

Large batches → faster hardware utilization but can hurt generalization (sharper minima).
Critical Batch Size (CBS): Largest B where increasing B (with LR scaled) does not degrade loss trajectory significantly. Beyond CBS, further increases harm token efficiency or final loss.

Recent findings (2025):

CBS starts near 0 at init, grows rapidly early in training, then plateaus.
CBS scales with model size and training progress → small runs can predict large-run CBS.
Batch size warmup: Start small B, increase as CBS grows → reliable large-batch training (e.g., 43% fewer steps for similar/better loss in OLMo experiments).
Scaling laws for batch size: Optimal B_opt and CBS grow with tokens-per-parameter ratio D/N; power-law behavior in many regimes.
Weight decay / LR scaling: Optimal λ often linear in B; AdamW timescale τ = B/(η λ D) roughly constant.

Practical scaling rules (2025–2026 large models):

Linear LR scaling with B up to CBS, then sub-linear or constant.
Use batch size warmup + cosine LR decay.
For LLMs: aim for B near/above CBS but use techniques (e.g., ghost memory, gradient accumulation) to fit large effective B.
DiLoCo-style low-communication methods scale better with model size than pure data-parallel in some regimes.

Summary table

AspectSmall/Medium BatchLarge BatchMitigation StrategiesSpeed per stepSlower (more steps/epoch)Faster (fewer steps/epoch)Gradient accumulationGeneralizationOften betterCan degradeNoise injection, warmup, AdamWCritical Batch Size—Exists; grows during trainingMeasure empirically, warmup BScaling Rule (LR)—√B or linear up to CBSOne-cycle, cosine annealing

These stochastic techniques bridge theory and massive-scale practice, enabling efficient training of today's frontier models.

Let me know if you'd like Chapter 6 (Second-Order Optimization), more code examples (e.g., PyTorch mini-batch vs SVRG), or deeper dives into any subsection!

3.9s

Fast

40 sources

Second-Order Optimization: Hessian-Aware Methods 6.1 Newton's method: quadratic local model and update rule 6.2 Modified Newton / damped Newton / Levenberg-Marquardt 6.3 Trust-region methods 6.4 Quasi-Newton methods: BFGS, L-BFGS (limited-memory version) 6.5 Natural gradient descent & Fisher information matrix approximation 6.6 Hessian-free optimization (Martens-style, CG solver) 6.7 Kronecker-factored approximate curvature (K-FAC) 6.8 Practical second-order methods in deep learning: pros, cons, when to use

6. Second-Order Optimization: Hessian-Aware Methods

Second-order methods incorporate curvature information via the Hessian matrix (or approximations thereof) to make more informed update steps than first-order gradient descent. They can converge much faster near minima (quadratic vs linear/sublinear rates) but are significantly more expensive in computation and memory — especially in deep learning with d = 10⁶–10¹² parameters.

6.1 Newton's method: quadratic local model and update rule

Newton's method approximates the loss J(θ) locally as a quadratic function using the second-order Taylor expansion:

J(θ + δ) ≈ J(θ) + ∇J(θ)ᵀ δ + (1/2) δᵀ H(θ) δ

where H(θ) = ∇²J(θ) is the Hessian.

To minimize this quadratic model, set its gradient w.r.t. δ to zero:

∇J(θ) + H(θ) δ = 0 ⇒ δ = − H(θ)⁻¹ ∇J(θ)

Update rule:

θ_{t+1} = θ_t − H(θ_t)⁻¹ ∇J(θ_t)

Convergence (under strong assumptions):

Locally quadratic convergence: error ∝ error² near the minimum (extremely fast once close).
Requires H positive definite (or made so via modifications).

Example – quadratic objective J(θ) = (1/2) θᵀ A θ − bᵀ θ Exact minimum in one step if H = A is invertible.

Deep learning reality: Computing and inverting the full Hessian (O(d²) storage, O(d³) inversion) is impossible for large models. This motivates all subsequent approximations.

6.2 Modified Newton / damped Newton / Levenberg-Marquardt

Pure Newton steps can overshoot or diverge when far from the minimum (Hessian may not be positive definite, or quadratic model poor).

Damped / Modified Newton:

θ_{t+1} = θ_t − (H(θ_t) + λ I)⁻¹ ∇J(θ_t) (λ ≥ 0)

λ = 0 → pure Newton
Large λ → gradient descent step (direction −∇J, step size 1/λ)

Levenberg-Marquardt (very popular in non-linear least squares):

Similar damping, but λ chosen adaptively (trust-region like behavior):

If step reduces loss enough → decrease λ (trust quadratic model more)
If not → increase λ (fall back toward gradient descent)

Advantages:

Guarantees descent (positive definite matrix)
Smooth transition between Newton and GD

Still requires solving linear system with damped Hessian → expensive for large d.

6.3 Trust-region methods

Instead of damping the Hessian, explicitly restrict step length:

min_δ m(δ) = J(θ) + ∇Jᵀ δ + (1/2) δᵀ H δ subject to ‖δ‖ ≤ Δ (trust-region radius)

Approximate solution (dogleg / Steihaug-CG):

Solve subproblem iteratively with conjugate gradient (CG) until hitting boundary or satisfying accuracy
Adjust Δ based on how well model predicted actual reduction:

ratio = [J(θ+δ) − J(θ)] / [m(δ) − m(0)]

ratio ≈ 1 → good model → increase Δ
ratio small/negative → poor model → shrink Δ

Advantages over damped Newton:

More principled step-size control
Better handling of indefinite Hessians

Disadvantage: Still needs many Hessian-vector products per iteration.

6.4 Quasi-Newton methods: BFGS, L-BFGS (limited-memory version)

Quasi-Newton methods build an approximation B_t ≈ H(θ_t) (or inverse H⁻¹) using only gradient evaluations — no explicit Hessian computation.

BFGS update (Broyden–Fletcher–Goldfarb–Shanno):

B_{t+1} = B_t − (B_t s sᵀ B_t)/(sᵀ B_t s) + (y yᵀ)/(yᵀ s)

where s = θ_{t+1} − θ_t, y = ∇J(θ_{t+1}) − ∇J(θ_t)

Maintains positive definiteness if line search satisfies Wolfe conditions.

L-BFGS (Limited-memory BFGS):

Stores only m most recent {s_k, y_k} pairs (m ≈ 5–30)
Approximates inverse Hessian via two-loop recursion (very memory efficient)
Update: θ_{t+1} = θ_t − H_t⁻¹ ∇J(θ_t) (implicit multiplication, no matrix storage)

Convergence: Superlinear in convex smooth case; often very good in practice even for non-convex deep nets.

Use in deep learning: L-BFGS is occasionally used for fine-tuning small-to-medium models or when full-batch gradients are feasible.

6.5 Natural gradient descent & Fisher information matrix approximation

Natural gradient (Amari 1998):

θ_{t+1} = θ_t − η F(θ_t)⁻¹ ∇J(θ_t)

where F(θ) is the Fisher information matrix:

F_{ij} = E_{p(data|θ)} [ (∂ log p / ∂θ_i) (∂ log p / ∂θ_j) ]

Interpretation:

Pre-conditions gradient using the local geometry of the statistical manifold (KL-divergence metric).
Invariant under reparameterization (unlike vanilla gradient).

In practice for neural nets: F ≈ empirical Fisher = (1/n) ∑ (∇_θ log p(y_i|x_i,θ)) (∇_θ log p(y_i|x_i,θ))ᵀ → outer product of per-example score vectors.

KFAC (below) is a popular structured approximation to F.

Pros: Better-conditioned updates, sometimes faster convergence. Cons: Still expensive unless approximated.

6.6 Hessian-free optimization (Martens-style, CG solver)

Idea (Martens 2010): Never form the full Hessian — only compute Hessian-vector products Hv via automatic differentiation (double backprop or forward-over-reverse).

Update: Solve H δ = −∇J approximately using conjugate gradient (CG):

δ ≈ argmin_δ (1/2) δᵀ H δ + ∇Jᵀ δ

CG needs only ~10–50 Hv products per outer iteration (much cheaper than forming H).

Damping: Add λ I inside CG loop to stabilize.

Variants:

Hessian-free with CG + damping + line search
Used successfully in early deep learning (2010–2015) before Adam dominance

Status in 2025–2026: Rarely used standalone, but Hv products remain core to many advanced methods (K-FAC, SAM variants, etc.).

6.7 Kronecker-factored approximate curvature (K-FAC)

K-FAC (Martens & Grosse 2015, later improvements) approximates the Fisher/empirical curvature with Kronecker-factored structure — extremely effective for deep nets.

Core idea: For a fully-connected or conv layer a = Wx + b, the Fisher block for W is approximately:

F_W ≈ (A ⊗ G) where A = E[x xᵀ] (input covariance, layer input statistics) G = E[ (∇_a log p) (∇_a log p)ᵀ ] (gradient covariance w.r.t. pre-activations)

→ Inverse approximated as A⁻¹ ⊗ G⁻¹ (via Woodbury or direct inversion of small matrices)

Extensions (2020–2025):

Include layer-wise damping
Diagonal, block-diagonal, or empirical Fisher variants
Shampoo (2018) – similar but uses different factorization
Sophia (2023) – uses K-FAC-like preconditioner with clipping → competitive with AdamW at scale

Advantages:

Memory O(d) instead of O(d²)
Fast per-iteration cost
Often 2–10× fewer steps than Adam in vision/language tasks

Current status (2025–2026): K-FAC and variants (Sophia, Shampoo, AdaHessian) are among the few second-order-style methods that remain competitive or superior to AdamW on large-scale tasks when tuned properly.

6.8 Practical second-order methods in deep learning: pros, cons, when to use

Pros of second-order / Hessian-aware methods:

Much faster convergence per epoch (quadratic locally)
Better handling of ill-conditioning
Natural step-size adaptation via curvature
Sometimes reach flatter, better-generalizing minima

Cons:

High memory and compute per step (even approximations cost 2–10× more than Adam)
Sensitive to hyperparameters (damping, trust radius, etc.)
Less robust to noisy gradients (mini-batches)
Hard to scale to trillion-parameter models without heavy approximations

When to use in 2025–2026 practice (realistic ranking):

ScenarioRecommended MethodWhy / WhenSmall–medium models (< 100M params)L-BFGS, K-FAC, AdaHessianFull-batch or large-batch feasible; faster convergenceFine-tuning / few-shot adaptationK-FAC, Sophia, ShampooCurvature helps when data is limitedLarge-scale pre-training (LLMs)AdamW / Lion (first-order)Speed & stability win; second-order too expensiveResearch on optimization landscapesHessian-free, exact Newton on toy netsUnderstand curvature, saddle escape, sharpnessWhen second-order shinesK-FAC/Sophia on vision transformers, diffusion models2–5× fewer steps, sometimes better loss/perplexity

Current frontier (2025–2026):

Pure second-order methods are niche
Hybrid / approximate second-order (Sophia, Adafactor with momentum, Lion + curvature hints) are gaining traction
For most practitioners: AdamW + proper scheduling still dominates due to simplicity and robustness at extreme scale

This concludes the core treatment of second-order techniques. The next chapters cover convergence theory and advanced modern variants.

Let me know if you would like to continue with Chapter 7 (Convergence Theory and Analysis), need numerical/PyTorch examples for any method (e.g., L-BFGS vs Adam, K-FAC pseudocode), or want deeper mathematical derivations for any subsection.

3.6s

Fast

Convergence Theory and Analysis 7.1 Convex smooth optimization: rates for gradient descent family 7.2 Strongly convex case: linear convergence 7.3 Non-convex optimization: stationarity, escape from saddle points 7.4 Stochastic convergence: almost sure convergence, expected suboptimality 7.5 Complexity lower bounds and optimal methods 7.6 Influence of conditioning number, Lipschitz constants, smoothness 7.7 Landscape pathologies in deep learning: sharp vs flat minima, mode connectivity

7. Convergence Theory and Analysis

This chapter provides a rigorous overview of convergence guarantees for optimization algorithms in machine learning contexts. We cover deterministic first-order methods on convex problems, extensions to strongly convex and non-convex settings, stochastic guarantees, fundamental lower bounds, the role of problem constants (conditioning, smoothness), and modern insights into deep learning loss landscapes.

7.1 Convex smooth optimization: rates for gradient descent family

Assume J is L-smooth (Lipschitz continuous gradient: ‖∇J(θ) − ∇J(θ')‖ ≤ L ‖θ − θ'‖) and convex (possibly μ-strongly convex with μ ≥ 0).

Standard rates (fixed step size η ≤ 1/L):

Gradient Descent (GD): J(θ_t) − J^* ≤ (1 − μ/L)^t (J(θ_0) − J^) if μ > 0 (linear convergence) J(θ_t) − J^ ≤ O(1/t) if μ = 0 (sublinear)
Nesterov Accelerated Gradient (NAG): optimal among first-order methods for smooth convex functions J(θ_t) − J^* ≤ O(1/t²) (smooth convex, μ = 0) J(θ_t) − J^* ≤ O((1 − √(μ/L))^t) (strongly convex, accelerated linear rate)

Momentum variants (Polyak, Nesterov) achieve the same accelerated rates under proper step-size tuning.

Recent refinements (up to 2025–2026 literature): Long-step or anytime acceleration schedules allow GD-like methods to approach O(1/t^{1.27}) or better in specific smooth convex regimes via computer-assisted proofs or adaptive mechanisms. However, the canonical O(1/t²) for accelerated methods remains the optimal worst-case rate for black-box first-order oracles on L-smooth convex problems.

7.2 Strongly convex case: linear convergence

When J is μ-strongly convex + L-smooth (condition number κ = L/μ ≥ 1):

GD with η = 1/L: J(θ_t) − J^* ≤ (1 − 1/κ)^t (J(θ_0) − J^*) → linear rate, but slow when κ is large (ill-conditioned problems common in ML)
Optimal accelerated rate (Nesterov): Number of iterations to reach ε-suboptimality: O(√κ log(1/ε)) → much better dependence on conditioning than plain GD's O(κ log(1/ε))

In practice for deep nets, even when locally strongly convex near a minimum, global strong convexity rarely holds, but local linear convergence can still occur once the algorithm enters a strongly convex basin.

7.3 Non-convex optimization: stationarity, escape from saddle points

Most deep learning objectives are highly non-convex.

First-order stationarity GD/SGD converge to points where ‖∇J(θ)‖ → 0 (first-order stationary points).

Second-order stationarity A point is second-order stationary if ∇J(θ) = 0 and Hessian H(θ) ≽ 0 (no negative curvature directions). Local minima satisfy this; strict saddle points have at least one negative eigenvalue.

Escape from saddle points (key insight for non-convex success): In high dimensions (d ≫ 1), almost all critical points are saddles, not local minima.

Theoretical escape mechanisms (2015–2025 results):

Perturbed GD / SGD with injected noise: escapes saddles in O(log d / λ_min²) iterations (λ_min = most negative eigenvalue).
Stochasticity alone (SGD noise) suffices in many cases → escapes in polynomial time with high probability.
Recent variants (2024–2025): occupation-time-adapted perturbations (PGDOT/PAGDOT), curvature-calibrated noise, gradient clipping in distributed settings, or PID-like controllers accelerate escape rates compared to vanilla perturbed GD.
Empirical observation: momentum + adaptive methods (Adam) naturally escape saddles faster due to velocity buildup and noise.

Convergence guarantees in non-convex smooth case: E[‖∇J(θ_t)‖²] ≤ O(1/√t) for SGD under bounded variance assumptions.

7.4 Stochastic convergence: almost sure convergence, expected suboptimality

For SGD on finite-sum problems J(θ) = (1/n) ∑ ℓ_i(θ):

Robbins–Monro conditions (classic, still foundational): η_t > 0, ∑ η_t = ∞, ∑ η_t² < ∞ → almost sure convergence to stationary point in convex case.

Recent relaxations (2024–2025): Using stopping-time methods or martingale arguments, almost sure convergence holds under weaker conditions: ∑ η_t = ∞ and ∑ η_t^p < ∞ for some p > 2 (allows slower decay than 1/t). Expected suboptimality: O(1/√T) for non-convex smooth (optimal first-order stochastic rate).

Almost sure vs in expectation:

Almost sure: θ_t → stationary point w.p. 1 (pathwise convergence).
In expectation: E[J(θ_t) − J^*] → 0 (weaker but easier to prove).

In deep learning, we often get practical convergence even with constant LR + decay schedules that violate strict Robbins–Monro, due to non-stationary landscapes and regularization.

7.5 Complexity lower bounds and optimal methods

Oracle complexity lower bounds (worst-case, first-order black-box oracles):

Smooth convex: Ω(√(L/ε)) gradient queries for ε-suboptimality (achieved by NAG).
Strongly convex smooth: Ω(√(κ) log(1/ε)).
Stochastic smooth convex: Ω(1/ε²) (SGD optimal up to logs).
Non-convex smooth: Ω(1/ε⁴) for first-order stationarity in worst case; practical rates O(1/√T) for SGD.
Constrained composite non-convex nonsmooth (2025 results): tighter lower bounds involving affine constraints, showing some first-order methods cannot achieve better than certain polynomial dependencies.

Optimal methods: Accelerated gradient (Nesterov), variance-reduced SGD (SVRG/SAGA for finite-sum), or approximate second-order (K-FAC/Sophia) approach these bounds in regimes where assumptions hold.

7.6 Influence of conditioning number, Lipschitz constants, smoothness

Condition number κ = L/μ (strongly convex case):

Large κ → narrow valleys → slow convergence (zig-zagging in GD).
Preconditioning (e.g., Adam's adaptive LR, natural gradient, K-FAC) reduces effective κ.

Lipschitz constant L (smoothness):

Bounds step size η ≤ 1/L for descent guarantee.
Large L → small allowable steps → slow progress.

Gradient Lipschitz constant (Hessian bounded): Controls descent lemma strength and escape-from-saddle time (smaller negative curvature → longer escape).

In deep nets:

Effective L grows with depth/width → requires clipping, normalization (LayerNorm), or adaptive optimizers.
Ill-conditioning worsens with scale → motivates techniques like warm-up, large-batch scaling, sharpness-aware methods.

7.7 Landscape pathologies in deep learning: sharp vs flat minima, mode connectivity

Sharp vs flat minima (Hessian trace/norm or max eigenvalue as sharpness proxy):

Flat minima: low curvature → loss insensitive to parameter perturbations → better robustness/generalization (classic Hochreiter & Schmidhuber 1997; Keskar et al. 2017). SGD with small batches, noise, or SAM tends to find flatter minima.
Sharp minima: high curvature → sensitive → often overfit, poor generalization.

Recent challenges to the flat-minima hypothesis (2025 literature):

Sharp minima can generalize well if they lie in large-volume regions or under certain data distributions.
Even algorithmically found flat minima (via SAM/SA-GD) can sometimes yield Ω(1) population risk in realizable stochastic convex settings → flatness alone does not guarantee good generalization.
Volume argument: flat minima occupy larger measure in parameter space → more likely to be found, but sharp minima in high-volume "good" regions exist and generalize.

Mode connectivity (Garipov et al. 2018; later works): In over-parameterized nets, local minima are often connected via low-loss paths (valleys) → loss surface has connected global basin rather than isolated bad minima. This explains why SGD rarely gets stuck in poor local minima despite non-convexity.

Implications for deep learning (2025–2026 view):

Generalization tied more to implicit bias of optimizer (SGD → flat-ish minima) than landscape alone.
Sharpness-aware minimization (SAM) and variants remain popular for enforcing flatter solutions.
Rethinking flatness: some works propose seeking ε-maxima (points slightly worse than neighbors) for better robustness in modern over-parameterized regimes.

This chapter ties together theoretical foundations with practical deep learning observations. Convergence in non-convex high-dimensional settings relies heavily on stochasticity, momentum, and landscape properties rather than strict convexity.

Let me know if you'd like to proceed to Chapter 8 (Advanced Convergence Techniques and Modern Variants), need more detailed proofs (e.g., Nesterov acceleration derivation), or want examples/code illustrating saddle escape or sharpness in toy models!

3.4s

Fast

40 sources

Advanced Convergence Techniques and Modern Variants 8.1 Learning rate free / adaptive methods: Adadelta, Yogi, Sophia 8.2 Lookahead, RAdam, Ranger, Lion 8.3 Barzilai-Borwein step size & spectral methods 8.4 Polyak-Łojasiewicz (PL) inequality and beyond 8.5 Heavy-ball momentum under non-convexity 8.6 Proximal methods and regularization in optimization view 8.7 Distributed / parallel optimization: data-parallel, model-parallel implications

8. Advanced Convergence Techniques and Modern Variants

This chapter explores cutting-edge (as of 2025–2026) techniques that push beyond classical first- and second-order methods. These include learning-rate-free or highly adaptive optimizers, hybrid combinations, spectral step-size rules, relaxed convexity notions like the Polyak-Łojasiewicz (PL) inequality, momentum extensions to non-convex landscapes, proximal formulations, and considerations for distributed training.

8.1 Learning rate free / adaptive methods: Adadelta, Yogi, Sophia

These methods aim to eliminate or drastically reduce manual learning-rate tuning by incorporating adaptive mechanisms or second-order information.

Adadelta (2012): Fully adaptive, no global learning rate. Update: Δθ_t = - (RMS(Δθ)_{t-1} / RMS(g)_t ) ⊙ g_t RMS(g)t = √[ρ RMS(g){t-1} + (1-ρ) g_t² ] (similar for Δθ) ρ ≈ 0.95, ε ≈ 10^{-6} for stability. → Converges well without LR schedules in many cases; still used in some legacy setups.

Yogi (2019): Improvement over Adam that fixes slow convergence in later stages and better handles non-stationary objectives. Key change: adaptive learning rate uses sign-aware update instead of squared gradients: v_t = v_{t-1} + (1 - β₂) sign(v_{t-1} - g_t²) g_t² → Prevents aggressive decay of adaptive rates → faster late-stage progress. Empirical: Often outperforms Adam/RMSProp in vision/language tasks; meta-learned Yogi variants showed strong performance in nanophotonic inverse design (2025 studies).

Sophia (2023, with 2025–2026 refinements): Scalable second-order stochastic optimizer for LLM pre-training. Uses inexpensive diagonal Hessian or Gauss-Newton approximations + entry-wise update clipping. Per-step cost close to AdamW, but exploits curvature for adaptive scaling. 2025 comparisons (e.g., vs. AdamW/Lion): Sophia often achieves lowest training/validation loss on GPT-style models at fixed token budget, though AdamW sometimes retains better downstream accuracy. Variants integrated in libraries like CoLLiE for 3D parallelism. → Competitive or superior in compute efficiency for large-scale LLM pre-training.

8.2 Lookahead, RAdam, Ranger, Lion

These combine ideas for stability, faster convergence, or memory efficiency.

Lookahead (2019): Maintains slow weights (updated infrequently) and fast weights (inner-loop updates). Outer update: slow = slow + α (fast - slow) → Acts as a form of temporal ensembling → stabilizes training, reduces variance.

RAdam (Rectified Adam, 2019): Fixes Adam's variance issues early in training via rectified bounds on adaptive LR. → Better warm-start behavior, less sensitive to LR choice.

Ranger (2019–2021 variants like Ranger21): Synergistic combination: RAdam + Lookahead + Gradient Centralization (GC). GC: subtracts mean from gradients before update → centers gradients, improves stability. Ranger21 adds more components (e.g., AdamW-style decay). Still popular in 2025 for stable training in vision tasks.

Lion (Evolved Sign Momentum, 2023, refined 2025): Sign-based, memory-efficient (stores only momentum, no second moment). Update: m_t = β₁ m_{t-1} + (1-β₁) sign(g_t) θ_t = θ_t - η (sign(m_t) + λ θ_t) (with weight decay) → ~33–50% less memory than AdamW, competitive or better performance on many tasks (ImageNet, diffusion, LLMs). 2025 refinements (e.g., Refined Lion / RLion): Introduce non-linear bounded functions on momentum product → higher accuracy (up to +20% vs AdamW in some classification/detection), overcomes gradient explosion/vanishing issues. Often fastest in GPU hours; strong generalization in vision-language and object detection.

8.3 Barzilai-Borwein step size & spectral methods

Barzilai-Borwein (BB) step size (1988, revived in deep learning): Approximates quasi-Newton (diagonal Hessian) without storing full matrix. η_t = (s_{t-1}^T s_{t-1}) / (s_{t-1}^T y_{t-1}) or alternative form where s = θ_t - θ_{t-1}, y = g_t - g_{t-1} → Spectral step size → adapts per iteration based on recent curvature estimate.

Spectral variants (e.g., Spectral Projected Gradient): Combine BB with non-monotone line search or gradient projection. In deep learning: Used in sharpness-aware minimization extensions or as LR scheduler proxy. Advantages: No manual LR tuning, good for ill-conditioned problems. Drawback: Can oscillate; often combined with momentum or clipping.

8.4 Polyak-Łojasiewicz (PL) inequality and beyond

PL inequality (1963/1988, modern revival): ‖∇J(θ)‖² ≥ 2μ (J(θ) - J^) (μ-PL, exponent β=1/2) More general: ‖∇J‖ ≥ c (J - J^)^β with β ∈ (0,1]

→ Generalizes strong convexity (PL with β=1/2 is weaker than μ-strong convexity but sufficient for linear convergence of GD/SGD).

Key results (up to 2025–2026):

PL sufficient for linear convergence of GD/SGD in non-convex settings (local PL common near minima).
Recent: PL is essentially no more general than strong convexity for C² functions (2025 proofs show equivalence under sufficient smoothness).
Local PL emerges in over-parameterized nets via NTK stability or quasi-convex regions → explains fast convergence in deep nets.
Gradient domination (generalized PL) enables almost-sure rates close to o(n^{-1/(4β-1)}) for SGD/SHB.
PL used to analyze distributed methods, bilevel optimization, quantized training.

Beyond PL: Kurdyka-Łojasiewicz (KL) extensions for nonsmooth cases; local PLNet architectures explicitly enforce PL-like properties.

8.5 Heavy-ball momentum under non-convexity

Heavy-ball / Polyak momentum in non-convex:

v_{t+1} = β v_t - η ∇J(θ_t) θ_{t+1} = θ_t + v_{t+1}

Theoretical progress (2024–2025):

Escapes saddles faster due to velocity buildup.
Under PL or gradient-domination: almost-sure convergence with rates approaching expectation bounds.
Stochastic heavy-ball (SHB) achieves o(n^{-1/(4β-1)}) almost surely under global/local gradient domination.
Robust to noise; helps in ill-conditioned non-convex landscapes common in deep learning.

8.6 Proximal methods and regularization in optimization view

Proximal gradient for composite problems J(θ) + R(θ):

θ_{t+1} = prox_{η R} (θ_t - η ∇J(θ_t)) prox_R(z) = argmin_u (1/2)‖u - z‖² + R(u)

In deep learning:

L1/L2 regularization → soft/hard thresholding or weight decay.
AdamW decouples weight decay → proximal view of L2.
Sharpness-aware minimization (SAM) ≈ proximal step minimizing worst-case loss in neighborhood.
Proximal policy optimization (PPO) in RL uses clipped surrogate → proximal-like trust region.

Benefits: Built-in regularization, handles constraints, stabilizes non-smooth terms.

8.7 Distributed / parallel optimization: data-parallel, model-parallel implications

Data-parallel (DDP / FSDP): Each worker has full model, averages gradients. → Optimizer sees averaged (less noisy) gradients → larger effective batch size → sharper minima unless noise added. Implication: Adaptive methods (AdamW/Lion) benefit from large-batch scaling rules; second-order approximations (K-FAC/Sophia) scale better with sharding.

Model-parallel / pipeline / tensor-parallel: Splits model across devices. → Gradient computation partial → requires all-reduce or local updates. Implication: Optimizer must handle asynchrony/noise; Lion/Sophia memory-efficient → preferred for extreme scale.

2025–2026 trends:

DiLoCo-style low-communication optimizers scale better than pure data-parallel.
Sharded optimizers (ZeRO-Offload) pair well with memory-light methods like Lion.
Hybrid parallelism + adaptive optimizers → frontier for trillion-parameter training.

Summary table (2025–2026 practical favorites)

Technique/OptimizerKey StrengthMemory vs AdamWTypical Use Case (2025–2026)SophiaDiagonal Hessian + clippingSimilarLLM pre-training (lowest loss)Lion / RLionSign-based, very efficient~33–50% lessVision, diffusion, general fine-tuneYogiFixed Adam variance issuesSimilarStable adaptive trainingRanger / Ranger21RAdam + Lookahead + GCSimilarStable legacy vision tasksPL-based analysisExplains fast convergence—Theoretical justification for DL

These advanced methods represent the state-of-the-art trade-offs between speed, stability, memory, and generalization in large-scale deep learning.

Let me know if you'd like the final chapter (9 or 10), code snippets (e.g., Lion in PyTorch), or expansions on any subsection!

Fast

40 sources

Backpropagation and Automatic Differentiation Revisited 9.1 Computational graph and reverse-mode AD 9.2 Gradient computation through chain rule 9.3 Vector-Jacobian products (VJP) and Jacobian-vector products (JVP) 9.4 Efficient Hessian-vector products (finite differences vs AD tricks) 9.5 Higher-order derivatives in AD frameworks (Hessian-free, K-FAC)

9. Backpropagation and Automatic Differentiation Revisited

This chapter revisits the core mechanism that makes modern deep learning possible: automatic differentiation (AD), with a focus on reverse-mode AD (backpropagation). We emphasize how gradients flow through computational graphs, the distinction between forward- and reverse-mode primitives, efficient ways to compute Hessian-vector products (crucial for second-order methods), and support for higher-order derivatives in contemporary frameworks (PyTorch, JAX, TensorFlow).

9.1 Computational graph and reverse-mode AD

A neural network computation is represented as a directed acyclic graph (DAG) — the computational graph — where:

Nodes are operations (e.g., matrix multiplication, ReLU, addition, softmax)
Edges represent data flow (tensors flowing from one op to another)
Leaves are inputs (parameters θ or data x)
Root(s) are outputs (usually scalar loss J)

Forward pass: Evaluate the graph from inputs to outputs, computing intermediate values (primal values) and storing them for later use.

Reverse-mode AD (backpropagation):

Starts from the scalar output (loss J).
Propagates adjoints (sensitivities, ∂J/∂z for each intermediate z) backward through the graph.
At each node, applies the local vector-Jacobian product (VJP) rule using the chain rule.
Final result: adjoints w.r.t. parameters → gradient ∇_θ J.

Why reverse-mode dominates in deep learning:

Number of parameters (inputs to AD) ≫ number of outputs (usually 1 scalar loss) → reverse-mode costs O(1) backward pass + 1 forward pass, regardless of parameter count.
Forward-mode would require one pass per parameter → infeasible for millions/billions of parameters.

Modern frameworks (2025–2026):

PyTorch: dynamic graph (eager execution + tape-based backward)
JAX: static graph via tracing + XLA compilation, supports both modes natively
TensorFlow 2.x: eager by default, graph mode optional

Example simple graph (MLP layer + ReLU + MSE):

text

Copy

x → [W x + b] → ReLU → ŷ → (ŷ - y)² / 2 → J

Forward: compute all intermediates. Backward: start with ∂J/∂J = 1, propagate back to ∂J/∂W, ∂J/∂b, ∂J/∂x.

9.2 Gradient computation through chain rule

The chain rule in reverse mode:

For z = f(u), where u = g(v), scalar loss J:

∂J/∂v = (∂J/∂z) ⋅ (∂z/∂u) ⋅ (∂u/∂v) = adjoint_z ⋅ local_derivative(f) ⋅ local_derivative(g)

In vector form (most practical):

Let a = layer input, z = W a + b, ŷ = σ(z), J = loss(ŷ, target)

Backward pass:

Receive incoming adjoint from above: δ_ŷ = ∂J/∂ŷ
Compute δ_z = δ_ŷ ⊙ σ'(z) (elementwise for activation)
Then δ_a = Wᵀ δ_z
δ_W = δ_z aᵀ
δ_b = δ_z

This is repeated layer-by-layer backward.

Key property: Each local VJP is cheap (same cost as forward op), and storage of intermediates enables efficient reuse.

Autodiff implementation:

During forward: record op type + inputs + outputs
During backward: traverse graph in reverse topological order, apply pre-defined VJP rules for each op (e.g., matmul, conv, softmax all have efficient VJP implementations)

9.3 Vector-Jacobian products (VJP) and Jacobian-vector products (JVP)

These are the primitive building blocks of AD.

Jacobian-vector product (JVP) — forward-mode primitive:

J_f(v) = Df(x) ⋅ v (pushforward)

v is tangent vector (direction)
Computes directional derivative along v
Implemented by dual numbers or forward-over-forward AD
Cost: same as one forward pass

Vector-Jacobian product (VJP) — reverse-mode primitive:

V_f(u) = uᵀ Df(x) (pullback)

u is cotangent vector (incoming adjoint)
Computes gradient contribution back-propagated from u
Implemented by reverse accumulation
Cost: same as one backward pass

Comparison table (2025 perspective, JAX/PyTorch style)

PropertyJVP (forward-mode)VJP (reverse-mode)Mathematical formDf(x) vuᵀ Df(x)Direction of computationForward (with tangents)Backward (with cotangents)Best whenFew inputs, many outputsMany inputs, few outputs (DL default)In JAXjax.jvp(f, primals, tangents)jax.vjp(f, primals) → (out, vjp_fun)In PyTorchtorch.func.jvptorch.autograd.grad / .backward()Use for higher-orderBuilding block for nested JVPsBuilding block for nested VJPs

Example in JAX:

Python

Copy

def f(x): return jnp.sin(x) x*2 primals = (jnp.array(2.0),) tangents = (jnp.array(1.0),) out, jvp_val = jax.jvp(f, primals, tangents) # directional deriv

Reverse-mode equivalent uses vjp to get full gradient.

9.4 Efficient Hessian-vector products (finite differences vs AD tricks)

Hessian-vector product (HVP): Hv = ∇²J(θ) v = directional second derivative along v.

Needed for Newton-CG, Hessian-free, K-FAC, natural gradient, etc.

Method 1: Finite differences of gradients Hv ≈ [∇J(θ + εv) - ∇J(θ - εv)] / (2ε)

Two full gradient computations
Numerical instability (cancellation error, ε choice)
Simple to implement, no extra AD rules needed
Still used for debugging or when AD not available

Method 2: Forward-over-reverse AD (Pearlmutter's trick, most efficient) Hv = ∇_θ (vᵀ ∇_θ J(θ)) (gradient of directional derivative)

One forward pass → compute J and ∇J
One reverse pass → compute vᵀ ∇J (scalar)
Second reverse pass on that scalar → Hv
Cost: ~2–3× single gradient evaluation
Machine precision, no ε tuning
Native in JAX (jax.hessian or jax.grad(jax.grad) with care), PyTorch (torch.autograd.functional.hvp or nested .backward())

Method 3: Reverse-over-reverse (possible but less common for HvP) Used in some higher-order cases.

2025–2026 reality:

Forward-over-reverse is standard in JAX/PyTorch for HVPs (used in Sophia, Hessian-free variants, sharpness-aware methods).
Finite differences remain fallback for non-differentiable ops or quick prototyping.
In large models: memory-efficient via checkpointing or gradient checkpointing during nested AD.

9.5 Higher-order derivatives in AD frameworks (Hessian-free, K-FAC)

Modern frameworks support arbitrary-order derivatives via nested autodiff.

JAX (strongest higher-order support):

jax.grad(grad(f)) → Hessian (but beware memory explosion)
jax.hessian(f) computes full Hessian (small models only)
jax.jacrev, jax.jacfwd for Jacobians
Efficient HVPs via jax.grad(lambda x: jnp.dot(v, jax.grad(f)(x)))(x)
Used in K-FAC implementations, natural gradient, higher-order optimizers

PyTorch:

torch.func.hessian (functorch → torch.func)
Nested torch.autograd.grad or .backward() calls
torch.autograd.functional.hvp, jacrev, etc.
2025: torch.compile improves nested AD performance

Hessian-free optimization:

Never forms full H; only Hv via forward-over-reverse
Solve H δ = -g via conjugate gradient (CG), needing ~10–50 Hv products per outer step
Martens-style (2010) still relevant in 2025 for fine-tuning or small-to-medium models

K-FAC (Kronecker-Factored Approximate Curvature):

Approximates Fisher/Hessian as block-diagonal + Kronecker structure per layer
Inverts small matrices (A ⊗ G)⁻¹ ≈ A⁻¹ ⊗ G⁻¹
Uses HVPs or empirical Fisher statistics
Modern implementations (2025): Sophia (diagonal approx + clipping), Shampoo, AdaHessian variants
JAX-based K-FAC libs remain fastest; PyTorch ports improving

Higher-order in practice:

Second-order common (HVPs for Newton-CG, K-FAC/Sophia)
Third+ order rare (Taylor-mode AD or nested JVPs/VJPs for curvature of curvature)
Frameworks handle it via operator overloading or tracing (JAX excels here)

Summary table

ConceptForward-mode (JVP)Reverse-mode (VJP)Cost in DL (many params, scalar loss)Typical UseFirst-order gradientPossibleDominant1 fwd + 1 bwdBackpropHessian-vector productForward-over-reverse—~2–3× gradientNewton-CG, K-FAC, SophiaFull HessianVery expensiveVery expensiveO(d) gradients or memory blowupSmall models onlyHigher-orderNested JVPsNested VJPsExponential cost unless approximatedResearch, curvature analysis

This machinery — especially efficient reverse-mode VJPs and forward-over-reverse HVPs — underpins every gradient-based training loop and most advanced optimizers discussed earlier.

If you'd like the final chapter (10. Special Topics and Frontiers) or code examples (e.g., JAX HVP, PyTorch nested grad for Hessian approximation), let me know!

3.2s

Fast

40 sources

Special Topics and Frontiers (2025–2026 Perspective) 10.1 Sharpness-aware minimization (SAM) and related methods 10.2 Second-order preconditioning in large language models 10.3 Grokking and optimization dynamics 10.4 Implicit bias of optimizers (SGD vs Adam) 10.5 When do second-order methods outperform first-order in practice? 10.6 Open challenges: scaling second-order methods to trillion-parameter models

10. Special Topics and Frontiers (2025–2026 Perspective)

This final chapter surveys emerging trends, open questions, and frontier developments in optimization for machine learning as of early 2026. It focuses on phenomena observed in large-scale training (LLMs, vision transformers, multimodal models) and highlights where theory meets messy practice. Many ideas build on earlier chapters: sharpness from non-convex landscapes (7.7), second-order approximations (6), implicit biases (from stochasticity and adaptivity), and scaling challenges in distributed settings (8.7).

10.1 Sharpness-aware minimization (SAM) and related methods

Sharpness-Aware Minimization (SAM, 2021) perturbs parameters in the neighborhood of current θ before computing the gradient, effectively minimizing the worst-case loss in an ε-ball:

θ ← θ − η ∇θ max{‖δ‖₂ ≤ ρ} J(θ + δ) ≈ θ − η ∇_θ J(θ + ρ ∇_θ J(θ) / ‖∇_θ J(θ)‖)

This encourages flatter minima, improving generalization (especially OOD robustness).

2025–2026 advances:

Momentum-SAM (MSAM): Replaces gradient-ascent perturbation with momentum direction (inspired by NAG). Achieves similar flatness without doubling compute cost → near-zero overhead over SGD/Adam. Strong results on vision/language tasks; reduces memory demands.
Friendly-SAM / MNSAM: Adds momentum acceleration + NAG in inner/outer loops → skips sharp regions faster, accelerates convergence while flattening.
Physics-Informed SAM (PI-SAM): Combines SAM with domain-specific regularizations (e.g., semiconductor material prediction via GNNs) → outperforms baselines in OOD generalization.
Unified SAM / USAM: General analysis under relaxed stochastic noise; improved convergence rates; better hyperparameter robustness.
Z-Score Gradient Filtering + SAM: NeurIPS OPT 2025 variant → filters outliers before perturbation → stabilizes large-batch training.

Practical status: SAM and variants remain go-to for generalization boosts in vision, diffusion, and fine-tuning. Momentum/acceleration hybrids make it nearly free in cost. Still debated: some sharp minima generalize well if in high-volume regions; flatness not always causal for robustness.

10.2 Second-order preconditioning in large language models

Second-order methods precondition gradients using curvature (Hessian/Fisher approximations) → better handling of ill-conditioning in transformers.

Sophia (2023, still dominant 2025–2026): Lightweight diagonal Hessian estimate + per-coordinate clipping (controls worst-case updates). Achieves 2× speedup over AdamW on GPT-style models (125M–1.5B) → same perplexity with 50% fewer steps. Key: clipping stabilizes heterogeneous curvatures; transferable hyperparameters (γ ≈ 0.01–0.05).

Scaling insights (NeurIPS 2025):

Muon, SOAP, Shampoo variants → 1.4×+ speedup over AdamW on transformers when scaled properly (width/depth co-scaling).
Optimal weight decay ~1/width (independent scaling); second-order methods prefer larger optimal model size for fixed compute.
Incorrect scaling drops speedup from 1.4× to <1.1× (190M → 640M params).

Other developments:

AdaFisher: Diagonal block-Kronecker Fisher → robust, stable tuning; strong on image classification/language modeling.
MARS (Make vAriance Reduction Shine): Combines preconditioned AdamW/Lion/Shampoo with scaled recursive momentum → variance reduction shines at LLM scale.
Sassha: Sharpness-aware + stable Hessian approx → flattens curvature, lazy updates → generalization comparable/superior to first-order.

Trend: Diagonal/Kronecker approximations + clipping → practical second-order for LLMs. Full Gauss-Newton or dense K-FAC still too expensive; hybrids win on Pareto frontier (speed + perplexity).

10.3 Grokking and optimization dynamics

Grokking: Delayed generalization after prolonged overfitting (e.g., modular arithmetic in transformers) → sudden test accuracy jump post-memorization.

2025–2026 mechanistic insights:

Low-dimensional + transversely curved trajectories: PCA shows single principal component dominates weight evolution → lies on low-D manifold with curvature orthogonal to progress.
Anti-grokking collapse: After peak generalization, loss of test performance → marked by heavy-tailed spectral metric α shifts (HTSR detects all phases reliably).
Regularization-induced grokking: Small non-zero ℓ₁/nuclear norm → enforces sparsity/low-rank → grokking even without weight decay; over-parameterization (depth) enables grok/ungrok without explicit reg.
Geometric inductive bias: Architectural topology (e.g., fewer degrees of freedom) bypasses phase transitions → faster emergence of invariants.
Norm beyond ℓ₂: Grokking without weight decay if regularized toward other properties (ℓ₂ norm grows but generalization holds).

Implications: Grokking tied to geometry (curvature, volume), implicit simplicity bias, and data selection. Heavy-tailed metrics best predict dynamics; grokking sometimes necessary for optimal solutions but trades off training time.

10.4 Implicit bias of optimizers (SGD vs Adam)

Implicit bias: Optimizer favors certain solutions among many minimizers (no explicit regularization).

SGD: Simplicity bias → linear/simple boundaries, suboptimal margins, flatter minima → better generalization (noise as regularizer).

Adam/AdamW: Richer bias → nonlinear, diverse features, closer to Bayes optimal; resistant to simplicity bias → sometimes sharper minima, worse generalization in some regimes.

2025 findings:

Incremental (per-sample) Adam deviates from full-batch → can reach ℓ₂-max-margin vs ℓ∞-max-margin (Signum invariant).
Stochastic Adam does not converge to full-batch counterpart → mini-batch variants achieve near-zero test error where full-batch fails.
Adaptive methods recover higher-accuracy, non-collapsing solutions in categorical tasks → SGD simpler/lower-dimensional.

Practical: SGD (or noisy variants) often generalizes better despite slower convergence; Adam faster but needs careful decay/scheduling. Implicit noise geometry (SGD) vs coordinate-adaptivity (Adam) drives differences.

10.5 When do second-order methods outperform first-order in practice?

Conditions where second-order shines (2025–2026 evidence):

Large batch sizes + moderate dataset/model size → curvature exploitation accelerates.
Fine-tuning / few-shot / medium-scale (up to ~1B params) → K-FAC/Sophia/Sassha 2–5× fewer steps, better loss/perplexity.
Ill-conditioned landscapes (transformers, diffusion) → preconditioning reduces zig-zagging.
When tuned properly (diagonal/clipped/Kronecker + scaling rules) → Muon/SOAP/Sophia achieve 1.4×+ speedup over AdamW on LLMs.

Where first-order wins:

Extreme scale (trillion params) → memory/compute overhead dominates; AdamW/Lion simpler, more robust.
Heavy-tailed noise or unstable gradients → second-order brittle without heavy clipping.
Poor hyperparameter scaling → speedup collapses (e.g., wrong weight decay).

Empirical rule-of-thumb: Second-order > first-order when batch size large, model not gigantic, and curvature approx cheap (diagonal/HvP). Hybrids (Sophia, Sassha) close gap.

10.6 Open challenges: scaling second-order methods to trillion-parameter models

Core bottlenecks:

Memory: Full/approximate Hessian → O(d²) or high per-step cost; even diagonal/Kronecker strains at 10¹² params.
Compute: Frequent curvature estimates → offset by fewer steps, but communication in distributed (all-reduce Fisher stats) explodes.
Stability: Heavy-tailed noise, heterogeneous curvatures → divergence without aggressive clipping/lazy updates.
Scaling laws: Second-order prefers larger optimal models for fixed compute → but trillion-scale needs new parallelism (contiguous param mgmt, activation offloading).
Generalization: Some second-order converge to sharper minima → needs sharpness integration (Sassha-style).

Active directions (2025–2026):

Extreme sharding + low-communication (DiLoCo-style) for preconditioners.
Memory-efficient frameworks (Zero-Offload + diagonal/block) → target trillion-scale.
Hybrid fallback: Mid-training switch to first-order if curvature no longer helps.
Theoretical: Sample complexity under heavy tails; better Fisher approximations.

Outlook: Second-order not yet dominant at frontier scale, but diagonal/clipped variants (Sophia/Muon) show promise. Trillion-param training still AdamW/Lion territory; breakthroughs in efficient curvature + scaling rules could change that.

This concludes the module. These frontiers show optimization remains vibrant: blending classical theory with LLM-scale empirics drives progress.

If you'd like a compiled PDF-style summary of the full ToC + chapters, code examples (e.g., SAM/Sophia in PyTorch/JAX), or expansions on any topic, just say!

3.9s

Fast

60 sources

please write high quality A+ SEO meta description of this content under 200 words

Master Calculus & Optimization for Machine Learning

Dive deep into the mathematics powering modern ML with this comprehensive guide: Calculus & Optimization for ML: Gradients, Hessians & Convergence Techniques.

From foundational univariate/multivariate calculus and vector/matrix derivatives to practical first-order optimizers (GD, SGD, Adam, Lion), second-order methods (Newton, L-BFGS, K-FAC, Sophia), and advanced topics like sharpness-aware minimization (SAM), grokking dynamics, and implicit optimizer biases — this resource covers everything needed to understand and implement state-of-the-art training algorithms.

Perfect for ML engineers, researchers, and students seeking rigorous theory + 2025–2026 frontiers: large-batch scaling, second-order preconditioning in LLMs, and open challenges at trillion-parameter scale.

Build stronger intuition for loss landscapes, convergence guarantees, automatic differentiation, and why certain optimizers generalize better. Elevate your deep learning expertise today.

(148 characters)

3.5s

Fast

15 sources

please write detailed table of content for-Linear Algebra for AI: Vectors, Matrices, Eigenvalues in Neural Networks

Here is a high-quality, detailed, and logically structured Table of Contents for a comprehensive educational resource (book, course, or in-depth guide) titled:

Linear Algebra for AI: Vectors, Matrices, Eigenvalues in Neural Networks

This outline is designed for ML practitioners, students, and researchers who want to go beyond basic operations and deeply understand how linear algebra underpins modern AI — especially neural networks, transformers, optimization, and interpretability.

1. Foundations: Scalars, Vectors, and Vector Spaces 1.1 Scalars vs. Vectors: From numbers to directional quantities 1.2 Vector addition, scalar multiplication, and geometric intuition 1.3 Linear combinations, span, and linear independence 1.4 Basis, dimension, and coordinate representations 1.5 Normed vector spaces: Euclidean (L₂), L₁, L∞ norms in ML 1.6 Inner products, dot product, cosine similarity, and angle 1.7 Orthogonality, orthonormal bases, and Gram-Schmidt process 1.8 Vector spaces over real and complex numbers (why complex appear in spectral methods)

2. Matrices as Linear Transformations 2.1 Matrix representation of linear maps 2.2 Matrix multiplication: composition of transformations 2.3 Special matrices in deep learning: identity, diagonal, permutation, symmetric, skew-symmetric, orthogonal, unitary 2.4 Matrix-vector products: linear layers, attention scores 2.5 Rank, column space, row space, null space, and rank-nullity theorem 2.6 Full-rank, low-rank, and rank-deficient matrices in practice 2.7 Matrix norms (Frobenius, operator/spectral, nuclear) and their ML roles

3. Systems of Linear Equations and Least Squares 3.1 Matrix form Ax = b and solvability conditions 3.2 Gaussian elimination, LU decomposition, pivoting 3.3 Overdetermined systems and least-squares solutions 3.4 Normal equations: AᵀA x = Aᵀb 3.5 QR decomposition and its numerical stability 3.6 SVD-based least squares (pseudoinverse) 3.7 Regularized least squares: ridge regression as preconditioned system

4. Eigenvalues, Eigenvectors, and Spectral Decomposition 4.1 Definition and geometric meaning of eigenvalues/eigenvectors 4.2 Characteristic polynomial and algebraic multiplicity 4.3 Diagonalizable matrices and eigendecomposition A = PDP⁻¹ 4.4 Symmetric matrices: real eigenvalues, orthogonal eigenvectors, spectral theorem 4.5 Positive (semi)definite matrices and quadratic forms 4.6 Spectral radius, condition number, and numerical stability 4.7 Power method, inverse iteration, and Rayleigh quotient for largest/smallest eigenvalues 4.8 Deflation and subspace iteration techniques

5. Singular Value Decomposition (SVD) – The Cornerstone of Modern ML 5.1 Definition and geometric interpretation 5.2 Reduced, compact, and truncated SVD 5.3 Low-rank approximation and Eckart–Young–Mirsky theorem 5.4 SVD for dimensionality reduction: PCA as SVD on centered data 5.5 Matrix pseudoinverse via SVD 5.6 Spectral norm, Frobenius norm, and nuclear norm through singular values 5.7 Applications: image compression, recommender systems, attention visualization

6. Principal Component Analysis (PCA) and Variants 6.1 Covariance matrix and variance maximization view 6.2 SVD vs. eigendecomposition approaches to PCA 6.3 Centering, scaling, and standardization in practice 6.4 Kernel PCA for nonlinear dimensionality reduction 6.5 Probabilistic PCA and connection to autoencoders 6.6 Incremental/online PCA for streaming data 6.7 Randomized SVD for large-scale PCA

7. Matrix Factorizations in Neural Networks 7.1 Low-rank adaptation (LoRA): low-rank updates to weight matrices 7.2 Spectral initialization and orthogonal initialization 7.3 Weight decay as nuclear-norm regularization 7.4 Matrix completion and collaborative filtering 7.5 Tucker/CP decomposition for tensor layers 7.6 Butterfly factorization and structured linear layers 7.7 Fast Fourier Transform (FFT) and circulant matrices in efficient attention

8. Eigenvalues and Dynamics in Deep Learning 8.1 Hessian eigenvalues: curvature, sharpness, and generalization 8.2 Loss landscape geometry via spectral properties 8.3 Neural Tangent Kernel (NTK) and infinite-width eigenvalue spectrum 8.4 Spectral bias of gradient descent: learning low-frequency components first 8.5 Grokking and phase transitions through eigenvalue dynamics 8.6 Attention matrices: eigenvalues of softmax(QKᵀ/√d) and softmax collapse 8.7 Stability of recurrent networks via largest eigenvalue of weight matrix

9. Advanced Spectral Methods and Applications 9.1 Graph Laplacians, spectral clustering, and GNNs 9.2 Spectral normalization of weight matrices (SN-GAN, SN-ResNet) 9.3 Power iteration in contrastive learning and representation learning 9.4 Randomized linear algebra: randomized rangefinder, CUR decomposition 9.5 Matrix exponentials and Lie groups in equivariant networks 9.6 Operator learning and Fourier Neural Operators via spectral methods

10. Numerical Linear Algebra in Practice 10.1 Floating-point arithmetic and conditioning pitfalls 10.2 BLAS, LAPACK, cuBLAS, and modern GPU-accelerated linear algebra 10.3 JAX/PyTorch einsum and compiled linear operations 10.4 Mixed-precision training and its impact on eigenvalue accuracy 10.5 Randomized sketching for large-scale matrix computations 10.6 When to use exact vs. approximate methods at scale

Appendices A. Linear algebra notation cheat-sheet and layout conventions B. Common matrix identities used in ML derivations C. Proofs of key theorems (spectral theorem, Eckart–Young, etc.) D. Implementation notes: PyTorch/JAX examples for PCA, SVD, LoRA, spectral norm E. Recommended papers and resources (2023–2026 perspective)

This structure progresses from core concepts → practical decompositions → deep learning-specific applications → numerical considerations, making it suitable for both self-study and classroom use.

Let me know if you would like to expand any chapter into detailed subheadings, adjust the depth (more math vs. more applications), add/remove sections, or proceed with writing the content for a specific chapter (similar to the previous Calculus & Optimization series).

3.4s

Fast

Foundations: Scalars, Vectors, and Vector Spaces 1.1 Scalars vs. Vectors: From numbers to directional quantities 1.2 Vector addition, scalar multiplication, and geometric intuition 1.3 Linear combinations, span, and linear independence 1.4 Basis, dimension, and coordinate representations 1.5 Normed vector spaces: Euclidean (L₂), L₁, L∞ norms in ML 1.6 Inner products, dot product, cosine similarity, and angle 1.7 Orthogonality, orthonormal bases, and Gram-Schmidt process 1.8 Vector spaces over real and complex numbers (why complex appear in spectral methods)

1. Foundations: Scalars, Vectors, and Vector Spaces

This opening chapter establishes the most fundamental objects in linear algebra — scalars and vectors — and introduces the abstract structure of vector spaces. These concepts form the bedrock for everything that follows: matrix operations, neural network layers, embeddings, attention mechanisms, optimization landscapes, and spectral methods in modern AI.

1.1 Scalars vs. Vectors: From numbers to directional quantities

A scalar is just an ordinary real number (or sometimes complex number): 3.14, −7, 0.001, etc. Scalars have magnitude but no direction.

A vector is an ordered collection of scalars that carries both magnitude and direction. In ℝⁿ (n-dimensional Euclidean space), a vector is written as:

v = (v₁, v₂, …, vₙ) or v = v₁ e₁ + v₂ e₂ + … + vₙ eₙ

where eᵢ are the standard basis vectors.

Geometric intuition

In 2D: an arrow with length and direction (e.g., displacement, velocity, gradient of loss).
In high dimensions (d = 10⁶+ in deep learning): no direct visualization, but same algebraic rules apply.

ML examples

Word embeddings: GloVe / Word2Vec vectors live in ℝ³⁰⁰–ℝ¹⁰²⁴.
Image patches flattened into vectors ∈ ℝ^{H×W×C}.
Gradient vector ∇_θ J(θ) ∈ ℝ^{number of parameters}.

1.2 Vector addition, scalar multiplication, and geometric intuition

Vector addition u + v = (u₁ + v₁, u₂ + v₂, …, uₙ + vₙ) → Parallelogram law (head-to-tail addition).

Scalar multiplication c v = (c v₁, c v₂, …, c vₙ) → Stretching (c > 1), shrinking (0 < c < 1), or reversing direction (c < 0).

Properties (defining a vector space over field ℝ or ℂ):

Commutative, associative addition
Distributive laws
Additive identity (zero vector 0)
Additive inverse (−v)

Geometric / ML intuition

Adding two embedding vectors ≈ combining semantic meanings.
Scaling a gradient vector controls step size in optimization.

1.3 Linear combinations, span, and linear independence

A linear combination of vectors v₁, v₂, …, vₖ is any vector of the form:

c₁ v₁ + c₂ v₂ + … + cₖ vₖ (cᵢ scalars)

The span of a set of vectors {v₁, …, vₖ} is the set of all possible linear combinations.

A set of vectors is linearly independent if the only way to get the zero vector as a linear combination is to take all coefficients zero:

c₁ v₁ + … + cₖ vₖ = 0 ⇒ c₁ = c₂ = … = cₖ = 0

Otherwise, the set is linearly dependent (at least one vector is a combination of the others).

ML relevance

Over-complete bases (more features than needed) → linear dependence → multicollinearity in regression.
Attention heads or neurons often learn redundant (dependent) directions → motivates pruning, distillation, low-rank adaptation (LoRA).

1.4 Basis, dimension, and coordinate representations

A basis for a vector space V is a linearly independent set that spans V.

The dimension dim(V) is the number of vectors in any basis (well-defined theorem).

Once a basis {b₁, …, bₙ} is chosen, every vector v ∈ V has unique coordinates (coefficients) with respect to that basis:

v = x₁ b₁ + … + xₙ bₙ → v has coordinate vector [x₁ … xₙ]ᵀ

Change of basis Different bases give different coordinate representations of the same vector — crucial for understanding why neural networks are invariant under certain reparameterizations.

Standard basis in ℝⁿ: e₁ = (1,0,…,0), e₂ = (0,1,0,…), etc.

ML examples

Token embeddings live in ℝ^{d_model} with respect to the learned basis (the embedding matrix rows/columns).
Principal components define a new orthonormal basis aligned with maximum variance.

1.5 Normed vector spaces: Euclidean (L₂), L₁, L∞ norms in ML

A norm ‖·‖ is a function that measures vector “length” satisfying positivity, homogeneity, and triangle inequality.

Most important norms in ML:

NormFormulaGeometric shape (2D)ML usage examplesL₂ (Euclidean)‖v‖₂ = √(∑ vᵢ²)CircleDefault distance, weight decay, gradient norm, embeddingsL₁ (Manhattan)‖v‖₁ = ∑vᵢL∞ (max norm)‖v‖∞ = maxᵢvᵢL₀ “norm”number of non-zero entries—Counting sparsity (not a true norm)

Common ML operations

Gradient clipping: ‖∇θ‖₂ ≤ C → scale if larger.
Cosine similarity uses L₂-normalized vectors.
Weight decay: adds λ ‖θ‖₂² to loss → encourages small L₂ norm.

1.6 Inner products, dot product, cosine similarity, and angle

An inner product ⟨u, v⟩ is a scalar-valued bilinear form that is symmetric, linear in each argument, and positive definite (⟨v, v⟩ > 0 for v ≠ 0).

Standard dot product in ℝⁿ: ⟨u, v⟩ = uᵀ v = ∑ uᵢ vᵢ

Cauchy–Schwarz inequality |⟨u, v⟩| ≤ ‖u‖₂ ‖v‖₂ → defines angle θ between vectors: cos θ = ⟨u, v⟩ / (‖u‖₂ ‖v‖₂)

Cosine similarity cos θ ∈ [−1, 1] → most common similarity measure in NLP/CV

1: identical direction
0: orthogonal (no correlation)
−1: opposite direction

ML usage

Scaled dot-product attention: softmax(QKᵀ / √d) relies on dot products after L₂ normalization.
Contrastive learning (SimCLR, CLIP): maximize cosine similarity of positive pairs, minimize for negatives.

1.7 Orthogonality, orthonormal bases, and Gram-Schmidt process

Two vectors are orthogonal if ⟨u, v⟩ = 0.

A set is orthonormal if every pair is orthogonal and each has unit norm (‖vᵢ‖₂ = 1).

Gram-Schmidt process Takes any linearly independent set {v₁, …, vₖ} and produces an orthonormal basis {q₁, …, qₖ} for the same span:

u₁ = v₁, q₁ = u₁ / ‖u₁‖ u₂ = v₂ − proj_{q₁} v₂, q₂ = u₂ / ‖u₂‖ … uⱼ = vⱼ − ∑{i=1}^{j-1} proj{qᵢ} vⱼ, qⱼ = uⱼ / ‖uⱼ‖

where proj_{q} v = ⟨v, q⟩ q

ML relevance

Orthogonal initialization (e.g., orthogonal random matrices) preserves gradient variance in deep nets.
QR decomposition (built on Gram-Schmidt ideas) used in some optimizers and low-rank approximations.

1.8 Vector spaces over real and complex numbers (why complex appear in spectral methods)

Most ML uses real vector spaces ℝⁿ.

Complex vector spaces ℂⁿ appear when:

Fourier transforms (real signals → complex coefficients)
Spectral methods: eigenvalues of non-symmetric matrices can be complex
Unitary/orthogonal groups in equivariant networks
Quantum machine learning and certain physics-informed models
Diagonalization of circulant matrices (convolutions) via DFT → complex eigenvectors (Fourier modes)

Important note Even when matrices are real, eigenvectors can be complex (come in conjugate pairs). → Hermitian (complex) or symmetric (real) matrices guarantee real eigenvalues → preferred in spectral normalization, NTK analysis, etc.

Example The DFT matrix is unitary (complex orthogonal) → preserves norms → important for efficient FFT-based attention approximations.

With these foundations mastered, the next chapter will show how matrices represent linear transformations on these vector spaces — exactly how fully-connected layers, attention, and convolutions operate in neural networks.

Let me know whether you'd like to continue directly with Chapter 2 (Matrices as Linear Transformations), need more numerical examples / visualizations described in text, or want Python/PyTorch snippets illustrating any subsection here.

4.3s

Fast

Matrices as Linear Transformations 2.1 Matrix representation of linear maps 2.2 Matrix multiplication: composition of transformations 2.3 Special matrices in deep learning: identity, diagonal, permutation, symmetric, skew-symmetric, orthogonal, unitary 2.4 Matrix-vector products: linear layers, attention scores 2.5 Rank, column space, row space, null space, and rank-nullity theorem 2.6 Full-rank, low-rank, and rank-deficient matrices in practice 2.7 Matrix norms (Frobenius, operator/spectral, nuclear) and their ML roles

2. Matrices as Linear Transformations

Matrices are the language of linear transformations — they describe how vectors are stretched, rotated, reflected, projected, or otherwise mapped in a consistent (linear) way. In neural networks, every fully connected layer, attention mechanism, convolutional filter (when unrolled), and many normalization operations are matrix multiplications at their core. This chapter connects abstract linear maps to concrete deep learning components.

2.1 Matrix representation of linear maps

A function T: ℝⁿ → ℝᵐ is linear if it satisfies:

T(u + v) = T(u) + T(v) T(c u) = c T(u) for all vectors u, v and scalar c

Fundamental theorem: Every linear map between finite-dimensional vector spaces can be represented by a matrix (once bases are chosen).

Given bases for input space ℝⁿ and output space ℝᵐ, the matrix A ∈ ℝ^{m×n} has columns that are the images of the standard basis vectors:

A = [ T(e₁) | T(e₂) | … | T(eₙ) ]

Then, for any vector x ∈ ℝⁿ:

T(x) = A x

Example: A 2D rotation by θ has matrix A = | cos θ −sin θ | | sin θ cos θ |

ML connection: A fully connected layer without bias is exactly y = W x, where W is the weight matrix representing a learned linear transformation from input features to output features.

2.2 Matrix multiplication: composition of transformations

Matrix multiplication corresponds to composition of linear maps.

If T: ℝⁿ → ℝᵖ with matrix B ∈ ℝ^{p×n} and S: ℝᵐ → ℝⁿ with matrix A ∈ ℝ^{n×m}, then the composition T ∘ S has matrix B A (note the order: input matrix on the right).

**(T ∘ S)(x) = T(S(x)) = B (A x) = (B A) x

Key property: Matrix multiplication is associative but not commutative in general (AB ≠ BA).

ML examples

Stacking layers: output = W₃ (W₂ (W₁ x + b₁) + b₂) + b₃ → composition of affine maps
Attention: QKᵀ is matrix multiplication of query and (transposed) key matrices
Residual connections: x + Wx = (I + W)x → sum of identity and learned transformation

2.3 Special matrices in deep learning

Matrix TypePropertyDefinition / FormDeep Learning Usage ExamplesIdentityI x = xDiagonal 1s, off-diagonal 0sSkip connections, initializing, measuring deviation from identityDiagonalOnly diagonal entries nonzeroD = diag(d₁, …, dₙ)LayerNorm scale, per-channel scaling, eigenvalue scalingPermutationReorders coordinatesExactly one 1 per row/column, rest 0Channel shuffling, token reordering, permutation-equivariant netsSymmetricAᵀ = Aa_{ij} = a_{ji}Covariance matrices, Hessian (at critical points), kernel matricesSkew-symmetricAᵀ = −Aa_{ij} = −a_{ji}, diagonal 0Lie algebra elements in SO(n), some physics-informed modelsOrthogonalAᵀ A = I (real case)Columns/rows are orthonormalOrthogonal initialization, spectral norm control, equivarianceUnitaryAᴴ A = I (complex case, Aᴴ = conjugate transpose)Preserves complex normFourier transforms, quantum layers, complex-valued networks

Orthogonal & unitary matrices preserve norms: ‖A x‖₂ = ‖x‖₂ → Very useful for gradient stability (orthogonal init), equivariant networks, and bounding spectral norm.

2.4 Matrix-vector products: linear layers, attention scores

Linear layer (without bias): y = W x + b → W transforms input vector x ∈ ℝ^{d_in} to output y ∈ ℝ^{d_out}

Scaled dot-product attention (core of transformers): Attention(Q, K, V) = softmax( Q Kᵀ / √d_k ) V

Q Kᵀ ∈ ℝ^{seq_len × seq_len} computes similarity (dot product) between every query and every key
softmax turns similarities into probabilities
Final multiplication by V → weighted sum of value vectors → Entire attention head is a sequence of matrix multiplications + nonlinear softmax.

Multi-head attention: Parallel linear transformations for Q, K, V, then concatenated and projected → still matrix ops.

2.5 Rank, column space, row space, null space, and rank-nullity theorem

Column space Col(A) = span of columns of A Row space Row(A) = Col(Aᵀ) Null space (kernel) Null(A) = { x | A x = 0 }

Rank rank(A) = dimension of Col(A) = dimension of Row(A)

Rank-nullity theorem (fundamental): rank(A) + dim(Null(A)) = n (for A ∈ ℝ^{m×n})

ML interpretation

rank(W) < min(d_in, d_out) → information bottleneck / compression
Large null space → many directions unchanged by the layer (can cause vanishing gradients if repeated)
Full column rank → injective map (no collapse)

2.6 Full-rank, low-rank, and rank-deficient matrices in practice

TypeDefinitionTypical rankDeep Learning Examples & ImplicationsFull rankrank = min(m,n)min(d_in, d_out)Well-conditioned layers; good gradient flow early in trainingLow-rankrank ≪ min(m,n)r = 4–64 (LoRA style)LoRA, adapter tuning, pruning, compression, efficient fine-tuningRank-deficientrank < min(m,n)< min(d_in, d_out)Over-parameterized nets after training; redundant neurons/heads

Modern applications

LoRA (Low-Rank Adaptation): Update W ← W + B A where B ∈ ℝ^{d_out × r}, A ∈ ℝ^{r × d_in}, r ≪ d → Trains only ~0.1–1% of parameters while achieving near full fine-tuning performance
Spectral pruning: Remove directions with small singular values
Low-rank bottlenecks in attention matrices → explain why many heads become redundant

2.7 Matrix norms (Frobenius, operator/spectral, nuclear) and their ML roles

Norm NameFormulaInterpretationML Role / Regularization ExampleFrobenius‖A‖F = √(∑{i,j} a_{ij}²) = √(tr(Aᵀ A))Euclidean norm of vectorized matrixWeight decay (‖W‖_F²), matrix reconstruction errorOperator / Spectral‖A‖₂ = σ_max(A) (largest singular value)Maximum stretch factorSpectral normalization (SN-GAN), Lipschitz control, stabilityNuclear‖A‖_* = ∑ σ_i(A) (sum of singular values)Convex relaxation of rankLow-rank regularization, matrix completion, robust PCA

Practical notes

Frobenius norm is easy to compute and differentiate → default in L2 weight decay
Spectral norm bounds Lipschitz constant of layer → used in Wasserstein GANs, certified robustness
Nuclear norm promotes low-rank solutions → basis for inductive bias in LoRA, adapters, and inductive matrix completion

Mastering these matrix concepts allows you to see neural networks not as black boxes of nonlinearities, but as carefully composed chains of linear transformations — with rank, conditioning, and spectral properties dictating expressivity, stability, and generalization.

3. Systems of Linear Equations and Least Squares

This chapter bridges classical linear algebra with one of the most frequent tasks in machine learning: solving (or approximately solving) linear systems. Neural network training, linear regression, PCA, collaborative filtering, and many initialization/regularization techniques ultimately reduce to solving or optimizing linear systems of the form Ax = b or min ‖Ax − b‖₂.

3.1 Matrix form Ax = b and solvability conditions

Any system of m linear equations in n unknowns can be written compactly as:

A x = b

where A ∈ ℝ^{m×n} (coefficient matrix), x ∈ ℝⁿ (unknowns), b ∈ ℝᵐ (right-hand side).

Solvability conditions:

The system has a solution if and only if b lies in the column space of A: b ∈ Col(A).
Equivalently: rank([A | b]) = rank(A) (augmented matrix rank equals coefficient matrix rank).

Three fundamental cases:

CaseConditionNumber of solutionsGeometric interpretationConsistent & uniquerank(A) = n = full column rankExactly one solutionb intersects unique point in column spaceConsistent & infiniterank(A) = rank([Ab]) < nInfinitely many solutionsInconsistentrank([Ab]) > rank(A)No solution

ML relevance

In over-parameterized networks (more parameters than data points), linear systems are typically under-determined → infinitely many solutions (infinite weight configurations achieve zero training loss).
The choice among them is governed by optimization dynamics and implicit bias.

3.2 Gaussian elimination, LU decomposition, pivoting

Gaussian elimination transforms A into upper triangular form U via row operations:

A → U (forward elimination) then back-substitution solves U x = c.

LU decomposition (without pivoting): A = L U L lower triangular (1s on diagonal), U upper triangular.

Partial pivoting (essential in practice): P A = L U where P is a permutation matrix → Prevents division by very small pivots → dramatically improves numerical stability.

ML context

Full Gaussian elimination / LU is rarely used directly on huge weight matrices, but the ideas appear in:
- preconditioning for iterative solvers (e.g., conjugate gradient in Hessian-free optimization)
- analyzing conditioning of linear layers
- small-scale exact solves during debugging or toy-model analysis

3.3 Overdetermined systems and least-squares solutions

When m > n (more equations than unknowns) — the most common case in machine learning — the system Ax = b is usually inconsistent.

The least-squares solution minimizes the Euclidean residual:

x̂ = arg min_x ‖A x − b‖₂²

Geometrically: x̂ is the point whose image A x̂ is the orthogonal projection of b onto Col(A).

Existence and uniqueness:

Always exists
Unique if and only if A has full column rank (rank(A) = n)

3.4 Normal equations: AᵀA x = Aᵀb

The minimizer satisfies the normal equations:

Aᵀ A x̂ = Aᵀ b

Geometric derivation: The residual r = b − A x̂ must be orthogonal to Col(A): Aᵀ r = 0 ⇒ Aᵀ (b − A x̂) = 0 ⇒ normal equations.

Pros and cons:

AspectAdvantageDisadvantageFormulationSimple closed form—Numerical stability—Aᵀ A squares condition number (κ(AᵀA) ≈ κ(A)²)Memory—Forms n×n dense matrix even if A is sparse/tallUse caseSmall-to-medium nAvoid for ill-conditioned or large problems

ML example: Ordinary least squares linear regression → Xᵀ X w = Xᵀ y (X = design matrix).

3.5 QR decomposition and its numerical stability

QR decomposition factorizes A ∈ ℝ^{m×n} (m ≥ n) as:

A = Q R

Q ∈ ℝ^{m×m} orthogonal (Qᵀ Q = I) or thin Q ∈ ℝ^{m×n} with orthonormal columns
R ∈ ℝ^{m×n} upper triangular (or upper trapezoidal)

Thin/s economy QR: Q ∈ ℝ^{m×n}, R ∈ ℝ^{n×n} upper triangular.

Solving least squares via QR:

A x = b ⇒ Q R x = b R x = Qᵀ b (first n rows)

→ Back-substitution on triangular system — very stable.

Why QR is preferred over normal equations:

Does not square the condition number
Backward stable algorithm (Householder or Givens rotations)
Standard method in LAPACK, NumPy, SciPy, etc.

ML connection: QR appears in:

Gram–Schmidt-like orthogonalization in some initializations
Stabilized solvers for small linear subproblems
Randomized rangefinder algorithms for approximate SVD/PCA

3.6 SVD-based least squares (pseudoinverse)

Singular Value Decomposition (SVD):

A = U Σ Vᵀ

U ∈ ℝ^{m×m}, V ∈ ℝ^{n×n} orthogonal
Σ ∈ ℝ^{m×n} diagonal (singular values σ₁ ≥ σ₂ ≥ … ≥ σ_r > 0)

Moore–Penrose pseudoinverse:

A⁺ = V Σ⁺ Uᵀ where Σ⁺ inverts nonzero singular values and sets others to zero.

Least-squares solution (minimum-norm if underdetermined):

x̂ = A⁺ b

Advantages of SVD approach:

Handles rank-deficient cases naturally
Gives insight into numerical rank (small σᵢ ≈ noise)
Truncated SVD → regularized / robust solution
Basis for PCA, collaborative filtering, low-rank approximation

Drawback: Computationally expensive (O(min(m n², m² n))) → not used for huge matrices directly, but randomized/approximate SVD is common in large-scale ML.

3.7 Regularized least squares: ridge regression as preconditioned system

Ridge regression adds L₂ penalty:

min_x ‖A x − b‖₂² + λ ‖x‖₂²

Closed-form solution:

x̂ = (Aᵀ A + λ I)⁻¹ Aᵀ b

Equivalent views:

Tikhonov regularization → damps small singular values
Preconditioned normal equations: (Aᵀ A + λ I) is better conditioned than Aᵀ A
Bayesian MAP estimate under Gaussian prior on x
Spectral filtering: singular values of effective inverse become σᵢ / (σᵢ² + λ)

ML connections

Weight decay in neural networks is exactly ridge regularization on weights
Early stopping ≈ implicit ridge (smaller effective λ as training progresses)
Kernel ridge regression → foundation of Gaussian processes
In transformers: residual connections + LayerNorm can be analyzed as preconditioners of the effective linear system per layer

Summary table – Solving Ax ≈ b in ML contexts

MethodStabilityCondition number impactBest forTypical ML scale usageNormal equationsPoorSquares κVery small nToy problemsQR decompositionExcellentPreserves κMedium n, dense/tall matricesSub-problems, analysisSVD / PseudoinverseExcellentHandles rank deficiencyLow-to-medium n, interpretabilityPCA, low-rank approx, theoryRidge (Tikhonov)Very goodReduces effective κIll-conditioned, noisy dataWeight decay, regularizationIterative (CG, LSQR)Good (with prec.)Depends on preconditionerVery large sparse/tall-and-skinnyLarge-scale embeddings, etc.

This chapter equips you to understand why linear regression works, why regularization matters, why certain solvers are preferred numerically, and how these classical ideas quietly underpin much of modern deep learning.

4. Eigenvalues, Eigenvectors, and Spectral Decomposition

Eigenvalues and eigenvectors reveal the intrinsic geometric and algebraic structure of linear transformations. In deep learning, they appear everywhere: analyzing loss landscape curvature (Hessian eigenvalues), understanding attention matrix behavior, controlling Lipschitz constants (spectral norm), initializing networks orthogonally, explaining grokking phases, and interpreting neural tangent kernel dynamics.

4.1 Definition and geometric meaning of eigenvalues/eigenvectors

For a square matrix A ∈ ℝ^{n×n}, a nonzero vector v is an eigenvector of A if there exists a scalar λ such that:

A v = λ v

The scalar λ is the corresponding eigenvalue.

Geometric interpretation When A is applied to v, the result is simply a scaled version of v (by factor λ) — no change in direction (except possibly reversal if λ < 0).

|λ| > 1 → stretching along eigenvector direction
|λ| < 1 → shrinking / contraction
λ = 1 → fixed direction (invariant subspace)
λ = −1 → reflection
Complex λ → rotation + scaling (in real matrices, complex eigenvalues come in conjugate pairs)

ML examples

The largest eigenvalue of a weight matrix controls the maximum amplification of inputs → related to exploding gradients.
Eigenvectors of the Hessian at a minimum point to principal curvature directions (sharp vs. flat directions).
In attention: eigenvectors of QKᵀ/√d reveal dominant similarity patterns across tokens.

4.2 Characteristic polynomial and algebraic multiplicity

The characteristic polynomial of A is:

p(λ) = det(λ I − A) = λⁿ + c_{n−1} λ^{n−1} + … + c₀

The eigenvalues are exactly the roots of p(λ) = 0.

Algebraic multiplicity of eigenvalue λ is the multiplicity of λ as a root of p(λ).

Geometric multiplicity is the dimension of the eigenspace Ker(A − λI).

Key facts

Algebraic multiplicity ≥ geometric multiplicity
Sum of algebraic multiplicities = n
If algebraic = geometric for every eigenvalue → matrix is diagonalizable

ML note In large random matrices (common in initialization), eigenvalues follow known distributions (Marchenko–Pastur law for Wishart matrices) → helps explain why spectral norm grows as √(fan-in) in deep nets.

4.3 Diagonalizable matrices and eigendecomposition A = PDP⁻¹

A is diagonalizable if there exists an invertible matrix P whose columns are eigenvectors of A, and a diagonal matrix D = diag(λ₁, …, λₙ) such that:

A = P D P⁻¹

Equivalent conditions

A has n linearly independent eigenvectors
Algebraic multiplicity = geometric multiplicity for every eigenvalue

Power formula (extremely useful): Aᵏ = P Dᵏ P⁻¹ = P diag(λ₁ᵏ, …, λₙᵏ) P⁻¹

→ Long-term behavior dominated by largest |λ| (spectral radius).

ML applications

If weight matrix is diagonalizable, repeated application (deep layers) amplifies directions with |λ| > 1 → exploding/vanishing gradients unless eigenvalues near 1.
Orthogonal initialization → P orthogonal → better-conditioned propagation.

4.4 Symmetric matrices: real eigenvalues, orthogonal eigenvectors, spectral theorem

A real symmetric matrix (Aᵀ = A) has very special properties:

Spectral theorem Every real symmetric matrix is orthogonally diagonalizable:

A = Q Λ Qᵀ where Q is orthogonal (QᵀQ = I), Λ = diag(λ₁, …, λₙ) real

→ All eigenvalues are real → Eigenvectors can be chosen orthonormal

Consequences in ML

Hessian of smooth loss functions is symmetric → real curvatures, orthogonal principal directions
Covariance matrices are symmetric positive semi-definite → PCA works
Kernel matrices in kernel methods are symmetric
Attention similarity matrices (after softmax) are not symmetric, but QKᵀ often analyzed via symmetric approximations

4.5 Positive (semi)definite matrices and quadratic forms

A symmetric matrix A is:

Positive definite (PD) if xᵀ A x > 0 for all x ≠ 0 ⇔ all eigenvalues λᵢ > 0
Positive semi-definite (PSD) if xᵀ A x ≥ 0 for all x ⇔ all λᵢ ≥ 0

Quadratic form xᵀ A x = ∑ λᵢ ( qᵢᵀ x )² (in eigenvector basis)

ML examples

Hessian at local minimum is PSD → loss curves upward
Gram matrix Xᵀ X is always PSD
Fisher information matrix is PSD → natural gradient preconditioning
Weight matrices with spectral norm ≤ 1 → Lipschitz constant ≤ 1 (stability)

4.6 Spectral radius, condition number, and numerical stability

Spectral radius ρ(A) = maxᵢ |λᵢ(A)| → Determines long-term growth/decay of Aᵏ

Spectral norm (operator norm induced by L₂): ‖A‖₂ = ρ(Aᵀ A)^{1/2} = largest singular value σ_max(A)

Condition number (2-norm condition number): κ₂(A) = ‖A‖₂ ‖A⁻¹‖₂ = σ_max / σ_min (for invertible A)

Implications in deep learning

Large κ → ill-conditioned → small changes in input cause large changes in output → exploding/vanishing gradients
Orthogonal / unitary layers → κ ≈ 1 → excellent signal propagation
Spectral normalization → forces σ_max ≤ 1 → controls Lipschitz constant

4.7 Power method, inverse iteration, and Rayleigh quotient for largest/smallest eigenvalues

Power method — finds dominant (largest |λ|) eigenvalue/vector

x_{k+1} = A x_k / ‖A x_k‖ → Converges to eigenvector of largest |λ| (assuming gap between |λ₁| and |λ₂|)

Rayleigh quotient R(x) = (xᵀ A x) / (xᵀ x) → Approximates eigenvalue; minimized/maximized over subspaces

Inverse iteration — finds smallest eigenvalue Apply power method to (A − μI)⁻¹ → shifts and inverts spectrum

ML usage

Power method approximates spectral norm in spectral normalization (SN-GAN)
Rayleigh quotient used in variational eigenvalue problems (e.g., graph Laplacians in GNNs)
Lanczos / Arnoldi methods (extensions) compute few extreme eigenvalues of huge Hessians or Jacobians

4.8 Deflation and subspace iteration techniques

Deflation — after finding one eigenvector v₁ with λ₁, deflate A to remove that direction:

A₂ = A − λ₁ v₁ v₁ᵀ (for symmetric case)

→ A₂ has same eigenvalues as A except λ₁ replaced by 0 → Repeat to find next dominant directions

Subspace iteration / simultaneous iteration Power method on multiple vectors at once → orthogonalize periodically (like block power method)

Modern variants

Lanczos algorithm (for symmetric matrices): builds Krylov subspace → excellent for few extreme eigenvalues
Arnoldi iteration (nonsymmetric): basis for GMRES, ARPACK
Randomized subspace iteration: used in randomized SVD / PCA for very large matrices

Practical ML relevance

Computing top-k eigenvalues of Hessian or Fisher matrix (for sharpness, natural gradient)
Spectral clustering in GNN preprocessing
Approximating dominant modes in loss landscape analysis

With this deep understanding of eigenvalues and spectral structure, you are now equipped to interpret curvature, stability, and intrinsic geometry in neural networks — topics that become central in the next chapters on SVD, PCA, and spectral methods in deep learning.

5. Singular Value Decomposition (SVD) – The Cornerstone of Modern ML

The Singular Value Decomposition is arguably the single most powerful and widely applicable tool in the entire linear algebra toolkit for machine learning. It generalizes eigendecomposition to non-square and non-symmetric matrices, provides optimal low-rank approximations, underpins PCA, enables pseudoinverses, controls matrix norms, and appears (directly or indirectly) in almost every modern compression, recommendation, interpretability, and fine-tuning technique.

5.1 Definition and geometric interpretation

For any real matrix A ∈ ℝ^{m×n} (not necessarily square), the SVD is:

A = U Σ Vᵀ

where:

U ∈ ℝ^{m×m} is orthogonal (Uᵀ U = I) — left singular vectors
V ∈ ℝ^{n×n} is orthogonal (Vᵀ V = I) — right singular vectors
Σ ∈ ℝ^{m×n} is diagonal with non-negative singular values σ₁ ≥ σ₂ ≥ … ≥ σ_r ≥ 0 (r = rank(A)) on the main diagonal, zeros elsewhere

Geometric interpretation (most intuitive view):

Vᵀ rotates/reflects the input space ℝⁿ
Σ scales the axes by σ₁, σ₂, … (stretches most along first singular direction, least along last)
U rotates/reflects the result into the output space ℝᵐ

→ SVD = rotation → anisotropic scaling → rotation Unlike eigendecomposition (which requires square matrix and can have complex values), SVD always exists, singular values are always real and non-negative, and it works for rectangular matrices.

ML intuition Any linear layer y = W x can be decomposed into three interpretable stages: rotate input features → scale importance of directions → rotate to output space.

5.2 Reduced, compact, and truncated SVD

Full SVD U (m×m), Σ (m×n), V (n×n) — complete orthogonal bases.

Reduced / Thin SVD (most common in practice) Keep only the first r = min(m,n) columns:

A = Û Σ̂ V̂ᵀ

Û ∈ ℝ^{m×r}, V̂ ∈ ℝ^{n×r} have orthonormal columns
Σ̂ ∈ ℝ^{r×r} diagonal

Compact SVD (when rank(A) = r < min(m,n)) Keep exactly the r nonzero singular values:

A = U_r Σ_r V_rᵀ where U_r (m×r), Σ_r (r×r), V_r (n×r)

Truncated SVD (rank-k approximation, k < r) A_k = U_k Σ_k V_kᵀ (keep only top-k singular values/vectors)

→ Best possible rank-k approximation in both Frobenius and spectral norm (see 5.3)

Practical note Most libraries (NumPy, SciPy, scikit-learn, PyTorch, JAX) return the reduced/compact form by default.

5.3 Low-rank approximation and Eckart–Young–Mirsky theorem

Eckart–Young–Mirsky theorem (1936, cornerstone result):

Among all matrices B of rank at most k, the truncated SVD A_k minimizes both:

‖A − B‖_F (Frobenius norm) ‖A − B‖₂ (spectral/operator norm)

and the approximation errors are exactly:

‖A − A_k‖F² = σ{k+1}² + … + σ_r² ‖A − A_k‖₂ = σ_{k+1}

Huge implication for ML

Truncated SVD gives the theoretically optimal low-rank matrix approximation
Basis for compression, denoising, recommendation systems, LoRA-style adaptation, pruning
Explains why keeping top singular components captures most “signal” and discards “noise”

5.4 SVD for dimensionality reduction: PCA as SVD on centered data

Principal Component Analysis (PCA) in matrix form:

Center the data matrix X ∈ ℝ^{n_samples × n_features}: X_c = X − mean(row-wise)
Compute SVD of X_c = U Σ Vᵀ (usually thin SVD)
The right singular vectors V are the principal components
The singular values σ_i² / (n_samples − 1) are the explained variances
Projected data: Z = X_c V_k (top-k components)

Why SVD = PCA Covariance matrix C = (1/(n−1)) X_cᵀ X_c → C = (1/(n−1)) V Σ² Vᵀ (eigen-decomposition via SVD) → Eigenvectors of covariance = right singular vectors of centered data

Advantages of SVD route

Numerically more stable than forming Xᵀ X (avoids squaring condition number)
Works even when n_samples < n_features (thin SVD)

5.5 Matrix pseudoinverse via SVD

The Moore–Penrose pseudoinverse A⁺ satisfies four properties (generalized inverse):

A⁺ = V Σ⁺ Uᵀ

where Σ⁺ inverts every nonzero singular value and sets zero singular values to zero:

σ⁺_i = 1/σ_i if σ_i > 0, else 0

Properties in ML

Solves least-squares (min ‖A x − b‖₂) and minimum-norm problems in one formula
x̂ = A⁺ b is the minimum L₂-norm solution to A x ≈ b
Used in ridge regression limit (λ → 0), pseudo-inverse initialization, some meta-learning algorithms

Numerical caution Small singular values → unstable inversion → regularized pseudoinverse (cut off below threshold or add damping)

5.6 Spectral norm, Frobenius norm, and nuclear norm through singular values

NormExpression via SVDMeaning / ML roleFrobenius‖A‖_F = √(σ₁² + σ₂² + … + σ_r²)Total “energy” of matrix; L₂ norm of vectorized matrix; weight decay targetSpectral (2-norm)‖A‖₂ = σ₁ (largest singular value)Maximum stretch factor; Lipschitz constant of linear layer; spectral normalizationNuclear (trace norm)‖A‖_* = σ₁ + σ₂ + … + σ_rConvex surrogate for rank; low-rank regularization; matrix completion, robust PCA

Key relations ‖A‖_F² = ∑ σ_i² ‖A‖₂ ≤ ‖A‖F ≤ √r ‖A‖₂ ‖A‖* ≤ √r ‖A‖_F

5.7 Applications: image compression, recommender systems, attention visualization

Image compression

Flatten image patches or whole image into matrix A
Compute truncated SVD A_k
Store U_k, Σ_k, V_kᵀ → far fewer parameters than original → JPEG2000 and many classical methods use similar ideas

Recommender systems

Rating matrix R (users × items) is sparse and incomplete
R ≈ U_k Σ_k V_kᵀ (low-rank matrix completion) → Latent factor models (SVD++, FunkSVD) are truncated SVD with regularization

Attention visualization & interpretability

Attention matrix softmax(QKᵀ/√d) ≈ low-rank in practice
Compute SVD of attention heads → principal attention patterns
Right singular vectors → dominant query directions
Left singular vectors → dominant key directions
Top singular values → how “concentrated” the attention is

Modern frontiers (2025–2026)

LoRA / DoRA: low-rank updates = truncated SVD style adaptation
Spectral pruning: remove small singular components of weight matrices
Randomized SVD / CUR decomposition for billion-scale matrices
Attention SVD analysis for mechanistic interpretability in transformers

SVD is truly the “Swiss Army knife” of linear algebra in ML — it simultaneously solves approximation, decomposition, regularization, interpretability, and stability problems. Mastering SVD unlocks deep understanding of compression, generalization, and scaling in neural networks.

6. Principal Component Analysis (PCA) and Variants

Principal Component Analysis is one of the most fundamental and widely used dimensionality reduction techniques in machine learning. At its core, PCA finds a new orthogonal coordinate system (the principal components) that maximizes the variance of the projected data — effectively capturing the directions of greatest spread in the dataset. PCA is linear, unsupervised, and serves as the foundation for many advanced methods (autoencoders, factor analysis, spectral clustering, etc.). This chapter covers both classical PCA and its most important modern extensions.

6.1 Covariance matrix and variance maximization view

Given a centered data matrix X ∈ ℝ^{n×p} (n samples, p features, already mean-subtracted), the sample covariance matrix is:

S = (1/(n−1)) Xᵀ X ∈ ℝ^{p×p}

PCA objective (variance maximization):

Find unit vectors u₁, u₂, …, uₖ (principal components) such that the projected data X uᵢ has maximum possible variance, subject to orthogonality.

The variance of the projection onto direction u is:

Var(X u) = uᵀ S u (since u is unit norm)

Optimization problem:

max uᵀ S u subject to ‖u‖₂ = 1

By Lagrange multipliers → solution is the eigenvector of S with largest eigenvalue.

Subsequent components maximize variance subject to orthogonality → next eigenvectors in decreasing eigenvalue order.

Geometric view PCA rotates the data so that the new axes (principal components) align with directions of maximum spread → first PC captures most variance, second PC (orthogonal to first) captures next most, etc.

Explained variance ratio λᵢ / ∑ λⱼ → fraction of total variance captured by i-th component

6.2 SVD vs. eigendecomposition approaches to PCA

There are two mathematically equivalent but numerically different ways to compute PCA:

Method 1: Eigendecomposition of covariance matrix

Center X → X_c
Compute S = (1/(n−1)) X_cᵀ X_c
Eigendecompose S = V Λ Vᵀ
Principal components = columns of V
Scores = X_c V

Drawback: Forming X_cᵀ X_c squares the condition number → numerical instability when features have very different scales or when n ≪ p.

Method 2: SVD of centered data matrix (recommended in practice)

X_c = U Σ Vᵀ (thin SVD)

Then:

Right singular vectors V = principal components
Singular values σᵢ → eigenvalues of covariance are σᵢ² / (n−1)
Projected data = U Σ (or X_c V)

Advantages of SVD route:

Avoids explicit covariance matrix → better numerical stability
Works efficiently when n < p (thin SVD)
Libraries (scikit-learn, NumPy, PyTorch, JAX) use SVD internally for PCA

Rule of thumb (2025–2026 practice): Always prefer SVD-based PCA unless you have a very strong reason to form the covariance matrix explicitly.

6.3 Centering, scaling, and standardization in practice

Centering (mean subtraction) is mandatory for classical PCA — otherwise the first component is dominated by the mean direction.

Standardization (z-score: subtract mean, divide by standard deviation per feature) is almost always recommended when features have different units/scales.

PreprocessingWhen to useEffect on PCACommon ML scenariosCentering onlyFeatures already on similar scalePCA finds directions of max varianceImage pixels (0–255), already normalized dataStandardizationFeatures have different units/scalesEach feature contributes equally to varianceTabular data, mixed physical/chemical featuresNo preprocessingVariance structure is meaningful (e.g., counts)PCA biased toward high-variance featuresRarely recommended; can lead to poor components

Practical tips

In pipelines: use StandardScaler before PCA in scikit-learn
For images/CNN features: centering usually sufficient (pixels or activations already normalized)
For sparse data (e.g., TF-IDF): centering can be problematic → consider alternatives (NMF, truncated SVD without centering)

6.4 Kernel PCA for nonlinear dimensionality reduction

Limitation of linear PCA: It can only capture linear subspaces.

Kernel PCA extends PCA to nonlinear manifolds using the kernel trick.

Idea: Map data to high-dimensional feature space via φ(x), then perform PCA there — without explicitly computing φ(x).

Using kernel matrix K_{ij} = k(xᵢ, xⱼ) = ⟨φ(xᵢ), φ(xⱼ)⟩

Center the kernel matrix (important!)
Eigendecompose centered K → eigenvectors in feature space
Project new point x via kernel evaluations

Common kernels:

RBF / Gaussian: k(x, y) = exp(−‖x−y‖² / (2σ²))
Polynomial: k(x, y) = (⟨x, y⟩ + c)^d
Sigmoid, etc.

Applications in ML

Nonlinear dimensionality reduction before clustering (spectral clustering uses similar ideas)
Preprocessing for SVMs (kernel SVM = kernel PCA + linear SVM in feature space)
Manifold learning visualization (Swiss roll, MNIST digit manifolds)

Drawback: O(n²) memory and time for kernel matrix → not scalable to millions of points without approximations (Nyström, random Fourier features).

6.5 Probabilistic PCA and connection to autoencoders

Probabilistic PCA (PPCA, Tipping & Bishop 1999) Places probabilistic model behind classical PCA:

x = W z + μ + ε z ~ 𝒩(0, I) (latent variables) ε ~ 𝒩(0, σ² I) (isotropic noise)

Maximum likelihood estimation of W, μ, σ² recovers classical PCA when σ → 0.

Advantages:

Handles missing data naturally
Provides uncertainty estimates
EM algorithm solution

Deep connection to autoencoders A single-layer linear autoencoder with squared reconstruction loss and tied weights (decoder = transpose of encoder) learns exactly the principal components (up to orthogonal transformation).

Nonlinear autoencoders (deep, with nonlinear activations) generalize kernel PCA → learn nonlinear manifolds.

Modern view (2025–2026) Variational autoencoders (VAEs) are probabilistic generalizations of probabilistic PCA → latent space regularized by KL divergence.

6.6 Incremental/online PCA for streaming data

Classical PCA requires the full dataset → not suitable for streaming or very large data.

Incremental PCA (online / streaming PCA) updates components as new data arrives.

Key algorithms:

Oja’s rule / generalized Hebbian algorithm: stochastic gradient ascent on variance
CCIPCA (Candid Covariance-free Incremental PCA): rank-1 updates without covariance matrix
IPCA in scikit-learn: incremental SVD-style updates

Applications

Real-time face recognition
Online recommender systems
Adaptive dimensionality reduction in robotics/sensor streams
Mini-batch training of large-scale embeddings

6.7 Randomized SVD for large-scale PCA

For massive matrices (e.g., billion-scale embeddings, attention matrices), exact SVD is impossible.

Randomized SVD (Halko, Martinsson, Tropp 2011) approximates top-k singular vectors/values with high probability:

Generate random matrix Ω ∈ ℝ^{n×l} (l ≈ k + oversampling, typically 10–20)
Compute Y = A Ω (or Aᵀ A Ω for better accuracy)
Orthogonalize Y → Q (QR)
Compute small matrix B = Qᵀ A
SVD of B = Û Σ V̂ᵀ
Approximate singular vectors: U ≈ Q Û, V ≈ V̂

Theoretical guarantee: near-optimal approximation error with very high probability.

2025–2026 practice

scikit-learn TruncatedSVD uses randomized method by default
PyTorch / JAX libraries offer randomized SVD routines
Used in large-scale word embeddings, attention matrix factorization, recommender systems, and even approximating Hessians/Fisher matrices

Summary table – PCA variants at a glance

VariantLinearityScalabilityKey StrengthTypical Use CaseClassical PCALinearMediumSimple, interpretableTabular data, preprocessingKernel PCANonlinearLow–mediumCaptures nonlinear manifoldsManifold learning, small-to-medium datasetsProbabilistic PCALinearMediumHandles missing data, uncertaintyBayesian settings, incomplete dataIncremental PCALinearHigh (streaming)Online updatesReal-time systems, very large dataRandomized SVD/PCALinearVery highFast on huge matricesBillion-scale embeddings, attention analysis

PCA and its variants remain indispensable tools for visualization, noise reduction, feature extraction, compression, and interpretability — even in the deep learning era.

7. Matrix Factorizations in Neural Networks

Matrix factorizations lie at the heart of parameter-efficient fine-tuning, efficient architectures, compression, initialization strategies, and regularization techniques in modern deep learning. This chapter connects classical decomposition methods (low-rank, spectral, tensor, circulant) to their direct applications in transformers, CNNs, diffusion models, and large-scale language/vision models (as of 2025–2026 practice).

7.1 Low-rank adaptation (LoRA): low-rank updates to weight matrices

LoRA (Low-Rank Adaptation, Hu et al. 2021) is one of the most widely adopted parameter-efficient fine-tuning (PEFT) methods.

Core idea Instead of updating the full weight matrix W ∈ ℝ^{d_out × d_in}, we approximate the update as a low-rank decomposition:

ΔW = B A where B ∈ ℝ^{d_out × r}, A ∈ ℝ^{r × d_in}, r ≪ min(d_out, d_in) (typically r = 4–64)

During fine-tuning:

W remains frozen
Only A and B are trained
Forward pass: y = (W + B A) x = W x + B (A x)

Why low-rank works

Fine-tuning datasets are usually much smaller than pre-training → the necessary adaptation lies in a low-dimensional subspace
Empirical observation: most of the update ΔW has very low effective rank (often < 100 even for billion-parameter models)
Theoretical justification: comes from intrinsic low-dimensional structure in task-specific gradients

Variants & extensions (2023–2026)

DoRA (Decomposed LoRA): splits weight magnitude and direction → better performance than vanilla LoRA with same parameter budget
LoRA+ / LoRA-FA / QLoRA: combines quantization, frozen A matrix, or different learning rates for A/B
LoHa / LoKr / (IA)³: even more parameter-efficient alternatives using Kronecker or Hadamard products
VeRA (Vector-based Random Matrix Adaptation): trains only scaling vectors instead of full low-rank matrices → extreme compression

Practical impact LoRA (and family) enables fine-tuning 70B+ LLMs on consumer GPUs with <1% of original parameters while matching or approaching full fine-tuning quality.

7.2 Spectral initialization and orthogonal initialization

Proper initialization prevents exploding or vanishing gradients in deep networks.

Orthogonal initialization Initialize weight matrices as (semi-)orthogonal: Wᵀ W ≈ I (or scaled identity).

Common methods:

Saxe et al. (2013): Draw from random orthogonal matrix (via QR decomposition of random Gaussian matrix)
He/Kaiming orthogonal (variant of Kaiming init): orthogonal + appropriate scaling (√(gain / fan_in))

Spectral initialization Explicitly control the spectral norm (largest singular value) or eigenvalue distribution:

Set singular values to fixed value (e.g., all ≈ 1) → unitary-like propagation
Use truncated normal / uniform distribution clipped to desired spectral radius

Why it helps

Preserves gradient variance across layers (signal doesn’t explode or vanish)
Especially important in very deep networks, RNNs, and transformers without LayerNorm
Orthogonal / unitary layers are Lipschitz with constant 1 → improves adversarial robustness and training stability

Modern usage

Default in many transformer implementations (orthogonal init for attention QKV projections)
Spectral norm initialization in diffusion models and GANs
Helps mitigate spectral bias (preference for low-frequency functions) in early training

7.3 Weight decay as nuclear-norm regularization

Weight decay adds L₂ penalty on weights:

L_total = L_data + (λ/2) ‖W‖_F²

Nuclear norm ‖W‖_* = sum of singular values = convex relaxation of rank(W)

Key insight Weight decay = (λ/2) ‖W‖_F² is not directly low-rank promoting, but:

In over-parameterized regimes + SGD dynamics:

Weight decay implicitly encourages low effective rank (via implicit bias)
Several papers (2022–2025) show that SGD + weight decay approximates nuclear-norm regularization in the continuous limit

Practical effects

Reduces rank of weight matrices during training → fewer effective parameters → better generalization
Prevents overfitting by discouraging large singular values
AdamW decouples weight decay from adaptive learning rates → stronger low-rank effect than vanilla Adam + L₂

Related techniques

Explicit nuclear-norm regularization (rare, expensive)
Spectral norm penalty (controls largest singular value)
Low-rank adaptation methods (LoRA) directly enforce low-rank structure

7.4 Matrix completion and collaborative filtering

Matrix completion Recover missing entries of a partially observed matrix R ∈ ℝ^{m×n} (e.g., user–item ratings) assuming R is low-rank.

Classical approach (SVD-based): min ‖P_Ω (X − R)‖F² + λ ‖X‖* subject to observed entries

Nuclear-norm minimization is convex relaxation of rank minimization → solved via proximal gradient or SVT (Singular Value Thresholding).

Neural network view

Embed users and items into low-dimensional vectors U ∈ ℝ^{m×k}, V ∈ ℝ^{n×k}
Predict R̂_{ij} = uᵢᵀ vⱼ → low-rank matrix factorization
Modern variants: deep matrix factorization, neural collaborative filtering, autoencoder-based completion

2025–2026 relevance

Still foundation of many recommender systems
Low-rank inductive bias appears in attention matrices and cross-attention layers
Matrix completion ideas used in masked autoencoding (MAE) for vision and language

7.5 Tucker/CP decomposition for tensor layers

CANDECOMP/PARAFAC (CP) decomposition X ≈ ∑_{r=1}^R u_r ⊗ v_r ⊗ w_r (sum of rank-1 tensors)

Tucker decomposition (more general): X ≈ G ×₁ U ×₂ V ×₃ W (G core tensor, U,V,W factor matrices)

Applications in neural networks

Tensorized layers: replace large weight tensors with decomposed form (e.g., Tucker for conv layers)
Efficient multi-head attention approximations
Compression of large transformer layers (especially KV cache or adapter tensors)
Tensor Ring / Tensor Train decomposition for extreme compression
Used in some state-space models (Mamba-style) and efficient ViTs

Trade-off

Significant parameter reduction
But usually slower inference unless specialized kernels (e.g., TensorLy, t3f libraries)

7.6 Butterfly factorization and structured linear layers

Butterfly factorization (2017–2019) approximates dense matrices with products of sparse structured matrices (permutation + diagonal + butterfly blocks).

Key property A butterfly matrix is product of O(log n) sparse factors → O(n log n) parameters and multiply time instead of O(n²).

Related structured linear layers

Hadamard / Fourier / Toeplitz / circulant layers
Synthesizer / Linformer / Performer (low-rank + kernel approximations)
Long-range Arena benchmarks showed structured matrices scale better for long sequences
Monarch / Mamba-2 blocks use structured matrices for state-space efficiency

2025–2026 trend Structured matrices appear in sub-quadratic attention alternatives and efficient SSMs (Mamba family).

7.7 Fast Fourier Transform (FFT) and circulant matrices in efficient attention

A circulant matrix is defined by its first row (each subsequent row is right-shifted).

Key property Circulant matrices are diagonalized by the DFT matrix F:

C = F⁻¹ diag(F c) F where c is first column

→ Matrix multiplication C x = IFFT( FFT(c) ⊙ FFT(x) ) → O(n log n) time

Applications in efficient attention

FNet (2021): replace self-attention with FFT → mixes tokens in frequency domain
GFNet / AFNO (Adaptive Fourier Neural Operator): 2D FFT on image patches → global mixing
Vision Transformers with FFT: global convolution-like mixing at low cost
Efficient long-range attention: approximate softmax(QKᵀ) with circulant or Toeplitz structure + FFT

Advantages

O(n log n) complexity vs O(n²) for full attention
Global receptive field with fixed cost
Especially powerful for high-resolution images and long-context language models

Current status FFT-based layers are part of the efficient-transformer toolbox (together with linear attention, sparse attention, SSMs).

This chapter demonstrates how classical matrix and tensor factorizations directly inspire some of the most impactful architectural and training innovations in deep learning.

8. Eigenvalues and Dynamics in Deep Learning

Eigenvalues and spectral properties provide one of the most powerful lenses for understanding why and how deep neural networks train, generalize, and sometimes exhibit surprising behaviors. This chapter connects eigenvalue analysis to curvature of the loss surface, implicit biases of optimizers, frequency-domain learning dynamics, phase transitions like grokking, attention collapse, and long-term stability — all central themes in modern theoretical deep learning (2023–2026 literature).

8.1 Hessian eigenvalues: curvature, sharpness, and generalization

The Hessian H = ∇²J(θ) at a parameter point θ is a symmetric matrix whose eigenvalues quantify curvature of the loss landscape in every direction.

Key interpretations:

Positive eigenvalues → upward curvature (attractive direction toward minimum)
Negative eigenvalues → downward curvature (saddle or unstable direction)
Large positive eigenvalues → sharp curvature → sensitive / narrow minima
Small eigenvalues → flat directions → broad / insensitive minima

Sharpness and generalization (Keskar et al. 2017; later works 2020–2025):

Sharp minima (large dominant eigenvalues of Hessian) → poor generalization (overly sensitive to perturbations)
Flat minima (small eigenvalues, especially trace or max eigenvalue small) → better robustness and generalization → Hypothesis: SGD with small batch size / noise tends to find flatter minima
Sharpness-aware minimization (SAM) explicitly minimizes worst-case loss in neighborhood → flattens effective Hessian eigenvalues

Empirical metrics (2025 consensus):

Trace(H) / number of parameters → average curvature
Max eigenvalue λ_max(H) → sharpness proxy
Effective rank or number of large eigenvalues → how “high-dimensional” the minimum is

Modern view: flatness alone is not sufficient; volume and connectivity of flat regions matter more.

8.2 Loss landscape geometry via spectral properties

The loss J(θ) around a point θ₀ is approximated by second-order Taylor expansion:

J(θ₀ + δ) ≈ J(θ₀) + ∇Jᵀ δ + (1/2) δᵀ H δ

In the eigenbasis of H = Q Λ Qᵀ:

J(θ₀ + δ) ≈ J(θ₀) + (1/2) ∑ λ_i (q_iᵀ δ)²

→ The landscape is a paraboloid stretched differently along each eigenvector direction.

Spectral properties reveal geometry:

Very large λ_max → narrow “ravines” → SGD zig-zags (ill-conditioning)
Many near-zero eigenvalues → large flat plateaus → slow progress
Negative eigenvalues → saddle points → high-dimensional escape needed
In over-parameterized nets: most critical points are saddles; minima often lie in connected flat basins (mode connectivity)

Visualization insight (common in papers): PCA of sampled loss surface trajectories shows dominant low-dimensional structure → loss evolves mostly along a few principal Hessian directions.

8.3 Neural Tangent Kernel (NTK) and infinite-width eigenvalue spectrum

In the infinite-width limit, a neural network behaves like a kernel regression model with the Neural Tangent Kernel (NTK, Jacot et al. 2018):

f(θ; x) ≈ f(θ₀; x) + ∇_θ f(θ₀; x)ᵀ (θ − θ₀)

The NTK is Θ(x,x') = ∇_θ f(θ; x)ᵀ ∇_θ f(θ; x') (at initialization)

Spectral structure of NTK:

NTK is positive semi-definite
Its eigenvalues determine learning speed per eigenfunction
In infinite width: training dynamics become linear → gradient descent on NTK regression

Eigenvalue spectrum insights (2020–2025):

NTK eigenvalues decay rapidly → network learns low-frequency functions first (spectral bias)
Finite-width corrections introduce feature learning → spectrum becomes richer
μP (maximal update parametrization) keeps NTK eigenvalues well-behaved at scale

Practical implication: NTK analysis explains why wide networks generalize well and why scaling laws appear power-law-like.

8.4 Spectral bias of gradient descent: learning low-frequency components first

Spectral bias (Rahaman et al. 2019; later works): gradient descent on neural nets preferentially learns low-frequency (smooth) components of the target function before high-frequency (wiggly) ones.

Mechanism via NTK / linearization:

NTK eigenvalues are larger for low-frequency eigenfunctions
Learning speed ∝ eigenvalue size → low frequencies learned fastest

Finite-width effects:

Early training: behaves like infinite-width NTK → strong low-frequency bias
Later training: feature learning kicks in → higher frequencies gradually captured

Consequences:

Explains double descent: wide networks fit low-frequency signal quickly, then slowly memorize high-frequency noise
Adversarial vulnerability: high-frequency perturbations fool models early in training
grokking / delayed generalization: low-frequency modular rules learned first, high-frequency noise ignored until late

Mitigations (2025 research):

Curriculum learning (low → high frequency)
Frequency-aware regularization
Sharpness-aware methods → indirectly flatten high-frequency curvature

8.5 Grokking and phase transitions through eigenvalue dynamics

Grokking (Power et al. 2022): prolonged overfitting followed by sudden generalization (e.g., modular arithmetic in transformers).

Eigenvalue-based explanations (2023–2026 mechanistic studies):

Early phase: network fits training data → large eigenvalues associated with memorization directions
Prolonged phase: weight norm grows slowly → gradual alignment of directions with low-frequency modular invariants
Sudden grokking moment: phase transition when dominant eigenvalues shift from high-frequency memorization to low-frequency generalization patterns
Spectral metrics (heavy-tailed eigenvalue index α) jump at grokking point → HTSR (Heavy-Tailed Self-Regularization) detects transition reliably
Regularization (weight decay, small ℓ₁) accelerates eigenvalue alignment → grokking without long overfitting phase

Geometric view: Grokking occurs when trajectory escapes sharp memorization minima and enters flat, generalizing basin → visible in Hessian eigenvalue collapse or mode connectivity.

8.6 Attention matrices: eigenvalues of softmax(QKᵀ/√d) and softmax collapse

The attention matrix P = softmax(Q Kᵀ / √d_k) is row-stochastic (rows sum to 1).

Spectral properties:

Eigenvalues of P lie in [0,1]; one eigenvalue = 1 (uniform vector is eigenvector)
Remaining eigenvalues < 1 → measure how “concentrated” attention is

Softmax collapse (2023–2025 phenomenon): In late training or large depth, attention matrices become nearly constant (all entries ≈ 1/seq_len) → rank collapse.

Spectral signature:

One dominant eigenvalue ≈ 1
All other eigenvalues → 0 → Effective rank drops dramatically → attention loses token distinction

Causes:

Over-smoothing from residual connections + LayerNorm
Large initialization variance or scaling issues
Mitigation: attention scaling adjustments, residual dropout, or explicit rank-promoting regularizers

Visualization insight: Top singular vectors of attention matrices reveal dominant query-key interaction patterns; collapse = all singular values except one vanish.

8.7 Stability of recurrent networks via largest eigenvalue of weight matrix

For a simple recurrent network:

h_{t+1} = tanh(W h_t + U x_t)

Long-term behavior governed by W:

If spectral radius ρ(W) = max |λ_i(W)| < 1 → gradients vanish exponentially (vanishing gradient problem)
If ρ(W) > 1 → gradients explode
ρ(W) ≈ 1 → best memory retention

Modern recurrent models (LSTM, GRU, Mamba):

Gate mechanisms implicitly control effective spectral radius
State-space models (Mamba family) explicitly parametrize continuous-time dynamics → stable eigenvalues by design

2025–2026 trend:

Orthogonal / unitary RNNs → enforce ρ(W) = 1 exactly
Spectral normalization on recurrent weights → bounds exploding gradients
Mamba-2 and structured state models use diagonal + low-rank structure → controlled eigenvalue spectrum

This chapter illustrates how eigenvalues are not just a theoretical curiosity — they directly explain training speed, generalization, phase transitions, attention collapse, and recurrent stability in deep networks.

9. Advanced Spectral Methods and Applications

This chapter explores how spectral techniques — eigenvalue and singular value analysis, Fourier methods, and randomized linear algebra — extend far beyond basic decomposition and become core tools in modern graph learning, generative modeling, representation learning, equivariant architectures, and scientific machine learning (especially PDE solving). These methods exploit the spectral structure of data or operators to achieve efficiency, stability, inductive biases, and interpretability at scale.

9.1 Graph Laplacians, spectral clustering, and GNNs

The graph Laplacian of an undirected graph G = (V, E) with adjacency matrix A and degree matrix D is:

L = D − A (unnormalized) L_sym = I − D^{−1/2} A D^{−1/2} (symmetric normalized) L_rw = I − D^{−1} A (random walk normalized)

Spectral properties:

L is positive semi-definite; eigenvalues 0 = λ₁ ≤ λ₂ ≤ … ≤ λₙ
λ₁ = 0 with eigenvector constant vector (connected graph)
Algebraic connectivity = λ₂ (Fiedler eigenvalue) → measures how well-connected the graph is

Spectral clustering (Ng–Jordan–Weiss, Shi–Malik):

Compute k smallest nonzero eigenvectors of normalized Laplacian
Form matrix with these eigenvectors as columns
Cluster rows using k-means

→ Minimizes normalized cut / ratio cut approximately

Graph Neural Networks (GNNs) and spectral view:

Graph convolutional networks (GCN, Kipf & Welling 2017): approximate spectral filters on Laplacian eigenvalues H^{(l+1)} = σ( D^{−1/2} AD^{−1/2} H^{(l)} W^{(l)} ) → Polynomial filter on graph spectrum
ChebNet / ChebyGIN: use Chebyshev polynomials of Laplacian → better approximation of arbitrary spectral filters
ARMA convolutions, BernNet: learn rational filters on graph spectrum
2025–2026 trend: Fourier-based GNNs (e.g., GFNet-style on graphs) and spectral attention mechanisms

Key insight: Many GNNs are implicitly performing low-pass filtering in the graph Fourier domain → explains oversmoothing and why high-frequency graph signals are hard to capture.

9.2 Spectral normalization of weight matrices (SN-GAN, SN-ResNet)

Spectral normalization (Miyato et al. 2018) controls the Lipschitz constant of a neural network layer by normalizing the spectral norm (largest singular value) of each weight matrix:

W_{sn} = W / σ_max(W)

σ_max(W) is approximated via power iteration:

u_{k+1} = Wᵀ (W u_k) / ‖Wᵀ (W u_k)‖ σ̂ ≈ uᵀ Wᵀ W u (after few iterations)

Applications:

SN-GAN (Spectral Normalization GAN): stabilizes training by enforcing 1-Lipschitz discriminator → replaces gradient penalty
SN-ResNet: improves adversarial robustness and clean accuracy in classification
SNGP (Spectral-normalized Neural Gaussian Process): combines spectral norm with Gaussian process output layer → uncertainty estimation
Diffusion models & score-based models: spectral norm on certain layers prevents mode collapse

Advantages:

Single hyperparameter (no need to tune gradient penalty coefficient)
Computationally cheap (power iteration converges very fast)
Improves generalization and stability in many architectures

2025–2026 extensions: adaptive spectral scaling, spectral norm per head in multi-head attention, and combination with sharpness-aware methods.

9.3 Power iteration in contrastive learning and representation learning

Power iteration approximates the dominant eigenvector/eigenvalue of a matrix by repeated multiplication.

In contrastive learning:

SimCLR / MoCo-style: the covariance matrix of projected representations is implicitly maximized
Barlow Twins / VICReg: explicitly decorrelate features → penalize off-diagonal elements of cross-correlation matrix
Power iteration view: repeated application of projection head can be seen as power method on Gram matrix → amplifies dominant directions (invariant features) while suppressing noise

Direct usage:

Spectral contrastive loss (2022–2024 papers): maximize top-k eigenvalues of cross-view covariance matrix
Iterative normalization (Huang et al.): power iteration inside normalization layers → learns data-dependent orthogonal bases
Spectral decoupling in self-supervised learning: power method on representation covariance → separates content and style

Representation collapse prevention:

Power iteration detects collapse (one dominant eigenvalue ≈ batch size) → early warning for training instability

9.4 Randomized linear algebra: randomized rangefinder, CUR decomposition

For very large matrices (e.g., billion-scale embeddings, attention maps, Hessians), exact SVD is infeasible.

Randomized rangefinder:

Generate random test matrix Ω ∈ ℝ^{n×l} (Gaussian or sparse)
Compute Y = A Ω
QR factorize Y = Q R → Q approximates range of A
Project: B = Qᵀ A → small matrix
SVD of B → approximate singular vectors/values of A

CUR decomposition (alternative low-rank approx): A ≈ C U R

C = few columns of A
R = few rows of A
U = intersection matrix (pseudoinverse-based) → Interpretable (actual data entries), faster than SVD in some regimes

2025–2026 applications:

Randomized SVD for large-scale PCA on embeddings
CUR for interpretable feature selection in tabular LLMs
Approximate Hessian/Fisher sketching in second-order optimizers
Fast low-rank attention approximations

9.5 Matrix exponentials and Lie groups in equivariant networks

Matrix exponential exp(M) = ∑ (1/k!) Mᵏ → Solution to differential equation dX/dt = M X → continuous transformation

Lie groups in equivariant networks:

SO(3) (rotations): special orthogonal group → matrix exponential of skew-symmetric matrices
SE(3) (rigid motions): used in 3D point cloud networks, molecular modeling
Unitary group U(n): complex orthogonal → appears in complex-valued nets, Fourier layers

Applications:

E(n) Equivariant Graph Neural Networks: use matrix exp to build continuous equivariant transformations
LieConv / LieResNet: parametrize filters via Lie algebra exponentials → group-equivariant convolutions
Continuous normalizing flows (CNFs): dynamics defined via neural ODEs with matrix exponentials
Neural ODEs on manifolds: exponential maps for optimization on spheres, Stiefel manifolds

2025–2026 trend: exponential integrators and Lie-group transformers for physics-informed and geometric deep learning.

9.6 Operator learning and Fourier Neural Operators via spectral methods

Operator learning aims to learn mappings between infinite-dimensional function spaces (e.g., PDE solution operators).

Fourier Neural Operator (FNO, Li et al. 2020):

Lift input function to higher channel dimension
Apply FFT → pointwise multiplication in frequency domain (spectral convolution)
Apply inverse FFT → mix in spatial domain
Stack layers → global mixing with O(N log N) cost (N = grid points)

Why spectral methods excel:

Fourier basis diagonalizes translation-invariant operators → efficient convolution
Low-frequency modes dominate many physical operators → truncation in frequency domain loses little information
Resolution-invariant: works on different grid sizes

Extensions (2023–2026):

GeoFNO / U-NO: adaptive Fourier bases, unstructured meshes
Spectral Operators with learnable eigenfunctions
Wavelet Neural Operators, MIONet (multi-input/output)
Operator learning for climate modeling, fluid dynamics, material science

Spectral viewpoint: FNO is a deep neural network in the Fourier domain → learns spectral filters → captures global dependencies without quadratic cost.

This chapter illustrates how spectral thinking scales from small networks to scientific computing and geometric deep learning, providing inductive biases that are hard to achieve with purely spatial methods.10. Numerical Linear Algebra in Practice

This final chapter shifts focus from pure mathematics to the real-world implementation of linear algebra in deep learning pipelines. Even with perfect algorithms, floating-point precision, hardware acceleration, library choices, and approximation trade-offs determine whether computations are feasible, stable, and accurate at scale. These considerations are especially critical for large models (2025–2026 era: 100B–10T parameters), where matrices can exceed tens of terabytes and every bit of efficiency matters.

10.1 Floating-point arithmetic and conditioning pitfalls

Modern hardware uses IEEE 754 floating-point formats:

fp32 (single precision): 1 sign bit, 8 exponent bits, 23 mantissa bits → ~7 decimal digits
fp16 / bf16 (half precision): 1 sign, 5/8 exponent, 10/7 mantissa → ~3–4 digits
fp64 (double precision): 1 sign, 11 exponent, 52 mantissa → ~15–16 digits

Condition number impact The relative error in solution x of Ax = b is bounded by:

‖δx‖ / ‖x‖ ≤ κ(A) × (‖δA‖/‖A‖ + ‖δb‖/‖b‖)

where κ(A) = ‖A‖ ‖A⁻¹‖ ≈ σ_max / σ_min (2-norm condition number)

Pitfalls in deep learning:

Large condition numbers (κ ≫ 1) in weight matrices or Hessians → tiny input perturbations cause huge output changes
Repeated multiplications in deep layers amplify rounding errors exponentially if κ > 1 per layer
Ill-conditioned covariance matrices in batch normalization or PCA → numerical rank underestimation
Catastrophic cancellation in subtraction-heavy operations (e.g., residual connections without proper scaling)

Practical mitigations:

Use higher precision for sensitive steps (e.g., fp32 statistics in mixed-precision training)
Orthogonal/spectral initialization → keeps κ close to 1
Gradient clipping + weight decay → prevents extreme singular values
LayerNorm / RMSNorm → implicitly precondition layers

10.2 BLAS, LAPACK, cuBLAS, and modern GPU-accelerated linear algebra

BLAS (Basic Linear Algebra Subprograms) — Level 1 (vector), Level 2 (matrix-vector), Level 3 (matrix-matrix) operations LAPACK — Higher-level routines (LU, QR, SVD, eigendecomposition) built on BLAS

GPU equivalents:

cuBLAS (NVIDIA) — GPU-accelerated BLAS
cuSOLVER — GPU LAPACK-like routines (dense & sparse)
Magma / rocBLAS / hipSOLVER — AMD ROCm equivalents
oneMKL / oneMKL-DNN — Intel oneAPI Math Kernel Library (CPU + GPU)

2025–2026 landscape:

Most deep learning frameworks (PyTorch, JAX, TensorFlow) dispatch matmul, conv, etc., to cuBLAS / CUTLASS / Triton kernels
CUTLASS 3.0+ and Triton allow custom GEMM kernels → outperform cuBLAS on certain shapes
FlashAttention-2 / FlashAttention-3 — fused attention + softmax + matmul kernels → massive memory and speed gains
cuTENSOR / cuSPARSE — specialized for tensor contractions and sparse operations

Key lesson: Performance is dominated by memory bandwidth and kernel fusion, not just arithmetic intensity. Dense linear algebra on GPUs is highly optimized; sparse or irregular operations remain bottlenecks.

10.3 JAX/PyTorch einsum and compiled linear operations

einsum — Einstein summation notation — unifies many linear algebra operations:

Examples:

Matrix multiplication: ij,jk->ik
Batched matmul: bik,bkj->bij
Dot-product attention: bhid,bhjd->bhij
Bilinear form: bi,bj,ijk->bk

PyTorch:

torch.einsum — dynamic, interpreted
torch.compile(…, mode="reduce-overhead") + einsum → fuses operations, eliminates temporary tensors

JAX:

jax.numpy.einsum — traced and compiled via XLA
Automatic differentiation through einsum
jax.jit + jax.lax.scan / vmap → massive speedups for structured linear ops
pallas (JAX’s Triton-like kernel language) — write custom GPU kernels for einsum patterns

Practical impact:

Compiled einsum enables 2–10× speedups on attention, LoRA, spectral layers
JAX excels at higher-order derivatives through compiled linear ops (e.g., Hessian-vector products)
PyTorch 2.x torch.compile closes the gap significantly

10.4 Mixed-precision training and its impact on eigenvalue accuracy

Mixed-precision training (fp16/bf16 + fp32 master weights) is standard for large models.

Effects on linear algebra accuracy:

fp16 has only ~3 decimal digits → severe rounding in long summation chains (e.g., matmul reduction)
bf16 has more exponent range but same mantissa → better for weights/activations, but still lossy for gradients
Eigenvalue / singular value computation is very sensitive to rounding errors when matrix is ill-conditioned
Small singular values can be completely lost → effective numerical rank underestimation
Hessian eigenvalue histograms shift/artificially flatten due to fp16 noise

Mitigations in practice (2025–2026 standards):

Master weights in fp32 / fp64
Gradient accumulation in fp32
Loss scaling (dynamic or static)
Selective fp32 for statistics (LayerNorm, batch stats)
fp8 training (E4M3/E5M2 formats) — even more aggressive; requires careful scaling and loss scaling
Delayed parameter updates in mixed-precision second-order methods (Sophia, K-FAC variants)

Rule of thumb: If you need accurate eigenvalues/singular values (e.g., spectral analysis, sharpness measurement), compute in fp32 or higher — even if model was trained in mixed precision.

10.5 Randomized sketching for large-scale matrix computations

Sketching reduces matrix size while preserving important properties with high probability.

Common techniques:

Random projection (Johnson–Lindenstrauss): Y = S A, S random Gaussian or sparse sign matrix
CountSketch / Sparse JL — faster, memory-efficient
Randomized rangefinder (for SVD/PCA) — as in Chapter 9.4
Column/row sketching for CUR decomposition

Large-scale applications:

Approximate Hessian/Fisher matrix in second-order optimizers (Sophia, K-FAC sketching)
Fast low-rank attention (Linformer-style sketching of QKᵀ)
Sketching for large covariance estimation in continual learning
Randomized Nyström for kernel approximation in large-scale kernel methods

Theoretical guarantees: Error bounds scale as O(1/√sketch_dim) → 10–20× oversampling usually sufficient for good approximation.

10.6 When to use exact vs. approximate methods at scale

Exact methods (QR, SVD, eigendecomposition, dense LU):

SituationRecommendedWhy / Trade-offsSmall–medium matrices (n ≤ 10⁴)ExactAccuracy critical, time acceptableHessian / Fisher analysisExact (if possible)Eigenvalues/singular values very sensitiveInterpretability tasksExactExact singular vectors needed for visualizationDebugging / toy modelsExactReproducibility and precision

Approximate / randomized / structured methods:

SituationRecommendedWhy / When it winsn > 10⁵–10⁶Randomized SVD / sketchingO(n log n) or better vs O(n³)Full Hessian impossibleDiagonal / K-FAC / HvP + CGMemory & compute feasible at billion-param scaleAttention / large sequenceFlashAttention / Linformer / PerformerMemory bandwidth dominates over exactnessStreaming / online PCAIncremental / randomizedCannot store full matrixProduction inference at scaleStructured / low-rankLatency and memory critical

2025–2026 best practices:

Default to approximate for anything > 10k–50k dimension
Use exact only when interpretability or scientific accuracy is the goal
Hybrid: exact on small submatrices (e.g., per-layer statistics), approximate globally
Always validate approximation error on small representative problems

This chapter closes the loop: theoretical linear algebra meets hardware reality. Understanding these numerical and implementation details is what separates research prototypes from production-scale training and inference.

PREVIOUS PAGE INDEX PAGE NEXT PAGE

Join AI Learning

Get free AI tutorials and PDFs

Email-ibm.anshuman@gmail.com

All my books are exclusively available on Amazon. The free notes/materials on globalcodemaster.com do NOT match even 1% with any of my PUBLISHED BOoks. Similar topics ≠ same content. Books have full details, exercises, chapters & structure — website notes do not.No book content is shared here. We fully comply with Amazon policies.

Free Reading Alert! All my books are FREE on Kindle Unlimited or eBooks just ₹145!

Check now: https://www.amazon.in/stores/Anshuman-Mishra/author/B0DQVNPL7P

Start reading! 🚀

🚀 Best content for SSC, CGL, LDC, TET, NET & SET preparation!
📚 Maths | Reasoning | GK | Previous Year Questions | Tips & Tricks

👉 Join our WhatsApp Channel now:
🔗 https://whatsapp.com/channel/0029Vb6kg2vFnSz4zknEOG1D...