All my books are exclusively available on Amazon. The free notes/materials on globalcodemaster.com do NOT match even 1% with any of my published books. Similar topics ≠ same content. Books have full details, exercises, chapters & structure — website notes do not.No book content is shared here. We fully comply with Amazon policies.

PREVIOUS PAGE INDEX PAGE NEXT PAGE

Linear Algebra for AI: Vectors Matrices Eigenvalues in Neural Networks

N.B.- All my books are exclusively available on Amazon. The free notes/materials on globalcodemaster.com do NOT match even 1% with any of my PUBLISHED BOoks. Similar topics ≠ same content. Books have full details, exercises, chapters & structure — website notes do not.No book content is shared here. We fully comply with Amazon policies.
TABLE OF CONTENT

1. Foundations: Scalars, Vectors, and Vector Spaces 1.1 Scalars vs. Vectors: From numbers to directional quantities 1.2 Vector addition, scalar multiplication, and geometric intuition 1.3 Linear combinations, span, and linear independence 1.4 Basis, dimension, and coordinate representations 1.5 Normed vector spaces: Euclidean (L₂), L₁, L∞ norms in ML 1.6 Inner products, dot product, cosine similarity, and angle 1.7 Orthogonality, orthonormal bases, and Gram-Schmidt process 1.8 Vector spaces over real and complex numbers (why complex appear in spectral methods)

2. Matrices as Linear Transformations 2.1 Matrix representation of linear maps 2.2 Matrix multiplication: composition of transformations 2.3 Special matrices in deep learning: identity, diagonal, permutation, symmetric, skew-symmetric, orthogonal, unitary 2.4 Matrix-vector products: linear layers, attention scores 2.5 Rank, column space, row space, null space, and rank-nullity theorem 2.6 Full-rank, low-rank, and rank-deficient matrices in practice 2.7 Matrix norms (Frobenius, operator/spectral, nuclear) and their ML roles

3. Systems of Linear Equations and Least Squares 3.1 Matrix form Ax = b and solvability conditions 3.2 Gaussian elimination, LU decomposition, pivoting 3.3 Overdetermined systems and least-squares solutions 3.4 Normal equations: AᵀA x = Aᵀb 3.5 QR decomposition and its numerical stability 3.6 SVD-based least squares (pseudoinverse) 3.7 Regularized least squares: ridge regression as preconditioned system

4. Eigenvalues, Eigenvectors, and Spectral Decomposition 4.1 Definition and geometric meaning of eigenvalues/eigenvectors 4.2 Characteristic polynomial and algebraic multiplicity 4.3 Diagonalizable matrices and eigendecomposition A = PDP⁻¹ 4.4 Symmetric matrices: real eigenvalues, orthogonal eigenvectors, spectral theorem 4.5 Positive (semi)definite matrices and quadratic forms 4.6 Spectral radius, condition number, and numerical stability 4.7 Power method, inverse iteration, and Rayleigh quotient for largest/smallest eigenvalues 4.8 Deflation and subspace iteration techniques

5. Singular Value Decomposition (SVD) – The Cornerstone of Modern ML 5.1 Definition and geometric interpretation 5.2 Reduced, compact, and truncated SVD 5.3 Low-rank approximation and Eckart–Young–Mirsky theorem 5.4 SVD for dimensionality reduction: PCA as SVD on centered data 5.5 Matrix pseudoinverse via SVD 5.6 Spectral norm, Frobenius norm, and nuclear norm through singular values 5.7 Applications: image compression, recommender systems, attention visualization

6. Principal Component Analysis (PCA) and Variants 6.1 Covariance matrix and variance maximization view 6.2 SVD vs. eigendecomposition approaches to PCA 6.3 Centering, scaling, and standardization in practice 6.4 Kernel PCA for nonlinear dimensionality reduction 6.5 Probabilistic PCA and connection to autoencoders 6.6 Incremental/online PCA for streaming data 6.7 Randomized SVD for large-scale PCA

7. Matrix Factorizations in Neural Networks 7.1 Low-rank adaptation (LoRA): low-rank updates to weight matrices 7.2 Spectral initialization and orthogonal initialization 7.3 Weight decay as nuclear-norm regularization 7.4 Matrix completion and collaborative filtering 7.5 Tucker/CP decomposition for tensor layers 7.6 Butterfly factorization and structured linear layers 7.7 Fast Fourier Transform (FFT) and circulant matrices in efficient attention

8. Eigenvalues and Dynamics in Deep Learning 8.1 Hessian eigenvalues: curvature, sharpness, and generalization 8.2 Loss landscape geometry via spectral properties 8.3 Neural Tangent Kernel (NTK) and infinite-width eigenvalue spectrum 8.4 Spectral bias of gradient descent: learning low-frequency components first 8.5 Grokking and phase transitions through eigenvalue dynamics 8.6 Attention matrices: eigenvalues of softmax(QKᵀ/√d) and softmax collapse 8.7 Stability of recurrent networks via largest eigenvalue of weight matrix

9. Advanced Spectral Methods and Applications 9.1 Graph Laplacians, spectral clustering, and GNNs 9.2 Spectral normalization of weight matrices (SN-GAN, SN-ResNet) 9.3 Power iteration in contrastive learning and representation learning 9.4 Randomized linear algebra: randomized rangefinder, CUR decomposition 9.5 Matrix exponentials and Lie groups in equivariant networks 9.6 Operator learning and Fourier Neural Operators via spectral methods

10. Numerical Linear Algebra in Practice 10.1 Floating-point arithmetic and conditioning pitfalls 10.2 BLAS, LAPACK, cuBLAS, and modern GPU-accelerated linear algebra 10.3 JAX/PyTorch einsum and compiled linear operations 10.4 Mixed-precision training and its impact on eigenvalue accuracy 10.5 Randomized sketching for large-scale matrix computations 10.6 When to use exact vs. approximate methods at scale

1. Foundations: Scalars, Vectors, and Vector Spaces

This opening chapter establishes the most fundamental objects in linear algebra — scalars and vectors — and introduces the abstract structure of vector spaces. These concepts form the bedrock for everything that follows: matrix operations, neural network layers, embeddings, attention mechanisms, optimization landscapes, and spectral methods in modern AI.

1.1 Scalars vs. Vectors: From numbers to directional quantities

A scalar is just an ordinary real number (or sometimes complex number): 3.14, −7, 0.001, etc. Scalars have magnitude but no direction.

A vector is an ordered collection of scalars that carries both magnitude and direction. In ℝⁿ (n-dimensional Euclidean space), a vector is written as:

v = (v₁, v₂, …, vₙ) or v = v₁ e₁ + v₂ e₂ + … + vₙ eₙ

where eᵢ are the standard basis vectors.

Geometric intuition

  • In 2D: an arrow with length and direction (e.g., displacement, velocity, gradient of loss).

  • In high dimensions (d = 10⁶+ in deep learning): no direct visualization, but same algebraic rules apply.

ML examples

  • Word embeddings: GloVe / Word2Vec vectors live in ℝ³⁰⁰–ℝ¹⁰²⁴.

  • Image patches flattened into vectors ∈ ℝ^{H×W×C}.

  • Gradient vector ∇_θ J(θ) ∈ ℝ^{number of parameters}.

1.2 Vector addition, scalar multiplication, and geometric intuition

Vector addition u + v = (u₁ + v₁, u₂ + v₂, …, uₙ + vₙ) → Parallelogram law (head-to-tail addition).

Scalar multiplication c v = (c v₁, c v₂, …, c vₙ) → Stretching (c > 1), shrinking (0 < c < 1), or reversing direction (c < 0).

Properties (defining a vector space over field ℝ or ℂ):

  • Commutative, associative addition

  • Distributive laws

  • Additive identity (zero vector 0)

  • Additive inverse (−v)

Geometric / ML intuition

  • Adding two embedding vectors ≈ combining semantic meanings.

  • Scaling a gradient vector controls step size in optimization.

1.3 Linear combinations, span, and linear independence

A linear combination of vectors v₁, v₂, …, vₖ is any vector of the form:

c₁ v₁ + c₂ v₂ + … + cₖ vₖ (cᵢ scalars)

The span of a set of vectors {v₁, …, vₖ} is the set of all possible linear combinations.

A set of vectors is linearly independent if the only way to get the zero vector as a linear combination is to take all coefficients zero:

c₁ v₁ + … + cₖ vₖ = 0 ⇒ c₁ = c₂ = … = cₖ = 0

Otherwise, the set is linearly dependent (at least one vector is a combination of the others).

ML relevance

  • Over-complete bases (more features than needed) → linear dependence → multicollinearity in regression.

  • Attention heads or neurons often learn redundant (dependent) directions → motivates pruning, distillation, low-rank adaptation (LoRA).

1.4 Basis, dimension, and coordinate representations

A basis for a vector space V is a linearly independent set that spans V.

The dimension dim(V) is the number of vectors in any basis (well-defined theorem).

Once a basis {b₁, …, bₙ} is chosen, every vector v ∈ V has unique coordinates (coefficients) with respect to that basis:

v = x₁ b₁ + … + xₙ bₙv has coordinate vector [x₁ … xₙ]ᵀ

Change of basis Different bases give different coordinate representations of the same vector — crucial for understanding why neural networks are invariant under certain reparameterizations.

Standard basis in ℝⁿ: e₁ = (1,0,…,0), e₂ = (0,1,0,…), etc.

ML examples

  • Token embeddings live in ℝ^{d_model} with respect to the learned basis (the embedding matrix rows/columns).

  • Principal components define a new orthonormal basis aligned with maximum variance.

1.5 Normed vector spaces: Euclidean (L₂), L₁, L∞ norms in ML

A norm ‖·‖ is a function that measures vector “length” satisfying positivity, homogeneity, and triangle inequality.

Most important norms in ML:

NormFormulaGeometric shape (2D)ML usage examplesL₂ (Euclidean)‖v‖₂ = √(∑ vᵢ²)CircleDefault distance, weight decay, gradient norm, embeddingsL₁ (Manhattan)‖v‖₁ = ∑vᵢL∞ (max norm)‖v‖∞ = maxᵢvᵢL₀ “norm”number of non-zero entries—Counting sparsity (not a true norm)

Common ML operations

  • Gradient clipping: ‖∇θ‖₂ ≤ C → scale if larger.

  • Cosine similarity uses L₂-normalized vectors.

  • Weight decay: adds λ ‖θ‖₂² to loss → encourages small L₂ norm.

1.6 Inner products, dot product, cosine similarity, and angle

An inner productu, v⟩ is a scalar-valued bilinear form that is symmetric, linear in each argument, and positive definite (⟨v, v⟩ > 0 for v0).

Standard dot product in ℝⁿ: ⟨u, v⟩ = uv = ∑ uᵢ vᵢ

Cauchy–Schwarz inequality |⟨u, v⟩| ≤ ‖u‖₂ ‖v‖₂ → defines angle θ between vectors: cos θ = ⟨u, v⟩ / (‖u‖₂ ‖v‖₂)

Cosine similarity cos θ ∈ [−1, 1] → most common similarity measure in NLP/CV

  • 1: identical direction

  • 0: orthogonal (no correlation)

  • −1: opposite direction

ML usage

  • Scaled dot-product attention: softmax(QKᵀ / √d) relies on dot products after L₂ normalization.

  • Contrastive learning (SimCLR, CLIP): maximize cosine similarity of positive pairs, minimize for negatives.

1.7 Orthogonality, orthonormal bases, and Gram-Schmidt process

Two vectors are orthogonal if ⟨u, v⟩ = 0.

A set is orthonormal if every pair is orthogonal and each has unit norm (‖vᵢ‖₂ = 1).

Gram-Schmidt process Takes any linearly independent set {v₁, …, vₖ} and produces an orthonormal basis {q₁, …, qₖ} for the same span:

u₁ = v₁, q₁ = u₁ / ‖u₁u₂ = v₂ − proj_{q₁} v₂, q₂ = u₂ / ‖u₂‖ … uⱼ = vⱼ − ∑{i=1}^{j-1} proj{qᵢ} vⱼ, qⱼ = uⱼ / ‖uⱼ

where proj_{q} v = ⟨v, qq

ML relevance

  • Orthogonal initialization (e.g., orthogonal random matrices) preserves gradient variance in deep nets.

  • QR decomposition (built on Gram-Schmidt ideas) used in some optimizers and low-rank approximations.

1.8 Vector spaces over real and complex numbers (why complex appear in spectral methods)

Most ML uses real vector spaces ℝⁿ.

Complex vector spaces ℂⁿ appear when:

  • Fourier transforms (real signals → complex coefficients)

  • Spectral methods: eigenvalues of non-symmetric matrices can be complex

  • Unitary/orthogonal groups in equivariant networks

  • Quantum machine learning and certain physics-informed models

  • Diagonalization of circulant matrices (convolutions) via DFT → complex eigenvectors (Fourier modes)

Important note Even when matrices are real, eigenvectors can be complex (come in conjugate pairs). → Hermitian (complex) or symmetric (real) matrices guarantee real eigenvalues → preferred in spectral normalization, NTK analysis, etc.

Example The DFT matrix is unitary (complex orthogonal) → preserves norms → important for efficient FFT-based attention approximations.

With these foundations mastered, the next chapter will show how matrices represent linear transformations on these vector spaces — exactly how fully-connected layers, attention, and convolutions operate in neural networks.

2.1 Limits, continuity, and differentiability

Limits tell us what value a function approaches as the input gets close to a specific point.

Univariate limit example: Consider f(x) = (x² − 1)/(x − 1) As x approaches 1, direct substitution gives 0/0 (undefined), but we can simplify: f(x) = (x − 1)(x + 1)/(x − 1) = x + 1 (for x ≠ 1) So lim_{x→1} f(x) = 2

Continuity at a point x₀: Three conditions must hold:

  1. f(x₀) is defined

  2. lim_{x→x₀} f(x) exists

  3. lim_{x→x₀} f(x) = f(x₀)

Differentiability at x₀ requires the derivative limit to exist: f'(x₀) = lim_{h→0} [f(x₀ + h) − f(x₀)] / h

Classic non-differentiable but continuous example: f(x) = |x| At x = 0: Right derivative = lim_{h→0⁺} ( |h| − 0 )/h = 1 Left derivative = lim_{h→0⁻} ( |h| − 0 )/h = −1 Derivatives don’t match → not differentiable at 0 (sharp corner), but continuous everywhere.

In machine learning, most loss functions (MSE, cross-entropy, etc.) are differentiable almost everywhere, except possibly at a few points (e.g., ReLU has a kink at 0, but subgradients handle it).

2.2 Derivatives, tangent lines, and directional derivatives

Derivative f'(x) = instantaneous rate of change = slope of the tangent line.

Tangent line equation at point (x₀, f(x₀)): y = f(x₀) + f'(x₀) (x − x₀)

Numerical example: f(x) = x³ − 2x + 1 f'(x) = 3x² − 2 At x₀ = 1: f(1) = 1 − 2 + 1 = 0, f'(1) = 3 − 2 = 1 Tangent line: y = 0 + 1·(x − 1) = x − 1

Directional derivative preview (full treatment in 2.3): In 1D, only two directions exist: positive (along +1) or negative (along −1). The derivative f'(x₀) is the directional derivative in the +x direction.

2.3 Partial derivatives and the gradient vector ∇f

For a function of several variables, e.g., f(x, y) = x²y + 3xy³ + sin(x)

Partial with respect to x (treat y constant): ∂f/∂x = 2xy + 3y³ + cos(x)

Partial with respect to y (treat x constant): ∂f/∂y = x² + 9xy²

Gradient vector: ∇f(x,y) = ( ∂f/∂x , ∂f/∂y ) = (2xy + 3y³ + cos(x), x² + 9xy²)

Geometric interpretation:

  • Direction of ∇f = direction of fastest increase of f

  • Magnitude ‖∇f‖ = steepest rate of increase

  • Therefore −∇f = direction of fastest decrease (used in gradient descent)

Directional derivative in direction of unit vector u = (u₁, u₂): D_u f = ∇f · u = (∂f/∂x) u₁ + (∂f/∂y) u₂

Maximum value occurs when u points exactly along ∇f: max D_u f = ‖∇f‖ Minimum (steepest descent) = −‖∇f‖

Machine learning example — linear regression loss J(w, b) = (1/2) Σ (yᵢ − (w xᵢ + b))² (dropping 1/n for simplicity)

∂J/∂w = Σ (w xᵢ + b − yᵢ) xᵢ ∂J/∂b = Σ (w xᵢ + b − yᵢ)

∇J = [ Σ(errorᵢ · xᵢ) , Σ(errorᵢ) ]ᵀ This vector tells us exactly how much to change w and b to reduce loss fastest.

2.4 Higher-order derivatives: second derivatives, curvature intuition

Second partial derivatives:

For f(x,y): ∂²f/∂x² , ∂²f/∂y² , ∂²f/∂x∂y = ∂²f/∂y∂x (equality holds if f is twice continuously differentiable)

Hessian matrix (2×2 case): H = | ∂²f/∂x² ∂²f/∂x∂y | | ∂²f/∂y∂x ∂²f/∂y² |

Curvature interpretation:

  • If H is positive definite (both eigenvalues > 0) → local minimum (curves upward in all directions)

  • If negative definite (both < 0) → local maximum

  • If indefinite (one positive, one negative eigenvalue) → saddle point

  • If singular (zero eigenvalue) → flat direction or inflection

Simple example — quadratic bowl f(x,y) = x² + 4y² ∇f = (2x, 8y) H = | 2 0 | | 0 8 | → positive definite → minimum at (0,0)

Ill-conditioned example — Rosenbrock function f(x,y) = 100(y − x²)² + (1 − x)² At the minimum (1,1): H ≈ | 1200 −400 | | −400 2 | Eigenvalues ≈ 1202 and ≈0.002 → condition number ≈ 600,000 Very narrow valley → gradient descent takes tiny steps or oscillates.

2.5 Chain rule in single-variable and multivariable settings

Single variable: z = f(g(x)) dz/dx = f'(g(x)) · g'(x)

Multivariable – scalar outer function: z = f(u,v), u = g(x,y), v = h(x,y) ∂z/∂x = (∂f/∂u)(∂u/∂x) + (∂f/∂v)(∂v/∂x) ∂z/∂y = (∂f/∂u)(∂u/∂y) + (∂f/∂v)(∂v/∂y)

Vector form (most relevant to backpropagation): Let θ ∈ ℝᵈ, let z = f(g(θ)) where g: ℝᵈ → ℝᵏ Then ∇_θ z = J_g(θ)ᵀ · ∇_g z (J_g is the k×d Jacobian matrix of g)

This matrix multiplication form is exactly how gradients flow backward through layers in neural networks.

Tiny example: Let z = sin(u) + u², u = x² + y ∂z/∂x = (cos(u) + 2u) · (2x) ∂z/∂y = (cos(u) + 2u) · 1

2.6 Taylor series expansions (first-, second-, and higher-order)

Univariate Taylor expansion around x₀: f(x) ≈ f(x₀) + f'(x₀)(x − x₀) + (1/2)f''(x₀)(x − x₀)² + (1/6)f'''(x₀)(x − x₀)³ + ...

Multivariate Taylor expansion around θ₀: f(θ) ≈ f(θ₀) + ∇f(θ₀)ᵀ (θ − θ₀)

  • (1/2) (θ − θ₀)ᵀ H_f(θ₀) (θ − θ₀)

  • higher-order remainder

Example – univariate: f(x) = e^x around x=0 f(0)=1, f'(0)=1, f''(0)=1, f'''(0)=1, ... e^x ≈ 1 + x + (1/2)x² + (1/6)x³ + ...

Example – quadratic approximation: f(x,y) = x² + xy + 3y² around (0,0) f(0,0) = 0 ∇f = (2x + y, x + 6y) → ∇f(0,0) = (0,0) H = | 2 1 | | 1 6 | So f(x,y) ≈ (1/2) [x y] |2 1| |x| = (1/2)(2x² + 2xy + 6y²) = x² + xy + 3y² |1 6| |y|

(The quadratic term exactly matches the function since it is already quadratic.)

2.7 Linear and quadratic approximations of functions

Linear (first-order) approximation: f(θ) ≈ f(θ₀) + ∇f(θ₀)ᵀ (θ − θ₀) This is the equation of the tangent hyperplane.

Use in optimization: Gradient descent follows this linear piece: move in direction −∇f.

Quadratic (second-order) approximation: f(θ) ≈ f(θ₀) + ∇fᵀ Δθ + (1/2) Δθᵀ H Δθ (where Δθ = θ − θ₀)

Use in optimization:

  • Newton’s method finds the minimum of this quadratic approximation → sets gradient of quadratic to zero → Δθ = −H⁻¹ ∇f

  • Trust-region methods stay inside a region where the quadratic approximation is trusted

  • Explains adaptive step sizes: large curvature (large eigenvalue) → small step in that direction

Summary table of approximations

OrderApproximation formulaGeometric meaningOptimization use case0thf(θ) ≈ f(θ₀)Constant valueZeroth-order methods1stf(θ) ≈ f(θ₀) + ∇fᵀ ΔθTangent hyperplaneGradient descent family2ndf(θ) ≈ f(θ₀) + ∇fᵀ Δθ + (1/2) Δθᵀ H ΔθLocal paraboloidNewton, quasi-Newton, trust-region

3. Vector and Matrix Calculus Essentials for Machine Learning

This chapter focuses on the calculus tools that are used most frequently when deriving gradients and Hessians in neural networks, loss functions, and optimization algorithms. We emphasize notation conventions used in modern ML frameworks and provide concrete examples for the most common components.

3.1 Gradient of scalar-valued functions: ∇_θ J(θ)

Most machine learning objectives are scalar-valued functions J: ℝᵈ → ℝ, where θ ∈ ℝᵈ is the vector of all trainable parameters (weights, biases, etc.).

Definition The gradient ∇_θ J(θ) is the column vector of all partial derivatives:

∇_θ J(θ) = ⎡ ∂J/∂θ₁ ⎤ ⎢ ∂J/∂θ₂ ⎥ ⎢ ⋮ ⎥ ⎢ ∂J/∂θ_d ⎥ ⎣ ⎦

Interpretation in ML

  • Direction of steepest ascent of the loss

  • −∇_θ J(θ) is the direction of steepest descent

  • Backpropagation computes exactly this vector efficiently

Concrete example — logistic regression binary cross-entropy Let σ(z) = 1/(1 + e⁻ᶻ) be the sigmoid. For one sample: J(θ) = − [ y log σ(θᵀx) + (1−y) log(1 − σ(θᵀx)) ]

Gradient w.r.t. θ: ∇_θ J = (σ(θᵀx) − y) ⋅ x

For n samples (vector form): ∇_θ J(θ) = (1/n) Xᵀ (σ(Xθ) − y) where X is n×d feature matrix, y is n×1 label vector.

This is the form almost every first-order optimizer receives.

3.2 Jacobian matrix (gradients of vector-valued functions)

When the function outputs a vector, f: ℝᵈ → ℝᵏ, we use the Jacobian matrix J_f ∈ ℝ^{k × d}:

(J_f)_{ij} = ∂fᵢ / ∂θⱼ

Important cases in ML

  1. Layer output before activation: h = Wx + b → Jacobian w.r.t. input x is W

  2. Softmax output: p = softmax(z), z ∈ ℝᶜ (logits) Jacobian J_p/z ∈ ℝ^{C×C} is diag(p) − p pᵀ (very important for cross-entropy gradient)

Chain rule in vector form If scalar loss ℓ depends on vector a, and a = f(θ), then ∇_θ ℓ = J_f(θ)ᵀ ⋅ ∇_a ℓ

This transpose-Jacobian-vector-product form is the mathematical basis of reverse-mode automatic differentiation (backpropagation).

3.3 Hessian matrix H = ∇²J(θ): definition, symmetry, interpretation as curvature

Definition The Hessian of a scalar function J is the square symmetric matrix of second partial derivatives:

H_{ij} = ∂²J / ∂θᵢ ∂θⱼ

H = ∇_θ (∇_θ J(θ))ᵀ (gradient of the gradient)

Properties

  • Symmetric: Hᵀ = H (when J is twice continuously differentiable)

  • H is the Jacobian of the gradient vector

Curvature interpretation For small step δ: J(θ + δ) ≈ J(θ) + ∇Jᵀ δ + (1/2) δᵀ H δ

  • If H is positive definite → local minimum (loss curves upward in every direction)

  • If H has negative eigenvalues → saddle or maximum in those directions

  • Large |eigenvalue| → strong curvature → sensitive direction

  • Small eigenvalue → flat direction → ill-conditioning

ML reality In deep networks, the Hessian is almost never computed fully (O(d²) memory and time). Instead we use:

  • Diagonal approximation

  • Kronecker-factored approximations (K-FAC)

  • Hessian-vector products (via forward-over-reverse or reverse-over-forward AD)

3.4 Gradient and Hessian of common ML building blocks

Here are the derivatives needed in almost every modern network.

A. Linear layer z = Wx + b (W ∈ ℝ^{m×n}, x ∈ ℝⁿ, b,z ∈ ℝᵐ)

∇_W J = (∇_z J) xᵀ ∇_b J = ∇_z J ∇_x J = Wᵀ (∇_z J)

B. Element-wise activation (ReLU, sigmoid, tanh, GELU, etc.) Let a = σ(z), σ applied elementwise Then ∇_z J = diag(σ'(z)) ⋅ ∇_a J

Examples:

  • ReLU: σ'(z) = 1 if z > 0 else 0

  • Sigmoid: σ'(z) = σ(z)(1 − σ(z))

  • Tanh: σ'(z) = 1 − tanh²(z)

C. Squared Euclidean norm (regularization term) R(θ) = (1/2) ‖θ‖₂² = (1/2) θᵀ θ ∇_θ R = θ H_R = I (identity matrix)

D. Cross-entropy + softmax (single example) J = − ∑ y_k log p_k , p = softmax(z) ∇_z J = p − y (beautiful simplification!)

Hessian of cross-entropy w.r.t. logits z: H = diag(p) − p pᵀ (This matrix is positive semi-definite with rank at most C−1.)

E. L2 regularization on weights only Common form: J_total = J_data + (λ/2) ∑_{layers} ‖Wₗ‖₂² → adds λ I to the diagonal blocks of the Hessian corresponding to each weight matrix.

3.5 Useful identities: matrix calculus cheat-sheet

Important conventions

  • Most ML papers and frameworks use denominator layout (gradient is column vector, Jacobian has output dimension first).

  • Some older control theory / statistics literature uses numerator layout (gradient is row vector).

Key identities (denominator layout – standard in PyTorch/JAX/TensorFlow)

  1. Linear: ∂/∂X (aᵀ X b) = a bᵀ

  2. Quadratic form: ∂/∂x (xᵀ A x) = (A + Aᵀ) x If A symmetric → 2 A x

  3. Trace trick (very powerful): ∂/∂A tr(A B) = Bᵀ ∂/∂A tr(A B Aᵀ C) = Bᵀ Aᵀ Cᵀ + C A B

  4. Frobenius inner product: ⟨A, B⟩ = tr(Aᵀ B) = vec(A)ᵀ vec(B) → ∂/∂A ⟨A, f(A)⟩ = f(A)ᵀ (if scalar)

  5. Matrix-vector product: ∂/∂x (A x) = Aᵀ

  6. Hadamard (elementwise) product: ∂/∂x (a ⊙ x) = diag(a)

  7. Softmax Jacobian: ∂p/∂z = diag(p) − p pᵀ

Quick reference table

ExpressionDerivative w.r.t. variableResultaᵀ X bXa bᵀxᵀ A xx(A + Aᵀ) xtr(Xᵀ A X B)XAᵀ X Bᵀ + A X B‖X‖₂² = tr(Xᵀ X)X2 Xlog det(X)XX⁻ᵀsoftmax(z) cross-entropyzsoftmax(z) − y

3.6 Automatic differentiation vs symbolic vs finite differences

Three main ways to compute derivatives in practice:

MethodHow it worksPrecisionSpeedMemoryUse-case in MLFinite differences(f(θ + εeᵢ) − f(θ))/ε ≈ ∂f/∂θᵢLow (rounding error, cancellation)Slow (d forward passes)LowDebugging, black-box, zeroth-order opt.SymbolicManipulates expression algebraicallyExact (fractions)Slow for large expr.High during deriv.Small models, theoretical workAutomatic (AD)Builds computational graph, applies chain ruleMachine precisionFast (one forward + one backward)Moderate–highVirtually all deep learning today

Modes of AD

  • Forward mode: good when #inputs ≪ #outputs (e.g., Hessian-vector products)

  • Reverse mode (backpropagation): good when #outputs ≪ #inputs (scalar loss, millions of parameters) → dominant in DL

Practical comparison example Suppose θ has 10 million parameters, J is scalar loss.

  • Finite differences → ~10 million forward passes → impossible

  • Symbolic → expression blowup → impossible

  • Reverse-mode AD → 1 forward + 1 backward pass → feasible (what PyTorch .backward() does)

JAX / PyTorch nuance Both support higher-order derivatives via nested autodiff: grad(grad(J)) ≈ Hessian-vector product when used with vmap/jvp.

This chapter gives you the core mathematical vocabulary and identities needed to understand and derive gradients/Hessians for almost any neural architecture. The next chapters apply these tools to concrete optimization algorithms.

4. First-Order Optimization: Gradient-Based Methods

First-order methods rely solely on the gradient ∇J(θ) (or stochastic estimates thereof) to update parameters. They dominate deep learning because they scale well to high dimensions (millions to trillions of parameters) thanks to automatic differentiation and mini-batching.

4.1 Gradient descent (GD): vanilla / batch gradient descent

Vanilla / Batch Gradient Descent (GD) Update rule (deterministic, full-batch):

θ_{t+1} = θ_t − η ∇_θ J(θ_t)

where η > 0 is the learning rate (step size).

Algorithm sketch:

  1. Initialize θ₀

  2. For t = 0, 1, 2, … until convergence:

    • Compute full gradient ∇J(θ_t) over entire dataset

    • θ_{t+1} ← θ_t − η ∇J(θ_t)

Pros: Smooth convergence path, exact gradient Cons: Computationally prohibitive for large datasets (one pass = one epoch)

In practice, we almost always use mini-batch or stochastic variants (covered in Chapter 5).

Mathematical example — quadratic bowl J(θ) = (1/2) θᵀ A θ, A positive definite Optimal step size η = 1/λ_max(A) gives fastest convergence.

4.2 Convergence analysis: Lipschitz smooth & strongly convex functions

We analyze convergence under standard assumptions used in optimization theory.

Key assumptions

  1. L-smooth (Lipschitz continuous gradient): ‖∇J(θ) − ∇J(θ')‖ ≤ L ‖θ − θ'‖ → J has bounded curvature → descent lemma: J(θ − η ∇J(θ)) ≤ J(θ) − η ‖∇J(θ)‖² + (L η² / 2) ‖∇J(θ)‖²

  2. μ-strongly convex (μ ≥ 0): J(θ') ≥ J(θ) + ∇J(θ)ᵀ (θ' − θ) + (μ/2) ‖θ' − θ‖² → Unique minimizer θ^, J(θ) ≥ J(θ^) + (μ/2) ‖θ − θ^*‖²

Convergence rates for gradient descent (fixed step size η ≤ 1/L)

Function classRate on function suboptimality J(θ_t) − J^*Rate on distance ‖θ_t − θ^*‖²Typical η choiceNotesL-smooth convex (μ=0)O(1/t)O(1/t)η = 1/LSublinear; optimal among first-order methodsL-smooth + μ-strongly convexLinear: O((1 − μ/L)^t)Linear: O((1 − μ/L)^t)η = 2/(μ + L) or 1/LCondition number κ = L/μ hurts rate

Proof intuition for smooth convex case (descent lemma + bounded gradients away from optimum): J(θ_{t+1}) − J^* ≤ J(θ_t) − J^* − (η/2) ‖∇J(θ_t)‖² ≤ J(θ_t) − J^* − (η μ / L) (J(θ_t) − J^*) Summing telescopes to O(1/t) after t steps.

Strongly convex case achieves linear (exponential) convergence — very fast once close to optimum.

In deep learning, functions are non-convex, so we get convergence to stationary points (‖∇J‖ → 0), but rates are more subtle (see Chapter 7).

4.3 Step-size selection: fixed, diminishing, backtracking line search

Fixed step size η constant (e.g., 0.001). Simple but requires careful tuning.

Diminishing step size (Robbins–Monro conditions for stochastic case): ∑ η_t = ∞ and ∑ η_t² < ∞ Examples: η_t = c / t, η_t = c / √t → Ensures convergence in expectation for SGD.

Backtracking line search (Armijo condition) Start with η = η_init While J(θ − η ∇J) > J(θ) − c η ‖∇J‖² (c ≈ 10⁻⁴): η ← β η (β = 0.5 or 0.8) → Adaptive per iteration, guarantees descent.

In deep learning, fixed or scheduled η is far more common than line search (too expensive per step).

4.4 Momentum & acceleration: Polyak momentum, Nesterov accelerated gradient (NAG)

Momentum accelerates convergence by adding a velocity term.

Polyak heavy-ball momentum (classical momentum):

v_{t+1} = β v_t − η ∇J(θ_t) θ_{t+1} = θ_t + v_{t+1}

(β ≈ 0.9)

Nesterov Accelerated Gradient (NAG) — optimal among first-order methods for smooth convex functions:

v_{t+1} = β v_t − η ∇J(θ_t + β v_t) (look-ahead gradient) θ_{t+1} = θ_t + v_{t+1}

or equivalent form:

y_{t+1} = θ_t + β (θ_t − θ_{t-1}) θ_{t+1} = y_{t+1} − η ∇J(y_{t+1})

Convergence rates (smooth convex case, optimal step-size):

  • GD: O(1/t)

  • NAG: O(1/t²) — optimal rate for any first-order method on L-smooth convex functions

For strongly convex: NAG achieves O(√(L/μ) log(1/ε)) iterations to reach ε-suboptimality (better dependence on condition number than GD).

In deep learning, momentum (especially Nesterov-style in SGD) helps escape flat regions and dampens oscillations in ill-conditioned landscapes.

4.5 Adaptive gradient methods: AdaGrad, RMSProp, Adam (& variants: AdamW, AMSGrad)

Adaptive methods scale the learning rate per parameter based on historical gradient magnitudes.

AdaGrad (2011):

g_t = g_{t-1} + (∇J_t)² (elementwise) θ_{t+1} = θ_t − η / (√g_t + ε) ⊙ ∇J_t

→ Good for sparse data, but learning rate shrinks too aggressively (can stop learning).

RMSProp (2012):

m_t = β₁ m_{t-1} + (1−β₁) ∇J_t v_t = β₂ v_{t-1} + (1−β₂) (∇J_t)² θ_{t+1} = θ_t − η / (√v_t + ε) ⊙ m_t

Adam (2014) — most popular:

m_t = β₁ m_{t-1} + (1−β₁) ∇J_t v_t = β₂ v_{t-1} + (1−β₂) (∇J_t)² m̂_t = m_t / (1 − β₁ᵗ) (bias correction) v̂_t = v_t / (1 − β₂ᵗ) θ_{t+1} = θ_t − η / (√v̂_t + ε) ⊙ m̂_t

Default hyperparameters: β₁=0.9, β₂=0.999, ε=10⁻⁸

Variants

  • AdamW (2019): Decouples weight decay from adaptive LR → better generalization θ_{t+1} = θ_t − η ( ∇J_t / (√v̂_t + ε) + λ θ_t ) (weight decay λ outside adaptive term)

  • AMSGrad (2018): Fixes non-convergence issue by using max(v_t) instead of v_t

Recent theoretical results (2024–2025) show Adam can achieve sharper rates than RMSProp in certain regimes, sometimes approaching accelerated rates under specific conditions (e.g., higher-order analysis with momentum-like effects).

4.6 Practical considerations: learning rate schedules, warm-up, gradient clipping

Learning rate schedules (most common in 2025 practice)

  • Step decay: Multiply LR by 0.1 every X epochs

  • Cosine annealing (SGDR / cosine with restarts): η_t = η_min + (1/2)(η_max − η_min)(1 + cos(π t / T))

  • One-cycle policy: Linear increase → cosine decay → very fast convergence

  • Exponential decay: η_t = η₀ ⋅ γ^t

Warm-up (standard since ~2017, still essential in 2025) Start with very small LR (e.g., 1/10 of target), linearly increase to target over W steps (W ≈ 1000–10000). Prevents early instability, exploding gradients, especially with large batches or adaptive optimizers.

Gradient clipping (by value or by norm) Clip if ‖∇J‖ > C (e.g., C=1.0 or 0.5): ∇J ← C ⋅ ∇J / max(C, ‖∇J‖)

Best practices (2025):

  • Always use warm-up + cosine decay for transformers / large models

  • Gradient clipping by norm (global norm) is default in most libraries

  • Combine with AdamW + weight decay (0.01–0.1)

  • Monitor gradient norm during training — sudden spikes indicate instability

  • For very large models: warm-up over more steps (proportional to model size)

Summary table of popular choices (2025)

OptimizerTypical LRScheduleWarm-up stepsClipping normWeight decaySGD + Mom.0.1–1.0Cosine / step2000–50000.5–5.0—Adam / AdamW1e-4 – 5e-4Cosine annealing1000–100001.00.01–0.1Lion1e-4One-cycle / cosineYesOften none0.1

These techniques turn theoretical first-order methods into practical powerhouses for training massive models.

5. Stochastic and Mini-Batch Gradient Methods

While batch gradient descent provides exact gradients, it is computationally infeasible for modern datasets with millions or billions of examples. Stochastic and mini-batch methods approximate the full gradient using subsets of data, enabling faster iterations and scaling to enormous models.

5.1 From batch GD → stochastic GD (SGD)

Batch Gradient Descent (GD) computes the exact gradient over the entire dataset:

∇J(θ) = (1/n) ∑_{i=1}^n ∇_θ ℓ(θ; x_i, y_i)

Stochastic Gradient Descent (SGD) uses a single random example (or a small batch) at each step:

θ_{t+1} = θ_t − η g_t where g_t = ∇θ ℓ(θ_t; x{i_t}, y_{i_t}) (i_t chosen uniformly at random)

This is an unbiased estimate: E[g_t | θ_t] = ∇J(θ_t)

Transition intuition:

  • Batch GD: one very accurate but expensive step per epoch.

  • SGD: many cheap, noisy steps per epoch → faster wall-clock progress early on.

  • The noise introduces stochasticity that helps escape local minima/saddles in non-convex settings.

In practice (2025–2026), pure single-example SGD is rare; almost everyone uses mini-batch SGD (batch size 32–8192+ depending on hardware and model).

5.2 Mini-batch gradient descent: bias-variance trade-off

Mini-batch gradient:

g_t = (1/B) ∑_{j=1}^B ∇θ ℓ(θ_t; z{i_j}) (B = mini-batch size)

Bias: Always zero (unbiased estimator of full gradient).

Variance of g_t: Var(g_t) ≈ (1/B) Var(single gradient)

  • Small B → high variance → noisy updates → more exploration, potential escape from sharp minima.

  • Large B → low variance → smoother path, closer to deterministic GD → can converge to sharper minima, sometimes worse generalization.

Trade-off summary (empirical & theoretical consensus 2025):

Batch SizeGradient VarianceWall-clock Speed per EpochGeneralization (typical)Sharpness of MinimaCommon Use Case1 (pure SGD)Very highVery fast iterationsOften bestFlattestSmall/medium models, research32–256HighFastStrongFlatMost vision/early LLM pretraining1024–8192+LowSlower per iteration but fewer steps neededCan degradeSharperLarge-scale training with scaling rulesGlobal (full batch)ZeroExtremely slowCan be worstSharpestSmall problems only

Larger batches require learning rate scaling (often linear or √B) to maintain progress.

5.3 Noise as implicit regularization

The stochasticity in SGD acts as a form of implicit regularization, often leading to better generalization than deterministic methods.

Mechanisms (supported by theory and practice up to 2025–2026):

  1. Preference for flat minima SGD noise perturbs parameters → trajectories favor regions where loss is insensitive to perturbations (flat minima). Sharp minima generalize poorly (high sensitivity); flat minima are more robust → better test performance. Evidence: Increasing batch size sharpens minima unless compensated by noise injection or LR adjustment.

  2. Escape from saddles/plateaus Noise helps SGD escape saddle points faster than GD in high dimensions.

  3. Effective regularization equivalent SGD with small batches behaves similarly to explicit L2 regularization or label smoothing in some regimes.

Modern view (2025): SGD noise aligns with directions of low curvature → implicit bias toward flat, well-generalizing solutions. Recent analyses link this to stability of trajectories under noise covariance.

In large language models, small-batch SGD-like noise (or explicit noise addition) remains beneficial even at scale for generalization.

5.4 Convergence of SGD: Robbins-Monro conditions, sublinear rates

Robbins–Monro conditions (classic 1951, still foundational in 2025–2026 analyses):

For SGD convergence in expectation:

  1. ∑_{t=1}^∞ η_t = ∞ (steps accumulate infinitely → reach anywhere)

  2. ∑_{t=1}^∞ η_t² < ∞ (noise averages out)

Common schedules satisfying this: η_t = c / t^α with 1/2 < α ≤ 1 (e.g., 1/√t or 1/t).

Convergence rates (under assumptions):

  • Convex, L-smooth: J(θ_t) − J^* = O(1/√t) in expectation (optimal for stochastic first-order methods).

  • Strongly convex + smooth: linear convergence possible with diminishing steps or variance reduction (see 5.5).

  • Non-convex: convergence to stationary point E[‖∇J(θ_t)‖²] → 0, typically O(1/√t) rate for function value or gradient norm.

Recent work (2024–2025) relaxes conditions (e.g., local Łojasiewicz inequality for deep nets) and shows almost-sure convergence under milder step-size rules, or exponential convergence in special cases (e.g., separable logistic regression).

5.5 Variance reduction techniques: SVRG, SAGA, SARAH

These methods reduce gradient variance periodically → faster convergence than plain SGD, especially on finite-sum problems (standard in ML: J(θ) = (1/n) ∑ ℓ_i(θ)).

SVRG (Stochastic Variance Reduced Gradient, 2013–2014)

  • Every m iterations: compute full gradient w_t = ∇J(θ̃) (snapshot)

  • Then for inner loop: g_t = ∇ℓ_{i_t}(θ_t) − ∇ℓ_{i_t}(θ̃) + w_t → unbiased, variance reduced (zero when θ_t = θ̃)

Convergence: O(n + √(n κ) log(1/ε)) for strongly convex (better than SGD's O(n/ε)).

SAGA (Stochastic Average Gradient, 2014)

  • Maintain table of past gradients for each example

  • Update: g_t = ∇ℓ_{i_t}(θ_t) − old_grad_{i_t} + (1/n) ∑ all old_grads

  • After update, replace old_grad_{i_t}

Linear convergence for strongly convex, memory O(n d) — practical for medium n.

SARAH (Stochastic Recursive Gradient Algorithm, 2017–2019)

Inner loop: v_t = ∇ℓ_{i_t}(θ_t) − ∇ℓ_{i_t}(θ_{t-1}) + v_{t-1} → recursive variance reduction

Strong convergence guarantees, often best empirical performance among variance-reduced methods.

Recent advances (2024–2025):

  • Proximal/Stochastic proximal point extensions

  • Shuffling variants (e.g., Adjusted Shuffling SARAH) → improved complexity in strongly convex settings

  • Integration with operator splitting and generalized equations

  • Applications to cocoercive variational inequalities and high-dimensional control

In deep learning, these are less common today (due to huge n and non-convexity), but useful for fine-tuning, small-to-medium models, or when full passes are affordable.

5.6 Large-batch training phenomena and scaling rules

Phenomena observed (2017–2025):

  • Large batches → faster hardware utilization but can hurt generalization (sharper minima).

  • Critical Batch Size (CBS): Largest B where increasing B (with LR scaled) does not degrade loss trajectory significantly. Beyond CBS, further increases harm token efficiency or final loss.

Recent findings (2025):

  • CBS starts near 0 at init, grows rapidly early in training, then plateaus.

  • CBS scales with model size and training progress → small runs can predict large-run CBS.

  • Batch size warmup: Start small B, increase as CBS grows → reliable large-batch training (e.g., 43% fewer steps for similar/better loss in OLMo experiments).

  • Scaling laws for batch size: Optimal B_opt and CBS grow with tokens-per-parameter ratio D/N; power-law behavior in many regimes.

  • Weight decay / LR scaling: Optimal λ often linear in B; AdamW timescale τ = B/(η λ D) roughly constant.

Practical scaling rules (2025–2026 large models):

  • Linear LR scaling with B up to CBS, then sub-linear or constant.

  • Use batch size warmup + cosine LR decay.

  • For LLMs: aim for B near/above CBS but use techniques (e.g., ghost memory, gradient accumulation) to fit large effective B.

  • DiLoCo-style low-communication methods scale better with model size than pure data-parallel in some regimes.

Summary table

AspectSmall/Medium BatchLarge BatchMitigation StrategiesSpeed per stepSlower (more steps/epoch)Faster (fewer steps/epoch)Gradient accumulationGeneralizationOften betterCan degradeNoise injection, warmup, AdamWCritical Batch Size—Exists; grows during trainingMeasure empirically, warmup BScaling Rule (LR)—√B or linear up to CBSOne-cycle, cosine annealing

These stochastic techniques bridge theory and massive-scale practice, enabling efficient training of today's frontier models.

6. Second-Order Optimization: Hessian-Aware Methods

Second-order methods incorporate curvature information via the Hessian matrix (or approximations thereof) to make more informed update steps than first-order gradient descent. They can converge much faster near minima (quadratic vs linear/sublinear rates) but are significantly more expensive in computation and memory — especially in deep learning with d = 10⁶–10¹² parameters.

6.1 Newton's method: quadratic local model and update rule

Newton's method approximates the loss J(θ) locally as a quadratic function using the second-order Taylor expansion:

J(θ + δ) ≈ J(θ) + ∇J(θ)ᵀ δ + (1/2) δᵀ H(θ) δ

where H(θ) = ∇²J(θ) is the Hessian.

To minimize this quadratic model, set its gradient w.r.t. δ to zero:

∇J(θ) + H(θ) δ = 0 ⇒ δ = − H(θ)⁻¹ ∇J(θ)

Update rule:

θ_{t+1} = θ_t − H(θ_t)⁻¹ ∇J(θ_t)

Convergence (under strong assumptions):

  • Locally quadratic convergence: error ∝ error² near the minimum (extremely fast once close).

  • Requires H positive definite (or made so via modifications).

Example – quadratic objective J(θ) = (1/2) θᵀ A θ − bᵀ θ Exact minimum in one step if H = A is invertible.

Deep learning reality: Computing and inverting the full Hessian (O(d²) storage, O(d³) inversion) is impossible for large models. This motivates all subsequent approximations.

6.2 Modified Newton / damped Newton / Levenberg-Marquardt

Pure Newton steps can overshoot or diverge when far from the minimum (Hessian may not be positive definite, or quadratic model poor).

Damped / Modified Newton:

θ_{t+1} = θ_t − (H(θ_t) + λ I)⁻¹ ∇J(θ_t) (λ ≥ 0)

  • λ = 0 → pure Newton

  • Large λ → gradient descent step (direction −∇J, step size 1/λ)

Levenberg-Marquardt (very popular in non-linear least squares):

Similar damping, but λ chosen adaptively (trust-region like behavior):

  • If step reduces loss enough → decrease λ (trust quadratic model more)

  • If not → increase λ (fall back toward gradient descent)

Advantages:

  • Guarantees descent (positive definite matrix)

  • Smooth transition between Newton and GD

Still requires solving linear system with damped Hessian → expensive for large d.

6.3 Trust-region methods

Instead of damping the Hessian, explicitly restrict step length:

min_δ m(δ) = J(θ) + ∇Jᵀ δ + (1/2) δᵀ H δ subject to ‖δ‖ ≤ Δ (trust-region radius)

Approximate solution (dogleg / Steihaug-CG):

  • Solve subproblem iteratively with conjugate gradient (CG) until hitting boundary or satisfying accuracy

  • Adjust Δ based on how well model predicted actual reduction:

ratio = [J(θ+δ) − J(θ)] / [m(δ) − m(0)]

  • ratio ≈ 1 → good model → increase Δ

  • ratio small/negative → poor model → shrink Δ

Advantages over damped Newton:

  • More principled step-size control

  • Better handling of indefinite Hessians

Disadvantage: Still needs many Hessian-vector products per iteration.

6.4 Quasi-Newton methods: BFGS, L-BFGS (limited-memory version)

Quasi-Newton methods build an approximation B_t ≈ H(θ_t) (or inverse H⁻¹) using only gradient evaluations — no explicit Hessian computation.

BFGS update (Broyden–Fletcher–Goldfarb–Shanno):

B_{t+1} = B_t − (B_t s sᵀ B_t)/(sᵀ B_t s) + (y yᵀ)/(yᵀ s)

where s = θ_{t+1} − θ_t, y = ∇J(θ_{t+1}) − ∇J(θ_t)

Maintains positive definiteness if line search satisfies Wolfe conditions.

L-BFGS (Limited-memory BFGS):

  • Stores only m most recent {s_k, y_k} pairs (m ≈ 5–30)

  • Approximates inverse Hessian via two-loop recursion (very memory efficient)

  • Update: θ_{t+1} = θ_t − H_t⁻¹ ∇J(θ_t) (implicit multiplication, no matrix storage)

Convergence: Superlinear in convex smooth case; often very good in practice even for non-convex deep nets.

Use in deep learning: L-BFGS is occasionally used for fine-tuning small-to-medium models or when full-batch gradients are feasible.

6.5 Natural gradient descent & Fisher information matrix approximation

Natural gradient (Amari 1998):

θ_{t+1} = θ_t − η F(θ_t)⁻¹ ∇J(θ_t)

where F(θ) is the Fisher information matrix:

F_{ij} = E_{p(data|θ)} [ (∂ log p / ∂θ_i) (∂ log p / ∂θ_j) ]

Interpretation:

  • Pre-conditions gradient using the local geometry of the statistical manifold (KL-divergence metric).

  • Invariant under reparameterization (unlike vanilla gradient).

In practice for neural nets: F ≈ empirical Fisher = (1/n) ∑ (∇_θ log p(y_i|x_i,θ)) (∇_θ log p(y_i|x_i,θ))ᵀ → outer product of per-example score vectors.

KFAC (below) is a popular structured approximation to F.

Pros: Better-conditioned updates, sometimes faster convergence. Cons: Still expensive unless approximated.

6.6 Hessian-free optimization (Martens-style, CG solver)

Idea (Martens 2010): Never form the full Hessian — only compute Hessian-vector products Hv via automatic differentiation (double backprop or forward-over-reverse).

Update: Solve H δ = −∇J approximately using conjugate gradient (CG):

δ ≈ argmin_δ (1/2) δᵀ H δ + ∇Jᵀ δ

CG needs only ~10–50 Hv products per outer iteration (much cheaper than forming H).

Damping: Add λ I inside CG loop to stabilize.

Variants:

  • Hessian-free with CG + damping + line search

  • Used successfully in early deep learning (2010–2015) before Adam dominance

Status in 2025–2026: Rarely used standalone, but Hv products remain core to many advanced methods (K-FAC, SAM variants, etc.).

6.7 Kronecker-factored approximate curvature (K-FAC)

K-FAC (Martens & Grosse 2015, later improvements) approximates the Fisher/empirical curvature with Kronecker-factored structure — extremely effective for deep nets.

Core idea: For a fully-connected or conv layer a = Wx + b, the Fisher block for W is approximately:

F_W ≈ (A ⊗ G) where A = E[x xᵀ] (input covariance, layer input statistics) G = E[ (∇_a log p) (∇_a log p)ᵀ ] (gradient covariance w.r.t. pre-activations)

→ Inverse approximated as A⁻¹ ⊗ G⁻¹ (via Woodbury or direct inversion of small matrices)

Extensions (2020–2025):

  • Include layer-wise damping

  • Diagonal, block-diagonal, or empirical Fisher variants

  • Shampoo (2018) – similar but uses different factorization

  • Sophia (2023) – uses K-FAC-like preconditioner with clipping → competitive with AdamW at scale

Advantages:

  • Memory O(d) instead of O(d²)

  • Fast per-iteration cost

  • Often 2–10× fewer steps than Adam in vision/language tasks

Current status (2025–2026): K-FAC and variants (Sophia, Shampoo, AdaHessian) are among the few second-order-style methods that remain competitive or superior to AdamW on large-scale tasks when tuned properly.

6.8 Practical second-order methods in deep learning: pros, cons, when to use

Pros of second-order / Hessian-aware methods:

  • Much faster convergence per epoch (quadratic locally)

  • Better handling of ill-conditioning

  • Natural step-size adaptation via curvature

  • Sometimes reach flatter, better-generalizing minima

Cons:

  • High memory and compute per step (even approximations cost 2–10× more than Adam)

  • Sensitive to hyperparameters (damping, trust radius, etc.)

  • Less robust to noisy gradients (mini-batches)

  • Hard to scale to trillion-parameter models without heavy approximations

When to use in 2025–2026 practice (realistic ranking):

ScenarioRecommended MethodWhy / WhenSmall–medium models (< 100M params)L-BFGS, K-FAC, AdaHessianFull-batch or large-batch feasible; faster convergenceFine-tuning / few-shot adaptationK-FAC, Sophia, ShampooCurvature helps when data is limitedLarge-scale pre-training (LLMs)AdamW / Lion (first-order)Speed & stability win; second-order too expensiveResearch on optimization landscapesHessian-free, exact Newton on toy netsUnderstand curvature, saddle escape, sharpnessWhen second-order shinesK-FAC/Sophia on vision transformers, diffusion models2–5× fewer steps, sometimes better loss/perplexity

Current frontier (2025–2026):

  • Pure second-order methods are niche

  • Hybrid / approximate second-order (Sophia, Adafactor with momentum, Lion + curvature hints) are gaining traction

  • For most practitioners: AdamW + proper scheduling still dominates due to simplicity and robustness at extreme scale

This concludes the core treatment of second-order techniques. The next chapters cover convergence theory and advanced modern variants.

  • 7. Convergence Theory and Analysis

    This chapter provides a rigorous overview of convergence guarantees for optimization algorithms in machine learning contexts. We cover deterministic first-order methods on convex problems, extensions to strongly convex and non-convex settings, stochastic guarantees, fundamental lower bounds, the role of problem constants (conditioning, smoothness), and modern insights into deep learning loss landscapes.

    7.1 Convex smooth optimization: rates for gradient descent family

    Assume J is L-smooth (Lipschitz continuous gradient: ‖∇J(θ) − ∇J(θ')‖ ≤ L ‖θ − θ'‖) and convex (possibly μ-strongly convex with μ ≥ 0).

    Standard rates (fixed step size η ≤ 1/L):

    • Gradient Descent (GD): J(θ_t) − J^* ≤ (1 − μ/L)^t (J(θ_0) − J^) if μ > 0 (linear convergence) J(θ_t) − J^ ≤ O(1/t) if μ = 0 (sublinear)

    • Nesterov Accelerated Gradient (NAG): optimal among first-order methods for smooth convex functions J(θ_t) − J^* ≤ O(1/t²) (smooth convex, μ = 0) J(θ_t) − J^* ≤ O((1 − √(μ/L))^t) (strongly convex, accelerated linear rate)

    Momentum variants (Polyak, Nesterov) achieve the same accelerated rates under proper step-size tuning.

    Recent refinements (up to 2025–2026 literature): Long-step or anytime acceleration schedules allow GD-like methods to approach O(1/t^{1.27}) or better in specific smooth convex regimes via computer-assisted proofs or adaptive mechanisms. However, the canonical O(1/t²) for accelerated methods remains the optimal worst-case rate for black-box first-order oracles on L-smooth convex problems.

    7.2 Strongly convex case: linear convergence

    When J is μ-strongly convex + L-smooth (condition number κ = L/μ ≥ 1):

    • GD with η = 1/L: J(θ_t) − J^* ≤ (1 − 1/κ)^t (J(θ_0) − J^*) → linear rate, but slow when κ is large (ill-conditioned problems common in ML)

    • Optimal accelerated rate (Nesterov): Number of iterations to reach ε-suboptimality: O(√κ log(1/ε)) → much better dependence on conditioning than plain GD's O(κ log(1/ε))

    In practice for deep nets, even when locally strongly convex near a minimum, global strong convexity rarely holds, but local linear convergence can still occur once the algorithm enters a strongly convex basin.

    7.3 Non-convex optimization: stationarity, escape from saddle points

    Most deep learning objectives are highly non-convex.

    First-order stationarity GD/SGD converge to points where ‖∇J(θ)‖ → 0 (first-order stationary points).

    Second-order stationarity A point is second-order stationary if ∇J(θ) = 0 and Hessian H(θ) ≽ 0 (no negative curvature directions). Local minima satisfy this; strict saddle points have at least one negative eigenvalue.

    Escape from saddle points (key insight for non-convex success): In high dimensions (d ≫ 1), almost all critical points are saddles, not local minima.

    Theoretical escape mechanisms (2015–2025 results):

    • Perturbed GD / SGD with injected noise: escapes saddles in O(log d / λ_min²) iterations (λ_min = most negative eigenvalue).

    • Stochasticity alone (SGD noise) suffices in many cases → escapes in polynomial time with high probability.

    • Recent variants (2024–2025): occupation-time-adapted perturbations (PGDOT/PAGDOT), curvature-calibrated noise, gradient clipping in distributed settings, or PID-like controllers accelerate escape rates compared to vanilla perturbed GD.

    • Empirical observation: momentum + adaptive methods (Adam) naturally escape saddles faster due to velocity buildup and noise.

    Convergence guarantees in non-convex smooth case: E[‖∇J(θ_t)‖²] ≤ O(1/√t) for SGD under bounded variance assumptions.

    7.4 Stochastic convergence: almost sure convergence, expected suboptimality

    For SGD on finite-sum problems J(θ) = (1/n) ∑ ℓ_i(θ):

    Robbins–Monro conditions (classic, still foundational): η_t > 0, ∑ η_t = ∞, ∑ η_t² < ∞ → almost sure convergence to stationary point in convex case.

    Recent relaxations (2024–2025): Using stopping-time methods or martingale arguments, almost sure convergence holds under weaker conditions: ∑ η_t = ∞ and ∑ η_t^p < ∞ for some p > 2 (allows slower decay than 1/t). Expected suboptimality: O(1/√T) for non-convex smooth (optimal first-order stochastic rate).

    Almost sure vs in expectation:

    • Almost sure: θ_t → stationary point w.p. 1 (pathwise convergence).

    • In expectation: E[J(θ_t) − J^*] → 0 (weaker but easier to prove).

    In deep learning, we often get practical convergence even with constant LR + decay schedules that violate strict Robbins–Monro, due to non-stationary landscapes and regularization.

    7.5 Complexity lower bounds and optimal methods

    Oracle complexity lower bounds (worst-case, first-order black-box oracles):

    • Smooth convex: Ω(√(L/ε)) gradient queries for ε-suboptimality (achieved by NAG).

    • Strongly convex smooth: Ω(√(κ) log(1/ε)).

    • Stochastic smooth convex: Ω(1/ε²) (SGD optimal up to logs).

    • Non-convex smooth: Ω(1/ε⁴) for first-order stationarity in worst case; practical rates O(1/√T) for SGD.

    • Constrained composite non-convex nonsmooth (2025 results): tighter lower bounds involving affine constraints, showing some first-order methods cannot achieve better than certain polynomial dependencies.

    Optimal methods: Accelerated gradient (Nesterov), variance-reduced SGD (SVRG/SAGA for finite-sum), or approximate second-order (K-FAC/Sophia) approach these bounds in regimes where assumptions hold.

    7.6 Influence of conditioning number, Lipschitz constants, smoothness

    Condition number κ = L/μ (strongly convex case):

    • Large κ → narrow valleys → slow convergence (zig-zagging in GD).

    • Preconditioning (e.g., Adam's adaptive LR, natural gradient, K-FAC) reduces effective κ.

    Lipschitz constant L (smoothness):

    • Bounds step size η ≤ 1/L for descent guarantee.

    • Large L → small allowable steps → slow progress.

    Gradient Lipschitz constant (Hessian bounded): Controls descent lemma strength and escape-from-saddle time (smaller negative curvature → longer escape).

    In deep nets:

    • Effective L grows with depth/width → requires clipping, normalization (LayerNorm), or adaptive optimizers.

    • Ill-conditioning worsens with scale → motivates techniques like warm-up, large-batch scaling, sharpness-aware methods.

    7.7 Landscape pathologies in deep learning: sharp vs flat minima, mode connectivity

    Sharp vs flat minima (Hessian trace/norm or max eigenvalue as sharpness proxy):

    • Flat minima: low curvature → loss insensitive to parameter perturbations → better robustness/generalization (classic Hochreiter & Schmidhuber 1997; Keskar et al. 2017). SGD with small batches, noise, or SAM tends to find flatter minima.

    • Sharp minima: high curvature → sensitive → often overfit, poor generalization.

    Recent challenges to the flat-minima hypothesis (2025 literature):

    • Sharp minima can generalize well if they lie in large-volume regions or under certain data distributions.

    • Even algorithmically found flat minima (via SAM/SA-GD) can sometimes yield Ω(1) population risk in realizable stochastic convex settings → flatness alone does not guarantee good generalization.

    • Volume argument: flat minima occupy larger measure in parameter space → more likely to be found, but sharp minima in high-volume "good" regions exist and generalize.

    Mode connectivity (Garipov et al. 2018; later works): In over-parameterized nets, local minima are often connected via low-loss paths (valleys) → loss surface has connected global basin rather than isolated bad minima. This explains why SGD rarely gets stuck in poor local minima despite non-convexity.

    Implications for deep learning (2025–2026 view):

    • Generalization tied more to implicit bias of optimizer (SGD → flat-ish minima) than landscape alone.

    • Sharpness-aware minimization (SAM) and variants remain popular for enforcing flatter solutions.

    • Rethinking flatness: some works propose seeking ε-maxima (points slightly worse than neighbors) for better robustness in modern over-parameterized regimes.

    This chapter ties together theoretical foundations with practical deep learning observations. Convergence in non-convex high-dimensional settings relies heavily on stochasticity, momentum, and landscape properties rather than strict convexity.

    8.1.2 Spektral, Jraph, Graph Nets (TensorFlow)

    Spektral Repo: danielegrattarola/spektral Strengths: Keras/TensorFlow 2 → simple API, excellent for quick experiments Supports GCN, GAT, GraphSAGE, Message Passing, Graph Attention 2026 status: Still popular for TF/Keras users; less active than PyG/DGL.

    Jraph Repo: google-deepmind/jraph Strengths: JAX-native → composable, GPU/TPU acceleration GraphTuple abstraction → excellent for research on new GNN layers 2026 usage: Growing in JAX ecosystem (AlphaFold3 graph components, scientific ML).

    Graph Nets (TensorFlow) Repo: deepmind/graph_nets Strengths: TensorFlow 1/2 → Graph Nets library for message-passing models Legacy but influential; largely superseded by PyG/DGL in 2026.

    2026 recommendation: PyG or DGL cover 95%+ of use cases; Jraph for pure JAX workflows.

    8. Advanced Convergence Techniques and Modern Variants

    This chapter explores cutting-edge (as of 2025–2026) techniques that push beyond classical first- and second-order methods. These include learning-rate-free or highly adaptive optimizers, hybrid combinations, spectral step-size rules, relaxed convexity notions like the Polyak-Łojasiewicz (PL) inequality, momentum extensions to non-convex landscapes, proximal formulations, and considerations for distributed training.

    8.1 Learning rate free / adaptive methods: Adadelta, Yogi, Sophia

    These methods aim to eliminate or drastically reduce manual learning-rate tuning by incorporating adaptive mechanisms or second-order information.

    Adadelta (2012): Fully adaptive, no global learning rate. Update: Δθ_t = - (RMS(Δθ)_{t-1} / RMS(g)_t ) ⊙ g_t RMS(g)t = √[ρ RMS(g){t-1} + (1-ρ) g_t² ] (similar for Δθ) ρ ≈ 0.95, ε ≈ 10^{-6} for stability. → Converges well without LR schedules in many cases; still used in some legacy setups.

    Yogi (2019): Improvement over Adam that fixes slow convergence in later stages and better handles non-stationary objectives. Key change: adaptive learning rate uses sign-aware update instead of squared gradients: v_t = v_{t-1} + (1 - β₂) sign(v_{t-1} - g_t²) g_t² → Prevents aggressive decay of adaptive rates → faster late-stage progress. Empirical: Often outperforms Adam/RMSProp in vision/language tasks; meta-learned Yogi variants showed strong performance in nanophotonic inverse design (2025 studies).

    Sophia (2023, with 2025–2026 refinements): Scalable second-order stochastic optimizer for LLM pre-training. Uses inexpensive diagonal Hessian or Gauss-Newton approximations + entry-wise update clipping. Per-step cost close to AdamW, but exploits curvature for adaptive scaling. 2025 comparisons (e.g., vs. AdamW/Lion): Sophia often achieves lowest training/validation loss on GPT-style models at fixed token budget, though AdamW sometimes retains better downstream accuracy. Variants integrated in libraries like CoLLiE for 3D parallelism. → Competitive or superior in compute efficiency for large-scale LLM pre-training.

    8.2 Lookahead, RAdam, Ranger, Lion

    These combine ideas for stability, faster convergence, or memory efficiency.

    Lookahead (2019): Maintains slow weights (updated infrequently) and fast weights (inner-loop updates). Outer update: slow = slow + α (fast - slow) → Acts as a form of temporal ensembling → stabilizes training, reduces variance.

    RAdam (Rectified Adam, 2019): Fixes Adam's variance issues early in training via rectified bounds on adaptive LR. → Better warm-start behavior, less sensitive to LR choice.

    Ranger (2019–2021 variants like Ranger21): Synergistic combination: RAdam + Lookahead + Gradient Centralization (GC). GC: subtracts mean from gradients before update → centers gradients, improves stability. Ranger21 adds more components (e.g., AdamW-style decay). Still popular in 2025 for stable training in vision tasks.

    Lion (Evolved Sign Momentum, 2023, refined 2025): Sign-based, memory-efficient (stores only momentum, no second moment). Update: m_t = β₁ m_{t-1} + (1-β₁) sign(g_t) θ_t = θ_t - η (sign(m_t) + λ θ_t) (with weight decay) → ~33–50% less memory than AdamW, competitive or better performance on many tasks (ImageNet, diffusion, LLMs). 2025 refinements (e.g., Refined Lion / RLion): Introduce non-linear bounded functions on momentum product → higher accuracy (up to +20% vs AdamW in some classification/detection), overcomes gradient explosion/vanishing issues. Often fastest in GPU hours; strong generalization in vision-language and object detection.

    8.3 Barzilai-Borwein step size & spectral methods

    Barzilai-Borwein (BB) step size (1988, revived in deep learning): Approximates quasi-Newton (diagonal Hessian) without storing full matrix. η_t = (s_{t-1}^T s_{t-1}) / (s_{t-1}^T y_{t-1}) or alternative form where s = θ_t - θ_{t-1}, y = g_t - g_{t-1} → Spectral step size → adapts per iteration based on recent curvature estimate.

    Spectral variants (e.g., Spectral Projected Gradient): Combine BB with non-monotone line search or gradient projection. In deep learning: Used in sharpness-aware minimization extensions or as LR scheduler proxy. Advantages: No manual LR tuning, good for ill-conditioned problems. Drawback: Can oscillate; often combined with momentum or clipping.

    8.4 Polyak-Łojasiewicz (PL) inequality and beyond

    PL inequality (1963/1988, modern revival): ‖∇J(θ)‖² ≥ 2μ (J(θ) - J^) (μ-PL, exponent β=1/2) More general: ‖∇J‖ ≥ c (J - J^)^β with β ∈ (0,1]

    → Generalizes strong convexity (PL with β=1/2 is weaker than μ-strong convexity but sufficient for linear convergence of GD/SGD).

    Key results (up to 2025–2026):

    • PL sufficient for linear convergence of GD/SGD in non-convex settings (local PL common near minima).

    • Recent: PL is essentially no more general than strong convexity for C² functions (2025 proofs show equivalence under sufficient smoothness).

    • Local PL emerges in over-parameterized nets via NTK stability or quasi-convex regions → explains fast convergence in deep nets.

    • Gradient domination (generalized PL) enables almost-sure rates close to o(n^{-1/(4β-1)}) for SGD/SHB.

    • PL used to analyze distributed methods, bilevel optimization, quantized training.

    Beyond PL: Kurdyka-Łojasiewicz (KL) extensions for nonsmooth cases; local PLNet architectures explicitly enforce PL-like properties.

    8.5 Heavy-ball momentum under non-convexity

    Heavy-ball / Polyak momentum in non-convex:

    v_{t+1} = β v_t - η ∇J(θ_t) θ_{t+1} = θ_t + v_{t+1}

    Theoretical progress (2024–2025):

    • Escapes saddles faster due to velocity buildup.

    • Under PL or gradient-domination: almost-sure convergence with rates approaching expectation bounds.

    • Stochastic heavy-ball (SHB) achieves o(n^{-1/(4β-1)}) almost surely under global/local gradient domination.

    • Robust to noise; helps in ill-conditioned non-convex landscapes common in deep learning.

    8.6 Proximal methods and regularization in optimization view

    Proximal gradient for composite problems J(θ) + R(θ):

    θ_{t+1} = prox_{η R} (θ_t - η ∇J(θ_t)) prox_R(z) = argmin_u (1/2)‖u - z‖² + R(u)

    In deep learning:

    • L1/L2 regularization → soft/hard thresholding or weight decay.

    • AdamW decouples weight decay → proximal view of L2.

    • Sharpness-aware minimization (SAM) ≈ proximal step minimizing worst-case loss in neighborhood.

    • Proximal policy optimization (PPO) in RL uses clipped surrogate → proximal-like trust region.

    Benefits: Built-in regularization, handles constraints, stabilizes non-smooth terms.

    8.7 Distributed / parallel optimization: data-parallel, model-parallel implications

    Data-parallel (DDP / FSDP): Each worker has full model, averages gradients. → Optimizer sees averaged (less noisy) gradients → larger effective batch size → sharper minima unless noise added. Implication: Adaptive methods (AdamW/Lion) benefit from large-batch scaling rules; second-order approximations (K-FAC/Sophia) scale better with sharding.

    Model-parallel / pipeline / tensor-parallel: Splits model across devices. → Gradient computation partial → requires all-reduce or local updates. Implication: Optimizer must handle asynchrony/noise; Lion/Sophia memory-efficient → preferred for extreme scale.

    2025–2026 trends:

    • DiLoCo-style low-communication optimizers scale better than pure data-parallel.

    • Sharded optimizers (ZeRO-Offload) pair well with memory-light methods like Lion.

    • Hybrid parallelism + adaptive optimizers → frontier for trillion-parameter training.

    Summary table (2025–2026 practical favorites)

    Technique/OptimizerKey StrengthMemory vs AdamWTypical Use Case (2025–2026)SophiaDiagonal Hessian + clippingSimilarLLM pre-training (lowest loss)Lion / RLionSign-based, very efficient~33–50% lessVision, diffusion, general fine-tuneYogiFixed Adam variance issuesSimilarStable adaptive trainingRanger / Ranger21RAdam + Lookahead + GCSimilarStable legacy vision tasksPL-based analysisExplains fast convergence—Theoretical justification for DL

    These advanced methods represent the state-of-the-art trade-offs between speed, stability, memory, and generalization in large-scale deep learning.

    9. Backpropagation and Automatic Differentiation Revisited

    This chapter revisits the core mechanism that makes modern deep learning possible: automatic differentiation (AD), with a focus on reverse-mode AD (backpropagation). We emphasize how gradients flow through computational graphs, the distinction between forward- and reverse-mode primitives, efficient ways to compute Hessian-vector products (crucial for second-order methods), and support for higher-order derivatives in contemporary frameworks (PyTorch, JAX, TensorFlow).

    9.1 Computational graph and reverse-mode AD

    A neural network computation is represented as a directed acyclic graph (DAG) — the computational graph — where:

    • Nodes are operations (e.g., matrix multiplication, ReLU, addition, softmax)

    • Edges represent data flow (tensors flowing from one op to another)

    • Leaves are inputs (parameters θ or data x)

    • Root(s) are outputs (usually scalar loss J)

    Forward pass: Evaluate the graph from inputs to outputs, computing intermediate values (primal values) and storing them for later use.

    Reverse-mode AD (backpropagation):

    • Starts from the scalar output (loss J).

    • Propagates adjoints (sensitivities, ∂J/∂z for each intermediate z) backward through the graph.

    • At each node, applies the local vector-Jacobian product (VJP) rule using the chain rule.

    • Final result: adjoints w.r.t. parameters → gradient ∇_θ J.

    Why reverse-mode dominates in deep learning:

    • Number of parameters (inputs to AD) ≫ number of outputs (usually 1 scalar loss) → reverse-mode costs O(1) backward pass + 1 forward pass, regardless of parameter count.

    • Forward-mode would require one pass per parameter → infeasible for millions/billions of parameters.

    Modern frameworks (2025–2026):

    • PyTorch: dynamic graph (eager execution + tape-based backward)

    • JAX: static graph via tracing + XLA compilation, supports both modes natively

    • TensorFlow 2.x: eager by default, graph mode optional

    Example simple graph (MLP layer + ReLU + MSE):

    text

    x → [W x + b] → ReLU → ŷ → (ŷ - y)² / 2 → J

    Forward: compute all intermediates. Backward: start with ∂J/∂J = 1, propagate back to ∂J/∂W, ∂J/∂b, ∂J/∂x.

    9.2 Gradient computation through chain rule

    The chain rule in reverse mode:

    For z = f(u), where u = g(v), scalar loss J:

    ∂J/∂v = (∂J/∂z) ⋅ (∂z/∂u) ⋅ (∂u/∂v) = adjoint_z ⋅ local_derivative(f) ⋅ local_derivative(g)

    In vector form (most practical):

    Let a = layer input, z = W a + b, ŷ = σ(z), J = loss(ŷ, target)

    Backward pass:

    • Receive incoming adjoint from above: δ_ŷ = ∂J/∂ŷ

    • Compute δ_z = δ_ŷ ⊙ σ'(z) (elementwise for activation)

    • Then δ_a = Wᵀ δ_z

    • δ_W = δ_z aᵀ

    • δ_b = δ_z

    This is repeated layer-by-layer backward.

    Key property: Each local VJP is cheap (same cost as forward op), and storage of intermediates enables efficient reuse.

    Autodiff implementation:

    • During forward: record op type + inputs + outputs

    • During backward: traverse graph in reverse topological order, apply pre-defined VJP rules for each op (e.g., matmul, conv, softmax all have efficient VJP implementations)

    9.3 Vector-Jacobian products (VJP) and Jacobian-vector products (JVP)

    These are the primitive building blocks of AD.

    Jacobian-vector product (JVP) — forward-mode primitive:

    J_f(v) = Df(x) ⋅ v (pushforward)

    • v is tangent vector (direction)

    • Computes directional derivative along v

    • Implemented by dual numbers or forward-over-forward AD

    • Cost: same as one forward pass

    Vector-Jacobian product (VJP) — reverse-mode primitive:

    V_f(u) = uᵀ Df(x) (pullback)

    • u is cotangent vector (incoming adjoint)

    • Computes gradient contribution back-propagated from u

    • Implemented by reverse accumulation

    • Cost: same as one backward pass

    Comparison table (2025 perspective, JAX/PyTorch style)

    PropertyJVP (forward-mode)VJP (reverse-mode)Mathematical formDf(x) vuᵀ Df(x)Direction of computationForward (with tangents)Backward (with cotangents)Best whenFew inputs, many outputsMany inputs, few outputs (DL default)In JAXjax.jvp(f, primals, tangents)jax.vjp(f, primals) → (out, vjp_fun)In PyTorchtorch.func.jvptorch.autograd.grad / .backward()Use for higher-orderBuilding block for nested JVPsBuilding block for nested VJPs

    Example in JAX:

    Python

    def f(x): return jnp.sin(x) x*2 primals = (jnp.array(2.0),) tangents = (jnp.array(1.0),) out, jvp_val = jax.jvp(f, primals, tangents) # directional deriv

    Reverse-mode equivalent uses vjp to get full gradient.

    9.4 Efficient Hessian-vector products (finite differences vs AD tricks)

    Hessian-vector product (HVP): Hv = ∇²J(θ) v = directional second derivative along v.

    Needed for Newton-CG, Hessian-free, K-FAC, natural gradient, etc.

    Method 1: Finite differences of gradients Hv ≈ [∇J(θ + εv) - ∇J(θ - εv)] / (2ε)

    • Two full gradient computations

    • Numerical instability (cancellation error, ε choice)

    • Simple to implement, no extra AD rules needed

    • Still used for debugging or when AD not available

    Method 2: Forward-over-reverse AD (Pearlmutter's trick, most efficient) Hv = ∇_θ (vᵀ ∇_θ J(θ)) (gradient of directional derivative)

    • One forward pass → compute J and ∇J

    • One reverse pass → compute vᵀ ∇J (scalar)

    • Second reverse pass on that scalar → Hv

    • Cost: ~2–3× single gradient evaluation

    • Machine precision, no ε tuning

    • Native in JAX (jax.hessian or jax.grad(jax.grad) with care), PyTorch (torch.autograd.functional.hvp or nested .backward())

    Method 3: Reverse-over-reverse (possible but less common for HvP) Used in some higher-order cases.

    2025–2026 reality:

    • Forward-over-reverse is standard in JAX/PyTorch for HVPs (used in Sophia, Hessian-free variants, sharpness-aware methods).

    • Finite differences remain fallback for non-differentiable ops or quick prototyping.

    • In large models: memory-efficient via checkpointing or gradient checkpointing during nested AD.

    9.5 Higher-order derivatives in AD frameworks (Hessian-free, K-FAC)

    Modern frameworks support arbitrary-order derivatives via nested autodiff.

    JAX (strongest higher-order support):

    • jax.grad(grad(f)) → Hessian (but beware memory explosion)

    • jax.hessian(f) computes full Hessian (small models only)

    • jax.jacrev, jax.jacfwd for Jacobians

    • Efficient HVPs via jax.grad(lambda x: jnp.dot(v, jax.grad(f)(x)))(x)

    • Used in K-FAC implementations, natural gradient, higher-order optimizers

    PyTorch:

    • torch.func.hessian (functorch → torch.func)

    • Nested torch.autograd.grad or .backward() calls

    • torch.autograd.functional.hvp, jacrev, etc.

    • 2025: torch.compile improves nested AD performance

    Hessian-free optimization:

    • Never forms full H; only Hv via forward-over-reverse

    • Solve H δ = -g via conjugate gradient (CG), needing ~10–50 Hv products per outer step

    • Martens-style (2010) still relevant in 2025 for fine-tuning or small-to-medium models

    K-FAC (Kronecker-Factored Approximate Curvature):

    • Approximates Fisher/Hessian as block-diagonal + Kronecker structure per layer

    • Inverts small matrices (A ⊗ G)⁻¹ ≈ A⁻¹ ⊗ G⁻¹

    • Uses HVPs or empirical Fisher statistics

    • Modern implementations (2025): Sophia (diagonal approx + clipping), Shampoo, AdaHessian variants

    • JAX-based K-FAC libs remain fastest; PyTorch ports improving

    Higher-order in practice:

    • Second-order common (HVPs for Newton-CG, K-FAC/Sophia)

    • Third+ order rare (Taylor-mode AD or nested JVPs/VJPs for curvature of curvature)

    • Frameworks handle it via operator overloading or tracing (JAX excels here)

    Summary table

    ConceptForward-mode (JVP)Reverse-mode (VJP)Cost in DL (many params, scalar loss)Typical UseFirst-order gradientPossibleDominant1 fwd + 1 bwdBackpropHessian-vector productForward-over-reverse—~2–3× gradientNewton-CG, K-FAC, SophiaFull HessianVery expensiveVery expensiveO(d) gradients or memory blowupSmall models onlyHigher-orderNested JVPsNested VJPsExponential cost unless approximatedResearch, curvature analysis

    This machinery — especially efficient reverse-mode VJPs and forward-over-reverse HVPs — underpins every gradient-based training loop and most advanced optimizers discussed earlier.

    10. Special Topics and Frontiers (2025–2026 Perspective)

    This final chapter surveys emerging trends, open questions, and frontier developments in optimization for machine learning as of early 2026. It focuses on phenomena observed in large-scale training (LLMs, vision transformers, multimodal models) and highlights where theory meets messy practice. Many ideas build on earlier chapters: sharpness from non-convex landscapes (7.7), second-order approximations (6), implicit biases (from stochasticity and adaptivity), and scaling challenges in distributed settings (8.7).

    10.1 Sharpness-aware minimization (SAM) and related methods

    Sharpness-Aware Minimization (SAM, 2021) perturbs parameters in the neighborhood of current θ before computing the gradient, effectively minimizing the worst-case loss in an ε-ball:

    θ ← θ − η ∇θ max{‖δ‖₂ ≤ ρ} J(θ + δ) ≈ θ − η ∇_θ J(θ + ρ ∇_θ J(θ) / ‖∇_θ J(θ)‖)

    This encourages flatter minima, improving generalization (especially OOD robustness).

    2025–2026 advances:

    • Momentum-SAM (MSAM): Replaces gradient-ascent perturbation with momentum direction (inspired by NAG). Achieves similar flatness without doubling compute cost → near-zero overhead over SGD/Adam. Strong results on vision/language tasks; reduces memory demands.

    • Friendly-SAM / MNSAM: Adds momentum acceleration + NAG in inner/outer loops → skips sharp regions faster, accelerates convergence while flattening.

    • Physics-Informed SAM (PI-SAM): Combines SAM with domain-specific regularizations (e.g., semiconductor material prediction via GNNs) → outperforms baselines in OOD generalization.

    • Unified SAM / USAM: General analysis under relaxed stochastic noise; improved convergence rates; better hyperparameter robustness.

    • Z-Score Gradient Filtering + SAM: NeurIPS OPT 2025 variant → filters outliers before perturbation → stabilizes large-batch training.

    Practical status: SAM and variants remain go-to for generalization boosts in vision, diffusion, and fine-tuning. Momentum/acceleration hybrids make it nearly free in cost. Still debated: some sharp minima generalize well if in high-volume regions; flatness not always causal for robustness.

    10.2 Second-order preconditioning in large language models

    Second-order methods precondition gradients using curvature (Hessian/Fisher approximations) → better handling of ill-conditioning in transformers.

    Sophia (2023, still dominant 2025–2026): Lightweight diagonal Hessian estimate + per-coordinate clipping (controls worst-case updates). Achieves 2× speedup over AdamW on GPT-style models (125M–1.5B) → same perplexity with 50% fewer steps. Key: clipping stabilizes heterogeneous curvatures; transferable hyperparameters (γ ≈ 0.01–0.05).

    Scaling insights (NeurIPS 2025):

    • Muon, SOAP, Shampoo variants → 1.4×+ speedup over AdamW on transformers when scaled properly (width/depth co-scaling).

    • Optimal weight decay ~1/width (independent scaling); second-order methods prefer larger optimal model size for fixed compute.

    • Incorrect scaling drops speedup from 1.4× to <1.1× (190M → 640M params).

    Other developments:

    • AdaFisher: Diagonal block-Kronecker Fisher → robust, stable tuning; strong on image classification/language modeling.

    • MARS (Make vAriance Reduction Shine): Combines preconditioned AdamW/Lion/Shampoo with scaled recursive momentum → variance reduction shines at LLM scale.

    • Sassha: Sharpness-aware + stable Hessian approx → flattens curvature, lazy updates → generalization comparable/superior to first-order.

    Trend: Diagonal/Kronecker approximations + clipping → practical second-order for LLMs. Full Gauss-Newton or dense K-FAC still too expensive; hybrids win on Pareto frontier (speed + perplexity).

    10.3 Grokking and optimization dynamics

    Grokking: Delayed generalization after prolonged overfitting (e.g., modular arithmetic in transformers) → sudden test accuracy jump post-memorization.

    2025–2026 mechanistic insights:

    • Low-dimensional + transversely curved trajectories: PCA shows single principal component dominates weight evolution → lies on low-D manifold with curvature orthogonal to progress.

    • Anti-grokking collapse: After peak generalization, loss of test performance → marked by heavy-tailed spectral metric α shifts (HTSR detects all phases reliably).

    • Regularization-induced grokking: Small non-zero ℓ₁/nuclear norm → enforces sparsity/low-rank → grokking even without weight decay; over-parameterization (depth) enables grok/ungrok without explicit reg.

    • Geometric inductive bias: Architectural topology (e.g., fewer degrees of freedom) bypasses phase transitions → faster emergence of invariants.

    • Norm beyond ℓ₂: Grokking without weight decay if regularized toward other properties (ℓ₂ norm grows but generalization holds).

    Implications: Grokking tied to geometry (curvature, volume), implicit simplicity bias, and data selection. Heavy-tailed metrics best predict dynamics; grokking sometimes necessary for optimal solutions but trades off training time.

    10.4 Implicit bias of optimizers (SGD vs Adam)

    Implicit bias: Optimizer favors certain solutions among many minimizers (no explicit regularization).

    SGD: Simplicity bias → linear/simple boundaries, suboptimal margins, flatter minima → better generalization (noise as regularizer).

    Adam/AdamW: Richer bias → nonlinear, diverse features, closer to Bayes optimal; resistant to simplicity bias → sometimes sharper minima, worse generalization in some regimes.

    2025 findings:

    • Incremental (per-sample) Adam deviates from full-batch → can reach ℓ₂-max-margin vs ℓ∞-max-margin (Signum invariant).

    • Stochastic Adam does not converge to full-batch counterpart → mini-batch variants achieve near-zero test error where full-batch fails.

    • Adaptive methods recover higher-accuracy, non-collapsing solutions in categorical tasks → SGD simpler/lower-dimensional.

    Practical: SGD (or noisy variants) often generalizes better despite slower convergence; Adam faster but needs careful decay/scheduling. Implicit noise geometry (SGD) vs coordinate-adaptivity (Adam) drives differences.

    10.5 When do second-order methods outperform first-order in practice?

    Conditions where second-order shines (2025–2026 evidence):

    • Large batch sizes + moderate dataset/model size → curvature exploitation accelerates.

    • Fine-tuning / few-shot / medium-scale (up to ~1B params) → K-FAC/Sophia/Sassha 2–5× fewer steps, better loss/perplexity.

    • Ill-conditioned landscapes (transformers, diffusion) → preconditioning reduces zig-zagging.

    • When tuned properly (diagonal/clipped/Kronecker + scaling rules) → Muon/SOAP/Sophia achieve 1.4×+ speedup over AdamW on LLMs.

    Where first-order wins:

    • Extreme scale (trillion params) → memory/compute overhead dominates; AdamW/Lion simpler, more robust.

    • Heavy-tailed noise or unstable gradients → second-order brittle without heavy clipping.

    • Poor hyperparameter scaling → speedup collapses (e.g., wrong weight decay).

    Empirical rule-of-thumb: Second-order > first-order when batch size large, model not gigantic, and curvature approx cheap (diagonal/HvP). Hybrids (Sophia, Sassha) close gap.

    10.6 Open challenges: scaling second-order methods to trillion-parameter models

    Core bottlenecks:

    • Memory: Full/approximate Hessian → O(d²) or high per-step cost; even diagonal/Kronecker strains at 10¹² params.

    • Compute: Frequent curvature estimates → offset by fewer steps, but communication in distributed (all-reduce Fisher stats) explodes.

    • Stability: Heavy-tailed noise, heterogeneous curvatures → divergence without aggressive clipping/lazy updates.

    • Scaling laws: Second-order prefers larger optimal models for fixed compute → but trillion-scale needs new parallelism (contiguous param mgmt, activation offloading).

    • Generalization: Some second-order converge to sharper minima → needs sharpness integration (Sassha-style).

    Active directions (2025–2026):

    • Extreme sharding + low-communication (DiLoCo-style) for preconditioners.

    • Memory-efficient frameworks (Zero-Offload + diagonal/block) → target trillion-scale.

    • Hybrid fallback: Mid-training switch to first-order if curvature no longer helps.

    • Theoretical: Sample complexity under heavy tails; better Fisher approximations.

    Outlook: Second-order not yet dominant at frontier scale, but diagonal/clipped variants (Sophia/Muon) show promise. Trillion-param training still AdamW/Lion territory; breakthroughs in efficient curvature + scaling rules could change that.

    This concludes the module. These frontiers show optimization remains vibrant: blending classical theory with LLM-scale empirics drives progress.

PREVIOUS PAGE INDEX PAGE NEXT PAGE

Join AI Learning

Get free AI tutorials and PDFs