中文版本:[[二阶导数判别法 - 微积分和线代交汇处]] ## Why Stationary Points Matter A stationary point is where the gradient vanishes — the function stops climbing and stops falling. These points matter because they are the **candidates for optimal values**. Every local minimum and local maximum of a differentiable function must occur at one. The pattern appears everywhere: - **Machine learning** — training a network means minimising a loss function of millions of parameters; gradient descent hunts for stationary points of that surface. - **Economics** — firms maximise profit, consumers maximise utility; stationary points locate the optimal allocations. - **Physics** — systems settle into minimum-energy states; equilibrium configurations are stationary points of the potential energy function. - **Statistics** — maximum likelihood estimation finds parameters that make observed data most probable, which reduces to locating stationary points of the likelihood function. The common thread: whenever a quantity depends on several variables and you want the best value, you search for stationary points and then classify them. ## The Two-Stage Process **Stage one**: set every first partial derivative to zero. The gradient vanishes, and the stationary points reveal themselves. **Stage two**: classify each point — local minimum, local maximum, or saddle. The second derivative test handles this stage. --- ## Why the Taylor Polynomial Is the Right Tool A function $f(x,y)$ can be arbitrarily complicated — products of trigonometric terms, nested exponentials, compositions that resist direct analysis. Near a specific point, the Taylor polynomial **replaces** the function with a polynomial that behaves identically up to a controlled error. The polynomial is simple enough to reason about. The error is small enough to ignore. That trade is the heart of the method. At a stationary point specifically, this trade becomes decisive. You want to know: does the function rise, fall, or do both as you move away? The Taylor polynomial converts that geometric question into an **algebraic** one — the sign behavior of a polynomial you can inspect directly. ### The Arithmetic The second-order Taylor expansion of $f(x,y)$ around $\mathbf{a} = (a_1, a_2)$: $f(\mathbf{a} + \mathbf{h}) \approx f(\mathbf{a}) + \underbrace{f_x(\mathbf{a}),h_1 + f_y(\mathbf{a}),h_2}_{\text{linear terms}} + \underbrace{\frac{1}{2}\bigl[f_{xx}(\mathbf{a}),h_1^2 + 2f_{xy}(\mathbf{a}),h_1 h_2 + f_{yy}(\mathbf{a}),h_2^2\bigr]}_{\text{quadratic terms}}$ where $\mathbf{h} = (h_1, h_2)$ is a small displacement from $\mathbf{a}$. ### Why the Linear Terms Die Suppose $\mathbf{a}$ is a **stationary point**. By definition, $f_x(\mathbf{a}) = 0$ and $f_y(\mathbf{a}) = 0$. So: $f_x(\mathbf{a}),h_1 + f_y(\mathbf{a}),h_2 = (0),h_1 + (0),h_2 = 0$ No trick here — each coefficient is literally zero. The gradient **is** the coefficient vector of the linear terms, so a vanishing gradient kills the entire first-order contribution. This is what makes stationary points special. At a non-stationary point, the linear terms dominate for small $\mathbf{h}$ — they shrink like $|\mathbf{h}|$ while the quadratic terms shrink like $|\mathbf{h}|^2$. The function just tilts in the gradient direction, and there is nothing to classify. Only when the linear terms vanish does the quadratic form take charge. Only then does curvature become the leading behavior. ### What Remains $f(\mathbf{a} + \mathbf{h}) - f(\mathbf{a}) \approx \frac{1}{2}\bigl[f_{xx},h_1^2 + 2f_{xy},h_1 h_2 + f_{yy},h_2^2\bigr]$ The left side is the **change** in $f$ as you move away from $\mathbf{a}$. The right side is a quadratic form in $\mathbf{h}$. In matrix language: $f(\mathbf{a} + \mathbf{h}) - f(\mathbf{a}) \approx \frac{1}{2},\mathbf{h}^T H \mathbf{h}, \qquad H = \begin{pmatrix} f_{xx} & f_{xy} \ f_{xy} & f_{yy} \end{pmatrix}$ ### The Classification The entire question reduces to: **what sign does $\mathbf{h}^T H \mathbf{h}$ take?** [[Definiteness in Linear Algebra]] - $\mathbf{h}^T H \mathbf{h} > 0$ for every nonzero $\mathbf{h}$ → the function rises in every direction → **local minimum** - $\mathbf{h}^T H \mathbf{h} < 0$ for every nonzero $\mathbf{h}$ → the function drops in every direction → **local maximum** - The sign depends on $\mathbf{h}$ → some directions rise, others fall → **saddle** This is the same logic as one-variable calculus: $a > 0$ in $ax^2 + bx + c$ makes the parabola open upward → local minimum. **Positive definiteness** generalises this to $n$ dimensions — the paraboloid opens upward no matter which direction you slice through it. The Taylor polynomial is what makes the reduction possible. It converts a question about an arbitrary function into a question about a quadratic form — and quadratic forms are objects that linear algebra knows how to classify. --- ## The Bridge: Calculus Meets Linear Algebra The dependency chain: > **Taylor polynomial → quadratic form appears → linear algebra classifies it** Taylor and Sylvester's criterion come from different branches of mathematics. The **second derivative test** is where these branches meet — calculus produces the Hessian matrix, linear algebra judges its character: - **Positive definite** $\Rightarrow$ local minimum - **Negative definite** $\Rightarrow$ local maximum - **Indefinite** $\Rightarrow$ saddle point ## The $2 \times 2$ Case The Hessian matrix at a stationary point $\mathbf{a}$: $H = \begin{pmatrix} f_{xx} & f_{xy} \ f_{xy} & f_{yy} \end{pmatrix}$ The **Hessian determinant**: $H(\mathbf{a}) = f_{xx} f_{yy} - (f_{xy})^2$. The textbook recipe — check $H > 0$ first, then check the sign of $f_{xx}$ — is **Sylvester's criterion** applied to a $2 \times 2$ matrix without naming it. When $H > 0$, curvature is consistent across all directions. $f_{xx}$ then tells you _which_ consistent curvature: upward or downward. You could equally check $f_{yy}$ — when $H > 0$, both share the same sign. When $H = 0$, the entire second-order Taylor expansion may vanish (as happens with cubic functions at the origin), and the test has nothing to grab onto. Higher-order terms govern the behavior. ### Decision Flowchart ```mermaid flowchart TD A["Stationary point found<br/>∇f(a) = 0"] --> B["Compute Hessian determinant<br/>H = f_xx · f_yy − (f_xy)²"] B --> C{"H(a) > 0?"} B --> D{"H(a) < 0?"} B --> E{"H(a) = 0?"} C -->|Yes| F{"f_xx(a) > 0?"} F -->|Yes| G["🟢 Local Minimum<br/>Positive definite<br/>Bowl opens upward"] F -->|No| H["🔴 Local Maximum<br/>Negative definite<br/>Bowl opens downward"] D -->|Yes| I["🟡 Saddle Point<br/>Indefinite<br/>Curves up in one direction,<br/>down in another"] E -->|Yes| J["⚪ Inconclusive<br/>Second derivative test fails<br/>Need higher-order terms"] style G fill:#d4edda,stroke:#28a745,color:#000 style H fill:#f8d7da,stroke:#dc3545,color:#000 style I fill:#fff3cd,stroke:#ffc107,color:#000 style J fill:#e2e3e5,stroke:#6c757d,color:#000 ``` The flowchart encodes **Sylvester's criterion** for the $2 \times 2$ case. The first branch ($H > 0$ vs $H < 0$ vs $H = 0$) determines whether the eigenvalues agree in sign. The second branch ($f_{xx}$) determines _which_ sign they agree on. --- ## Methods for Testing Positive Definiteness Several paths lead to the same answer. Each asks whether $\mathbf{v}^T A \mathbf{v} > 0$ for all nonzero $\mathbf{v}$, but through a different lens. **Eigenvalues** — compute them all. Every eigenvalue positive → positive definite. The most direct route conceptually: the quadratic form stretches space outward in every eigendirection. **Sylvester's criterion** — check the leading principal minors $\Delta_1 > 0, \Delta_2 > 0, \ldots, \Delta_n > 0$. Avoids eigenvalue computation. This is the method that scales the second derivative test to functions of three, four, or $n$ variables. **Cholesky decomposition** — factor $A = LL^T$ where $L$ is lower triangular with positive diagonal. If the factorization succeeds, the matrix is positive definite. This is what numerical software actually uses. **Gaussian elimination pivots** — perform elimination without row swaps. Every pivot positive → positive definite. The pivots are the squares of the diagonal entries in the Cholesky factor $L$. **Definition directly** — show $\mathbf{v}^T A \mathbf{v} > 0$ for all nonzero $\mathbf{v}$ by algebraic argument (e.g., writing the form as a sum of squares). Rarely practical for computation, but sometimes the cleanest proof. --- ## The Unifying Idea The second derivative test in every multivariable calculus textbook is a **matrix classification problem** that only linear algebra can solve. The study of quadratic forms, eigenvalues, and definiteness is not a detour from calculus — it is the machinery that powers optimisation in $n$ variables.