The hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field.
It describes the curvature of a multivariate function.
For a function , taking an input vector , the entry of the row and the column is:
How to compute hessian vector products? https://iclr-blogposts.github.io/2024/blog/bench-hvp
(after / when reading the 20 pages convex optimization doc)
https://chatgpt.com/c/cf1abad5-a4bc-4805-b557-a5ef99ff1987
finish the below explanation after learning abt condition number etc.
Why do large weights slow down learning in neural networks?
The weights of a neural network are directly linked to the condition number of the Hessian matrix in the second-order Taylor approximation of the loss function. The condition number of the Hessian is known to affect the speed of convergence of SGD algorithms. Consequently, the growth in the magnitude of the weights could lead to an ill-conditioned. Hessian matrix, resulting in a slower convergence.
TLDR: The weight values influence the curvature of the loss landscape. Larger weights often lead to sharper curvatures (higher eigenvalues), making the loss function harder to optimize.