Convergence to local minima is well understood in the non-convex literature. Deep neural networks are also assumed to converge
to local minima. The proliferation of Saddle points increase as the dimension increases. In this work we hypothesise that
- All deep networks converge to degenerate saddles
- Good saddles are good enough
We coin a new definition
good saddle and empirically verify that deep networks converge to them where the hessian at convergence
have significant number of zero eigenvalues showing flatness in many directions making gradient descent difficult to escape. The above figure is a toy example of an error surface showing the flatness and different gradient descent algorithms work in different way to escape the flat region.
The source code for this project could be found
here