Having a good initialization prevents the characteristic “hockey shape” of the loss function, which is essentially just the NN shrinking down way to high weights

Good example:

Values should also be initialized close to 0, so that the initial values don’t overshoot, which can lead to dead neurons for activation functions like tanh, sigmoid, ReLU (Since if you’re too far from the middle, they are flat → 0 gradient → no learning). This can also happen due to too high learning rate.
What you want is a fairly homogenous, roughly gaussian initialization.

Randomly initialized neural network != random function!

References

Makemore (also plotting tanh activations, …)