๐ก ๐ช๐ฒ๐ถ๐ด๐ต๐ ๐๐ป๐ถ๐๐ถ๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป in Deep Learning: What is it and why you should care about it ?
Letโs start with ๐๐ฒ๐ถ๐ด๐ต๐๐.
Those ๐ณ๐น๐ผ๐ฎ๐๐ถ๐ป๐ด ๐ฝ๐ผ๐ถ๐ป๐ numbers which model learn during training and somehow encapsulates the magic of deep learning.
But, whatโs the value of those floats when we start the training?
Should it be randomly set? Should it be kept at zero or one ? Will that be optimal? Will it help in faster convergence ?
Letโs find out.
Imagine you are in your garden planting seeds.
If you plant seeds too deep (weights too small), they might never sprout.
If you plant them too shallow (weights too large), they might sprout too fast but wonโt grow strong roots.
The right depth ensures healthy growthโjust like proper weight initialization ensures good learning.
Why it reminds me of ๐๐ผ๐น๐ฑ๐ถ๐น๐ผ๐ฐ๐ธ๐ ๐ Not too deep, not too shallow but just right.
๐ช๐ต๐ ๐ช๐ฒ๐ถ๐ด๐ต๐ ๐๐ป๐ถ๐๐ถ๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป ๐ ๐ฎ๐๐๐ฒ๐ฟ๐
At the beginning of training, weights need to be initialized to some starting values.
Why? Because if you donโt start with good initial values , bad things can happen ๐
1๏ธโฃ Vanishing Gradients: weights ttoo small can cause gradients to shrink exponentially as they backpropagate. This slows down learning, especially in the early layers.
2๏ธโฃ Exploding Gradients: Conversely, weights that are too large can cause gradients to grow uncontrollably, leading to unstable training.
3๏ธโฃ Symmetry Breaking: If all weights are initialized to the same value, neurons in the same layer will learn the same features, defeating the purpose of having multiple neurons.
4๏ธโฃ Training Efficiency: Poor initialization can make the optimization process unnecessarily slow / suboptimal.
๐ช๐ต๐ฎ๐โ๐ ๐๐ต๐ฒ ๐ฟ๐ฒ๐บ๐ฒ๐ฑ๐?
Here are some common weight initialization techniques.
1๏ธโฃ Random Initialization : Weights initialized randomly, typically using a uniform or normal distribution.
2๏ธโฃ Xavier Initialization (Glorot Initialization) Adjusts the scale of random weights based on the number of input and output neurons.
3๏ธโฃ He Initialization Similar to Xavier Initialization but scales weights based on the number of input neurons only.
4๏ธโฃ LeCun Initialization Specifically tailored for sigmoid and tanh activations. Scales weights to maintain variance stability for these activation functions.
When in doubt, Use He Initialization for ReLU-based activations and
Xavier or LeCun for sigmoid/tanh activations.