This material is part of cs231 Training a Neural Networks, Part 1

Ans: If all weights are initialized to 0. In this case, each hidden unit will get exactly zero activation. If every neuron in the network computes the same output, then they will also all compute the same gradients during backpropagation and undergo the exact same parameter updates. In other word, there is no source of asymmetry between neurons if their weights are initialized to be the same.
(gaussian with zero mean and 1e-2 standard deviation)
$W=0.01*np.random.randn(D, H)$
It is common to initialize the weights of the neurons to small numbers and refer to doing so as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themeselves as diverse parts of the full network.
Works ~okay for small networks, but problems with deep neural network.
import numpy as np
import matplotlib.pyplot as plt
E.g. 10-layer net with 500000 neurons on each layer, using tanh non-linearities, and initializing as described in last cell.
def init_weight_plot(init_func, act_name):
    # assume some unit Guassian 10-D input data
    D = np.random.randn(1000, 500)
    hidden_layer_sizes = [500] * 10
    nonlinearities = [act_name] * len(hidden_layer_sizes)
    act = {'relu' : lambda x : np.maximum(0, x), 'tanh' : lambda x : np.tanh(x)}
    Hs = {}
    for i in range(len(hidden_layer_sizes)):
        X = D if i == 0 else Hs[i - 1] # input at this layer
        fan_in = X.shape[1]
        fan_out = hidden_layer_sizes[i]
        W = init_func(fan_in, fan_out)
        H = np.dot(X, W) # matrix multiply
        H = act[nonlinearities[i]](H) # nonlinearity
        Hs[i] = H
    # look at distributions at each layer
    print('input layer had mean %f and std %f' % (np.mean(D), np.std(D)))
    layer_means = [np.mean(H) for i, H in Hs.iteritems()]
    layer_stds = [np.std(H) for i, H in Hs.iteritems()]
    for i, H in Hs.iteritems():
        print('hidden layer %d had mean %f and std %f' % (i + 1, layer_means[i], layer_stds[i]))
    # plot the means and standard devations
    plt.figure()
    plt.subplot(121)
    plt.plot(Hs.keys(), layer_means, 'ob-')
    plt.title('layer mean')
    plt.subplot(122)
    plt.plot(Hs.keys(), layer_stds, 'or-')
    plt.title('layer std')
    # plot the raw distributions
    fig, axes = plt.subplots(nrows = 1, ncols = 10, figsize=(15, 5))
    for i, ax in enumerate(axes.flat, start = 1):
        ax.hist(Hs[i - 1].ravel(), 30, range = (-1, 1))
    fig.tight_layout()
    plt.show()
Hint: think about backward pass for a W*X gate.
func = lambda x, y: np.random.randn(x, y) * 0.01
init_weight_plot(func, 'tanh')
Ans: A neural network layer that has very small weights will during propagation compute very small gradients on its data (since this gradient is proportional to the value of the weights). This could greatly diminsh the 'gradient signal' flowing backward through a network, and become a concern for deep networks.
func = lambda x, y: np.random.randn(x, y) * 1.0
init_weight_plot(func, 'tanh')
One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neurons has a variance that grows with the number of inputs. It turns out that we can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e. its number of inputs). That is, the recommended heuristic is to initialize each neuron's weight vector as the following func.
func = lambda x, y: np.random.randn(x, y) / np.sqrt(x)
init_weight_plot(func, 'tanh')
The sketch of the derivation is as follows: Consider the inner product $s=\sum^n_i{w_ix_i}$ between the weights $w$ and input $x$, which gives the raw activation of a neuron before a neuron before the non-linearity. We can examine the variance of $s$.
$Var(s)=Var(\sum^n_i{w_ix_i})\\ \quad\quad=\sum^n_i{Var(w_ix_i)}\\ \quad\quad=\sum^n_i{[E(w_i)]^2Var(x_i)+E[(x_i)]^2Var(w_i)+Var(x_i)Var(w_i)}\\ \quad\quad=\sum^n_i{Var(x_i)Var(w_i)}\\ \quad\quad=(nVar(w))Var(x)$
where in the first 2 steps we have used properties of variance. In third step we assumed zero mean inputs and weights, so $E[x_i]=E[w_i]=0$. Note that this is not generally the case: For example, ReLU units wil have a positive mean. In the last step we assumed that all $w_i$,$x_i$ are identically distributed. From the this derivation we can see that if we want $s$ to have the same variance s all of its inputs $x$, then during initialization we should make sure that the variance of every weight $w$ is $1/n$, where $n$ is the number of its inputs. And since $Var(aX)=a^2Var(X)$ for a random variable $X$ and a scalar $a$, this implies that we should draw unit Gaussian and then scale it by $a=\sqrt{1/n}$, to make its variance $1/n$. This gives the initialization.
func = lambda x, y: np.random.randn(x, y) / np.sqrt(x)
init_weight_plot(func, 'relu')
func = lambda x, y: np.random.randn(x, y) / np.sqrt(x / 2)
init_weight_plot(func, 'relu')
by Glorot and Bengio, 2010
The authors end up recommending an initialization of the form $Var(w)=2/(n_{in}+n_{out})$ where $n_{in}$, $n_{out}$ are the number of units in the previous layer and the next layer. This is based on a compromose and an equivalent analysis of the packpropagation gradients.
by Saxe et al., 2013
by Sussillo and Abbot, 2014
by He et al., 2015 This paper derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be $2,0/n$.
by Krähenbühl et al., 2015
by Mishkin and Matas, 2015