I have a very simple feed forward, back propagation, fully connected MLP, using tanh as the activation function.
I read about Xavier weight initialisation, and wanted to try it out, but I'm not sure I'm doing it right.
double generateXavierWeight (size_t prev_layer_size, size_t this_layer_size)
double var = 2.0 / double(prev_layer_size + this_layer_size);
double dev = var*var;
std::normal_distribution<> d(0, dev);
The c++ normal distribution wants mean and standard deviation. The Xavier paper talks about the variance being 2 / (nin + nout), so i assume I can do that and square it to get the standard deviation.
The weights I'm getting (my network has layers of 1024) are very tiny, and the network doesn't seem to be moving at all.
I was previously using uniform distribution over [-0.07, 0.07] and the network was working somewhat with those.
Can anyone give me a clue if I am doing something wrong here?
The paper is here