神经网络中的梯度计算问题

神经网络中的梯度计算问题

我在使用numpy在python中实现神经网络时检查梯度的计算遇到问题。我正在使用mnist数据集尝试并尝试使用小批量梯度下降。

我已经检查了数学并且在纸面上看起来不错，所以也许您可以给我一些这里发生的事情的提示：

编辑：一个答案让我意识到，成本函数的计算确实是错误的。但是，这不能解释渐变问题，因为它是使用back_prop计算的。使用minibatch gradient带有rmsprop30个历元和100个批次的下降的隐藏层中的300个单位，我得到％7的错误率。（learning_rate= 0.001，由于rmsprop而较小）。

每个输入具有768个功能，因此对于100个样本，我有一个矩阵。Mnist有10个班级。

X = NoSamplesxFeatures = 100x768

Y = NoSamplesxClasses = 100x10

完全训练后，我正在使用一个隐藏层神经网络，其中隐藏层大小为300。我还有一个问题是我是否应该使用softmax输出函数来计算错误...我认为不是。但是，对于所有这些我都是新手，显然，这对我来说似乎很奇怪。

return np.true_divide(1,1 + np.exp(-z) )

#not calculated really - this the fake version to make it faster.

def sigmoid_prime(a):

return (a)*(1 - a)

def _back_prop(self,W,X,labels,f=sigmoid,fprime=sigmoid_prime,lam=0.001):

"""

Calculate the partial derivates of the cost function using backpropagation.

"""

#Weight for first layer and hidden layer

Wl1,bl1,Wl2,bl2 = self._extract_weights(W)

# get the forward prop value

layers_outputs = self._forward_prop(W,X,f)

#from a number make a binary vector, for mnist 1x10 with all 0 but the number.

y = self.make_1_of_c_encoding(labels)

num_samples = X.shape[0] # layers_outputs[-1].shape[0]

# Dot product return Numsamples (N) x Outputs (No CLasses)

# Y is NxNo Clases

# Layers output to

big_delta = np.zeros(Wl2.size + bl2.size + Wl1.size + bl1.size)

big_delta_wl1, big_delta_bl1, big_delta_wl2, big_delta_bl2 = self._extract_weights(big_delta)

# calculate the gradient for each training sample in the batch and accumulate it

for i,x in enumerate(X):

# Error with respect the output

dE_dy = layers_outputs[-1][i,:] - y[i,:]

# bias hidden layer

big_delta_bl2 += dE_dy

# get the error for the hiddlen layer

dE_dz_out = dE_dy * fprime(layers_outputs[-1][i,:])

#and for the input layer

dE_dhl = dE_dy.dot(Wl2.T)

茅侃侃

浏览 362回答 2

2回答

互换的青春

运行梯度检查时会得到什么样的结果？通常，您可以通过查看梯度的输出与梯度检查产生的输出来弄清实现错误的性质。此外，对于诸如MNIST之类的分类任务，平方误差通常不是一个好的选择，我建议使用简单的S型顶层或softmax。对于S形，您要使用的交叉熵函数为：L(h,Y) = -Y*log(h) - (1-Y)*log(1-h)对于softmaxL(h,Y) = -sum(Y*log(h))其中Y是作为1x10向量给出的目标，h是您的预测值，但可以轻松扩展到任意批次大小。在这两种情况下，顶层增量都简单地变为：delta = h - Y顶层渐变变为：grad = dot(delta, A_in)其中A_in是上一层的顶层输入。虽然我在解决反向传播例程时遇到了一些麻烦，但我从您的代码中怀疑梯度误差是由于以下事实造成的：使用平方误差时，您没有正确计算顶级dE / dw_l2以及计算fprime输入错误。使用平方误差时，顶层增量应为：delta = (h - Y) * fprime(Z_l2)Z_l2是第2层传递函数的输入。类似地，在计算较低层的fprime时，您想使用传递函数的输入（即dot（X，weights_L1）+ bias_L1）希望能有所帮助。

0

0

随时随地看视频慕课网APP

相关分类

Python