CNN咋调参数？古董级CPU能否配合最新的GPU进行深度学习

发表时间：2017-12-25 21:00:01 作者： 来源： 浏览：次

在上一篇文章中，小编为您详细介绍了关于《铭影GTX750TI(2G显存双风扇)与铭瑄GTX750TI(1G显存单风扇)谁好？i53550配r9270能发挥全部显卡性能么》相关知识。本篇中小编将再为您讲解标题CNN咋调参数？古董级CPU能否配合最新的GPU进行深度学习。

初识卷积神经网络，感觉调参数很需要经验啊，各位大神有没有比较好的关于调参数的论文推荐？

原文在Yisong Yue博客上：

由于blogspot被墙，可以看王威廉的微博

谷歌科学家、Hinton亲传弟子Ilya Sutskever的深度学习综述及实际建议

===================

(原文很长很长，此处只节选practical advice部分，爪机答题，格式惨不忍睹，见谅)

Here is a summary of the community’s knowledge of what’s important and what to look after:

Get the data: Make sure that you have a high-quality dataset of input-output examples that is large, representative, and has relatively clean labels. Learning is completely impossible without such a dataset.

Preprocessing: it is essential to center the data so that its mean is zero and so that the variance of each of its dimensions is one. Sometimes, when the input dimension varies by orders of magnitude, it is better to take the log(① + x) of that dimension. Basically, it’s important to find a faithful encoding of the input with zero mean and sensibly bounded dimensions. Doing so makes learning work much better. This is the case because the weights are updated by the formula: change in wij propto xidL/dyj (w denotes the weights from layer x to layer y, and L is the loss function). If the average value of the x’s is large (say, ①⓪⓪), then the weight updates will be very large and correlated, which makes learning bad and slow. Keeping things zero-mean and with small variance simply makes everything work much better.

Minibatches: Use minibatches. Modern computers cannot be efficient if you process one training case at a time. It is vastly more efficient to train the network on minibatches of ①②⑧ examples, because doing so will result in massively greater throughput. It would actually be nice to use minibatches of size ① · and they would probably result in improved performance and lower overfitting; but the benefit of doing so is outweighed the massive computational gains provided by minibatches. But don’t use very large minibatches because they tend to work less well and overfit more. So the practical recommendation is: use the smaller minibatch that runs efficiently on your machine.

Gradient normalization: Divide the gradient by minibatch size. This is a good idea because of the following pleasant property: you won’t need to change the learning rate (not too much, anyway), if you double the minibatch size (or halve it).

Learning rate schedule: Start with a normal-sized learning rate (LR) and reduce it towards the end.

A typical value of the LR is ⓪.①. Amazingly, ⓪.① is a good value of the learning rate for a large number of neural networks problems. Learning rates frequently tend to be smaller but rarely much larger.

Use a validation set ---- a subset of the training set on which we don’t train --- to decide when to lower the learning rate and when to stop training (e.g., when error on the validation set starts to increase).

A practical suggestion for a learning rate schedule: if you see that you stopped making progress on the validation set, divide the LR by ② (or by ⑤), and keep going. Eventually, the LR will become very small, at which point you will stop your training. Doing so helps ensure that you won’t be (over-)fitting the training data at the detriment of validation performance, which happens easily and often. Also, lowering the LR is important, and the above recipe provides a useful approach to controlling via the validation set.

But most importantly, worry about the Learning Rate. One useful idea used by some researchers (e.g., Alex Krizhevsky) is to monitor the ratio between the update norm and the weight norm. This ratio should be at around ①⓪-③. If it is much smaller then learning will probably be too slow, and if it is much larger then learning will be unstable and will probably fail.

Weight initialization. Worry about the random initialization of the weights at the start of learning.

If you are lazy, it is usually enough to do something like ⓪.⓪② * randn(num_params). A value at this scale tends to work surprisingly well over many different problems. Of course, smaller (or larger) values are also worth trying.

If it doesn’t work well (say your neural network architecture is unusual and/or very deep), then you should initialize each weight matrix with the init_scale / sqrt(layer_width) * randn. In this case init_scale should be set to ⓪.① or ① · or something like that.

Random initialization is super important for deep and recurrent nets. If you don’t get it right, then it’ll look like the network doesn’t learn anything at all. But we know that neural networks learn once the conditions are set.

Fun story: researchers believed, for many years, that SGD cannot train deep neural networks from random initializations. Every time they would try it, it wouldn’t work. Embarrassingly, they did not succeed because they used the “small random weights” for the initialization, which works great for shallow nets but simply doesn’t work for deep nets at all. When the nets are deep, the many weight matrices all multiply each other, so the effect of a suboptimal scale is amplified.

But if your net is shallow, you can afford to be less careful with the random initialization, since SGD will just find a way to fix it.

You’re now informed. Worry and care about your initialization. Try many different kinds of initialization. This effort will pay off. If the net doesn’t work at all (i.e., never “gets off the ground”), keep applying pressure to the random initialization. It’s the right thing to do.

If you are training RNNs or LSTMs, use a hard constraint over the norm of the gradient (remember that the gradient has been divided by batch size). Something like ①⑤ or ⑤ works well in practice in my own experiments. Take your gradient, divide it by the size of the minibatch, and check if its norm exceeds ①⑤ (or ⑤). If it does, then shrink it until it is ①⑤ (or ⑤). This one little trick plays a huge difference in the training of RNNs and LSTMs, where otherwise the exploding gradient can cause learning to fail and force you to use a puny learning rate like ①e-⑥ which is too small to be useful.

Numerical gradient checking: If you are not using Theano or Torch, you’ll be probably implementing your own gradients. It is easy to make a mistake when we implement a gradient, so it is absolutely critical to use numerical gradient checking. Doing so will give you a complete peace of mind and confidence in your code. You will know that you can invest effort in tuning the hyperparameters (such as the learning rate and the initialization) and be sure that your efforts are channeled in the right direction.

If you are using LSTMs and you want to train them on problems with very long range dependencies, you should initialize the biases of the forget gates of the LSTMs to large values. By default, the forget gates are the sigmoids of their total input, and when the weights are small, the forget gate is set to ⓪.⑤ · which is adequate for some but not all problems. This is the one non-obvious caveat about the initialization of the LSTM.

Data augmentation: be creative, and find ways to algorithmically increase the number of training cases that are in your disposal. If you have images, then you should translate and rotate them; if you have speech, you should combine clean speech with all types of random noise; etc. Data augmentation is an art (unless you’re dealing with images). Use common sense.

Dropout. Dropout provides an easy way to improve performance. It’s trivial to implement and there’s little reason to not do it. Remember to tune the dropout probability, and to not forget to turn off Dropout and to multiply the weights by (namely by ①-dropout probability) at test time. Also, be sure to train the network for longer. Unlike normal training, where the validation error often starts increasing after prolonged training, dropout nets keep getting better and better the longer you train them. So be patient.

Ensembling. Train ①⓪ neural networks and average their predictions. It’s a fairly trivial technique that results in easy, sizeable performance improvements. One may be mystified as to why averaging helps so much, but there is a simple reason for the effectiveness of averaging. Suppose that two classifiers have an error rate of ⑦⓪%. Then, when they agree they are right. But when they disagree, one of them is often right, so now the average prediction will place much more weight on the correct answer. The effect will be especially strong whenever the network is confident when it’s right and unconfident when it’s wrong.

I am pretty sure that I haven’t forgotten anything. The above ①③ points cover literally everything that’s needed in order to train LDNNs successfully.

So, to Summarize:

LDNNs are powerful.

LDNNs are trainable if we have a very fast computer.

So if we have a very large high-quality dataset, we can find the best LDNN for the task.

Which will solve the problem, or at least come close to solving it.

编后语：关于《CNN咋调参数？古董级CPU能否配合最新的GPU进行深度学习》关于知识就介绍到这里，希望本站内容能让您有所收获，如有疑问可跟帖留言，值班小编第一时间回复。下一篇内容是有关《请问这套配置合理么大概多少RMB？8500预算装机求推荐一下》，感兴趣的同学可以点击进去看看。

资源转载网络，如有侵权联系删除。