Trouble Shooting Neural Networks¶

Training a neural network is a lot like traying to decide between having pancakes and waffles for breakfast...there's no right answer. Many different frameworks and training methods will give you adequate results. However, we might choose certain approaches because they are simpler, more robust, easier to understand, or give us nice uncertainty estimates. While exploring these different methodologies we often run into problems. I've hit a variety of different problems while trying to train neural nets ranging from simple numerical errors to much more subtle issues with custom loss functions. I've taken some of my favorite neural net troubleshooting guides along with my own advice and synthesized it below to help anyone who finds themselves in the same boat.

Sources:

Hours of frustrating troubleshooting
Bugging coworkers and friends gosh knows how many times
37 Reasons why your Neural Network is not working
How To Improve Deep Learning Performance
How to debug neural networks. Manual.
GAN — Why it is so hard to train Generative Adversarial Networks!
From GAN to WGAN

Gettign Started¶

Get a good baseline. You should have simpler models to compare your model performance to (typically there is a standard benchmark for your use case such as logistic regression for classification tasks). Don't jsut comapre the accuracy of these models, look at F scores, ROC curves, PR curves, etc. These will give you clues about inconsistencies/instabilities in model performance.
Try a pre-trained model. (like VGG for images) There are already a variety of very powerful pre-trained neural nets out there that you can use! Transfer learning is a powerful tool in a modeler's toolbox. This pre-trained model could even be your benchmark agaisnt your 100% custom model.
Start simple. Drop the regularization and data augmentation techniques. Bells and whistles can obfuscate the real issue with your model. There are numerous times where I only caught the issue with my model after cutting it down to a couple small layers without any regularization.
You don't always need a code. I've dealt with some crazy custom code and it can be tricky. A myriad of problems could pop up from simple math errors to how you are formatting your tensors. Try out of the box code first and try to get that working as well as possible.
Look at the data. (graph it if possible) You should understand your data. Does it need to be normalized? Do you ahve enough of it? Does it display heteroscedastic behavior? For time series, is it stationary? Is the neural net architecture you chose suited for that type of data? Why or why not?
Fewer samples and fewer features. Debugging issues becoems a lot easier with smaller data. You can even try intentionally overfitting to you subsampled dataset. If your neural net can't overfit to the data then something is seriously wrong with your setup and you need to figure out what before scaling up to the full dataset.
It's not always about wider or deeper. Many modelers will start debugging model performance by adding more layers or mode hidden units to their model. More often than not the real answer is feature engineering or regularization. You can also try embeddings or convolutions. Don't get stuck in the mindset that all problems are solved by deeper nets. Deeper nets are ahrder to train and prone to all sorts of gradient issues.
Add in complexity one piece at a time. If there are 5 cool features you want to add to your model, try adding them one at a time. That way you'll know what caused your error...the last feature you added! For example, if I want a custom regularization like concrete dropout and a custom loss function, I'll add the loss function and then the dropout.

Data Issues¶

Try random noise. Your trained model should spit out nonsense if fed noise. If it doesn't it might mean that you've messed up a tensor definition or some math component in your model.
Try normalizing to 0-1 scale or to ${N}(0,1)$. The scale of your features can impact the likelihood of getting stuck in a local minima when training (just think about the topography of the loss function with different scaled data). I've found that sometimes scaling input data can take an ok model and make it amazing.
Check the data generator. We often use custom functions to feed in batches to our neural net. Sometimes the data and the model are fine but the data generator has an error in it. Try looking at a batch and making sure that it has the dimensions you want and that the data is what you expect.
Shuffle the data. (do this before splitting into test, train, and validation) Patterns in your data can throw off training. For example, if you only have a single class in some of your training batches!
Downsample to fix class imbalance. Downsampling is (in my experience) often the ebst way to handle clas simbalance issues in deep learning. We often have enough sampels that down sampling isn't a bad idea. If down sampling doesn't help then you can try upsampling and custom loss functions. Start simple and then go complex.
Ensure you have enough data. Image and audio deep learning problems often require at least 1000 samples per class. IF you don't have enough data, you might want to consider an approach other than deep learning. Neural nets aren't the end all of data science.
Results too good to be true? Ensure you don't have futurebleed. I've had numerous young modelers come to me with resutls that are too good to believe. Often it is a result of accidently addign in a feature that they shouldn't have. When you feed in thousands of variables to a model this is easy to do. Be skeptical of good resutls as well as bad results.
Try with standardized datasets first. (MNIST, CFAR, Boston Housing, etc.) There are datasets that the industry uses to benchmark new modeling approaches. These are good for benchmarking your model performance. There is also a ton of material about how you should treat these datasets and model them, so you have plenty of guides to help debug common problems.
Too much of a good thing can be bad. Excess data augmentation can have a regulaizing effect. We can go crazy with data augmentation. This can drown out certain signals in the data or have regularizing effects that we don't intend.
Check preprocessing. Is your pixel data between [0,255]? Is your audio recording in the range of human hearing? Did you accidently drop all of your negative data samples? Is preprocessing uniform between test, train, and validation sets? You should process your data as you would in production. Normalizing your test and train sets independantly is a bad idea. Normalize your trains et and use the same mentod on your test set.
Cast your data to the right precision 8,16,32,64. Loss of precision can drown out important signals. However, excess precision takes up un-needed space that could be valauable. I often cast data as 16 or 32 instead of 64 so that I can fit bigger batches on my GPUs.
Watch for high cardinality for one-hot categoricals. High cardinality means that a categorical variable has a lot of possible values. Neural nets are notoriously bad at handing very high cardinality. Often we try to use embeddings to try and solve this. Deeply nested categoricals present similar issues.
Double check your feature selection...too much or too little? In pursuit if simplicity you might be removing important features accidentally. For example, some features have very important secondary or tertiary interactions with other featurees even though they are unimportant in and of themselves. You may not see that relationship using standard caulaity and correlation tools, but you're using a NN because it's more compelx than the standard tools! Additionally, too many features may cause your NN to overfit to noise. There is no rule of thumb on how many feautres you need to have and this is where Deep Learning beacomes more of an art than a science.
Autoencoders can also be used for denoising/feature engineering/embedding of your data. I would use these after the afore mentioned techniques.

Implementation Issues¶

Try a simpler problem and work up to your use case. Fewer classes. Smaller networks. If you have a 10 class classififcation rpoblem try formatting it as a binary classification problem first. If working on sequence problems try an RNN then GRU then LSTM. If working on a time series problem start with a fully connected feed foward then a CNN and then an RNN/GRU/LSTM (I've found CNNs to be best for most time series problems).
Does your output make sense? Does your NN predict only one class? Is your prediction random? Is the activation function of your output layer what it should be (i.e. sigmoid for binary classification)? Patterns in your outputs are good clues about what could be going wrong in your network.
Does your loss function make sense? Is the math correct? I can't tell you how many times a missing negative sign has messed up one of my models. make sure you aren't doign something stupid like using binary cross entropy instead of categorical cross entropy for a multi-class problem.
Loss isn't everything. Monitor accuracy, precision, recall, etc. Look at a variety of dimensions during training. Also, look at a train and validation set to check for overfitting.
Test custom components independantly. For example, a custom loss function can be tested on fake predictions to ensure it behaves properly. You don't have to just stick it into a model untested. Try testing model components like you would unit test functions in your code...because that's what they are!
If you've already tried a simpler approach, try deepening or widening your network. I know that I previously said that this isn't always the answer...but sometimes it is!
Is your common sense tingling? Check that the dimensions of your data, input layers, hidden layers, and output layer make sense. I've caught a number of silly copy-paste erros this way.
Check your gradients. Exploding and vanishing gradients can be big problems. Exploding gradients are where your loss gradient is so steep that it blows your model weights up to infinity during backprop. Vanishing gradients are a simialr problem where the gradient of a gradient (think very deep networks) does to zero perventing certain layers from learning.
Check your math. (especially for bayes nets) Are your standard deviations constrained to positive numbers? Do you have all the right negative signs? Did you distribute right and apply transformations along the correct axis?

Training Issues¶

Can you overfit to a smaller dataset? If your NN can't overfit to a small dataset then something is fundamentally wrong with your code/math. Take a look. NNs should be prone to overfitting and if your NN can't...that's bad.
Issues initializing NN weights. How are you initializing your weights? Are you sampling from a gaussian? sampling from a uniform? If you have a bayes net, what does your prior look like?
Check your hyperparameters. (especially learning rate) Do your hyperparameters make sense? Could your learning rate be too big?
Beware over regualrization and underfitting. We are often so scared of overfitting with NNs that we can underfit with all of our data aumentarions and regualrizations.
More train time. Sometimes you just need to let your NN train for a little longer or a little less. Be patient and plot your valdiation datset's loss across epochs.
Visualize training. Sometimes you just need to get a feel for what your net is learning. This falls under the greater category of emthods from producing 'explainable AI'.
- visualize model outputs during train (especially for generative models)
- watch for frozen layers (layers that aren't updating), vanishing/exploding gradients
Try a smarter/dumber optimizer. Optimizers can make a huge difference. Typically we default to something like the Adam optimizer...but that doesn't mean that it is always the best optimizer to use. Shake things up!
NaNs...seeing NaNs during training can be very frustrating.
- try smaller learning rates
- check for divisions by zero or infinity
- check for undeflow/overflow issues
- check for gradient issues
- check for math errors
Watch for the classics: over and under fitting.
Decrease batch size. Batch sizes that are alrge decrease training times but they can also imapct accuracies: https://arxiv.org/abs/1609.04836.

Architecture¶

Make sure you have the right architecture. If you're using a CNN to solve the boston housing dataset you should take a serious look in the mirror and ask waht you're doing with your life. I've been there.
After you've gone through everything above, try something crazy.
- Stacking Neural Nets. Try stacking a CNN and LSTM.
- Parallel Neural Nets. There are a plethora of ways of 'ensembling' neural nets.

RNNs¶

Don't start with a LSTM. Start with an RNN then try a GRU then try a LSTM.
RNNs are prone to gradient issues. Look out for:
- Poor/stagnant loss during training
- Large changes in loss (instability) during training
- Loss or model weights goes to NaN (overflow,underflow,division by zero/infinity issue)
- All model weights grow or shrink rapidly during training
- All gradient updates are > 1.0
Facing down gradient issues. If you have a gradient issue try:
- Fewer layers
- New activation function (linear or relu)
- Setting floors/ceilings on gradients
- Regualrization (i.e. guassian noise or dropout)
- Teacher forcing
- Try GRU or LSTM recurrent units
- Try gradient clipping
- Try truncated backprop through time

GANs¶

GANs suck. First try using a different generative model. GANs should be your last resort.
Overfitting. Using a GAN to generate adversarial samples for another NN? CHECK FOR OVERFITTING! Adversarial training can make your NN more robust but it can also make it overfit.
Tuning GANs can feel like black magic. There are so many more hyperparameters to deal with! Use your brain and think about what might be going wrong. Is it the discriminator learning rate? The generator learning rate? The form of the generator/discriminator?

Bayesian Neural Networks¶

Check your priors! Do they make sense?
Check your math. Are standard deviations constrained to positive numbers? I like passing them through softplus functions to ensure they are positive/non-zero valued.
Abusing your hardware. Are you trying to use MCMC sampling on a large model/dataset? Stop it. You're abusing your poor computer. Try variational inference or MCDropout.
Waiter, check please! Convergence checks. Mixing checks. Uncertainty checks. Posterior predictive checks. All the standard bayesian stuff.
Bayesian statistics isn't magic. It isn't snake oil either. Make sure you are using the right tool for the right use case and that you understand the model you've built.