Dropout, a CNN story
Batch Normalization greatly improved my Convolutional Neural Network in my last post. My colorized image successfully captured a wide range of colors, more than the standard sepia dogging my other models since the beginning of my Capstone. Because of this, I decided to run my model for longer, in hopes of improving my results; gathering weights more suitable for my desired results.
I ran the model for 25 epochs, with an early stopping rate at 5 epochs. The process took roughly two hours, and the results were lackluster. Where my original CNN was able to create a colorized photo with reasonable results, running the model for longer re-colorized the photo in sepia. But why could this be? I running a model for 10 epochs could produce reasonable results, why would running my model for 25 over color my image? The reason, I found, after studying further, was a problem with generalization.
Essentially, where my model was able to find reasonable weights in 10 epochs, training the model for longer actually made my model over-generalize its own results. As the model trained, it kept attempting to reduce the difference between actual and expected values in the colorization process. But because of this, the unique connections between certain grays and their RGB cousins became over-generalized, and my model chose sepia, or brown, instead of the true color, since brown is the color most like every other color on the color wheel.
To reconcile this problem, I decided to do research into Dropout. Dropout is a technique used in machine learning problems where a neurons in hidden layers are randomly selected, and removed from the network to ensure that the model can adapt to different inputs, without over-generalizing on the outputs. Where Dropout is typically used to drop neurons that go ‘dead’ (with vanishing gradients), I would use Dropout to also help my model handle variability.
Reading into Dropout on Keras’ website, I discovered that a Dropout layer will randomly drop a neuron by assigning its input value/weight to zero. This regularization technique essentially removes the influence of dead neurons on their sibling components in later hidden layers. In this way, by forward pass experiences a form of random feature selection that helps prevent overfitting. On the backward pass, where gradient descent is calculated, dead neurons are not updated in their weights, since their weights are not calculated to begin with.
I decided to add a dropout layer between each Conv2D layer of my network. My thinking was that by adding a dropout layer between Conv2D, each calculated layer would experience dropout along the network, helping my model to avoid overfitting by forcing the remaining neurons to pull extra weight along the forward pass. Some interesting changes where immediately apparent during training. Notably, training time went up from approximately 353 seconds per epoch to 383s, 447s, 551s, and so on. This added training time could be beneficial to my network, since the remaining neurons in my network have to adapt and compensate for the randomly selected dead neurons, or it could mean that my model is diverging from my desired results.
Unfortunately, my model didn’t improve my results, even with dropout. If anything, what was returned was even fuzzier, out of focus, and more brown than I anticipated. With these results, more studying is needed. I will look into transfer learning.