Simple learning rate selection for neural networks

In this blog we’ll take a look at how you can select in a simple way the learning rate when training and validating a neural network. We’ll be using a constant learning rate and show the effects of using different values for it.

There are more advanced and effective learning rate schemes that can be used such as linear or exponential learning rate decay but I’ll dive into those methods in the near future.

First a small recap of what learning rate actually is. The learning rate determines how much the optimizer moves the weights in a neural network model in the direction of the mini-batch gradient. In the provided sample code an Adam optimizer is chosen and the mini-batch size is set to 256.

With a too high learning rate training the model will not be reliable
and could possibly diverge. The model will ‘overshoot’ the optimal solution and won’t make a good fit on the training and validation data.

With a too small learning rate training of the model will be more reliable but it could take many epochs to optimize the model.

To test the various learning rates I’ll setup a simple neural network model which will be trained for 150 epochs and validated on the MNIST dataset. The MNIST dataset consist of a large collection of handwritten digits and is commonly used in training and testing of machine learning systems. For each learning rate a chart will be plotted showing the accuracy and loss of the model.

So let’s first take a look at the Python code we’ll be using for the whole process. The sample code used for this blog can be found in the Python file in my Github Repository. Please also note that I’ve used Tensorflow 1.6 as the backend for Keras. It should also work with other backends (the result might be different…).

First we’ll import all necessary modules and download the MNIST dataset and rescale the values into the range of 0 to 1.

import numpy as np 
from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.optimizers import Adam
import matplotlib.pyplot as plt

# Download MNIST Data
mnist = fetch_mldata('MNIST original', data_home='~')

# Rescale
X = / 255

Next we’ll perform one-hot encoding of the labels.

# One Hot
labels = range(10)
lb = preprocessing.LabelBinarizer()
Y = lb.transform(

We’ll split the data and labels into training and validation sets.

# Split in Training and Validation Sets
x_train, x_val, y_train, y_val = train_test_split(X, Y, test_size=0.15, random_state=42, stratify=Y)

We’ll define a function with which we can create a model with the specified learning rate. The input size of the neural network model is 784. The images of the MNIST dataset are originally 28 * 28 pixels. However in the dataset they are already flattened so there is no need to reshape the input for the model. The model further contains a hidden layer and an output layer representing the 10 digits. We use an Adam optimizer and set the specified learning rate.

# Create Model
def create_model(learning_rate):
    model = Sequential()
    model.add(Dense(784, input_dim=784, kernel_initializer='normal', 
    model.add(Dense(10, kernel_initializer='normal', 
    # Compile model
    optimizer=Adam(lr=learning_rate), metrics=['accuracy'])
    return model

Then we define the learning rates that we will loop through. Note that 0.001 is the default learning rate for the Adam optimizer in Keras if nothing is specified.

# Define all learning rates
learning_rates = [ 0.00001, 0.0001, 0.001, 0.01, 0.1 ]

Next we will loop through the different learning rates. In each loop we will fit a new model based on the learning rate and train the model for 150 epochs. In each loop we also generate a combined chart showing the training and validation accuracy and loss.

# Loop through Learning Rates
for learning_rate in learning_rates:
    # Create Model
    model = create_model(learning_rate)
    # Fit Model on new learning_rate
    history =, y_train, 
    validation_data=(x_val, y_val), 
    epochs=150, batch_size=256, verbose=2, shuffle=False)

    # Plot Chart
    fig = plt.figure(dpi=300)

    # Subplot for Accuracy
    ax1 = fig.add_subplot(111)    
    ax1.plot(history.history['acc'], color='b', 
    label='Train Accuracy')
    ax1.plot(history.history['val_acc'], color='g', 
    label='Validation Accuracy')
    ax1.legend(loc='lower left', bbox_to_anchor=(0, -0.25))
    # Subplot for Loss
    ax2 = ax1.twinx()
    ax2.plot(history.history['loss'], color='r', 
    label='Train Loss')
    ax2.plot(history.history['val_loss'], color='c', 
    label='Validation Loss')
    ax2.legend(loc='lower right', bbox_to_anchor=(1, -0.25))
    # Format Learning Rate and set Title
    lr_label = ('%f' % learning_rate).rstrip('0').rstrip('.')
    plt.title('Model - Learning Rate ' + lr_label)
    # .. and save..
    + '.png', bbox_inches="tight") 

After running the Python code for over an hour on my laptop all charts were generated. Let’s take a look at them and how we can analyze the results.

First let’s take a look at the chart for the largest learning rate 0.1. The accuracy for the training and validation data doesn’t get above the 21% which is not good. Also the loss for both sets is close to 12.8. The loss should be close to 0. This chart shows very nicely a model that is unable to fit both training and validation data.
Chart for Learning Rate 0.1
Next let’s take a look at learning rate 0.01. The training and validation both reach a high accuracy after a number of epochs. The training loss decreases to a low value. However the validation loss keeps increasing with every epoch so this model shows signs of overfitting and also should not be used.
Chart for Learning Rate 0.01
Now the chart for the default learning rate of 0.001. The training accuracy increases smoothly to almost 100% and the validation accuracy increases to slightly above 98%. The training loss decreases to almost 0 after 20 or more epochs. The validation loss also decreases smoothly however after about 15 epochs it starts to increase again and shows signs of overfitting. So this model would be usable if we would apply early-stopping and train the model for only about 15 epochs.
Chart for Learning Rate 0.001
The chart for learning rate 0.0001 shows clearly that it already takes a lot longer to reach high accuracy values. The validation loss decreases until around 60 epochs and then starts to increase again showing signs of little overfitting. Again the model could be used if we applied early-stopping.
Chart for Learning Rate 0.0001
With the smallest learning rate 0.00001 even after 150 epochs the accuracy for training and validation data are still increasing and the loss is still decreasing. This model has not yet reached an optimal fit on the training and validation data and requires many additional epochs of training.
Chart for Learning Rate 0.00001
Based on the first set of learning rates it looks like the range between 0.001 – 0.0001 is the best one with high accuracy, very low loss and almost no overfitting. Let’s try out 2 additional learning rates in that range.

The learning rate of 0.0005 shows comparable results as learning rate 0.001. The training and validation accuracy reach high values. The training loss decreases to almost 0 and the validation loss only starts to increase around epoch 10 – 15.
Chart for Learning Rate 0.0005
The learning rate of 0.00025 already shows it takes a little longer to train. The validation loss starts to increase again after about 20 epochs and the validation accuracy at that moment is slightly below 98%.
Chart for Learning Rate 0.00025

We have now seen how we can try different learning rates in a simple way and generate usable charts. Based on the charts the learning rates of 0.001 or 0.0005 would give a model with a good fit on training and validation data, high accuracy and low loss. In my next blog I’ll further dive into learning rate decay and early-stopping.

One thought on “Simple learning rate selection for neural networks

  1. Pingback: Batch size selection for neural networks – Robin's Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s