Deep Learning with Convnets: Computer Vision

Introduction
      Step 1: Defining the problem and assembling a dataset
      Step 2: Choosing a measure of success
      Step 3: Deciding on an evaluation protocol
Methodology
      Step 4: Preparing the data
      Step 5: Develop a model that does better than a baseline
      Step 6: Develop a model that overfits
      Step 7: Regularizing the model
Results
Conclusion

Introduction

This report is based on the text Deep Learning with Python by François Chollet (hereon referred to as DLWP). Specifically, it is largely based on chapter 4.5 of the first edition and chapter 8 of the second edition.

Step 1: Defining the problem and assembling a dataset

Given an input of an image that contains either one cat or one dog, we wish to predict the species of the depicted animal. As such, we are dealing with a binary classification task. As a dataset, we can use Kaggle's "Dogs vs Cats" dataset from 2013.

Downloading the dataset

If you intend to run any of the code in this Jupyter notebook, you must first create a Kaggle account if you do not already have one, go to your account settings, and then click "Create API Key" and save the JSON file to $HOME/.kaggle/kaggle.json. You must then go to this page and click "I Understand and Accept" (you may need to verify your account first if asked to). This will enable you to download the dataset via the Kaggle API as follows.

The images are of varying quality but all of them contain a cat or dog that is clearly discernable to a human viewer. Other objects may be present in the images, such as a human arm as seen here in this somewhat blurry image that appears to have been captured with a mobile phone camera:

Nevertheless, we can hypothesize that the quality of these inputs are sufficiently informative to map inputs to predictive outputs.

We can moreover note that this is a stationary problem, in that the general appearances of cats and dogs is not something that will change with any practical length of time that concerns us. We therefore do not need to worry about updating and retraining our model, as would likely be needed in the case of something like fashion trends.

Step 2: Choosing a measure of success

Since this is a binary classification problem with a symmetric dataset (50% cats and 50% dogs), we can use accuracy as a measure of success. A baseline classifier that outputs a random guess for each image would achieve 50% accuracy on average. We must therefore aim to beat 50% accuracy.

Step 3: Deciding on an evaluation protocol

The dataset we intend to work with will contain 25000 images which is quite a lot. Convnets can be quite effective with datasets that are smaller than this. Hold-out validation should therefore be an appropriate choice for evaluation. We might choose K-fold cross-validation or iterated K-fold validation if we were dealing with a smaller dataset.

Methodology

Step 4: Preparing the data

The original dataset contains 25000 training images and 12500 test images, with the training set having an even split between cat and dog images. However, the test images are unlabelled so we will be ignoring them and will instead split the training set into training (60%), validation (20%), and test (20%) sets. The image names are in the format [SPECIES].[ID].jpg where [SPECIES] is either cat or dog and [ID] is in the range [0, 12500).

The following code has been taken from DLWP and slightly modified. It splits the training set into training, validation and test set directories, and moreover creates separate directories for cats and dogs within each parent set directory.

This produces a training set of 15000 images (7500 cats and 7500 dogs), a validation set of 5000 images (2500 cats and 2500 dogs), and a test set of 5000 images (2500 cats and 2500 dogs).

Machine learning models expect as input homogeneous tensors of floating-point numbers within a small range. As such, we will need to resize all the images. For this will we choose 180x180 pixels. Additionally, the RGB values should be scaled to the range [0, 1], and the images should be put into batches so that we do not need to load all of them into memory. All of this can conveniently be done via Keras' image_dataset_from_directory function which will also shuffle the data. DLWP demonstrates that this can be done as follows:

Step 5: Develop a model that does better than a baseline

As previously mentioned, our baseline is an accuracy of 50% which our model must exceed. For the intermediate layers of the model, we will be relying on convnets which are widely used in computer vision.

Use of convnets

A convnet is essentially a stack of Conv2D and MaxPooling2D layers whose input tensors are of shape ([IMAGE HEIGHT], [IMAGE WIDTH], [IMAGE CHANNELS]). Where as Dense layers are limited to learning patterns in their global input space (i.e. all pixels of an image), convnets can learn local patterns (i.e. patterns within an image). These patterns can then be recognised wherever they appear. Additionally, subsequent layers can learn patterns based on features of previous layers. Convnets therefore require fewer samples to achieve generalisation power.

The convolution operation (Conv2D) takes as input a 3D tensor and outputs another 3D tensor by applying transformations to patches within the input's subspace. The job of the max-pooling operation (MaxPooling2D) is then to aggressively downsample these 3D tensors with the goal of reducing the number of parameters to process.

The following function is based on code from DLWP and creates our model using intermediate convnet layers. An initial layer scales the inputs from range [0, 255] to range [0, 1]. For binary classifiers, a standard practice is to use sigmoid for last layer activation which is what is being done here.

It is also standard practice to use binary_crossentropy for the loss function of a binary classifier. For the optimizer, DLWP mentions that "in most cases, it's safe to go with rmrprop and its default learning rate", so we will be using that. Now to compile the model with the appropriate settings. Let's also take a look at the model's summary.

We can also define a training function below. Again, this is based on code from DLWP and uses Keras callbacks to save and monitor the state of the training.

Let us now train the preliminary model.

Let's now define a function to plot some graphs comparing training and validation accuracies and losses per epoch. The code within this function has been taken directly from DLWP.

Let's see what the graphs look like.

Notice that the validation loss continuously increases after the third epoch. This indicates that the third epoch is the optimal epoch for the current model. We can confirm this by defining a function to return the epoch at which the validation loss is lowest.

Let's now test the model. Keras' load_model will load the best-fit model (i.e. the one trained using the optimal number of epochs) from the saved callback checkpoints. Based on the examples from DLWP we can define a test_model function and run it.

The model has a test accuracy of 74.2%, better than our common sense baseline of 50%. We have thus achieved statistical power, but further improvements can be made.

Step 6: Develop a model that overfits

To achieve overfitting we should try adding more and larger layers as well as train for more epochs. Let's do this by adding three extra intermediate layers.

Now let us train the model using 20 epochs.

Here are the training results:

We see that the validation loss begins to increase after the sixth epoch, so the optimal epoch for the model is 6.

Let's now evaluate the model on the test dataset. Remember that our test_model function will use the model trained on the optimal number of epochs which was saved via Keras callbacks.

We see that increasing the complexity of the model's intermediate convnet layers has resulted in a higher prediction accuracy of 87.4% on unseen data.

Step 7: Regularizing the model

To further reduce the amount of overfitting, we can first try the following two regularization techniques in conjunction:

Dropout essentially introduces noise in the outputs tensors of the convnets' layers, making it harder for the model to memorise patterns that are not significant. L2 regularization is a form of weight decay that helps to prevent overfitting via a reduction in the values of a layer's weight coefficients. (DLWP)

Let's see this in action by training a new model using these techniques. Apart from regularization, the configuration will be the same.

The following output was produced for the first six epochs:

Epoch 1/20
469/469 [==============================] - 160s 337ms/step - loss: 0.7273 - accuracy: 0.4994 - val_loss: 0.6936 - val_accuracy: 0.5000
Epoch 2/20
469/469 [==============================] - 162s 346ms/step - loss: 0.6937 - accuracy: 0.4931 - val_loss: 0.6936 - val_accuracy: 0.5000
Epoch 3/20
469/469 [==============================] - 166s 355ms/step - loss: 0.6936 - accuracy: 0.4943 - val_loss: 0.6936 - val_accuracy: 0.5000
Epoch 4/20
469/469 [==============================] - 159s 339ms/step - loss: 0.6936 - accuracy: 0.4969 - val_loss: 0.6936 - val_accuracy: 0.5000
Epoch 5/20
469/469 [==============================] - 156s 332ms/step - loss: 0.6936 - accuracy: 0.4957 - val_loss: 0.6936 - val_accuracy: 0.5000
Epoch 6/20
469/469 [==============================] - 155s 331ms/step - loss: 0.6936 - accuracy: 0.4956 - val_loss: 0.6936 - val_accuracy: 0.5000

Clearly there is something wrong with this, as the validation loss and accuracy are hanging at 69.36% and 50% respectively. We could try tinkering with the L2 regularization constant, but since the network is quite large, L2 regularization may be unnecessary in the first place since it is mostly used with smaller models according to DLWP. Let's instead try again using only a dropout regularization strategy.

Results

The results of regularizing the model can be seen as follows.

We see that the dropout has worked to slightly reduce overfitting and that the optimal number of epochs is now 8.

Now to evaluate on the test dataset using the model trained with the optimal number of epochs.

The accuracy of the classifier on unseen data has also increased to 88.6%.

One reason that the dropout only slightly improved the model may be that the dataset is too large for any type of regularization to have much effect. In order to combat overfitting further, we should try tuning the model using various different hyperparameters such as the optimizer's learning rate and the units per layer.

Conclusion

We have seen the effectiveness of convnet layers for deep learning tasks related to computer vision. We have furthermore developed a binary classification model for images that beats a common sense baseline by a significant margin of predictive accuracy: 88.6% to 50%. We then observed how dropout regularization can help to reduce overfitting with large complex models, although its effect was limited in this case. Finally, we noted that further work should be done to combat overfitting by tuning the hyperparamters of the model. Doing this effectively should further increase the model's predictive accuracy on unseen data.