Introduction
Step 1: Defining the problem and assembling a dataset
Step 2: Choosing a measure of success
Step 3: Deciding on an evaluation protocol
Methodology
Step 4: Preparing the data
Step 5: Develop a model that does better than a baseline
Step 6: Develop a model that overfits
Step 7: Regularizing the model
Results
Conclusion
This report is based on the text Deep Learning with Python by François Chollet (hereon referred to as DLWP). Specifically, it is largely based on chapter 4.5 of the first edition and chapter 8 of the second edition.
Given an input of an image that contains either one cat or one dog, we wish to predict the species of the depicted animal. As such, we are dealing with a binary classification task. As a dataset, we can use Kaggle's "Dogs vs Cats" dataset from 2013.
If you intend to run any of the code in this Jupyter notebook, you must first create a Kaggle account if you do not already have one, go to your account settings, and then click "Create API Key" and save the JSON file to $HOME/.kaggle/kaggle.json
. You must then go to this page and click "I Understand and Accept" (you may need to verify your account first if asked to). This will enable you to download the dataset via the Kaggle API as follows.
!kaggle competitions download -c dogs-vs-cats
!unzip -qq dogs-vs-cats.zip
!unzip -qq train.zip
The images are of varying quality but all of them contain a cat or dog that is clearly discernable to a human viewer. Other objects may be present in the images, such as a human arm as seen here in this somewhat blurry image that appears to have been captured with a mobile phone camera:
from IPython.display import Image
Image(filename="train/cat.0.jpg")
Nevertheless, we can hypothesize that the quality of these inputs are sufficiently informative to map inputs to predictive outputs.
We can moreover note that this is a stationary problem, in that the general appearances of cats and dogs is not something that will change with any practical length of time that concerns us. We therefore do not need to worry about updating and retraining our model, as would likely be needed in the case of something like fashion trends.
Since this is a binary classification problem with a symmetric dataset (50% cats and 50% dogs), we can use accuracy as a measure of success. A baseline classifier that outputs a random guess for each image would achieve 50% accuracy on average. We must therefore aim to beat 50% accuracy.
The dataset we intend to work with will contain 25000 images which is quite a lot. Convnets can be quite effective with datasets that are smaller than this. Hold-out validation should therefore be an appropriate choice for evaluation. We might choose K-fold cross-validation or iterated K-fold validation if we were dealing with a smaller dataset.
The original dataset contains 25000 training images and 12500 test images, with the training set having an even split between cat and dog images. However, the test images are unlabelled so we will be ignoring them and will instead split the training set into training (60%), validation (20%), and test (20%) sets. The image names are in the format [SPECIES].[ID].jpg
where [SPECIES] is either cat
or dog
and [ID] is in the range [0, 12500).
The following code has been taken from DLWP and slightly modified. It splits the training set into training, validation and test set directories, and moreover creates separate directories for cats and dogs within each parent set directory.
import os, shutil, pathlib
original_dir = pathlib.Path("train")
new_base_dir = pathlib.Path("cats_vs_dogs")
def make_subset(subset_name, start_index, end_index):
for category in ("cat", "dog"):
dir = new_base_dir / subset_name / category
if not os.path.isdir(dir):
os.makedirs(dir)
fnames = [f"{category}.{i}.jpg" for i in range(start_index, end_index)]
for fname in fnames:
shutil.copyfile(src=original_dir / fname,
dst=dir / fname)
make_subset("train", start_index=0, end_index=7500)
make_subset("validation", start_index=7500, end_index=10000)
make_subset("test", start_index=10000, end_index=12500)
This produces a training set of 15000 images (7500 cats and 7500 dogs), a validation set of 5000 images (2500 cats and 2500 dogs), and a test set of 5000 images (2500 cats and 2500 dogs).
Machine learning models expect as input homogeneous tensors of floating-point numbers within a small range. As such, we will need to resize all the images. For this will we choose 180x180 pixels. Additionally, the RGB values should be scaled to the range [0, 1], and the images should be put into batches so that we do not need to load all of them into memory. All of this can conveniently be done via Keras' image_dataset_from_directory
function which will also shuffle the data. DLWP demonstrates that this can be done as follows:
from tensorflow.keras.preprocessing import image_dataset_from_directory
train_dataset = image_dataset_from_directory(
new_base_dir / "train",
image_size=(180, 180),
batch_size=32
)
validation_dataset = image_dataset_from_directory(
new_base_dir / "validation",
image_size=(180, 180),
batch_size=32
)
test_dataset = image_dataset_from_directory(
new_base_dir / "test",
image_size=(180, 180),
batch_size=32
)
Found 15000 files belonging to 2 classes. Found 5000 files belonging to 2 classes. Found 5000 files belonging to 2 classes.
As previously mentioned, our baseline is an accuracy of 50% which our model must exceed. For the intermediate layers of the model, we will be relying on convnets which are widely used in computer vision.
A convnet is essentially a stack of Conv2D
and MaxPooling2D
layers whose input tensors are of shape ([IMAGE HEIGHT], [IMAGE WIDTH], [IMAGE CHANNELS]). Where as Dense
layers are limited to learning patterns in their global input space (i.e. all pixels of an image), convnets can learn local patterns (i.e. patterns within an image). These patterns can then be recognised wherever they appear. Additionally, subsequent layers can learn patterns based on features of previous layers. Convnets therefore require fewer samples to achieve generalisation power.
The convolution operation (Conv2D
) takes as input a 3D tensor and outputs another 3D tensor by applying transformations to patches within the input's subspace. The job of the max-pooling operation (MaxPooling2D
) is then to aggressively downsample these 3D tensors with the goal of reducing the number of parameters to process.
The following function is based on code from DLWP and creates our model using intermediate convnet layers. An initial layer scales the inputs from range [0, 255] to range [0, 1]. For binary classifiers, a standard practice is to use sigmoid
for last layer activation which is what is being done here.
from tensorflow import keras
from tensorflow.keras import layers
def create_model(num_layers=1, dropout=None, regularizer=None):
inputs = keras.Input(shape=(180, 180, 3))
x = layers.experimental.preprocessing.Rescaling(1 / 255)(inputs)
# add successively larger intermediate convnet layers
for i in range(num_layers):
conv_filters = 2 ** (5 + i)
x = layers.Conv2D(
filters=conv_filters, kernel_size=3, activation="relu", kernel_regularizer=regularizer
)(x)
x = layers.MaxPooling2D(pool_size=2)(x)
conv_filters = 2 ** (4 + num_layers)
x = layers.Conv2D(
filters=conv_filters, kernel_size=3, activation="relu", kernel_regularizer=regularizer
)(x)
x = layers.Flatten()(x)
if dropout:
x = layers.Dropout(dropout)(x)
# last layer activation
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
# compilation
model.compile(
loss="binary_crossentropy",
optimizer="rmsprop",
metrics=["accuracy"]
)
return model
It is also standard practice to use binary_crossentropy
for the loss function of a binary classifier. For the optimizer, DLWP mentions that "in most cases, it's safe to go with rmrprop
and its default learning rate", so we will be using that. Now to compile the model with the appropriate settings. Let's also take a look at the model's summary.
model = create_model()
model.summary()
Model: "model_5" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_6 (InputLayer) [(None, 180, 180, 3)] 0 _________________________________________________________________ rescaling_5 (Rescaling) (None, 180, 180, 3) 0 _________________________________________________________________ conv2d_18 (Conv2D) (None, 178, 178, 32) 896 _________________________________________________________________ max_pooling2d_14 (MaxPooling (None, 89, 89, 32) 0 _________________________________________________________________ conv2d_19 (Conv2D) (None, 87, 87, 32) 9248 _________________________________________________________________ flatten_5 (Flatten) (None, 242208) 0 _________________________________________________________________ dense_5 (Dense) (None, 1) 242209 ================================================================= Total params: 252,353 Trainable params: 252,353 Non-trainable params: 0 _________________________________________________________________
We can also define a training function below. Again, this is based on code from DLWP and uses Keras callbacks to save and monitor the state of the training.
def train_model(model, epochs=10):
callbacks = [
keras.callbacks.ModelCheckpoint(
filepath="convnet_from_scratch.keras",
save_best_only=True,
monitor="val_loss"
)
]
history = model.fit(
train_dataset,
epochs=epochs,
validation_data=validation_dataset,
callbacks=callbacks
)
return callbacks, history
Let us now train the preliminary model.
callbacks, history = train_model(model)
Epoch 1/10 469/469 [==============================] - 549s 1s/step - loss: 0.7056 - accuracy: 0.6233 - val_loss: 0.5824 - val_accuracy: 0.7028 Epoch 2/10 469/469 [==============================] - 544s 1s/step - loss: 0.5574 - accuracy: 0.7257 - val_loss: 0.5745 - val_accuracy: 0.7116 Epoch 3/10 469/469 [==============================] - 548s 1s/step - loss: 0.4615 - accuracy: 0.7851 - val_loss: 0.5496 - val_accuracy: 0.7376 Epoch 4/10 469/469 [==============================] - 553s 1s/step - loss: 0.3715 - accuracy: 0.8385 - val_loss: 0.6188 - val_accuracy: 0.7272 Epoch 5/10 469/469 [==============================] - 553s 1s/step - loss: 0.2855 - accuracy: 0.8832 - val_loss: 0.7363 - val_accuracy: 0.7212 Epoch 6/10 469/469 [==============================] - 555s 1s/step - loss: 0.2034 - accuracy: 0.9197 - val_loss: 0.8570 - val_accuracy: 0.7182 Epoch 7/10 469/469 [==============================] - 556s 1s/step - loss: 0.1321 - accuracy: 0.9511 - val_loss: 1.0850 - val_accuracy: 0.7168 Epoch 8/10 469/469 [==============================] - 559s 1s/step - loss: 0.0886 - accuracy: 0.9696 - val_loss: 1.3388 - val_accuracy: 0.7156 Epoch 9/10 469/469 [==============================] - 559s 1s/step - loss: 0.0603 - accuracy: 0.9796 - val_loss: 1.6129 - val_accuracy: 0.7040 Epoch 10/10 469/469 [==============================] - 558s 1s/step - loss: 0.0407 - accuracy: 0.9883 - val_loss: 1.6432 - val_accuracy: 0.7134
Let's now define a function to plot some graphs comparing training and validation accuracies and losses per epoch. The code within this function has been taken directly from DLWP.
import matplotlib.pyplot as plt
def plot_results(history):
accuracy = history.history["accuracy"]
val_accuracy = history.history["val_accuracy"]
loss = history.history["loss"]
val_loss = history.history["val_loss"]
epochs = range(1, len(accuracy) + 1)
plt.plot(epochs, accuracy, "bo", label="Training accuracy")
plt.plot(epochs, val_accuracy, "b", label="Validation accuracy")
plt.title("Training and validation accuracy")
plt.legend()
plt.figure()
plt.plot(epochs, loss, "bo", label="Training loss")
plt.plot(epochs, val_loss, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.legend()
plt.show()
Let's see what the graphs look like.
plot_results(history)
Notice that the validation loss continuously increases after the third epoch. This indicates that the third epoch is the optimal epoch for the current model. We can confirm this by defining a function to return the epoch at which the validation loss is lowest.
def get_optimal_epoch(history):
val_loss = history.history["val_loss"]
return val_loss.index(min(val_loss)) + 1
get_optimal_epoch(history)
3
Let's now test the model. Keras' load_model
will load the best-fit model (i.e. the one trained using the optimal number of epochs) from the saved callback checkpoints. Based on the examples from DLWP we can define a test_model
function and run it.
def test_model(filename, test_dataset):
model = keras.models.load_model(filename)
test_loss, test_acc = model.evaluate(test_dataset)
print(f"Test accuracy: {test_acc:.3f}")
test_model("convnet_from_scratch.keras", test_dataset)
157/157 [==============================] - 54s 335ms/step - loss: 0.5528 - accuracy: 0.7424 Test accuracy: 0.742
The model has a test accuracy of 74.2%, better than our common sense baseline of 50%. We have thus achieved statistical power, but further improvements can be made.
To achieve overfitting we should try adding more and larger layers as well as train for more epochs. Let's do this by adding three extra intermediate layers.
xmodel = create_model(4)
xmodel.summary()
Model: "model_4" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_5 (InputLayer) [(None, 180, 180, 3)] 0 _________________________________________________________________ rescaling_4 (Rescaling) (None, 180, 180, 3) 0 _________________________________________________________________ conv2d_13 (Conv2D) (None, 178, 178, 32) 896 _________________________________________________________________ max_pooling2d_10 (MaxPooling (None, 89, 89, 32) 0 _________________________________________________________________ conv2d_14 (Conv2D) (None, 87, 87, 64) 18496 _________________________________________________________________ max_pooling2d_11 (MaxPooling (None, 43, 43, 64) 0 _________________________________________________________________ conv2d_15 (Conv2D) (None, 41, 41, 128) 73856 _________________________________________________________________ max_pooling2d_12 (MaxPooling (None, 20, 20, 128) 0 _________________________________________________________________ conv2d_16 (Conv2D) (None, 18, 18, 256) 295168 _________________________________________________________________ max_pooling2d_13 (MaxPooling (None, 9, 9, 256) 0 _________________________________________________________________ conv2d_17 (Conv2D) (None, 7, 7, 256) 590080 _________________________________________________________________ flatten_4 (Flatten) (None, 12544) 0 _________________________________________________________________ dense_4 (Dense) (None, 1) 12545 ================================================================= Total params: 991,041 Trainable params: 991,041 Non-trainable params: 0 _________________________________________________________________
Now let us train the model using 20 epochs.
callbacks, history = train_model(model, 20)
Epoch 1/20 469/469 [==============================] - 1287s 3s/step - loss: 0.6732 - accuracy: 0.5984 - val_loss: 0.6355 - val_accuracy: 0.6238 Epoch 2/20 469/469 [==============================] - 1227s 3s/step - loss: 0.5538 - accuracy: 0.7213 - val_loss: 0.4802 - val_accuracy: 0.7744 Epoch 3/20 469/469 [==============================] - 1215s 3s/step - loss: 0.4579 - accuracy: 0.7861 - val_loss: 0.5504 - val_accuracy: 0.7458 Epoch 4/20 469/469 [==============================] - 1213s 3s/step - loss: 0.3933 - accuracy: 0.8270 - val_loss: 0.4375 - val_accuracy: 0.8022 Epoch 5/20 469/469 [==============================] - 1210s 3s/step - loss: 0.3216 - accuracy: 0.8606 - val_loss: 0.4776 - val_accuracy: 0.7920 Epoch 6/20 469/469 [==============================] - 1207s 3s/step - loss: 0.2646 - accuracy: 0.8913 - val_loss: 0.3394 - val_accuracy: 0.8752 Epoch 7/20 469/469 [==============================] - 1212s 3s/step - loss: 0.2087 - accuracy: 0.9159 - val_loss: 0.3438 - val_accuracy: 0.8708 Epoch 8/20 469/469 [==============================] - 1225s 3s/step - loss: 0.1652 - accuracy: 0.9344 - val_loss: 0.3999 - val_accuracy: 0.8532 Epoch 9/20 469/469 [==============================] - 1215s 3s/step - loss: 0.1299 - accuracy: 0.9531 - val_loss: 0.4605 - val_accuracy: 0.8790 Epoch 10/20 469/469 [==============================] - 1228s 3s/step - loss: 0.1070 - accuracy: 0.9607 - val_loss: 0.4847 - val_accuracy: 0.8736 Epoch 11/20 469/469 [==============================] - 1217s 3s/step - loss: 0.0973 - accuracy: 0.9665 - val_loss: 0.6046 - val_accuracy: 0.8644 Epoch 12/20 469/469 [==============================] - 1219s 3s/step - loss: 0.0962 - accuracy: 0.9679 - val_loss: 0.8009 - val_accuracy: 0.8544 Epoch 13/20 469/469 [==============================] - 1219s 3s/step - loss: 0.0905 - accuracy: 0.9733 - val_loss: 0.7528 - val_accuracy: 0.8718 Epoch 14/20 469/469 [==============================] - 1229s 3s/step - loss: 0.0742 - accuracy: 0.9752 - val_loss: 0.7845 - val_accuracy: 0.8552 Epoch 15/20 469/469 [==============================] - 1212s 3s/step - loss: 0.0717 - accuracy: 0.9764 - val_loss: 0.6046 - val_accuracy: 0.8830 Epoch 16/20 469/469 [==============================] - 1218s 3s/step - loss: 0.0699 - accuracy: 0.9781 - val_loss: 0.8608 - val_accuracy: 0.8758 Epoch 17/20 469/469 [==============================] - 1216s 3s/step - loss: 0.0755 - accuracy: 0.9777 - val_loss: 0.8500 - val_accuracy: 0.8872 Epoch 18/20 469/469 [==============================] - 1240s 3s/step - loss: 0.0827 - accuracy: 0.9797 - val_loss: 0.9651 - val_accuracy: 0.8800 Epoch 19/20 469/469 [==============================] - 1230s 3s/step - loss: 0.0740 - accuracy: 0.9809 - val_loss: 0.8050 - val_accuracy: 0.8892 Epoch 20/20 469/469 [==============================] - 1243s 3s/step - loss: 0.0706 - accuracy: 0.9813 - val_loss: 0.8270 - val_accuracy: 0.8838
Here are the training results:
plot_results(history)
We see that the validation loss begins to increase after the sixth epoch, so the optimal epoch for the model is 6.
get_optimal_epoch(history)
6
Let's now evaluate the model on the test dataset. Remember that our test_model
function will use the model trained on the optimal number of epochs which was saved via Keras callbacks.
test_model("convnet_from_scratch.keras", test_dataset)
157/157 [==============================] - 107s 673ms/step - loss: 0.3238 - accuracy: 0.8738 Test accuracy: 0.874
We see that increasing the complexity of the model's intermediate convnet layers has resulted in a higher prediction accuracy of 87.4% on unseen data.
To further reduce the amount of overfitting, we can first try the following two regularization techniques in conjunction:
Dropout essentially introduces noise in the outputs tensors of the convnets' layers, making it harder for the model to memorise patterns that are not significant. L2 regularization is a form of weight decay that helps to prevent overfitting via a reduction in the values of a layer's weight coefficients. (DLWP)
Let's see this in action by training a new model using these techniques. Apart from regularization, the configuration will be the same.
from tensorflow.keras import regularizers
model = create_model(4, dropout=0.5, regularizer=regularizers.l2(0.002))
callbacks, history = train_model(model, 20)
The following output was produced for the first six epochs:
Epoch 1/20
469/469 [==============================] - 160s 337ms/step - loss: 0.7273 - accuracy: 0.4994 - val_loss: 0.6936 - val_accuracy: 0.5000
Epoch 2/20
469/469 [==============================] - 162s 346ms/step - loss: 0.6937 - accuracy: 0.4931 - val_loss: 0.6936 - val_accuracy: 0.5000
Epoch 3/20
469/469 [==============================] - 166s 355ms/step - loss: 0.6936 - accuracy: 0.4943 - val_loss: 0.6936 - val_accuracy: 0.5000
Epoch 4/20
469/469 [==============================] - 159s 339ms/step - loss: 0.6936 - accuracy: 0.4969 - val_loss: 0.6936 - val_accuracy: 0.5000
Epoch 5/20
469/469 [==============================] - 156s 332ms/step - loss: 0.6936 - accuracy: 0.4957 - val_loss: 0.6936 - val_accuracy: 0.5000
Epoch 6/20
469/469 [==============================] - 155s 331ms/step - loss: 0.6936 - accuracy: 0.4956 - val_loss: 0.6936 - val_accuracy: 0.5000
Clearly there is something wrong with this, as the validation loss and accuracy are hanging at 69.36% and 50% respectively. We could try tinkering with the L2 regularization constant, but since the network is quite large, L2 regularization may be unnecessary in the first place since it is mostly used with smaller models according to DLWP. Let's instead try again using only a dropout regularization strategy.
model = create_model(4, dropout=0.5)
callbacks, history = train_model(model, 20)
Epoch 1/20 469/469 [==============================] - 177s 375ms/step - loss: 0.6624 - accuracy: 0.6062 - val_loss: 0.5649 - val_accuracy: 0.6936 Epoch 2/20 469/469 [==============================] - 179s 382ms/step - loss: 0.5295 - accuracy: 0.7395 - val_loss: 0.5332 - val_accuracy: 0.7298 Epoch 3/20 469/469 [==============================] - 179s 381ms/step - loss: 0.4375 - accuracy: 0.8029 - val_loss: 0.4065 - val_accuracy: 0.8126 Epoch 4/20 469/469 [==============================] - 183s 391ms/step - loss: 0.3606 - accuracy: 0.8427 - val_loss: 0.3257 - val_accuracy: 0.8588 Epoch 5/20 469/469 [==============================] - 184s 393ms/step - loss: 0.3040 - accuracy: 0.8696 - val_loss: 0.3356 - val_accuracy: 0.8564 Epoch 6/20 469/469 [==============================] - 176s 375ms/step - loss: 0.2620 - accuracy: 0.8912 - val_loss: 0.3444 - val_accuracy: 0.8574 Epoch 7/20 469/469 [==============================] - 178s 380ms/step - loss: 0.2240 - accuracy: 0.9098 - val_loss: 0.2990 - val_accuracy: 0.8922 Epoch 8/20 469/469 [==============================] - 174s 371ms/step - loss: 0.1866 - accuracy: 0.9251 - val_loss: 0.2899 - val_accuracy: 0.8868 Epoch 9/20 469/469 [==============================] - 180s 384ms/step - loss: 0.1619 - accuracy: 0.9355 - val_loss: 0.3639 - val_accuracy: 0.8732 Epoch 10/20 469/469 [==============================] - 172s 367ms/step - loss: 0.1433 - accuracy: 0.9445 - val_loss: 0.3546 - val_accuracy: 0.8962 Epoch 11/20 469/469 [==============================] - 169s 361ms/step - loss: 0.1337 - accuracy: 0.9531 - val_loss: 0.3411 - val_accuracy: 0.8896 Epoch 12/20 469/469 [==============================] - 175s 372ms/step - loss: 0.1181 - accuracy: 0.9585 - val_loss: 0.3289 - val_accuracy: 0.9030 Epoch 13/20 469/469 [==============================] - 168s 358ms/step - loss: 0.1122 - accuracy: 0.9606 - val_loss: 0.4041 - val_accuracy: 0.8934 Epoch 14/20 469/469 [==============================] - 172s 368ms/step - loss: 0.1113 - accuracy: 0.9615 - val_loss: 0.4306 - val_accuracy: 0.9044 Epoch 15/20 469/469 [==============================] - 173s 369ms/step - loss: 0.1005 - accuracy: 0.9659 - val_loss: 0.4040 - val_accuracy: 0.8936 Epoch 16/20 469/469 [==============================] - 174s 372ms/step - loss: 0.1012 - accuracy: 0.9660 - val_loss: 0.4536 - val_accuracy: 0.8934 Epoch 17/20 469/469 [==============================] - 164s 350ms/step - loss: 0.1040 - accuracy: 0.9682 - val_loss: 0.6003 - val_accuracy: 0.8840 Epoch 18/20 469/469 [==============================] - 172s 366ms/step - loss: 0.1021 - accuracy: 0.9707 - val_loss: 0.5463 - val_accuracy: 0.8908 Epoch 19/20 469/469 [==============================] - 167s 355ms/step - loss: 0.1052 - accuracy: 0.9683 - val_loss: 0.5361 - val_accuracy: 0.8932 Epoch 20/20 469/469 [==============================] - 168s 359ms/step - loss: 0.1014 - accuracy: 0.9749 - val_loss: 0.3896 - val_accuracy: 0.9034
The results of regularizing the model can be seen as follows.
plot_results(history)
We see that the dropout has worked to slightly reduce overfitting and that the optimal number of epochs is now 8.
get_optimal_epoch(history)
8
Now to evaluate on the test dataset using the model trained with the optimal number of epochs.
test_model("convnet_from_scratch.keras", test_dataset)
157/157 [==============================] - 11s 66ms/step - loss: 0.2636 - accuracy: 0.8864 Test accuracy: 0.886
The accuracy of the classifier on unseen data has also increased to 88.6%.
One reason that the dropout only slightly improved the model may be that the dataset is too large for any type of regularization to have much effect. In order to combat overfitting further, we should try tuning the model using various different hyperparameters such as the optimizer's learning rate and the units per layer.
We have seen the effectiveness of convnet layers for deep learning tasks related to computer vision. We have furthermore developed a binary classification model for images that beats a common sense baseline by a significant margin of predictive accuracy: 88.6% to 50%. We then observed how dropout regularization can help to reduce overfitting with large complex models, although its effect was limited in this case. Finally, we noted that further work should be done to combat overfitting by tuning the hyperparamters of the model. Doing this effectively should further increase the model's predictive accuracy on unseen data.