Dogs & Cats : Using Pretrained Convolution Neural Network for Feature Extraction and Prediction with Pytorch

In a 2013 Kaggle competition, people need to write an algorithm to distinguish whether the animal in an image is a dog or a cat. It's an easy task for human, but for a machine, it may not be so.

A Dog and a Cat

Asirra Dataset

Internet protection usually faces a challenge. It should be easily identified by a human and let machine unable to distriguish. For example, verification code can effectively reduce spam mail and prevent users' password from malicious crack.

Asirra (Animal Species Image Recognition for Restricting Access) is a HIP (Human Interactive Proof) designed by Microsoft Research Labs that works by asking users to identify photographs of cats and dogs. Asirra has a lot of photos of different cats and dogs since its partnerships with the world largest website devoted to finding homes for homeless pets, pet finder. They have provided three million images of cats and dogs. We will use its subset as the training dataset. The dataset can be downloaded from Kaggle.

Transfer Learning

In this task, we can write a neural network on our own. There is a possibility to find our network is not so effective. It not only has low accuracy but also converges slowly or doesn't converge. At this time, we can use mature models, such as VggNet, GoogleNet, ResNet etc, to help us solve these problems. These excellent networks are implemented by world-leading deep learning laboratories after numerous trials and error and are the champions or the second places in ImageNet. As a result, using these networks can guarantee a degree of performance. Nowadays, the threshold of deep learning has become lower and lower. On one hand, current frameworks make writing a network very easy. On the other hand, these laboratories are willing to open source their models and experiment results.

We can use existing models to train on other datasets and fine tune them. However, this brings some problems of computation resource since running an experiment can consume large computation resource for a large dataset. Sometimes, we don't have such powerful resource. Is it means that there is no other way? No, it's not. With transfer learning, it can let people without powerful computation resource to accomplish the training of complex models in deep learning.

In a classic supervised learning of machine learning, if we are training a model for task A, we provide the data and label of task A. Now we have trained a model A from the given dataset and expect it can perform well in the unknown data of the same task. In another scenario, given the data and label of task B, we can do the same thing.

But in some cases, there is not enough dataset for a specific task. Then classic supervised learning can't support it. Transfer learning enables borrowing existing data and label from relative tasks to solve this situation, preserves the information of solving relative tasks, and applies it to our target mission.

As a consequence, we can use pretrained neural networks in ImageNet to perform transfer learning. These pretrained networks contain the information, the weights and parameters, to classify the 1000 classes in Imagenet, including cats and dogs.

Convolution neural network consists of two parts, convolution layers and classifying layers. What convolution layers mainly do is extracting features in images and the effect of feature extracting in pretrained networks are very good since the networks have already learned the necessary weghits. In our task, binary classification of cats and dogs, we use fully connected classifying layers.

To summarize, we transfer the pretrained convolution layers, only update the weights of fully connected layers. and obtain our target of binary classification.

Finally, transfer learning may not be approiate for any scenario. As previously mentioned, it has to be relative tasks. As a result, transfer learning works well on similar dataset. For example, the weights of a pretrained network is trained from classifying naural landscape. Then using these weights to do face recognition may not obtain a good result since the feature extraction of human faces is different from that of the natural landscape and the corresponding trained weights is different.

Data Preprocessing

After download the data from Kaggle, there is a file called all.zip. Put it into the directory of data. In the directory of data, perform these three bash commands:

unzip all.zip
unzip train.zip
unzip test.zip

Now, there should be a directory train that contains the training images, test that contains the testing images, and sample_submission.csv for the sample submission file. And you are ready to go.

In [1]:
import os
import operator
import cv2
from tqdm import tqdm
import h5py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torchvision
import torchvision.models as models
import torchvision.transforms as transforms

DATA_PATH = 'data'
TRAIN_PATH = 'data/train'
TEST_PATH = 'data/test'
classes = ('dog', 'cat')
print(os.listdir(DATA_PATH))
IMG_SIZE = (224, 224)
img_classes = 2
BATCH_SIZE = 512
NB_EPOCH = 5

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)
/usr/lib64/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
['train.zip', 'sample_submission.csv', 'test.zip', 'train', 'test', '.ipynb_checkpoints', 'all.zip']
cuda:0
In [2]:
train_img_list = []
train_label_list = []
test_img_dict = {}
img_size_dict = {}

for file in os.listdir(TRAIN_PATH):
    img = cv2.imread(os.path.join(TRAIN_PATH, file))
    train_img_list.append(img)
    img_size_dict[img.shape] = img_size_dict.get(img.shape, 0) + 1
    if 'dog' in file:
        train_label_list.append(0)
    else:
        train_label_list.append(1)
        
for file in os.listdir(TEST_PATH):
    img = cv2.imread(os.path.join(TEST_PATH, file))
    test_img_dict[int(file.replace('.jpg', ''))] = img
    img_size_dict[img.shape] = img_size_dict.get(img.shape, 0) + 1
In [3]:
print('There are {} training data and {} testing data'.format(len(train_img_list), len(test_img_dict)))
print('Among them, there {} dogs and {} cats in training data'.format(train_label_list.count(0), train_label_list.count(1)))

fig = plt.figure('Example of a dog and a cat')
ax0 = fig.add_subplot(1, 2, 1)
ax0.imshow(train_img_list[20010])
ax0.axis('off')

ax1 = fig.add_subplot(1, 2, 2)
ax1.imshow(train_img_list[1])
ax1.axis('off')

plt.suptitle('Example of a Dog and a Cat')
plt.show()

print('There are {} different image sizes'.format(len(img_size_dict)))
most_10_img_size_dict = dict(sorted(img_size_dict.items(), key=operator.itemgetter(1), reverse=True)[:10])
plt.bar(range(len(most_10_img_size_dict)), list(most_10_img_size_dict.values()), align='center')
plt.xticks(range(len(most_10_img_size_dict)), list(most_10_img_size_dict.keys()), rotation='vertical')
plt.title('10 Most Common Image Size')
plt.show()
There are 25000 training data and 12500 testing data
Among them, there 12500 dogs and 12500 cats in training data
There are 11550 different image sizes
In [4]:
print('Resize images to {}'.format(IMG_SIZE))
for i, img in enumerate(train_img_list):
    train_img_list[i] = cv2.resize(train_img_list[i], IMG_SIZE)
    
for key in test_img_dict:
    test_img_dict[key] = cv2.resize(test_img_dict[key], IMG_SIZE)
    
train_img = np.array(train_img_list)

train_mean = np.mean(train_img, axis=(0, 1, 2), keepdims=True)
train_std = np.std(train_img, axis=(0, 1, 2), keepdims=True)
print('Traing image mean {} and std {}'.format(train_mean, train_std))

# zero mean and unit variance
train_img = (train_img - train_mean) / train_std
for key in test_img_dict:
    test_img_dict[key] = (test_img_dict[key] - train_mean[0]) / train_std[0]
Resize images to (224, 224)
Traing image mean [[[[106.20786271 115.92731951 124.40450583]]]] and std [[[[65.62016858 64.95751316 66.62977726]]]]

1. Using a Pretrained Model

In the first section, we will introduce the pretrained network, ResNet18, modify the last fully connect layer to our self-designed fully connected layer, update the weights of the whole network, and converge very fast.

Define data loader

In [5]:
x_train, x_val, y_train, y_val = train_test_split(train_img, train_label_list, test_size = 0.1)

# NCHW format
tensor_train_img = torch.stack([torch.Tensor(i).permute(2, 0, 1) for i in x_train])
tensor_train_label = torch.stack([torch.LongTensor([i]) for i in y_train]).view(-1)

train_dataset = torch.utils.data.TensorDataset(tensor_train_img, tensor_train_label)
train_dataloader = torch.utils.data.DataLoader(train_dataset, 
                                               batch_size=BATCH_SIZE, 
                                               shuffle=True)

tensor_val_img = torch.stack([torch.Tensor(i).permute(2, 0, 1) for i in x_val])
tensor_val_label = torch.stack([torch.LongTensor([i]) for i in y_val]).view(-1)

val_dataset = torch.utils.data.TensorDataset(tensor_val_img, tensor_val_label)
val_dataloader = torch.utils.data.DataLoader(val_dataset,
                                             batch_size=BATCH_SIZE, 
                                             shuffle=False)

tensor_test_img = torch.stack([torch.Tensor(test_img_dict[key]).permute(2, 0, 1) for key in test_img_dict])

test_dataset = torch.utils.data.TensorDataset(tensor_test_img)
test_dataloader = torch.utils.data.DataLoader(test_dataset, 
                                              batch_size=BATCH_SIZE, 
                                              shuffle=False)

Define a Model

In [6]:
net = models.resnet18(pretrained=True)
dim_in = net.fc.in_features
net.fc = nn.Linear(dim_in, img_classes)
net = net.to(device)

Define a loss function and optimizer

In [7]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=1e-3)

Train the network

In [8]:
def test(net, dataloader):
    correct = 0
    total = 0
    with torch.no_grad():
        for data in dataloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)

            outputs = net(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    print('Accrracy: %d %%' % (100 * correct / total))
    
def train(net, dataloader, val_dataloader, optimizer, criterion):
    for epoch in range(NB_EPOCH):

        running_loss = 0.0
        for i, data in enumerate(dataloader, 0):
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

            if i % 40 == 39:
                print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 40))
                test(net, val_dataloader)
                running_loss = 0.0
    print('Finished Training')
In [9]:
train(net, train_dataloader, val_dataloader, optimizer, criterion)
[1,    40] loss: 0.111
Accrracy: 97 %
[2,    40] loss: 0.027
Accrracy: 97 %
[3,    40] loss: 0.020
Accrracy: 97 %
[4,    40] loss: 0.018
Accrracy: 97 %
[5,    40] loss: 0.018
Accrracy: 98 %
Finished Training

Test the network on testing data

After uploading pretrained_prediction.csv to Late Submission, the result is

Log Loss: 0.13165

In [10]:
def predict(net, dataloader):
    predicted_list = []
    
    prob = nn.Softmax(dim = 1)

    with torch.no_grad():
        for data in dataloader:
            images = data[0]
            images = images.to(device)

            outputs = prob(net(images))
            predicted_list.append(outputs[:, 0].cpu().data.numpy())

    return np.concatenate(predicted_list, axis=0)
In [11]:
predicted = predict(net, test_dataloader)

df = pd.read_csv(os.path.join(DATA_PATH, 'sample_submission.csv'), index_col='id')
for i, key in enumerate(test_img_dict):
    df.at[key, 'label'] = predicted[i]
df.to_csv('pretrained_prediction.csv')

2. Fixing Convoltional Layer of a Pretrained Model

In the second section, we fix the weights in convolution layers, only update the weights of the fully connected layer. This can greatly reduce the training time.

Define a Model

In [12]:
net = models.resnet18(pretrained=True)
for param in net.parameters():
    param.requires_grad = False
    
dim_in = net.fc.in_features
net.fc = nn.Linear(dim_in, img_classes)


net = net.to(device)

Define a loss function and optimizer

In [13]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam (net.fc.parameters(), lr=1e-3)#, momentum=0.9)

Train the network

In [14]:
train(net, train_dataloader, val_dataloader, optimizer, criterion)
[1,    40] loss: 0.305
Accrracy: 94 %
[2,    40] loss: 0.127
Accrracy: 95 %
[3,    40] loss: 0.104
Accrracy: 96 %
[4,    40] loss: 0.093
Accrracy: 96 %
[5,    40] loss: 0.087
Accrracy: 96 %
Finished Training

Test the network on testing data

After uploading pretrained_prediction.csv to Late Submission, the result is

Log Loss: 0.11055

In [15]:
predicted = predict(net, test_dataloader)

df = pd.read_csv(os.path.join(DATA_PATH, 'sample_submission.csv'), index_col='id')
for i, key in enumerate(test_img_dict):
    df.at[key, 'label'] = predicted[i]
df.to_csv('fixed_pretrained_prediction.csv')

3. Feature Extracting Network with Pretrained Model

In the third section, we combine multiplce pretrained models, fix their weights of convolution layers, and only update the weights of last fully connected layer. Since there is no update in the convolution layers, the results of forwarding through convolution layers are the same. There is no need to forward the whole dataset in every iteration. We can store the forwarding result as feature vectors after one iteration of forwarding.

feature_net is the feature extracting network. It can accept vgg, inceptionv3, and resnet152 as the input of parameter model, representing the 19-layered Vgg network, Inception V3, or 152-layered Residual network. With these pretrained network, we remove their fully connected layers, add average pooling layers, and transform the dataset into feature vectors.

classifier is the fully connected classifier network for our dataset with Dropout preventing overfitting.

Then we can input the dataset into the network to perform forwarding, obtain their feature vectors, and store the result into h5 files.

Feature Extracting

In [16]:
BATCH_SIZE = 4
train_dataloader = torch.utils.data.DataLoader(train_dataset, 
                                               batch_size=BATCH_SIZE, 
                                               shuffle=False)
val_dataloader = torch.utils.data.DataLoader(val_dataset,
                                             batch_size=BATCH_SIZE, 
                                             shuffle=False)
test_dataloader = torch.utils.data.DataLoader(test_dataset, 
                                              batch_size=BATCH_SIZE, 
                                              shuffle=False)

model_list = ['vgg', 'inceptionv3', 'resnet152']
feature_dim = {}

class feature_net(nn.Module):
    def __init__(self, model):
        super(feature_net, self).__init__()
        
        if model == 'vgg':
            vgg = models.vgg19(pretrained=True)
            self.feature = nn.Sequential(*list(vgg.children())[:-1])
            self.feature.add_module('global average', nn.AvgPool2d(7))
        elif model == 'inceptionv3':
            inception = models.inception_v3(pretrained=True)
            self.feature = nn.Sequential(*list(inception.children())[:-1])
            self.feature._modules.pop('13')
            self.feature.add_module('global average', nn.AvgPool2d(26))
        elif model == 'resnet152':
            resnet = models.resnet152(pretrained=True)
            self.feature = nn.Sequential(*list(resnet.children())[:-1])
            
    def forward(self, x):
        x = self.feature(x)
        x = x.view(x.size(0), -1)
        return x
    
class classifier(nn.Module):
    def __init__(self, dim, n_classes):
        super(classifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(dim, 1000),
            nn.ReLU(True),
            nn.Dropout(0.5),
            nn.Linear(1000, n_classes)
        )
    
    def forward(self, x):
        x = self.fc(x)
        return x
    
h5_list = {}

for model in model_list:
    for phase, dataloader in zip(['train', 'val', 'test'], [train_dataloader, val_dataloader, test_dataloader]):
        featurenet = feature_net(model).to(device)
        feature_map = torch.FloatTensor()
        label_map = torch.LongTensor()
        for data in tqdm(dataloader):
            if phase != 'test':
                img, label = data
            else:
                img = data[0]
            img = Variable(img).to(device)
            out = featurenet(img)
            feature_map = torch.cat((feature_map, out.cpu().data), 0)
            if phase != 'test':
                label_map = torch.cat((label_map, label), 0)
        feature_map = feature_map.numpy()
        label_map = label_map.numpy()
        file_name = '{}_feature_{}.hd5f'.format(phase, model)
        h5_path = file_name
        phase_list = h5_list.get(phase, [])
        phase_list.append(file_name)
        h5_list[phase] = phase_list
        with h5py.File(h5_path, 'w') as h:
            h.create_dataset('data', data=feature_map)
            if phase != 'test':
                h.create_dataset('label', data=label_map)
    
    feature_dim[model] = feature_map.shape[1]
100%|██████████| 5625/5625 [02:06<00:00, 44.50it/s]
100%|██████████| 625/625 [00:10<00:00, 61.18it/s]
100%|██████████| 3125/3125 [00:55<00:00, 56.50it/s]
100%|██████████| 5625/5625 [30:10<00:00,  1.85it/s]
100%|██████████| 625/625 [02:28<00:00,  4.02it/s]
100%|██████████| 3125/3125 [13:40<00:00,  2.87it/s]
100%|██████████| 5625/5625 [09:12<00:00, 10.19it/s]
100%|██████████| 625/625 [00:20<00:00, 29.90it/s]
100%|██████████| 3125/3125 [02:33<00:00, 20.34it/s]

h5 Data Loader

In [17]:
class h5Dataset(torch.utils.data.Dataset):
    
    def __init__(self, h5py_list, nSamples=None, train=True):
        label_file = h5py.File(h5py_list[0], 'r')
        if train:
            self.label = torch.from_numpy(label_file['label'].value)
        self.nSamples = len(label_file['data'].value)
        temp_dataset = torch.FloatTensor()
        for file in h5py_list:
            h5_file = h5py.File(file, 'r')
            dataset = torch.from_numpy(h5_file['data'].value)
            temp_dataset = torch.cat((temp_dataset, dataset), 1)
        
        self.train = train
        self.dataset = temp_dataset
    
    def __len__(self):
        return self.nSamples

    def __getitem__(self, index):
        assert index < len(self), 'index range error'
        data = self.dataset[index]
        if self.train:
            label = self.label[index]
            return (data, label)
        else:
            return (data,)

Define a loss function and optimizer

Train the network

Test the network on testing data

After uploading pretrained_prediction.csv to Late Submission, the result is

Log Loss: 0.09273

In [18]:
BATCH_SIZE = 128
train_dataset = h5Dataset(h5_list['train'])
train_dataloader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=BATCH_SIZE, 
                                               shuffle=True)

val_dataset = h5Dataset(h5_list['val'])
val_dataloader = torch.utils.data.DataLoader(val_dataset,
                                             batch_size=BATCH_SIZE, 
                                             shuffle=False)

test_dataset = h5Dataset(h5_list['test'], train=False)
test_dataloader = torch.utils.data.DataLoader(test_dataset,
                                              batch_size=BATCH_SIZE, 
                                              shuffle=False)

dim = 0
for key in feature_dim:
    dim += feature_dim[key]
net = classifier(dim, img_classes).to(device)
    
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=1e-3)
    
train(net, train_dataloader, val_dataloader, optimizer, criterion)
    
predicted = predict(net, test_dataloader)

df = pd.read_csv(os.path.join(DATA_PATH, 'sample_submission.csv'), index_col='id')
for i, key in enumerate(test_img_dict):
    df.at[key, 'label'] = predicted[i]
df.to_csv('feature_prediction.csv')
[1,    40] loss: 0.463
Accrracy: 96 %
[1,    80] loss: 0.097
Accrracy: 96 %
[1,   120] loss: 0.085
Accrracy: 97 %
[1,   160] loss: 0.065
Accrracy: 97 %
[2,    40] loss: 0.054
Accrracy: 98 %
[2,    80] loss: 0.060
Accrracy: 98 %
[2,   120] loss: 0.048
Accrracy: 97 %
[2,   160] loss: 0.065
Accrracy: 98 %
[3,    40] loss: 0.056
Accrracy: 98 %
[3,    80] loss: 0.048
Accrracy: 98 %
[3,   120] loss: 0.059
Accrracy: 98 %
[3,   160] loss: 0.045
Accrracy: 98 %
[4,    40] loss: 0.056
Accrracy: 97 %
[4,    80] loss: 0.054
Accrracy: 98 %
[4,   120] loss: 0.052
Accrracy: 98 %
[4,   160] loss: 0.060
Accrracy: 97 %
[5,    40] loss: 0.060
Accrracy: 98 %
[5,    80] loss: 0.051
Accrracy: 97 %
[5,   120] loss: 0.048
Accrracy: 98 %
[5,   160] loss: 0.055
Accrracy: 98 %
Finished Training

Visualization

In [19]:
f, axes = plt.subplots(1, 4, figsize = (12, 10))
for i in range(4):
    axes[i].imshow((x_val[i] * train_std[0] + train_mean[0]).astype(int))
    axes[i].set_title('%5s' % classes[y_val[i]])
plt.suptitle('GroundTruth')
plt.tight_layout(rect=[0, 0.6, 1, 1])
plt.show()
In [20]:
dataiter = iter(val_dataloader)
tensors, labels = dataiter.next()
outputs = net(tensors.to(device))

_, predicted = torch.max(outputs, 1)

print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]
                              for j in range(4)))
Predicted:    cat   cat   dog   cat

Summary

With transfer learning , we can use pretrained network to fine tune the accuracy of neural network, use convolution layers to perform feature extraction, and save computation resource with it.