Zhang Yuxuan, Jiang Wenxin, Fan Yifei, Zhang Juntao (In order of speakers)
Jiang Wenxin
callbacks = [
ModelCheckpoint(monitor="val_acc", mode="max"),
LearningRateMonitor(logging_interval="step"),
StochasticWeightAveraging(swa_lrs=1e-2),
early_stopping,
]
trainer = Trainer(
max_epochs=50,
devices='auto', # auto choose GPU or CPU
logger=wandb_logger,
callbacks=callbacks, # defined above
)
Tools: random crop, random flip, random rotation, etc. Benefits of data augmentation:
Tools: Normalize, Resize, etc. Why data normalization?
Why data resizing?
model = torchvision.models.resnet18(pretrained=True)
Useful when dataset is small.
# for hardware
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# for numpy/pytorch package
seed_everything(42)
(But it sometimes doesn't work well.) Not to pick the lowest loss, but in the middle of the sharpest downward slope (red point).
callbacks = [
LearningRateMonitor(logging_interval="step"),
StochasticWeightAveraging(swa_lrs=1e-2),
GradientAccumulationScheduler(scheduling={...}),
early_stopping
]
trainer = Trainer(
gradient_clip_val=0.5,
devices='auto', # default
logger=wandb_logger,...
)
Accumulated gradients run K small batches of size N before doing a backward pass, resulting a KxN effective batch size. Control batch size, improve the stability and generalisation of the model
Stop at the best epoch, not the last epoch. Avoid over-fitting.
Gradient clipping can be enabled to avoid exploding gradients.
Smooths the loss landscape thus making it harder to end up in a local minimum during optimization. Improves generalization.
Control learning rate. Make the model converge faster.
Before training: After training:
Dashboard: Or more commonly used: TensorBoard
In pratice, since the transformer becomes more and more popular, many companies and researchers want to add attention block to their existed model in order to improve the performance. However, training an attention related model is computation consuming and not easy.
In this project, we add the attention block to the resnet18 model. And we follow the ViT structure to utilize the attention block.
The general form for the self-attention
$$ A_{ij}=f(h_i,h_j) $$
where $h_i$ and $h_j$ are the features for node $i$ and $j$, and $f$ is an arbitrary function that computes the attention score between two nodes.
The classical self-attention
$$ A=softmax\left(\frac{H^{\top}(Q^{\top}K)H}{\sqrt{d_k}}\right) $$
where $H\in\mathbb{R}^{f\times n}$ is the feature matrix for each embedding, $Q,K\in\mathbb{R}^{f^{'}\times f}$ are the query and key matrix for self-attention, and $d_k$ is the dimension of the key vector. Actually, it's just a bilinear function.
The graph attention block is shown in the following figure,
The formula for graph attention
$$ A_{ij}=softmax(\sigma(W^{\top}[h_i||h_j])) $$
where $W\in\mathbb{R}^{f\times 1}$ is the weight, $h_i\in \mathbb{R}^{f}$ is the feature of the $i^{th}$ node and $\sigma$ represents for activation function.
Convert in this form, the attention matrix $A$ can be computed as follows:
$$ A=softmax(\sigma(W_K^{\top}H+H^{\top}W_Q)) $$where $H\in\mathbb{R}^{f\times n}$ is the feature matrix of all nodes.
Then we can implement the graph attention block in pytorch like this:
Q = H @ W_Q
# Q : (batch, nodes, 1)
K = H @ W_K
# K : (batch, nodes, 1)
A = torch.softmax(self.activation(Q.transpose(-1,-2)+K),dim=-1)
# A : (batch, nodes, nodes)
In order to use the graph attention block, we should convert patches into feature vectors of nodes in the graph.
In this project, we use the Conv2d
layer to convert instead of the Flatten
layer.
nn.Conv2d(chans,feats,kernel_size=patch_size,stride=patch_size)
Set the kernel_size
and stride_size
equal to patch_size
.
Model | Pretrained | Attention | Epoch Consuming | Test Accuracy |
---|---|---|---|---|
Resnet18 | 50 | 0.926 | ||
Resnet18 + GraphAtten | ✔️ | 50 | 0.918 | |
Resnet18 |> GraphAtten | ✔️ | ✔️ | 5 | 0.935 |
Resnet18 |> ClassicAtten | ✔️ | ✔️ | 5 | 0.931 |
The test accuracy even decreases when we add the attention block to the resnet18 model and train whole model from the begining. However, when we use the pretrained model, the metric increases.
In conclusion, using the pretrained model to boost the performance of new added attention block is a good choice.
Here are the advantages of this method:
Apply model interpretability algorithms from Captum library on CIFAR-10 dataset in order to attribute the label of the image to the input pixels and visualize it.
Results of "cat" class:
Results of "plane" class:
Results of "ship" class:
Results of "ship" class:
Then, we apply model interpretability algorithms with a handpicked image and visualizes the attributions for each pixel by overlaying them on the image.
Git is an open source distributed version control system for teams or individuals to handle projects quickly and efficiently. And colab is a free cloud service and supports free GPUs. It is a good choice for us to use git and colab to collaborate on the project, we managed the project and shared the code easily and efficiently in the teamwork.