});

Deep Learning Milestone Papers

random readings

transformer & multi-head attention

A general explanation: {Ref-Medium}
A code implementation: {Ref}

The essence of multi-head attention is let the model calculate the weight of each value according to the position (context), this is done the formula:

    \[Attention(Q, K, V) = softmax(\frac{Q K^{T}}{\sqrt{d_k}}) V\]

Where Q is Query, K is Key, V is Value, d_k is the dimension of the network. Q and K are the positional.

This medium post gives a graphic illustration of the process: {Link}

This blog has a series of posts explain the natural language processing. I found it is very detailed with many figures explaining how weights of each layers distribute. {Link}

Embedding layers

Ref: {Link}
It is a map reduce the dimension of the input samples. For instance, sequences from combination of 50 words (n_vocabulary=50) will be reduced to 8 values after embedding.
It usually follows the One-Hot.

regress probability distribution

Simply return the mean and standard deviation, it seems comparing 2 loss functions.
{Medium-Link}
Author mentioned Kaiming initialization help training converging.
{Medium-Kaiming initinization}
The kaiming initialization consider the non-linearity of activation functions, such ReLU:

    \[ReLU(x) = max(0, y)\]

The post is math heavy, which can be beneficial. Read it for relaxation.

Recurrent Neural Network and LSTM

Ref: {wikipedia-LSTM}
file
schematc drawing
To anticipate a stochastic pattern (e.g., Markov Chain), the output/prediction from previous state flows back to adjust the probability of next prediction by leverage the impact from inputs (input gate’s activation vector), historical tendency assessment (forget/long-short-term), and output (output gate’s activation vector).

Milestone papers reading

BYOL

Bootstrap Your Own Latent {ArXix}
The result of this paper is a bit dodgy. They even has disclaimer on reproducibility issue (see Border Impact Section). But the idea may be inspiring.

Highlight:

  1. Didn’t use negative pair, as most Adversarial frameworks do.
    PS: Most adversarial framework boost differentiation ability of the model by joint objective that maximize similarity between postive pairs and minizie confusion between negative pairs. Mathmatically,

        \[L_{joint} = \max_{\theta} \min_{\xi}{L_{\theta}{(x, f_{\theta}^{+}{(x)})}}{L_{\xi}{(x, f_{\xi}^{-}{(x)})}}\]

    , where f_{\theta}^{+} $is the model for positive pair, and f_{\theta}^{+} is the model for negative pair.
  2. Strictly, only online model is updated, target (offline) model f_\xi passively updates as the exponential moving average of online model, in math
    \xi \leftarrow \tau \xi + (1-\tau) \theta, where decay rate \tau \in (0, 1)
    Indicated in figure below by purple arrow .
  3. They claim this is a Unsupvised learning method. They only keep the encoder f_{\theta} as the final outcome.
  4. Many experiments includes, a long Appendix (A-J) verifying different hypotheses.
file

Generative Pre-trained from Pixel

The conference page{Link}
Another paper with more illustrative schematic plot {Link}
1121 understanding:
Highlight:

  1. unsupervised pre-train with resized and sequentialized photos
    (Resize to save compu-resouces. sequen to fit the transformer manner)
  2. linear probe show middle latent layer has most contribution
  3. supervised with few labelled dataset reached 99% accuracy on CIFAR-10
    (They claim joint loss function, L_{GEN}+L_{CLF}, is more effective.)

Framework
file

I-GPT (Image GPT) first pretrain with raster-scan sequentializatized photo, such as AutoRegression (some halved show below) or BERT (randomly knock off pixels to ignore influence from adjacent pixel)
Test results

NIN – Network in Network (Lin et al., 2014)

This is used in EQtransformer, so I read this paper.
The NIN idea enable nonlinear kernel operations and Global Average Pooling (compress each feature map in certain depth into 1 scalar) because each final feature map is more robust now.
file

Deep residual Learning for Image Recognition

{Link}
The paper uses a hypothesis-verification architecture to prove the residual connection (convolution layer with 1*1 kernel size) mitigate the underfitting for very deep network, such as VGG (with 18 convolution layers).

Another highlight of the paper is the Implementation section, where they reveals their training scheme as well as detailed hyperparameter settings. That can be a very good guide for freshmen like me.

GoogleNet: Inception

Ref: google GoogLeNet, 2014
1*1 convolutions with fewer output channels than input channels reduce filter dimensionality and return feature cubes with smaller depth. (Consider a Primary Component Analysis)
1\*1 convolution

Besides, Global average pooling significant cut down the computational resources by only passing mean value of each feature map to fully connected layer compared to the feature map flattening in AlexNet. This is less prone to overfitting despite of 0.6% loss on accuracy.

Embed Physics in deepL | handle sample scarcity

Tunning papers

Channel pruning

Learning Efficient Convolutional Networks through Network Slimming, 2017 {arXiv}

They prove the efficacy with VGG net by reducing model size from 155MB to 8MB.
Users can assign a conservatively large channel size for all layers and remove those channels with the channel scaling factor below a user-defined threshold value.

Essence of the channel pruning

Albeit innovative, these pruning should involve only after satisfying performance reached on the preliminary model with redundancy. The paper approach 77.1% with all 512 channels layers but the test accuracy grows up a bit to 80% after pruning.

LSTM

file

1st Systematic Deep learning tuning instruction

The novel instruction is released in 2023, {github}. It systematically compiles methodology and implementation of hyperparameters tunning / searching in field of deep learning.

A quick skim over it help me confirm my intuitions on model design and hyperparameters searching that start from reproducing and transfer leanring set up baseline and progressively explore improvements. Their comparision on scheduler, optimizer, and exploration/exploitation trade-off are valuable for future research. They prefer random search to black-box optimization.

styleGAN-Renaissance of signal processing in Deep Learning

My brief summary:
Improve texture affine with customized window-filters in fourier domain.
Share similar philosophy with FNO, that linear manipulation in fourier domain equals multiplication in time domain (photo domain in graphic case).
Bridging downsampling and upsampling, i.e, discrete Z(x) and continous z(x) signal conversion, with sampling function and dirac impulse function.

Highlight:
Allow quick implementation of transfer learning with custom dataset. {github-Link}

Multimodal deep learning

A new ArXiv paper summarized the advances in this realm. https://arxiv.org/abs/2301.04856
The multimodal here implies data in different mode, e.g., caption generation for text, picture generation from text, etc.

Semantic Segamentation 语义分割

CSDN的解释:{Link}
初步理解为图像中的前景和后景识别,常通过卷积神经网络训练实现。
下文附源码,链接:{Link}

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.