Motivation

For years transformer models base exclusively on classification (CLS) token to construct the final classifier, without explicitly harnessing high-level word tokens. We empirically found that high-level word tokens contain rich information, which per se are very competent with the classifier and moreover, are complementary to the CLS token. Therefore, we propose a novel transformer model called second-order transformer (SoT), exploiting simultaneously the CLS token and word tokens for the classifier.

Quantitative analysis

Accuracie(%) of transformer models which use single classification token (ClassT), single word tokens (WordT) and their combination (ClassT+WordT). We showcase performance of vision transformers (i.e., DeiT and T2T) on ImageNet (IN) and ImageNet-A (IN-A), and that of language transformers (i.e., GPT and BERT) on CoLA and RTE.

Method

We propose multi-headed global cross-covariance Pooling (MGCrP) with singular value Power Normalization (svPN) for mining word tokens, while systemically studying several schemes to combine word tokens with classification token. The resulting second-order transformer (SoT) significantly improves state-of-the-art vision transformers on challenging benchmarks including ImageNet and ImageNet-A. For NLP tasks, through finetuning based on pretrained language transformers including GPT and BERT, our SoT greatly boosts the performance on widely used tasks such as CoLA and RTE.

		Differences of our MGCrP from the classical pooling methods. $\mathbf{Z}\in \mathbb{R}^{p\times q}$ is a feature matrix of word tokens. Global average pooling (GAP) produces 1st-order, vectorial representations; Global covariance pooling (GCP) produces symmetric positive definite (SPD) matrices to which matrix power normalization (MPN) is applicable; MGCrP yields asymmetric matrices, normalized by the proposed singular value power normalization (svPN). Singular value power normalization (svPN) $ \mathrm{svPN}(\mathbf{Q})=\sum\limits_{i=1}^{\min{(m,n)}}\lambda_{i}^{\alpha}\mathbf{u}_{i}\mathbf{v}_{i}^{T} $ where $0<\alpha<1$, and $\mathrm{\lambda}_{i},\mathbf{u}_{i}/\mathbf{v}_{i} $ are singular value and left/right singular vectors of $\mathbf{Q} \in \mathbb{R}^{m\times n}$, respectively. Fast approximate svPN $ \widehat{\mathrm{sv}}\mathrm{PN}(\mathbf{Q})\!=\!\sum\limits_{i=1}^{r-1}\hat{\lambda}_{i}^{\alpha}\hat{\mathbf{u}}_{i}\hat{\mathbf{v}}_{i}^{T} \!+\!\frac{1}{\hat{\lambda}_{r}^{1-\alpha}}\big(\mathbf{Q}\!-\!\sum\limits_{i=1}^{r-1}\hat{\lambda}_{i}\hat{\mathbf{u}}_{i}\hat{\mathbf{v}}_{i}^{T}\big) $ where $\hat{\mathrm{\lambda}_{i}}, \hat{\mathbf{u}}_{i} / \hat{\mathbf{v}}_{i}$ are approximately computed using the Power Method, and only few $r$ singular values are used.
Diagram of SoT for vision classification


(a)		(b)
Diagrams of fine-tuning BERT and GPT on downstream NLP tasks, formulated as either (a) sentence-pair classification (RTE,MNLI and QNLI), or (b) single-sentence classification (CoLA). Note that for BERT & its variants, the classification token [CLS] is always in the first place of the sequence, while it is at the end for GPT; besides,GPT does not use segment embedding. Illustration of GPT is faded.

Results


(a) Comparison with light-weight models		(b) Comparison with middle-sized models

(c) Comparison with heavyweight models
Comparison with state-of-the-art vision transformer models on image classification tasks

Performance improvement over language transformer models on text classification tasks

References

CV Datasets

[ImageNet] Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
[ImageNet-A] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021.

NLP Datasets

[CoLA] Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. TACL, 2019.
[RTE] Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge. In TAC, 2009.
[MNLI] Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, 2018.
[QNLI] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.

Vision Transformers

[ViT] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[Swin]Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, 2021.
[DeiT]Hugo Touvron, Matthieu Cord, Matthijs Douze, et al. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
[T2T-ViT] Li Yuan, Yunpeng Chen, Tao Wang, et al. Tokensto-token ViT: Training vision transformers from scratch on imagenet. In ICCV, 2021.
[PS-ViT] Xiaoyu Yue, Shuyang Sun, Zhanghui Kuang, Meng Wei, Philip Torr, Wayne Zhang, and Dahua Lin. Vision transformer with progressive sampling. In ICCV, 2021.
[PVT] Wenhai Wang, Enze Xie, Xiang Li, et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
[PiT-Ti] Byeongho Heo, Sangdoo Yun, Dongyoon Han, et al. Rethinking spatial dimensions of vision transformers. In ICCV, 2021.
[iRPE-K] Kan Wu, Houwen Peng, Minghao Chen, et al. Rethinking and improving relative position encoding for vision transformer. In ICCV, 2021.
[AutoFormer]Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Autoformer: Searching transformers for visual recognition. In ICCV, 2021.
[Conformer] Zhiliang Peng, Wei Huang, Shanzhi Gu, Lingxi Xie, Yaowei Wang, Jianbin Jiao, and Qixiang Ye. Conformer: Local features coupling global representations for visual recognition. In ICCV, 2021.

Language Transformers

[GPT] Alec Radford, Karthik Narasimhan, Tim Salimans, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.
[GPT2] Alec Radford, Jeffrey Wu, Rewon Child, et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
[BERT] Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019
[SpanBERT] Mandar Joshi, Danqi Chen, Yinhan Liu, et al. SpanBERT: Improving pretraining by representing and predicting spans. TACL, 8:6477, 2020.
[RoBERTa] Yinhan Liu, Myle Ott, Naman Goyal, et al. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[Attention] Ashish Vaswani, Noam Shazeer, Niki Parmar, et l. Attention is all you need. In NIPS, 2017.

Visualization

[Grad-CAM] Ramprasaath R Selvaraju, Michael Cogswell, and Abhishek Das, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017

[Chefer et al.] Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. arXiv preprint arXiv:2012.09838, 2021.

[Yun et al.] Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In NAACL, 2021.

SoT: Delving Deeper into Classification Head for Transformer

Abstract

Motivation

Quantitative analysis

Method

Results

Visualization

Why CLS token and word tokens are complementary?

References

CV Datasets

NLP Datasets

Vision Transformers

Language Transformers

Visualization