SoT: Delving Deeper into Classification Head for Transformer

Jiangtao Xie, Ruiren Zeng, Qilong Wang, Ziqi Zhou, Peihua Li§

Code: https://github.com/jiangtaoxie/SoT

Abstract

Transformer models are not only successful in natural language processing (NLP) but also demonstrate high potential in computer vision (CV). Despite great advance, most of works only focus on improvement of architectures but pay little attention to the classification head. For years transformer models base exclusively on classification token to construct the final classifier, without explicitly harnessing high-level word tokens. In this paper, we propose a novel transformer model called second-order transformer (SoT), exploiting simultaneously the classification token and word tokens for the classifier. Specifically, we empirically disclose that high-level word tokens contain rich information, which per se are very competent with the classifier and moreover, are complementary to the classification token. To effectively harness such rich information, we propose multiheaded global cross-covariance pooling with singular value power normalization, which shares similar philosophy and thus is compatible with the trformer block, better than commonly used pooling methods. Then, we study comprehensively how to explicitly combine word tokens with classification token for building the final classification head. For CV tasks, our SoT significantly improves state-of-the-art vision transformers on challenging benchmarks including ImageNet and ImageNet-A. For NLP tasks, through finetuning based on pretrained language transformers including GPT and BERT, our SoT greatly boosts the performance on widely used tasks such as CoLA and RTE.

‡These authors contributed equally to this work.
§The corresponding author.

Motivation

For years transformer models base exclusively on classification (CLS) token to construct the final classifier, without explicitly harnessing high-level word tokens. We empirically found that high-level word tokens contain rich information, which per se are very competent with the classifier and moreover, are complementary to the CLS token. Therefore, we propose a novel transformer model called second-order transformer (SoT), exploiting simultaneously the CLS token and word tokens for the classifier.

Quantitative analysis

Accuracie(%) of transformer models which use single classification token (ClassT), single word tokens (WordT) and their combination (ClassT+WordT). We showcase performance of vision transformers (i.e., DeiT and T2T) on ImageNet (IN) and ImageNet-A (IN-A), and that of language transformers (i.e., GPT and BERT) on CoLA and RTE.

Method

We propose multi-headed global cross-covariance Pooling (MGCrP) with singular value Power Normalization (svPN) for mining word tokens, while systemically studying several schemes to combine word tokens with classification token. The resulting second-order transformer (SoT) significantly improves state-of-the-art vision transformers on challenging benchmarks including ImageNet and ImageNet-A. For NLP tasks, through finetuning based on pretrained language transformers including GPT and BERT, our SoT greatly boosts the performance on widely used tasks such as CoLA and RTE.

Differences of our MGCrP from the classical pooling methods. $\mathbf{Z}\in \mathbb{R}^{p\times q}$ is a feature matrix of word tokens. Global average pooling (GAP) produces 1st-order, vectorial representations; Global covariance pooling (GCP) produces symmetric positive definite (SPD) matrices to which matrix power normalization (MPN) is applicable; MGCrP yields asymmetric matrices, normalized by the proposed singular value power normalization (svPN).

Singular value power normalization (svPN)
$
\mathrm{svPN}(\mathbf{Q})=\sum\limits_{i=1}^{\min{(m,n)}}\lambda_{i}^{\alpha}\mathbf{u}_{i}\mathbf{v}_{i}^{T}
$
where $0<\alpha<1$, and $\mathrm{\lambda}_{i},\mathbf{u}_{i}/\mathbf{v}_{i} $ are singular value and left/right singular vectors of $\mathbf{Q} \in \mathbb{R}^{m\times n}$, respectively.

Fast approximate svPN
$
\widehat{\mathrm{sv}}\mathrm{PN}(\mathbf{Q})\!=\!\sum\limits_{i=1}^{r-1}\hat{\lambda}_{i}^{\alpha}\hat{\mathbf{u}}_{i}\hat{\mathbf{v}}_{i}^{T}
\!+\!\frac{1}{\hat{\lambda}_{r}^{1-\alpha}}\big(\mathbf{Q}\!-\!\sum\limits_{i=1}^{r-1}\hat{\lambda}_{i}\hat{\mathbf{u}}_{i}\hat{\mathbf{v}}_{i}^{T}\big)
$
where $\hat{\mathrm{\lambda}_{i}}, \hat{\mathbf{u}}_{i} / \hat{\mathbf{v}}_{i}$ are approximately computed using the Power Method, and only few $r$ singular values are used.

Diagram of SoT for vision classification



(a) (b)
Diagrams of fine-tuning BERT and GPT on downstream NLP tasks, formulated as either (a) sentence-pair classification (RTE,MNLI and QNLI), or (b) single-sentence classification (CoLA). Note that for BERT & its variants, the classification token [CLS] is always in the first place of the sequence, while it is at the end for GPT; besides,GPT does not use segment embedding. Illustration of GPT is faded.

Results

(a) Comparison with light-weight models (b) Comparison with middle-sized models
(c) Comparison with heavyweight models
Comparison with state-of-the-art vision transformer models on image classification tasks


Performance improvement over language transformer models on text classification tasks

Visualization

Why CLS token and word tokens are complementary?

:correct prediction; :incorrect prediction

Visualizations of images on ImageNet validation set based on our SoT by using the Grad-CAM. The ClassT tends to focus on global context of Images, while the WordT mainly represents local regions. The two kinds of tokens are highly complementary, and therefore ClassT+WordT can make fully use of their merits, capturing both global context and local discriminative regions.

Visualization of the degree of impact of each word on linguistic acceptability of English sentences based on BERT-base by using the methods of Chefer et al. and Yun et al. Similar to vision task, the ClassT inclines to attend to the whole sentence, while WordT tends to focus on local correctness of each sentence. Finally, ClassT+WordT can highlight all important words in sentence, including the subordinate clause, conjunction,etc., helpfufl to boost the performance of classification.

References

CV Datasets

[ImageNet] Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
[ImageNet-A] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021.

NLP Datasets

[CoLA] Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. TACL, 2019.
[RTE] Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge. In TAC, 2009.
[MNLI] Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, 2018.
[QNLI] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.

Vision Transformers

[ViT] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[Swin]Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, 2021.
[DeiT]Hugo Touvron, Matthieu Cord, Matthijs Douze, et al. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
[T2T-ViT] Li Yuan, Yunpeng Chen, Tao Wang, et al. Tokensto-token ViT: Training vision transformers from scratch on imagenet. In ICCV, 2021.
[PS-ViT] Xiaoyu Yue, Shuyang Sun, Zhanghui Kuang, Meng Wei, Philip Torr, Wayne Zhang, and Dahua Lin. Vision transformer with progressive sampling. In ICCV, 2021.
[PVT] Wenhai Wang, Enze Xie, Xiang Li, et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
[PiT-Ti] Byeongho Heo, Sangdoo Yun, Dongyoon Han, et al. Rethinking spatial dimensions of vision transformers. In ICCV, 2021.
[iRPE-K] Kan Wu, Houwen Peng, Minghao Chen, et al. Rethinking and improving relative position encoding for vision transformer. In ICCV, 2021.
[AutoFormer]Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Autoformer: Searching transformers for visual recognition. In ICCV, 2021.
[Conformer] Zhiliang Peng, Wei Huang, Shanzhi Gu, Lingxi Xie, Yaowei Wang, Jianbin Jiao, and Qixiang Ye. Conformer: Local features coupling global representations for visual recognition. In ICCV, 2021.

Language Transformers

[GPT] Alec Radford, Karthik Narasimhan, Tim Salimans, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.
[GPT2] Alec Radford, Jeffrey Wu, Rewon Child, et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
[BERT] Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019
[SpanBERT] Mandar Joshi, Danqi Chen, Yinhan Liu, et al. SpanBERT: Improving pretraining by representing and predicting spans. TACL, 8:6477, 2020.
[RoBERTa] Yinhan Liu, Myle Ott, Naman Goyal, et al. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[Attention] Ashish Vaswani, Noam Shazeer, Niki Parmar, et l. Attention is all you need. In NIPS, 2017.

Visualization

[Grad-CAM] Ramprasaath R Selvaraju, Michael Cogswell, and Abhishek Das, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017

[Chefer et al.] Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. arXiv preprint arXiv:2012.09838, 2021.

[Yun et al.] Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In NAACL, 2021.