References
CV Datasets
[ImageNet] Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
[ImageNet-A] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, 2021.
NLP Datasets
[CoLA] Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. TACL, 2019.
[RTE] Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge. In TAC, 2009.
[MNLI] Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, 2018.
[QNLI] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
Vision Transformers
[ViT] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[Swin]Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Hierarchical vision transformer using shifted windows. InICCV, 2021.
[DeiT]Hugo Touvron, Matthieu Cord, Matthijs Douze, et al. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
[T2T-ViT] Li Yuan, Yunpeng Chen, Tao Wang, et al. Tokensto-token ViT: Training vision transformers from scratch on imagenet. In ICCV, 2021.
[PS-ViT] Xiaoyu Yue, Shuyang Sun, Zhanghui Kuang, Meng Wei, Philip Torr, Wayne Zhang, and Dahua Lin. Vision transformer with progressive sampling. In ICCV, 2021.
[PVT] Wenhai Wang, Enze Xie, Xiang Li, et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
[PiT-Ti] Byeongho Heo, Sangdoo Yun, Dongyoon Han, et al. Rethinking spatial dimensions of vision transformers. In ICCV, 2021.
[iRPE-K] Kan Wu, Houwen Peng, Minghao Chen, et al. Rethinking and improving relative position encoding for vision transformer. In ICCV, 2021.
[AutoFormer]Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Autoformer: Searching transformers for visual recognition. In ICCV, 2021.
[Conformer] Zhiliang Peng, Wei Huang, Shanzhi Gu, Lingxi Xie, Yaowei Wang, Jianbin Jiao, and Qixiang Ye. Conformer: Local features coupling global representations for visual recognition. In ICCV, 2021.
Language Transformers
[GPT] Alec Radford, Karthik Narasimhan, Tim Salimans, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.
[GPT2] Alec Radford, Jeffrey Wu, Rewon Child, et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
[BERT] Jacob Devlin, Ming-Wei Chang, Kenton Lee, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019
[SpanBERT] Mandar Joshi, Danqi Chen, Yinhan Liu, et al. SpanBERT: Improving pretraining by representing and predicting spans. TACL, 8:6477, 2020.
[RoBERTa] Yinhan Liu, Myle Ott, Naman Goyal, et al. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[Attention] Ashish Vaswani, Noam Shazeer, Niki Parmar, et l. Attention is all you need. In NIPS, 2017.
Visualization
[Grad-CAM] Ramprasaath R Selvaraju, Michael Cogswell, and Abhishek Das, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017
[Chefer et al.] Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. arXiv preprint arXiv:2012.09838, 2021.
[Yun et al.] Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In NAACL, 2021.