본문 바로가기

AI 논문 정리/Attention & Transformers

(2)

[Paper Review 2] Deep ViT: Towards Deeper Vision Transformer https://arxiv.org/abs/2103.11886 1. Introduction Convolution layer를 몇 개씩 쌓아 올려서 global information을 모아놓는 CNN과 다르게, ViT는 self-attention 메커니즘을 사용하여 layer-wise local feature extraction을 하지 않고도 global information을 모을 수 있다. 이러한 과정을 거쳐, ViT의 성능은 CNN보다 좋다고 할 수 있다. 최근 CNN 연구에 있어, deep model을 학습시키는 과정이 주가 되었기 때문에, 저자들은 "ViT 또한 CNN과 비슷하게 deep하게 만들어서 성능을 개선시킬 수 있지 않을까?"라는 의문을 가지게 되었다. ViT는 self-attention 메..

[Paper Review 1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale https://arxiv.org/abs/2010.11929 International Conference on Learning Representations(ICLR) 2021 1. Introduction 이 논문은 NLP에서 Transformer의 성공에 따라, 이미지에 Transformer를 적용하는 Vision Transformer를 제안한다. 먼저, 이미지를 patch로 split하고, 해당 patch들의 linear embedding sequence를 Transformer의 input으로 넣는다. CV의 이미지 패치들은 NLP의 token과 거의 유사하게 사용된다. 강하게 Regularization을 하지 않고, ImageNet과 같이 중간 크기의 데이터셋에 학습을 시킬 때, ResNet보다 정확도..

이전 1 다음

티스토리툴바