Transformers in Vision:A Survey 阅读笔记

admin2024-04-03  0

Transformers in Vision:A Survey 阅读笔记,第1张
Among their salient benefits,Transformers enable modeling long dependencies between inputsequence elements and support parallel processing of sequence as compared to recurrent networks e.g.,Long short-term memory(LSTM).
与循环网络(如 LSTM)相比,Transformer 可以建模输入序列元素之间的长期依赖关系,并支持序列的并行处理。
Furthermore,the straightforward design of Transformers allows processing multiple modalities(e.g.,images,videos,text and speech)using similar processing blocks and demonstrates excellent scalability to very large capacity networks and hugedatasets.
Transformer 的直观设计允许使用类似的处理模块处理多种模态(例如图像、视频、文本和语音),并且展现出对非常大容量网络和大规模数据集的良好可扩展性。
The profound impact of Transformer models has become more clear with their scalability to very large capacity mod-els
Transformer 模型的深远影响在于它们能够扩展到非常大容量的模型,比如 BERT-large 和 GPT-3,以及最新的 Switch transformer。
Transformers in Vision:A Survey 阅读笔记,第2张
As a result,Transformer mod-els and their variants have been successfully used for imagerecognition[11],[12],object detection[13],[14],segmenta-tion[15],image super-resolution[16],video understanding[17],[18],image generation[19],text-image synthesis[20]and visual question answering[21],[22],among severalother use cases[23]–[26].
Transformer 模型及其变体已成功应用于图像识别、目标检测、分割、图像超分辨率、视频理解、图像生成、文本-图像合成和视觉问答等领域。
Al-though attention models have been extensively used inboth feed-forward and recurrent networks[27],[28],Trans-formers are based solely on the attention mechanism andhave a unique implementation(i.e.,multi-head attention)optimized for parallelization.
Transformer 基于注意力机制,独特的多头注意力实现优化了并行性,与其他模型如硬注意力相比,Transformer 具有很好的可扩展性,不需要先验知识即可处理问题结构。
Since Transformers assume minimal priorknowledge about the structure of the problem as comparedto their convolutional and recurrent counterparts[30]–[32],they are typically pre-trained using pretext tasks on large-scale(unlabelled)datasets[1],[3].
Transformer 通常通过在大规模(未标记)数据集上的预训练来学习表征,然后在监督下游任务中进行微调,以获得良好的结果,避免了昂贵的手动标注成本。
The first one is self-attention,which allows capturing‘long-term’dependencies between sequence elements as com-pared to conventional recurrent models that find it chal-lenging to encode such relationships.
The second keyidea is that of pre-training1on a large(un)labelled corpus ina(self)supervised manner,and subsequently fine-tuning tothe target task with a small labeled dataset[3],[7],[38].
在大规模(未)标记语料库上进行自(监督)预训练,并随后使用小规模标记数据集对目标任务进行微调,是 Transformer 模型发展的第二个关键思想。
2.1 Transformer中的自注意力
The self-attentionmechanism is an integral component of Transformers, which explicitly models the interactions between all entities of asequence for structured prediction tasks.
自注意力机制是 Transformer 的核心组成部分,用于显式地建模序列中所有实体之间的交互,适用于结构化预测任务。
This is doneby defining three learnable weight matrices to transformQueries(WQ∈Rd×dq),Keys(WK∈Rd×dk)and Values(WV∈Rd×dv),where dq=dk
Transformers in Vision:A Survey 阅读笔记,第3张
For a given entity in the sequence,the self-attention basi-cally computes the dot-product of the query with all keys,which is then normalized using softmax operator to get theattention scores.
自注意力机制对于给定序列中的每个实体,基本上是计算查询(query)与所有键(keys)的点积,然后使用 softmax 运算进行归一化,以获得注意力分数。
For the Transformer model[1]which is trained to predict the next entity of the sequence,the self-attention blocks used in the decoder are masked toprevent attending to the subsequent future entities.
对于 Transformer 模型中的解码器,在训练过程中,用于预测序列下一个实体的自注意力块会被掩码,以防止关注未来的实体。
This issimply done by an element-wise multiplication operation with a mask M∈Rn×n,where M is an upper-triangular matrix.
这是通过与一个掩码矩阵相乘的元素级乘法操作完成的,掩码矩阵 M 是一个上三角矩阵。
Hadamard product
Hadamard 乘积(Hadamard product),也称为元素级乘积或者逐元素乘积,是指两个相同大小的矩阵(或者向量)中对应元素相乘得到的新矩阵(或者向量)。
Basically,while pre-dicting an entity in the sequence,the attention scores of thefuture entities are set to zero in masked self-attention.
In order to encapsulate multiplecomplex relationships amongst different elements in thesequence,the multi-head attention comprises multiple self-attention blocks(h=8 in the original Transformer model[1]).Each block has its own set of learnable weight ma-trices{WQi,WKi,WVi},where i=0···(h−1).
The main difference of self-attention with convolution operation is that the filters are dynamically calculated in-stead of static filters(that stay the same for any input)as in the case of convolution.
self-attention is invariant to permutations and changes in the number of input points.As a result,it can easily operate on irregular inputs as op-posed to standard convolution that requires grid structure.
In fact,self-attention provides the capability to learn the global aswell as local features,and provide expressivity to adaptivelylearn kernel weights as well as the receptive field(similar todeformable convolutions[42]).
2.2 自监督预训练
self-supervised learning has been very effectivelyused in the pre-training stage.The self-supervision basedpre-training stage training has played a crucial role in un-leashing the scalability and generalization of Transformernetworks,
预训练阶段通常采用自监督学习,利用大量非标记数据来训练模型,这种方法在 Transformer 网络的可扩展性和泛化性方面发挥了关键作用。
the basicidea of SSL is to fill in the blanks,i.e.,try to predict theoccluded data in images,future or past frames in temporalvideo sequences or predict a pretext task
Self-supervised learning provides a promising learningparadigm since it enables learning from a vast amount ofreadily available non-annotated data.
The pseudo-labels for the pretext task are automati-cally generated(without requiring any expensive manual annotations)based on data attributes and task definition. Therefore,the pretext task definition is a critical choice in SSL.
existing SSL methods basedupon their pretext tasks into (a)generative approaches whichsynthesize images or videos(given conditional inputs),(b)context-based methods which exploit the relationships be-tween image patches or video frames,and(c)cross-modalmethods which leverage from multiple data modalities.
2.3 Transformer模型
with each block having two sub-layers:amulti-head self-attention network,and a simple position-wise fully connected feed-forward network.
Positional encodings are added to the input sequence to capture the relative position of each word in the sequence.
Transformer 模型的编码器接收一个语言中的单词序列作为输入,并通过位置编码捕捉每个单词在序列中的相对位置。
Being an auto-regressive model,the decoder ofthe Transformer[1] uses previous predictions to output thenext word in the sequence.
Transformer 模型的解码器使用先前的预测来输出句子中的下一个单词,同时接收来自编码器的输入以及先前的输出。
Transformers in Vision:A Survey 阅读笔记,第4张
2.4 双向表征
Bidirectional Encoder Representations fromTransformers(BERT)[3]proposed to jointly encode the rightand left context of a word in a sentence,thus improvingthe learned feature representations for textual data in anself-supervised manner.
双向编码器表示( Bidirectional Encoder Representations from Transformers,BERT ) [ 3 ]提出对句子中单词的左右上下文进行联合编码,以自监督的方式改进文本数据的学习特征表示。
Masked Language Model(MLM)-Afixed percentage(15%)of words in a sentence are randomlymasked and the model is trained to predict these maskedwords using cross-entropy loss.
MLM 任务中,句子中的一定比例的单词会被随机屏蔽,模型要预测这些被屏蔽的单词,从而学习双向上下文信息。
Next Sentence Prediction(NSP)-Given a pairof sentences,the model predicts a binary label i.e.,whetherthe pair is valid from the original document or not.Thetraining data for this can easily be generated from anymonolingual text corpus.
NSP 任务中,模型需要预测一对句子的二进制标签,即原始文档中这对句子是否有效。训练数据可以从任何单语文本语料库中轻松生成,模型通过此任务能够捕捉句子之间的关系,这在许多语言建模任务中至关重要。
NSP enables the model to capture sentence-to-sentencerelationships which are crucial in many language modelingtasks such as Question Answering and Natural LanguageInference.
We broadly categorize vision models with self-attentioninto two categories:the models which use single-head self-attention(Sec.3.1),and the models which employ multi-head self-attention based Transformer modules into theirarchitectures(Sec.3.2).
single-head self-attention based frameworks,which generally apply global or local self-attention withinCNN architectures,or utilize matrix factorization to enhancedesign efficiency and use vectorized attention models.
Transformers in Vision:A Survey 阅读笔记,第5张
3.1 单头自注意力机制
3.1.1 CNN中的自注意力机制
This way,the non-local operation isable to capture interactions between any two positions inthe feature map regardless of the distance between them.
Videos classification is an example of a task where long-range interactions between pixels exist both in space andtime.
Although the self-attention allows us to model full-image contextual information,it is both memory and com-pute intensive.
Transformers in Vision:A Survey 阅读笔记,第6张
Another shortcoming of the convolutional operatorcomes from the fact that after training,it applies fixedweights regardless of any changes to the visual input.
self-attention as an alternative to convolutional operators.
3.1.2 单独使用自注意力
On the other hand,global attention[1]which attend toall spatial locations of the input can be computationallyintensive and is preferred on down-sampled small images,image patches[11]or augmenting the convolutional featuresspace[79].
3.2 多头自注意力(Transformer)
Vision Transformer(ViTs)[11]adapts the architecture of[1](see Fig.3),which cascades multiple Transformer layers
Vision Transformer(ViTs)采用了多个Transformer层级的架构,而不是像第3.1节中讨论的将自注意力作为CNN启发架构中的组件插入。
Below,we discuss these methods by categorizingthem into:uniform scale ViTs having single-scale featuresthrough all layers(Sec.3.2.1),multi-scale ViTs that learnhierarchical features which are more suitable for denseprediction tasks(Sec.3.2.2),and hybrid designs havingconvolution operations within ViTs(Sec.3.2.3).
3.2.1 统一尺度的视觉Transformer
The original Vision Transformer[11]model belongs to thisfamily,where the multi-head self-attention is applied to aconsistent scale in the input image where the spatial scale ismaintained through the network hierarchy.
最初的Vision Transformer模型属于这个家族,其中多头自注意力被应用于输入图像中的一致尺度,通过网络层次结构来保持空间尺度。
Vision Transformer(ViT)[11](Fig.6)is the first workto showcase how Transformers can‘altogether’replacestandard convolutions in deep neural networks on large-scale image datasets.
Vision Transformer(ViT)是第一个展示了Transformer如何完全取代标准卷积在大规模图像数据集上的深度神经网络的工作。
Transformers in Vision:A Survey 阅读笔记,第7张
Besides using augmentation and regularizationprocedures common in CNNs,the main contribution ofDeiT[12]is a novel native distillation approach for Trans-formers which uses a CNN as a teacher model(RegNetY-16GF[86])to train the Transformer model.
3.2.2 多尺度视觉Transformer
In standard ViTs,the number of the tokens and token featuredimension are kept fixed throughout different blocks ofthe network.
These architectures mostly sparsify tokens by merg-ing neighboring tokens and projecting them to a higherdimensional feature space.
Some of them are hybrid designs(with both convolution and self-attentionoperations,see Sec.3.2.3),while others only employ pureself-attention based design(discussed next).
Pyramid ViT(PVT)[93]is the first hierarchical designfor ViT,and proposes a progressive shrinking pyramidand spatial-reduction attention.
3.2.3 与卷积混合的ViT
Convolutions do an excellent job at capturing low-level localfeatures in images,and have been explored in multiple hy-brid ViT designs,specially at the beginning to“patchify andtokenize”an input image.
3.2.4 自监督视觉Transformer
Contrastive learning based self-supervised approaches,which have gained significant success for CNN based visiontasks,have also been investigated for ViTs.
3.3 用于目标检测的Transformer
Transformers based modules have been used for objectdetection in the following manner:(a)Transformer back-bones for feature extraction,with a R-CNN based headfor detection(see Sec.3.2.2),(b)CNN backbone for visualfeatures and a Transformer based decoder for object detec-tion[13],[14],[122],[123](see Sec.3.3.1,and(c)a purelytransformer based design for end-to-end object detection[124](see Sec.3.3.2).
目标检测中使用基于Transformer的模块有以下几种方式:- (a) 使用Transformer骨干进行特征提取,并配合一个基于R-CNN的头部进行检测。- (b) 使用CNN骨干提取视觉特征,并使用基于Transformer的解码器进行对象检测。- (c) 采用纯Transformer设计进行端到端的对象检测。
3.3.1 用CNN做骨干网络的检测Transformer
Detection Transformer(DETR)[13]treats object detectionas a set prediction task i.e.,given a set of image features,the objective is to predict the set of object bounding boxes.
Detection Transformer(DETR)将目标检测视为一个集合预测任务,即在给定一组图像特征的情况下,目标是预测一组物体边界框。
The Transformer model enables the prediction of a set ofobjects(in a single shot)and also allows modeling theirrelationships.
bipartite matching between predictions and ground-truthboxes.
The main advantage of DETR is that it removesthe dependence on hand-crafted modules and operations,such as the RPN(region proposal network)and NMS(non-maximal suppression)commonly used in object detection[125]–[129].
Transformers in Vision:A Survey 阅读笔记,第8张
The DETR[13]model successfully combines convolu-tional networks with Transformers[1]to remove hand-crafted design requirements and achieves an end-to-endtrainable object detection pipeline.
3.3.2 用纯Transformer做检测
You Only Look at One Sequence(YOLOS)[124]is a sim-ple,attention-only architecture directly built upon the ViT
You Only Look at One Sequence(YOLOS)是一个简单的基于注意力机制的架构,直接建立在ViT之上。
It replaces the class-token in ViT with multiplelearnable object query tokens,and the bipartite matchingloss is used for object detection similar to[13].
We note that it isfeasible to combine other recent ViTs with transformer baseddetection heads as well to create pure ViT based designs[124],and we hope to see more such efforts in future.
3.4 用于分割的Transformer
Self-attention can be leveraged for dense prediction taskslike image segmentation that requires modeling rich interac-tions between pixels.
To tackle these issues,Wang et al.[133]propose the position-sensitive axial-attention where the 2Dself-attention mechanism is reformulated as two 1D axial-attention layers,applied to height-axis and width-axis se-quentially(see Fig.8).
Axial self-attention将2D自注意机制重新构建为两个1D轴向自注意层,分别应用于高度轴和宽度轴,从而实现了计算效率,并使模型能够捕获全图像的上下文信息。
Transformers in Vision:A Survey 阅读笔记,第9张
Segmentation Transformer(SETR)[134]has a ViT encoder,and two decoder designs basedupon progressive upsampling,and multi-level feature ag-gregation.SegFormer[101]has a hierarchical pyramid ViT[93](without position encoding)as an encoder,and a simpleMLP based decoder with upsampling operation to get thesegmentation mask.
Segmenter使用ViT编码器提取图像特征,并采用Mask Transformer模块进行解码,预测分割掩码,同时提出了基线线性解码器,将补丁嵌入投影到分类空间,从而生成粗糙的补丁级标签。
3.5 用于图像与场景生成的Transformer
Their approachmodels the joint distribution of the image pixels by factor-izing it as a product of pixel-wise conditional distributions.
Transformers in Vision:A Survey 阅读笔记,第10张
Inspired by the success of GPT model[5]in the lan-guage domain,image GPT(iGPT)[143]demonstrated thatsuch models can be directly used for image generationtasks,and to learn strong features for downstream visiontasks(e.g.,image classification).
Image GPT(iGPT)直接在图像领域使用了类似于GPT模型在语言领域的成功经验,展示了这种模型可以直接用于图像生成任务,并学习用于下游视觉任务的强特征。
TransGAN[145]builds a strong GAN model,free of any convolutionoperation,with both generator and discriminator basedupon the Transformer model[1].
Ramesh etal.[20]recently proposed DALL·E which is a Transformermodel capable of generating high-fidelity images from agiven text description.DALL·E model has 12 billion param-eters and it is trained on a large set of text-image pairs takenfrom the internet.Before training,images are first resizedto 256×256 resolution,and subsequently compressed toa 32×32 grid of latent codes using a pre-trained discretevariational autoencoder[162],[163].
Transformers in Vision:A Survey 阅读笔记,第11张
3.6 低级别视觉处理的Transformer
numerous Transformer-based meth-ods have been proposed for low-level vision tasks,includingimage super-resolution[16],[19],[164],denoising[19],[165],deraining[19],[165],and colorization[24].
Image restorationrequires pixel-to-pixel correspondence from the input to theoutput images.One major goal of restoration algorithmsis to preserve desired fine image details(such as edgesand texture)in the restored images.
3.6.1 用于图像处理任务的Transformer
In contrast,algorithms for low-level vision tasks such as image denoising,super-resolution,and deraining are directly trained on task-specific data,thereby suffer from these limitations:(i)small number of im-ages available in task-specific datasets(e.g.,the commonlyused DIV2K dataset for image super-resolution containsonly 2000 images),(ii)the model trained for one imageprocessing task does not adapt well to other related tasks.
It is capable of performing various imagerestoration tasks such as super-resolution,denoising,andderaining.The overall architecture of IPT consists of multi-heads and multi-tails to deal with different tasks separately,and a shared encoder-decoder Transformer body.
During training,each task-specific head takes asinput a degraded image and generates visual features.Thesefeature maps are divided into small crops and subsequentlyflattened before feeding them to the Transformer encoder(whose architecture is the same as[1]).
The outputs of the encoder along with the task-specific embeddings are givenas input to the Transformer decoder.The features from thedecoder output are reshaped and passed to the multi-tailthat yields restored images.
3.6.2 用于超分辨率重构的Transformer
While the SR methods[167],[170]–[173]that are based on pixel-wise loss functions(e.g.,L1,MSE,etc.)yield impressive results in terms of image fi-delity metrics such as PSNR and SSIM,they struggle torecover fine texture details and often produce images thatare overly-smooth and perceptually less pleasant.
The above mentioned SR approaches follow two distinct(butconflicting)research directions:one maximizing the recon-struction accuracy and the other maximizing the perceptual quality,but never both.
The texture Transformer module of TTSR method(see Fig.11)consists of four core components:
Transformers in Vision:A Survey 阅读笔记,第12张
3.6.3 用于着色的Transformer
Given a grayscale image,colorization seeks to produce thecorresponding colorized sample.It is a one-to-many task asfor a given grayscale input,there exist many possibilitiesin the colorized output space.
Colorization Trans-former[24]is a probabilistic model based on conditionalattention mechanism[179].It divides the image colorizationtask into three sub-problems and proposes to solve eachtask sequentially by a different Transformer network.
first train a Transformer network to map a low-resolution grey-scale image to a 3-bit low-resolution col-ored image.
The 3-bit low-resolution colored imageis then upsampled to an 8-bit RGB sample by anotherTransformer network in the second stage of training.
Transformer is trained to increase the spatialresolution of the 8-bit RGB sample produced by the second-stage Transformer.
These layers capture the interaction between each pixel of an input image while being computation-ally less costly.
3.7 用于多模型任务的Transformer
Sev-eral works in this direction target effective vision-languagepre-training(VLP)on large-scale multi-modal datasets tolearn generic representations that effectively encode cross-modality relationships(e.g.,grounding semantic attributesof a person in a given image).
several of these modelsstill use CNNs as vision backbone to extract visual featureswhile Transformers are used mainly used to encode textfollowed by the fusion of language and visual features.
The single-stream designs feed the multi-modal inputs to a single Transformerwhile the multi-stream designs first use independent Trans-formers for each modality and later learn cross-modal repre-sentations using another Transformer(see Fig.12).
Transformers in Vision:A Survey 阅读笔记,第13张
3.7.1 多流Transformer
ViLBERTdeveloped a two-stream architecture where each stream is dedicated to model the vision or language inputs(Fig.12-h).
The pre-training phase oper-ates in a self-supervised manner,i.e.,pretext tasks are cre-ated without manual labeling on the large-scale unlabelled dataset.
For the first time in the literature,they propose tolearn an end-to-end multi-modal bidirectional Transformermodel called PEMT on audio-visual data from unlabeled videos.
short-term(e.g.,1-3 seconds)video dynamicsare encoded using CNNs,followed by a modality-specificTransformer(audio/visual)to model long-term dependen-cies(e.g.,30 seconds).A multi-modal Transformer is then applied to the modality-specific Transformer outputs to ex-change information across visual-linguistic domains.
CLIP[195]is a contrastive approach to learn image rep-resentations from text,with a learning objective which max-imizes similarity of correct text-image pairs embeddings ina large batch size.
Transformers in Vision:A Survey 阅读笔记,第14张
3.7.2 单流Transformer
Different from two-stream networks like ViLBERT[181]and LXMERT[21],VisualBERT[63]uses a single stack ofTransformers to model both the domains(images and text).
The input sequence of text(e.g.,caption)and the visualfeatures corresponding to the object proposals are fed tothe Transformer that automatically discovers relations be-tween the two domains.
The Unified Vision-Language Pre-training(VLP)[197]model uses a single Transformer network for both encod-ing and decoding stages.
Universal image-text representation(UNITER)[43]per-forms pre-training on four large-scale visual-linguisticdatasets(MS-COCO[75],Visual Genome[200],ConceptualCaptions[196]and SBU Captions[201]).
Universal image-text representation(UNITER)[43]在四个大规模视觉-语言数据集上进行预训练,并设计了预训练任务来强调学习视觉和语言领域之间的关系。
To address this problem,Object-Semantics AlignedPre-Training(Oscar)[44]first uses an object detector toobtain object tags(labels),which are then subsequently usedas a mechanism to align relevant visual features with thesemantic information(Fig.12-b).
Object-Semantics Aligned Pre-Training(Oscar)[44]使用对象检测器获取对象标签,并使用这些标签来将相关的视觉特征与语义信息对齐。
3.7.3 用于视觉描述的Transformer
The visual and text features are then separately linearly projected to a shared space,concatenated and fed toa transformer model(with an architecture similar to DETR)to predict the bounding boxes for objects corresponding to the queries in the grounding text.
Visual Groundingwith Transformer[206]has an encoder-decoder architecture,where visual tokens(features extracted from a pretrainedCNN model)and text tokens(parsed through an RNNmodule)are processed in parallel with two distinct branchesin the encoder,with cross-modality attention to generatetext-guided visual features.The decoder then computesattention between the text queries and visual features andpredicts query-specific bounding boxes.
Visual Grounding with Transformer[206]具有编码器-解码器架构,其中视觉标记(从预训练的CNN模型中提取的特征)和文本标记(通过RNN模块解析)在编码器的两个不同分支中并行处理,具有跨模态注意力,以生成文本引导的视觉特征。解码器然后计算文本查询和视觉特征之间的注意力,并预测查询特定的边界框。
3.8 视频理解
3.8.1 视频与语言模型的联合
The VideoBERT[17]model leverages Transformer networksand the strength of self-supervised learning to learn effec-tive multi-modal representations.
VideoBERT uses the prediction of masked visual and linguistic tokens as a pretext task(Fig.12-c).This allows modeling high-level se-mantics and long-range temporal dependencies,importantfor video understanding tasks.
The video+text model uses a visual-linguisticalignment task to learn cross-modality relationships.Thedefinition of this pre-text task is simple,given the latentstate of the[cls]token,the task is to predict whether thesentence is temporally aligned with the sequence of visual tokens.
3.8.2 视频动作识别
Neimark et al.[211]propose Video Transformer Network(VTN)that first ob-tains frame-wise features using 2D CNN and apply a Trans-former encoder(Longformer[103])on top to learn temporalrelationships.
Neimark等人提出了Video Transformer Network(VTN),该网络首先使用2D CNN获取逐帧特征,然后在顶部应用Transformer编码器(Longformer)来学习时间关系。
The classification token is passed through afully connected layer to recognize actions or events.Theadvantage of using Transformer encoder on top of spatialfeatures is two fold:(a)it allows processing a complete videoin a single pass,and(b)considerably improves training andinference efficiency by avoiding the expensive 3D convolu-tions.
Multiscale Vision Transformers(MViT)[219]build afeature hierarchy by progressively expanding the channelcapacity and reducing the spatio-temporal resolution invideos.They introduce multi-head pooling attention togradually change the visual resolution in their pyramidstructure.
Multiscale Vision Transformers(MViT)通过逐渐扩展通道容量和降低视频的时空分辨率来构建特征层次结构。他们引入了多头池化注意力,逐步改变金字塔结构中的视觉分辨率。
Transformers in Vision:A Survey 阅读笔记,第15张
First,the spatio-temporal tokens are extracted and then efficient factorisedversions of self-attention are applied to encode relationshipsbetween tokens.However,they require initialization withimage-pretrained models to effectively learn the ViT models.
3.8.3 视频实物分割
An encoder and a decoder Transformer is used similar to DETR toframe the instance segmentation problem as a sequence tosequence prediction task.
3.9 少样本学习中的Transformer
Trans-former models have been used to learn set-to-set mappingson this support set[26]or learn the spatial relationshipsbetween a given input query and support set samples[25].
Transformers in Vision:A Survey 阅读笔记,第16张
3.10 聚类中的Transformer
Recent works employ Transformers that operate on setinputs called the Set Transformers(ST)[228]for amortizedclustering.
最近的工作采用了在集合输入上运行的Transformer,称为Set Transformers(ST),用于摊销聚类。
Amortized clustering is a challenging problemthat seeks to learn a parametric function that can map aninput set of points to their corresponding cluster centers.
3.11 3D分析中的Transformer
Transformers provide a promising mechanism to encoderich relationships between 3D data points.
Transformers in Vision:A Survey 阅读笔记,第17张
4. 开放挑战与未来方向
The most important bottlenecksinclude requirement for large-amounts of training data andassociated high computational costs.
4.1 高计算成本
a strength of Transformer modelsis their flexibility to scale to high parametric complexity.While this is a remarkable property that allows trainingenormous sized models,this results in high training andinference cost
Additionally,these large-scale models require aggressivecompression(e.g.,distillation)to make them feasible for real-world settings.
Numerous methods have beenproposed that make special design choices to perform self-attention more‘efficiently’,for instance employing pool-ing/downsampling in self-attention[97],[219],[249],localwindow-based attention[36],[250],axial-attention[179],[251],low-rank projection attention[38],[252],[253],ker-24nelizable attention[254],[255],and similarity-clusteringbased methods[246],[256].
Therefore,thereis a pressing need to develop an efficient self-attentionmechanism that can be applied to HR images on resource-limited systems without compromising accuracy.
4.2 海量数据样本的需求
Since Transformer architectures do not inherently encodeinductive biases(prior knowledge)to deal with visual data,they typically require large amount of training to figureout the underlying modality-specific rules.
DeiT[12]uses a distillation approach to achievedata efficiency while T2T(Tokens-to-Token)ViT[35]modelslocal structure by combining spatially close tokens together,thus leading to competitive performance when trained onlyon ImageNet from scratch(without pre-training).
by smoothing the local loss surface using sharpness-awareminimizer(SAM)[258],ViTs can be trained with simpledata augmentation scheme(random crop,and horizontalflip)[259],instead of employing compute intensive strongdata augmentation strategies,and can outperform theircounterpart ResNet models.
4.3 基于视觉任务定制的Transformer设计
Although theinitial results from these simple applications are quite en-couraging and motivate us to look further into the strengthsof self-attention and self-supervised learning,current archi-tectures may still remain better tailored for language prob-lems(with a sequence structure)and need further intuitionsto make them more efficient for visual inputs.
One may argue thatthe architectures like Transformer models should remaingeneric to be directly applicable across domains,we noticethat the high computational and time cost for pre-trainingsuch models demands novel design strategies to make theirtraining more affordable on vision problems.
4.4 ViT中的网络结构搜寻
While Nerual Architecuter Search(NAS)has been wellexplored for CNNs to find an optimized architecture,itis relatively less explored in Transformers(even for lan-guage transformers[261],[262]).
It will be insightfulto further explore the domain-specific design choices(e.g.,the contrasting requirements between language and visiondomains)using NAS to design more efficient and light-weight models similar to CNNs[87].
4.5 Transformer的可解释性
compared with CNNs,ViTs demonstrate strongrobustness against texture changes and severe occlusions
The main challenge is that the attention originatingin each layer,gets inter-mixed in the subsequent layers in acomplex manner,making it difficult to visualize the relativecontribution of input tokens towards final predictions.
Furtherprogress in this direction can help in better understandingthe Transformer models,diagnosing any erroneous behav-iors and biases in the decision process.It can also help usdesign novel architectures that can help us avoid any biases.
4.6 硬件高效设计
Some recent ef-forts have been reported to compress and accelerate NLPmodels on embedded systems such as FPGAs[270].
However,such hardware efficient designs are currently lacking for the vision Transformers to enable their seamless deployment in resource-constrained devices.
4.7 整合所有模型
Inspired by the biological systems that canprocess information from a diverse range of modalities,Perceiver model[274]aims to learn a unified model thatcan process any given input modality without makingdomain-specific architectural assumptions.
An interesting and openfuture direction is to achieve total modality-agnosticism inthe learning pipeline.
Attention has played a key role in delivering efficientand accurate computer vision systems,while simultane-ously providing insights into the function of deep neu-ral networks.
Specifically,we in-clude state of the art self-attention models for image recog-nition,object detection,semantic and instance segmentation,video analysis and classification,visual question answering,visual commonsense reasoning,image captioning,vision-language navigation,clustering,few-shot learning,and 3Ddata analysis.
We systematically highlight the key strengthsand limitations of the existing methods and particularlyelaborate on the important future research directions.