Vision Transformers

Deep Learning

Intro to ViT architecture and fine-tuning pretrained ViTs for downstream tasks.

Vision Transformers (ViTs) adapt the transformer architecture from natural language processing to image processing. In this note, let’s discuss the architecture of these models, their pretraining and fine-tuning, usage for downstream tasks, and the memory and scaling considerations.

Architecture

Image → Sequence of Tokens

Consider an image \(x \in \mathbb{R}^{H \times W \times C}\), with height \(H\), width \(W\) and \(C\) channels. To feed it into a transformer, we first convert it into a sequence of tokens through three steps:

\[ \underset{\rule{0pt}{1.5ex}\mathbb{R}^{H\times W\times C}}{x} \xrightarrow{\text{patch + project}} \underset{\rule{0pt}{1.5ex}\mathbb{R}^{N\times d}}{\tilde X} \xrightarrow{\text{prepend CLS}} \underset{\rule{0pt}{1.5ex}\mathbb{R}^{(N+1)\times d}}{X} \xrightarrow{\text{add pos. emb.}} \underset{\rule{0pt}{1.5ex}\mathbb{R}^{(N+1)\times d}}{X_0} \]

1: Patch embeddings. We split the image into \(N = \frac{H}{P}\frac{W}{P}\) non-overlapping patches of size \(P \times P\), flatten each patch into a vector in \(\mathbb{R}^{P^2C}\), and project through a linear layer \(W_E \in \mathbb{R}^{(P^2C)\times d}\), yielding a patch embedding matrix \(\tilde X\).

2: CLS token. We prepend a learnable CLS token \(x_{\text{cls}} \in \mathbb{R}^{1\times d}\) to the sequence: \(X = [x_{\text{cls}};\, \tilde X]\).

3: Positional embeddings. We inject spatial information by adding learnable positional embeddings \(X_{\text{pos}} \in \mathbb{R}^{(N+1)\times d}\) : \(X_0 = X + X_{\text{pos}}\). (See Appendix A for variants.)

Transformer blocks

\(X_0\) is then passed through \(L\) transformer blocks, each preserving the matrix shape:

\[ \underset{\rule{0pt}{1.5ex}(N+1)\times d}{X_0} \xrightarrow{\text{block}_1} \underset{\rule{0pt}{1.5ex}(N+1)\times d}{X_1} \xrightarrow{\text{block}_2} \cdots \xrightarrow{\text{block}_L} \underset{\rule{0pt}{1.5ex}(N+1)\times d}{X_L} \]

Each block applies two sub-operations in sequence, each wrapped in a residual connection:

\[ \underset{\rule{0pt}{1.5ex}(N+1)\times d}{X_\ell} \;\xrightarrow{\;\mathrm{LN}+\mathrm{MHSA}+\mathrm{skip}\;}\; \underset{\rule{0pt}{1.5ex}(N+1)\times d}{Y_\ell} \;\xrightarrow{\;\mathrm{LN}+\mathrm{MLP}+\mathrm{skip}\;}\; \underset{\rule{0pt}{1.5ex}(N+1)\times d}{X_{\ell+1}} \]

\[ Y_\ell = X_\ell + \mathrm{MHSA}(\mathrm{LN}(X_\ell)), \qquad X_{\ell+1} = Y_\ell + \mathrm{MLP}(\mathrm{LN}(Y_\ell)) \]

MHSA is multi-head self-attention. It mixes information across tokens.
MLP is a multi-layer perceptron. It mixes information within each token: e.g. \(d \rightarrow 4d \rightarrow d\) with GELU activation.
LN is layer normalization. It normalizes each token’s feature vector.
Skip connections preserve gradient flow and let information bypass each sub-operation, if required.

Multi-Head Self-Attention

The key operation in each transformer block is self-attention: each token queries all other tokens to gather relevant information, then updates its own representation as a weighted mixture of their values. We first describe the single-head case, then extend to multi-head self-attention.

Self-attention. Given a token matrix \(X \in \mathbb{R}^{(N+1)\times d}\), we first project it into queries, keys, and values:

\[ Q = XW_Q, \quad K = XW_K, \quad V = XW_V, \qquad W_Q, W_K, W_V \in \mathbb{R}^{d\times d} \]

Then we compute attention:

\[ \underset{\rule{0pt}{1.5ex}(N+1)\times d}{Q,\;K,\;V} \;\xrightarrow{\;\mathrm{softmax}(QK^\top/\sqrt{d})\;}\; \underset{\rule{0pt}{1.5ex}(N+1)\times(N+1)}{\alpha} \;\xrightarrow{\;\alpha V\;}\; \underset{\rule{0pt}{1.5ex}(N+1)\times d}{\mathrm{A}(X)} \]

\(QK^\top\) contains pairwise affinities between all token pairs. \(\alpha_{jk}\) is how much token \(j\) attends to token \(k\). The output for token \(j\) is \(\sum_k \alpha_{jk} v_k\), a weighted mixture of all tokens’ value vectors.

Multi-head self-attention. A single SA head computes one set of affinities, capturing one relational structure across the sequence. MHSA runs \(m\) heads in parallel, each attending in its own \(d_{\text{head}}\)-dimensional subspace where \(d_{\text{head}} = d/m\):

\[ Q_i = XW_Q^i, \quad K_i = XW_K^i, \quad V_i = XW_V^i, \qquad W_Q^i,\, W_K^i,\, W_V^i \in \mathbb{R}^{d\times d_{\text{head}}} \]

Every head sees the same \(X\), but projects it through its own learned matrices into a lower-dimensional subspace. Each head then runs SA independently:

\[ \underset{\rule{0pt}{1.5ex}(N+1)\times d}{X} \;\xrightarrow{\;W_Q^i,\,W_K^i,\,W_V^i\;}\; \underset{\rule{0pt}{1.5ex}(N+1)\times d_{\text{head}}}{Q_i,\,K_i,\,V_i} \;\xrightarrow{\;\mathrm{SA}\;}\; \underset{\rule{0pt}{1.5ex}(N+1)\times d_{\text{head}}}{\mathrm{A}_i(X)} \quad\forall\,i \]

The \(m\) outputs are concatenated and projected through \(W_O \in \mathbb{R}^{d\times d}\):

\[ \underset{\rule{0pt}{1.5ex}(N+1)\times d}{[\mathrm{A}_1(X)\;|\;\cdots\;|\;\mathrm{A}_m(X)]} \;\xrightarrow{\;W_O\;}\; \underset{\rule{0pt}{1.5ex}(N+1)\times d}{\mathrm{MHSA}(X)} \]

Empirically, different heads learn different relational structures: some attend locally, others to long-range or semantic patterns.

Pretraining and Fine-Tuning

Directly training a ViT for a specific task using supervised learning rarely works well. These models are large and lack regularizing inductive biases, so they need an enormous amount of labelled data to learn useful representations.

In practice, ViTs are pretrained at scale on large unlabelled datasets using self-supervised learning methods such as contrastive learning (DINO, CLIP) or masked image modelling (MAE). Dataset and model sizes have grown rapidly: DINOv2 was trained on ~142M curated images with a 1.1B parameter model, and DINOv3 scaled this to ~1.7B images with a 6.7B parameter model. These models produce strong general-purpose representations that transfer well across downstream tasks.

For task-specific adaptation, fully fine-tuning all parameters is prone to overfitting. LoRA addresses this by freezing the pretrained weights and learning a small number of low-rank updates instead, making fine-tuning tractable with far fewer trainable parameters.

Parameter Counts

To get a sense of the scale of modern ViTs, consider a transformer block with embedding dimension \(d\). DINOv3-Large uses \(d = 1024\) across \(L = 24\) blocks. Each transformer block has two main parameter contributions. In practice, \(W_Q, W_K, W_V\) each implement all attention heads combined as a single \(d\times d\) matrix (equivalent to stacking the per-head matrices \(W_Q^i \in \mathbb{R}^{d \times d_\text{head}}\) across \(m\) heads), so MHSA contributes four \(d\times d\) matrices: \(W_Q\), \(W_K\), \(W_V\), \(W_O\). The MLP contributes two matrices: \(W_1 \in \mathbb{R}^{d\times 4d}\) and \(W_2 \in \mathbb{R}^{4d\times d}\).

\[ \underbrace{W_Q,\,W_K,\,W_V,\,W_O \in \mathbb{R}^{d\times d}}_{\text{MHSA: }\; 4d^2 \approx 4\text{M}} \qquad\qquad \underbrace{W_1 \in \mathbb{R}^{d\times 4d},\; W_2 \in \mathbb{R}^{4d\times d}}_{\text{MLP: }\; 2 \times 4d^2 = 8d^2 \approx 8\text{M}} \]

Thus, per block, we have: \(4d^2 + 8d^2 = 12d^2 \approx 12\text{M}\) (at \(d = 1024\)). Scaling to the full model, we have:

\[ \underbrace{24 \times 12\text{M}}_{\text{transformer stack}} \;+\; \underbrace{\approx 12\text{M}}_{\text{LN, biases, emb.}} \;=\; \underbrace{\;\approx\;300\text{M}}_{\text{Total}} \]

The DINOv3-7B model has 6.7B params by using \(d = 4096\) and \(L = 40\) blocks (total param count scales as \(L \cdot d^2\)).

Low-Rank Adaptation (LoRA)

Full fine-tuning is expensive when the model has hundreds of millions of parameters. LoRA keeps the pretrained weights frozen and learns a small low-rank correction instead. For a pretrained projection matrix \(W \in \mathbb{R}^{d\times d}\), LoRA introduces matrices \(B \in \mathbb{R}^{d\times r}\) and \(A \in \mathbb{R}^{r\times d}\) with rank \(r \ll d\). During fine-tuning and inference, \(W\) is replaced by the adapted weight \(W'\):

\[ W' = \underbrace{W}_{\substack{d\times d \\ \text{frozen}}} + \underbrace{BA}_{\substack{d\times d,\;\text{rank-}r \\ \text{trainable}}} \]

The update \(BA\) has \(2dr\) trainable params, compared to \(d^2\) for the full matrix. For DINOv3-Large (\(d = 1024\), \(r = 8\)):

\[ \underbrace{d^2 \approx 1\text{M}}_{\text{full fine-tuning}} \quad\longrightarrow\quad \underbrace{2dr = 2(1024)(8) \approx 16\text{k}}_{\text{LoRA, rank-}8} \]

LoRA is commonly applied to the attention projection matrices \(W_Q, W_K, W_V, W_O\), though many implementations adapt only a subset. Applying rank-8 LoRA to all four attention matrices across all 24 blocks: \(24 \times 4 \times 16\text{k} \approx 1.6\text{M}\) total trainable parameters. This is about \(1.6/300 \approx 0.5\%\) of the full model. The small number of trainable parameters lowers memory and optimization cost, and also limits how far the model can drift from its pretrained initialization, which helps avoid overfitting on small datasets.

Downstream Tasks

A pretrained ViT backbone produces token representations \(X_L \in \mathbb{R}^{(N+1)\times d}\) that transfer well to many downstream tasks. Which part of \(X_L\) to use depends on what the task requires:

\[ \underset{\rule{0pt}{1.5ex}\mathbb{R}^{H\times W\times C}}{x} \xrightarrow{\text{tokenize}} \underset{\rule{0pt}{1.5ex}(N+1)\times d}{X_0} \xrightarrow{L\text{ blocks}} \underset{\rule{0pt}{1.5ex}(N+1)\times d}{X_L} \]

\[ \underset{\text{global}}{X_L[0]} \;\xrightarrow{\;\text{linear head}\;}\; \text{class label} \qquad\qquad \underset{\text{spatial}}{X_L[1:]} \;\xrightarrow{\;\text{reshape + decode}\;}\; \text{dense output} \]

Image Classification

For classification, we need a single vector that summarizes the whole image. The CLS token \(X_L[0]\) serves this role: having attended to all patch tokens across all \(L\) layers, it accumulates global context. A linear classifier is applied on top for prediction.

Dense Prediction

Tasks such as semantic segmentation, object detection and depth estimation require dense spatial predictions rather than a single label. Patch tokens \(X_L[1:] \in \mathbb{R}^{N\times d}\) retain spatial structure through their positional embeddings, but at low resolution (\(H/P \times W/P\)). A common strategy is to reshape them back to a spatial grid and upsample with a convolutional decoder:

\[ X_L[1:] \in \mathbb{R}^{N\times d} \;\rightarrow\; \mathbb{R}^{H/P \times W/P \times d} \;\xrightarrow{\;\text{decoder}\;}\; \text{dense prediction} \]

Hierarchical Transformers

Vanilla ViTs produce a single-scale feature map and do not naturally support the multi-scale representations that CNNs provide through pooling and strided convolutions. This can be a limitation for dense prediction tasks, where coarse-to-fine hierarchies are useful.

Hierarchical transformer architectures address this directly. One influential example is the Swin Transformer, which restricts attention to local windows, progressively merges patches across layers, and constructs multi-scale feature pyramids similar to CNN hierarchies. These designs improve computational efficiency and produce richer representations for dense spatial tasks.

Memory and Scaling

Two practical constraints arise when using ViTs at scale: self-attention is quadratic in token count (limiting feasible resolutions and 3D volume sizes), and training memory is dominated by activations rather than weights (limiting how much LoRA’s parameter reduction helps).

Attention Complexity

For an image of size \(H \times W\) with patch size \(P\), the number of patch tokens is \(N = \frac{H}{P}\frac{W}{P}\). Self-attention computes pairwise dot products between all token pairs, producing an \((N+1)\times(N+1)\) attention matrix where each entry involves a \(d\)-dimensional dot product. Attention complexity therefore scales as:

\[ \mathcal{O}(N^2 d). \]

This is why ViTs typically use large patch sizes such as \(P=16\). Reducing patch size increases spatial resolution but also dramatically increases attention cost. A \(224\times224\) image at \(P=16\) gives \(14\times14 = 196\) tokens; at \(512\times512\) this grows to \(32\times32 = 1024\) tokens, a \((1024/196)^2 \approx 27\times\) increase in attention cost.

Practical Strategies for 3D Inputs

These constraints become particularly important in medical imaging. A typical 3D MRI or CT volume may contain hundreds of slices, making fully 3D transformers computationally expensive. One common strategy is to apply a pretrained 2D ViT independently to individual slices or views. The resulting slice-level representations can then be fused using lightweight fusion transformers. This hybrid approach often provides a good tradeoff between representation quality, memory efficiency, and computational cost.

GPU Memory in Practice

As a rough guide, fine-tuning a DINOv3-Large backbone with LoRA (\(r=8\)) on a single 48GB GPU allows batch sizes of ~32–64 images at \(224\times224\), and substantially smaller batches for high-resolution or volumetric inputs. Despite LoRA reducing trainable parameters to ~0.5%, the dominant memory cost during training comes from intermediate attention activations, not from the parameters themselves. Increasing LoRA rank therefore has little effect on peak memory, but increasing image resolution or adding 3D slices has a large one.

Appendix A: Positional Embedding Variants

Self-attention is permutation-invariant: without additional structure, the transformer cannot distinguish whether two patch embeddings originate from nearby or distant image regions. Positional embeddings inject spatial information into the token sequence. Several variants have been developed that trade off resolution generalization, inductive structure, and computational efficiency.

Learned Absolute Positional Embeddings

Vanilla ViTs learn a parameter matrix \(X_{\text{pos}} \in \mathbb{R}^{(N+1)\times d}\) with one embedding vector per patch location, added to the input token matrix:

\[ X_0 = X + X_{\text{pos}}. \]

This approach is simple and effective, but ties the model to a fixed input resolution. Changing image size changes the number of patch tokens \(N\), requiring positional embedding interpolation during transfer to new resolutions.

Sinusoidal Positional Embeddings

Transformers in NLP originally used deterministic sinusoidal embeddings:

\[ \mathrm{PE}(p,2i) = \sin\left(\frac{p}{10000^{2i/d}}\right), \qquad \mathrm{PE}(p,2i+1) = \cos\left(\frac{p}{10000^{2i/d}}\right). \]

These embeddings require no learned parameters and naturally generalize to longer sequences. However, learned embeddings typically perform better in ViTs.

Relative Positional Embeddings

Relative positional embeddings encode relative spatial offsets between tokens rather than absolute patch locations. This allows attention to depend on relationships such as “one patch above” or “two patches left”, rather than fixed grid coordinates. Relative positional embeddings often improve transfer across resolutions and are widely used.

Rotary Positional Embeddings (RoPE)

Rotary positional embeddings (RoPE) encode position through rotations applied directly to query and key vectors before attention. Rather than adding positional vectors to token embeddings, RoPE modifies the geometry of the attention computation itself, producing attention scores that depend naturally on relative token offsets.