[Stanford Univ: CS231n] Spring 2025 Assignment3. Q1(Image Captioning with Transformers)

Notice

Recent Posts

Recent Comments

Link

250x250

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

newhaneul

[Stanford Univ: CS231n] Spring 2025 Assignment3. Q1(Image Captioning with Transformers) 본문

2. Artificial Intelligence/Stanford Univ. CS231n

[Stanford Univ: CS231n] Spring 2025 Assignment3. Q1(Image Captioning with Transformers)

뉴하늘 2025. 5. 30. 20:03

728x90

본 포스팅은 Stanford University School of Engineering의 CS231n: Convolutional Neural Networks for Visual Recognition을 수강하고 공부한 내용을 정리하기 위한 포스팅입니다.

https://github.com/cs231n/cs231n.github.io/blob/master/assignments/2025/assignment3.md

cs231n.github.io/assignments/2025/assignment3.md at master · cs231n/cs231n.github.io

Public facing notes page. Contribute to cs231n/cs231n.github.io development by creating an account on GitHub.

github.com

https://github.com/KwonKiHyeok/CS231n/tree/main

GitHub - KwonKiHyeok/CS231n: This repository contains my solutions to the assignments of the CS231n course offered by Stanford U

This repository contains my solutions to the assignments of the CS231n course offered by Stanford University (Spring 2025). - KwonKiHyeok/CS231n

github.com

Q1. Image Captioning with Transformers

Transformer: Multi-Headed Attention

class MultiHeadAttention(nn.Module):
    """
    A model layer which implements a simplified version of masked attention, as
    introduced by "Attention Is All You Need" (https://arxiv.org/abs/1706.03762).

    Usage:
      attn = MultiHeadAttention(embed_dim, num_heads=2)

      # self-attention
      data = torch.randn(batch_size, sequence_length, embed_dim)
      self_attn_output = attn(query=data, key=data, value=data)

      # attention using two inputs
      other_data = torch.randn(batch_size, sequence_length, embed_dim)
      attn_output = attn(query=data, key=other_data, value=other_data)
    """

    def __init__(self, embed_dim, num_heads, dropout=0.1):
        """
        Construct a new MultiHeadAttention layer.

        Inputs:
         - embed_dim: Dimension of the token embedding
         - num_heads: Number of attention heads
         - dropout: Dropout probability
        """
        super().__init__()
        assert embed_dim % num_heads == 0

        # We will initialize these layers for you, since swapping the ordering
        # would affect the random number generation (and therefore your exact
        # outputs relative to the autograder). Note that the layers use a bias
        # term, but this isn't strictly necessary (and varies by
        # implementation).
        self.key = nn.Linear(embed_dim, embed_dim)
        self.query = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.proj = nn.Linear(embed_dim, embed_dim)
        
        self.attn_drop = nn.Dropout(dropout)

        self.n_head = num_heads
        self.emd_dim = embed_dim
        self.head_dim = self.emd_dim // self.n_head

    def forward(self, query, key, value, attn_mask=None):
        """
        Calculate the masked attention output for the provided data, computing
        all attention heads in parallel.

        In the shape definitions below, N is the batch size, S is the source
        sequence length, T is the target sequence length, and E is the embedding
        dimension.

        Inputs:
        - query: Input data to be used as the query, of shape (N, S, E)
        - key: Input data to be used as the key, of shape (N, T, E)
        - value: Input data to be used as the value, of shape (N, T, E)
        - attn_mask: Array of shape (S, T) where mask[i,j] == 0 indicates token
          i in the source should not influence token j in the target.

        Returns:
        - output: Tensor of shape (N, S, E) giving the weighted combination of
          data in value according to the attention weights calculated using key
          and query.
        """
        N, S, E = query.shape
        N, T, E = value.shape
        # Create a placeholder, to be overwritten by your code below.
        output = torch.empty((N, S, E))
        ############################################################################
        # TODO: Implement multiheaded attention using the equations given in       #
        # Transformer_Captioning.ipynb.                                            #
        # A few hints:                                                             #
        #  1) You'll want to split your shape from (N, T, E) into (N, T, H, E/H),  #
        #     where H is the number of heads.                                      #
        #  2) The function torch.matmul allows you to do a batched matrix multiply.#
        #     For example, you can do (N, H, T, E/H) by (N, H, E/H, T) to yield a  #
        #     shape (N, H, T, T). For more examples, see                           #
        #     https://pytorch.org/docs/stable/generated/torch.matmul.html          #
        #  3) For applying attn_mask, think how the scores should be modified to   #
        #     prevent a value from influencing output. Specifically, the PyTorch   #
        #     function masked_fill may come in handy.                              #
        ############################################################################
        H = self.n_head
        D = self.head_dim

        # 1. Linear projections
        Q = self.query(query).view(N, S, H, D).transpose(1, 2)  # (N, H, S, D)
        K = self.key(key).view(N, T, H, D).transpose(1, 2)      # (N, H, T, D)
        V = self.value(value).view(N, T, H, D).transpose(1, 2)  # (N, H, T, D)

        # 2) Attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(D) # (N, H, S, T)
        
        # 3) Apply mask (optional)
        if attn_mask is not None:
            attn_scores = attn_scores.masked_fill(attn_mask == 0, float('-inf'))

        # 4) Attention coefficient/weights
        attn_weights = F.softmax(attn_scores, dim = -1)
        attn_weights = self.attn_drop(attn_weights)

        # 5) Attention valueS
        attn_values = torch.matmul(attn_weights, V) # (N, H, S, D)
        attn_values = attn_values.transpose(1, 2) # (N, S, H, D)
        attn_output = attn_values.contiguous().view(N, S, E) # (N, S, E)

        # 6) Final projection
        output = self.proj(attn_output)     
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        return output

Step 1. Query, Key, Value projection

Q = self.query(query)  # shape: (N, S, E)
K = self.key(key)      # shape: (N, T, E)
V = self.value(value)  # shape: (N, T, E)

Step 2. Multi-Head 분리(Head Split)

Q = Q.view(N, S, H, D).transpose(1, 2)  # (N, H, S, D)
K = K.view(N, T, H, D).transpose(1, 2)  # (N, H, T, D)
V = V.view(N, T, H, D).transpose(1, 2)  # (N, H, T, D)

Step 3. Scaled Dot-Product Attention score 계산

attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(D)  # (N, H, S, T)

Step 4. Apply Mask

if attn_mask is not None:
       attn_scores = attn_scores.masked_fill(attn_mask == 0, float('-inf'))

1. 실제 단어가 아닌 padding 토큰이 attention에 영향을 주지 않도록 막기 위해서 선택적으로 적용한다. 예를 들어 [I, like, it, <PAD>, <PAD>] 처럼 padding이 있을 수 있는데, attention은 모든 토큰 간 유사도를 계산하므로, <PAD>도 영향을 줄 수 있다. 따라서 이러한 경우에는 무시할 수 있도록 mask로 가리는 것이다.

2. Transformer의 Decoder에서 미래 토큰을 미리 보지 못하도록 막는 역할을 한다. Decoder에서는 예측을 진행하는데, 이때 오직 과거 정보에만 의존을 해야 한다. 하지만 attention은 기본적으로 모든 위치를 다 보도록 되어 있으므로, 미래 토큰까지도 참조할 수 있게 된다. 따라서 mask를 적용해 미래 정보를 차단하게 한다.

Step 5. Softmax + Dropout

attn_weights = F.softmax(attn_scores, dim=-1)
attn_weights = self.attn_drop(attn_weights)

Step 6. Weighted sum of values

attn_values = torch.matmul(attn_weights, V)  # (N, H, S, D)

Step 7. Concatenate heads

attn_values = attn_values.transpose(1, 2).contiguous().view(N, S, E)  # (N, S, E)

Step 8. Final projection

output = self.proj(attn_output)  # (N, S, E)

Transformer: Positional Encoding

class PositionalEncoding(nn.Module):
    """
    Encodes information about the positions of the tokens in the sequence. In
    this case, the layer has no learnable parameters, since it is a simple
    function of sines and cosines.
    """
    def __init__(self, embed_dim, dropout=0.1, max_len=5000):
        """
        Construct the PositionalEncoding layer.

        Inputs:
         - embed_dim: the size of the embed dimension
         - dropout: the dropout value
         - max_len: the maximum possible length of the incoming sequence
        """
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        assert embed_dim % 2 == 0

        # Create an array with a "batch dimension" of 1 (which will broadcast
        # across all examples in the batch).
        pe = torch.zeros(1, max_len, embed_dim)
        ############################################################################
        # TODO: Construct the positional encoding array as described in            #
        # Transformer_Captioning.ipynb.  The goal is for each row to alternate     #
        # sine and cosine, and have exponents of 0, 0, 2, 2, 4, 4, etc. up to      #
        # embed_dim. Of course this exact specification is somewhat arbitrary, but #
        # this is what the autograder is expecting. For reference, our solution is #
        # less than 5 lines of code.                                               #
        ############################################################################
        position = torch.arange(0, max_len).unsqueeze(1) # (max_len, ) -> (max_len, 1)
        div_term = torch.arange(0, embed_dim, 2)
        pe[0, :, 0::2] = torch.sin(position * torch.pow(10000, -div_term / embed_dim))
        pe[0, :, 1::2] = torch.cos(position * torch.pow(10000, -div_term / embed_dim))
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # Make sure the positional encodings will be saved with the model
        # parameters (mostly for completeness).
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Element-wise add positional embeddings to the input sequence.

        Inputs:
         - x: the sequence fed to the positional encoder model, of shape
              (N, S, D), where N is the batch size, S is the sequence length and
              D is embed dim
        Returns:
         - output: the input sequence + positional encodings, of shape (N, S, D)
        """
        N, S, D = x.shape
        # Create a placeholder, to be overwritten by your code below.
        output = torch.empty((N, S, D))
        ############################################################################
        # TODO: Index into your array of positional encodings, and add the         #
        # appropriate ones to the input sequence. Don't forget to apply dropout    #
        # afterward. This should only take a few lines of code.                    #
        ############################################################################
        x = x + self.pe[0, : x.size(1), :]
        output = self.dropout(x) 
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        return output

Step 1. 각 위치와 주파수 준비

position = torch.arange(0, max_len).unsqueeze(1)  # shape: (max_len, 1)
div_term = torch.arange(0, embed_dim, 2)

position: 0부터 max_len-1까지 정수형 위치 벡터 (열 벡터로 reshape)
div_term: 짝수 인덱스의 임베딩 차원 번호 → ex: [0, 2, 4, ...]

Step 2. Positional Encoding 연산

pe[0, :, 0::2] = torch.sin(position * torch.pow(10000.0, -div_term / embed_dim))
pe[0, :, 1::2] = torch.cos(position * torch.pow(10000.0, -div_term / embed_dim))

Step 3. Input 데이터에 Positional Encoding을 더한 후 Dropout 적용

x = x + self.pe[0, : x.size(1), :]
output = self.dropout(x)

Inline Question 1

Scaled dot-product attention을 설계할 때 몇 가지 중요한 설계 결정들이 있었다. 아래 선택들이 왜 유익했는지 설명하라:

하나의 attention head 대신 여러 개의 attention head를 사용하는 이유
Softmax 함수를 적용하기 전에 sqrt{d/h}로 나누는 이유
여기서 는 feature 차원 수이고, h는 head의 수이다.
Attention 연산의 출력에 선형 변환(linear transformation)을 추가하는 이유

각 항목당 한두 문장이면 충분하지만, 반드시 다음 사항을 명확히 설명해야 한다:

만약 해당 구현 방식이 없었다면 어떤 문제가 생겼는지
그 상황이 왜 비효율적인지
제시된 방식이 어떻게 문제를 개선하는지

1. Using multiple attention heads as opposed to one

여러 개의 attention head를 사용하면 서로 다른 위치 간의 다양한 관계를 병렬적으로 학습할 수 있다. 하나의 head만 사용할 경우 한 종류의 관계만 주목할 수 있어 표현력이 제한되지만, 여러 head는 문맥, 구문 구조, 의미적 연관성 등 다양한 정보를 동시에 포착할 수 있어 더 풍부한 표현이 가능해진다.

2. Dividing by sqrt(d/h) before applying the softmax function

query와 key의 내적은 차원이 클수록 분산이 커져 softmax 출력이 매우 작거나 한쪽으로 치우칠 수 있다. 이를 sqrt{d/h}로 나눔으로써 값의 분산을 안정화시키고, softmax가 적절한 gradient를 갖도록 만들어 학습의 안정성과 속도를 높여준다.

3. Adding a linear transformation to the output of the attention operation

각 head의 출력을 단순히 연결(concatenate)만 하면, 서로 다른 하위 공간에서 나온 정보들이 통합되지 않아 의미 있는 표현을 만들기 어렵다. 따라서 선형 변환을 통해 다양한 head의 정보를 하나의 통합된 표현 공간으로 변환하여 downstream task에 유용한 표현을 학습할 수 있도록 돕는다.

Transformer: Decoder Layer

class TransformerDecoderLayer(nn.Module):
    """
    A single layer of a Transformer decoder, to be used with TransformerDecoder.
    """
    def __init__(self, input_dim, num_heads, dim_feedforward=2048, dropout=0.1):
        """
        Construct a TransformerDecoderLayer instance.

        Inputs:
         - input_dim: Number of expected features in the input.
         - num_heads: Number of attention heads
         - dim_feedforward: Dimension of the feedforward network model.
         - dropout: The dropout value.
        """
        super().__init__()
        self.self_attn = MultiHeadAttention(input_dim, num_heads, dropout)
        self.cross_attn = MultiHeadAttention(input_dim, num_heads, dropout)
        self.ffn = FeedForwardNetwork(input_dim, dim_feedforward, dropout)

        self.norm_self = nn.LayerNorm(input_dim)
        self.norm_cross = nn.LayerNorm(input_dim)
        self.norm_ffn = nn.LayerNorm(input_dim)

        self.dropout_self = nn.Dropout(dropout)
        self.dropout_cross = nn.Dropout(dropout)
        self.dropout_ffn = nn.Dropout(dropout)


    def forward(self, tgt, memory, tgt_mask=None):
        """
        Pass the inputs (and mask) through the decoder layer.

        Inputs:
        - tgt: the sequence to the decoder layer, of shape (N, T, D)
        - memory: the sequence from the last layer of the encoder, of shape (N, S, D)
        - tgt_mask: the parts of the target sequence to mask, of shape (T, T)

        Returns:
        - out: the Transformer features, of shape (N, T, W)
        """

        # Self-attention block (reference implementation)
        shortcut = tgt
        tgt = self.self_attn(query=tgt, key=tgt, value=tgt, attn_mask=tgt_mask)
        tgt = self.dropout_self(tgt)
        tgt = tgt + shortcut
        tgt = self.norm_self(tgt)

        ############################################################################
        # TODO: Complete the decoder layer by implementing the remaining two       #
        # sublayers: (1) the cross-attention block using the encoder output as     #
        # memory, and (2) the feedforward block. Each block should follow the      #
        # same structure as self-attention implemented just above.                 #
        ############################################################################
        # (1) Multi-Head Attention(cross)
        shortcut = tgt
        tgt = self.cross_attn(query = tgt, key = memory, value = memory)
        tgt = self.dropout_cross(tgt)
        tgt += shortcut
        tgt = self.norm_cross(tgt)

        # (2) The feedforward block
        shortcut = tgt
        tgt = self.ffn(tgt)
        tgt = self.dropout_ffn(tgt)
        tgt += shortcut
        tgt = self.norm_ffn(tgt)
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return tgt

Transformer decoder layer는 다음과 같은 세 단계로 구성된다:

Masked self-attention (with target sequence & mask)
Cross-attention (query = decoder output, key/value = encoder memory)
Position-wise feedforward

각 단계 후에 dropout → residual connection → layer norm 순으로 구성.

Step 1. Self-Attention Block

shortcut = tgt
tgt = self.self_attn(query=tgt, key=tgt, value=tgt, attn_mask=tgt_mask)
tgt = self.dropout_self(tgt)
tgt = tgt + shortcut
tgt = self.norm_self(tgt)

attn_mask 적용
residual connection 및 normalization 구조

Step 2. Cross-Attention Block

shortcut = tgt
tgt = self.cross_attn(query=tgt, key=memory, value=memory)
tgt = self.dropout_cross(tgt)
tgt += shortcut
tgt = self.norm_cross(tgt)

decoder의 출력을 query로, encoder의 출력을 key/value로 사용하는 방식.
residual connection 및 normalization 구조

Step 3. Feedforward Block

shortcut = tgt
tgt = self.ffn(tgt)
tgt = self.dropout_ffn(tgt)
tgt += shortcut
tgt = self.norm_ffn(tgt)

residual connection 및 normalization 구조

class CaptioningTransformer(nn.Module):
    """
    A CaptioningTransformer produces captions from image features using a
    Transformer decoder.

    The Transformer receives input vectors of size D, has a vocab size of V,
    works on sequences of length T, uses word vectors of dimension W, and
    operates on minibatches of size N.
    """
    def __init__(self, word_to_idx, input_dim, wordvec_dim, num_heads=4,
                 num_layers=2, max_length=50):
        """
        Construct a new CaptioningTransformer instance.

        Inputs:
        - word_to_idx: A dictionary giving the vocabulary. It contains V entries.
          and maps each string to a unique integer in the range [0, V).
        - input_dim: Dimension D of input image feature vectors.
        - wordvec_dim: Dimension W of word vectors.
        - num_heads: Number of attention heads.
        - num_layers: Number of transformer layers.
        - max_length: Max possible sequence length.
        """
        super().__init__()

        vocab_size = len(word_to_idx)
        self.vocab_size = vocab_size
        self._null = word_to_idx["<NULL>"]
        self._start = word_to_idx.get("<START>", None)
        self._end = word_to_idx.get("<END>", None)

        self.visual_projection = nn.Linear(input_dim, wordvec_dim)
        self.embedding = nn.Embedding(vocab_size, wordvec_dim, padding_idx=self._null)
        self.positional_encoding = PositionalEncoding(wordvec_dim, max_len=max_length)

        decoder_layer = TransformerDecoderLayer(input_dim=wordvec_dim, num_heads=num_heads)
        self.transformer = TransformerDecoder(decoder_layer, num_layers=num_layers)
        self.apply(self._init_weights)

        self.output = nn.Linear(wordvec_dim, vocab_size)

    def _init_weights(self, module):
        """
        Initialize the weights of the network.
        """
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def forward(self, features, captions):
        """
        Given image features and caption tokens, return a distribution over the
        possible tokens for each timestep. Note that since the entire sequence
        of captions is provided all at once, we mask out future timesteps.

        Inputs:
         - features: image features, of shape (N, D)
         - captions: ground truth captions, of shape (N, T)

        Returns:
         - scores: score for each token at each timestep, of shape (N, T, V)
        """
        N, T = captions.shape
        # Create a placeholder, to be overwritten by your code below.
        scores = torch.empty((N, T, self.vocab_size))
        ############################################################################
        # TODO: Implement the forward function for CaptionTransformer.             #
        # A few hints:                                                             #
        #  1) You first have to embed your caption and add positional              #
        #     encoding. You then have to project the image features into the same  #
        #     dimensions.                                                          #
        #  2) You have to prepare a mask (tgt_mask) for masking out the future     #
        #     timesteps in captions. torch.tril() function might help in preparing #
        #     this mask.                                                           #
        #  3) Finally, apply the decoder features on the text & image embeddings   #
        #     along with the tgt_mask. Project the output to scores per token      #
        ############################################################################
        # 1) caption embedding + positional encoding
        tgt = self.embedding(captions)
        tgt = self.positional_encoding(tgt)
        
        # 2) image features → projected
        features = self.visual_projection(features)
        features = features.unsqueeze(1)
        
        # 3) mask
        tgt_mask = torch.tril(torch.ones(T, T, device = captions.device)).bool()

        # 4) transformer decoder
        output = self.transformer(tgt, features, tgt_mask)

        # 5) final projection to vocab
        scores = self.output(output)
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return scores

    def sample(self, features, max_length=30):
        """
        Given image features, use greedy decoding to predict the image caption.

        Inputs:
         - features: image features, of shape (N, D)
         - max_length: maximum possible caption length

        Returns:
         - captions: captions for each example, of shape (N, max_length)
        """
        with torch.no_grad():
            features = torch.Tensor(features)
            N = features.shape[0]

            # Create an empty captions tensor (where all tokens are NULL).
            captions = self._null * np.ones((N, max_length), dtype=np.int32)

            # Create a partial caption, with only the start token.
            partial_caption = self._start * np.ones(N, dtype=np.int32)
            partial_caption = torch.LongTensor(partial_caption)
            # [N] -> [N, 1]
            partial_caption = partial_caption.unsqueeze(1)

            for t in range(max_length):

                # Predict the next token (ignoring all other time steps).
                output_logits = self.forward(features, partial_caption)
                output_logits = output_logits[:, -1, :]

                # Choose the most likely word ID from the vocabulary.
                # [N, V] -> [N]
                word = torch.argmax(output_logits, axis=1)

                # Update our overall caption and our current partial caption.
                captions[:, t] = word.numpy()
                word = word.unsqueeze(1)
                partial_caption = torch.cat([partial_caption, word], dim=1)

            return captions


def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])


class TransformerDecoder(nn.Module):
    def __init__(self, decoder_layer, num_layers):
        super().__init__()
        self.layers = clones(decoder_layer, num_layers)
        self.num_layers = num_layers

    def forward(self, tgt, memory, tgt_mask=None):
        output = tgt

        for mod in self.layers:
            output = mod(output, memory, tgt_mask=tgt_mask)

        return output


class TransformerEncoder(nn.Module):
    def __init__(self, encoder_layer, num_layers):
        super().__init__()
        self.layers = clones(encoder_layer, num_layers)
        self.num_layers = num_layers

    def forward(self, src, src_mask=None):
        output = src

        for mod in self.layers:
            output = mod(output, src_mask=src_mask)

        return output

Step 1. Caption embedding + Positional Encoding

# 1) caption embedding + positional encoding
tgt = self.embedding(captions)
tgt = self.positional_encoding(tgt)

self.embedding: nn.Embedding(V, W) → 각 토큰을 임베딩 벡터로 변환
self.positional_encoding: sinusoidal 또는 learned positional encoding을 더함
최종 tgt: decoder에 들어갈 입력 (문맥 + 순서 포함)
결과 shape: (N, T, W)

Step 2. 이미지 특징을 임베딩 차원으로 투영 + 시퀀스 차원 확장

# 2) image features → projected
features = self.visual_projection(features)
features = features.unsqueeze(1)

self.visual_projection: Linear layer → 이미지 feature를 word vector와 동일한 차원으로 맞춤
.unsqueeze(1): 이미지 특징을 decoder의 memory 시퀀스로 사용할 수 있도록 (N, 1, W)로 확장
결과 shape: (N, 1, W) → 시퀀스 길이 1인 encoder output처럼 사용

Step 3. 미래 시점 마스킹을 위한 attention mask 생성

 # 3) mask
tgt_mask = torch.tril(torch.ones(T, T, device = captions.device)).bool()

torch.tril(...): 하삼각 행렬 생성 → 미래 토큰은 보지 못하게 함(bool type)
Transformer decoder에서 auto-regressive한 예측을 위해 필요
True는 attention 가능, False는 마스킹됨
shape: (T, T)

Step 4. Transformer Decoder 실행

# 4) transformer decoder
output = self.transformer(tgt, features, tgt_mask)

tgt: 임베딩 + 위치 정보가 포함된 디코더 입력
features: memory 역할을 하는 이미지 피처
tgt_mask: 시퀀스 마스킹
출력 shape: (N, T, W)

Step 5. vocab 분포로 투영

# 5) final projection to vocab
scores = self.output(output)

self.output: Linear layer (W → V)
각 시점의 출력 벡터를 vocab 크기의 분포로 변환
결과 shape: (N, T, V)

Vision Transformer(ViT): Patch Embedding

class PatchEmbedding(nn.Module):
    """
    A layer that splits an image into patches and projects each patch to an embedding vector.
    Used as the input layer of a Vision Transformer (ViT).

    Inputs:
    - img_size: Integer representing the height/width of input image (assumes square image).
    - patch_size: Integer representing height/width of each patch (square patch).
    - in_channels: Number of input image channels (e.g., 3 for RGB).
    - embed_dim: Dimension of the linear embedding space.
    """
    def __init__(self, img_size, patch_size, in_channels=3, embed_dim=128):
        super().__init__()

        self.img_size = img_size
        self.patch_size = patch_size
        self.in_channels = in_channels
        self.embed_dim = embed_dim

        assert img_size % patch_size == 0, "Image dimensions must be divisible by the patch size."

        self.num_patches = (img_size // patch_size) ** 2
        self.patch_dim = patch_size * patch_size * in_channels

        # Linear projection of flattened patches to the embedding dimension
        self.proj = nn.Linear(self.patch_dim, embed_dim)


    def forward(self, x):
        """
        Forward pass for patch embedding.

        Inputs:
        - x: Input image tensor of shape (N, C, H, W)

        Returns:
        - out: Patch embeddings with positional encodings of shape (N, num_patches, embed_dim)
        """
        N, C, H, W = x.shape
        assert H == self.img_size and W == self.img_size, \
            f"Expected image size ({self.img_size}, {self.img_size}), but got ({H}, {W})"
        out = torch.zeros(N, self.embed_dim)

        ############################################################################
        # TODO: Divide the image into non-overlapping patches of shape             #
        # (patch_size x patch_size x C), and rearrange them into a tensor of       #
        # shape (N, num_patches, patch_dim). Do not use a for-loop.                #
        # Instead, you may find torch.reshape and torch.permute helpful for this   #
        # step. Once the patches are flattened, embed them into latent vectors     #
        # using the projection layer.                                              #
        ############################################################################
        P = self.patch_size

        x = x.reshape(N, C, H // P, P, W // P, P)
        x = x.permute(0, 2, 4, 1, 3, 5)
        x = x.reshape(N, self.num_patches, self.patch_dim)
        
        out = self.proj(x)
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        return out

Case 1)

permute(0, 2, 4, 3, 5, 1)

의미: (N, num_patches_h, num_patches_w, P, P, C)
→ 패치 그리드 좌표 + 픽셀 좌표를 먼저 배치하고, 채널을 뒤로
→ 각 패치가 (P, P, C)로 재구성됨.

patch = [
  [[R,G,B](0,0)], [[R,G,B](0,1)],
  [[R,G,B](1,0)], [[R,G,B](1,1)]
]

Case 2)

x = x.permute(0, 2, 4, 1, 3, 5)

의미: (N, num_patches_h, num_patches_w, C, P, P)
→ 패치 그리드 좌표를 먼저 배치하고, 채널과 픽셀을 뒤로
→ 각 패치가 (C, P, P)로 유지됨.

patch = [
  [R(0,0), R(0,1)],
  [R(1,0), R(1,1)],
  [G(0,0), G(0,1)],
  [G(1,0), G(1,1)],
  [B(0,0), B(0,1)],
  [B(1,0), B(1,1)],
]

이 Assignment에서는 Case 2가 정답이다. 채널을 먼저 처리하고(CNN-like 구조), 그 다음 픽셀을 처리해야한다.

Transformer: Encoder Layer

class TransformerEncoderLayer(nn.Module):
    """
    A single layer of a Transformer encoder, to be used with TransformerEncoder.
    """
    def __init__(self, input_dim, num_heads, dim_feedforward=2048, dropout=0.1):
        """
        Construct a TransformerEncoderLayer instance.

        Inputs:
         - input_dim: Number of expected features in the input.
         - num_heads: Number of attention heads.
         - dim_feedforward: Dimension of the feedforward network model.
         - dropout: The dropout value.
        """
        super().__init__()
        self.self_attn = MultiHeadAttention(input_dim, num_heads, dropout)
        self.ffn = FeedForwardNetwork(input_dim, dim_feedforward, dropout)

        self.norm_self = nn.LayerNorm(input_dim)
        self.norm_ffn = nn.LayerNorm(input_dim)

        self.dropout_self = nn.Dropout(dropout)
        self.dropout_ffn = nn.Dropout(dropout)

    def forward(self, src, src_mask=None):
        """
        Pass the inputs (and mask) through the encoder layer.

        Inputs:
        - src: the sequence to the encoder layer, of shape (N, S, D)
        - src_mask: the parts of the source sequence to mask, of shape (S, S)

        Returns:
        - out: the Transformer features, of shape (N, S, D)
        """
        ############################################################################
        # TODO: Implement the encoder layer by applying self-attention followed    #
        # by a feedforward block. This code will be very similar to decoder layer. #
        ############################################################################
        # (1) Multi-Head Attention
        shortcut = src
        src = self.self_attn(query = src, key = src, value = src, attn_mask = src_mask)
        src = self.dropout_self(src)
        src += shortcut
        src = self.norm_self(src)

        # (2) The feedforward block
        shortcut = src
        src = self.ffn(src)
        src = self.dropout_ffn(src)
        src += shortcut
        src = self.norm_ffn(src)
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        return src

Vision Transformer

class VisionTransformer(nn.Module):
    """
    Vision Transformer (ViT) implementation.
    """
    def __init__(self, img_size=32, patch_size=8, in_channels=3,
                 embed_dim=128, num_layers=6, num_heads=4,
                 dim_feedforward=256, num_classes=10, dropout=0.1):
        """
        Inputs:
         - img_size: Size of input image (assumed square).
         - patch_size: Size of each patch (assumed square).
         - in_channels: Number of image channels.
         - embed_dim: Embedding dimension for each patch.
         - num_layers: Number of Transformer encoder layers.
         - num_heads: Number of attention heads.
         - dim_feedforward: Hidden size of feedforward network.
         - num_classes: Number of classification labels.
         - dropout: Dropout probability.
        """
        super().__init__()
        self.num_classes = num_classes
        self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
        self.positional_encoding = PositionalEncoding(embed_dim, dropout=dropout)

        encoder_layer = TransformerEncoderLayer(embed_dim, num_heads, dim_feedforward, dropout)
        self.transformer = TransformerEncoder(encoder_layer, num_layers=num_layers)

        # Final classification layer to predict class scores from pooled token.
        self.head = nn.Linear(embed_dim, num_classes)

        self.apply(self._init_weights)


    def _init_weights(self, module):
        """
        Initialize the weights of the network.
        """
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def forward(self, x):
        """
        Forward pass of Vision Transformer.

        Inputs:
         - x: Input image tensor of shape (N, C, H, W)

        Returns:
         - logits: Output classification logits of shape (N, num_classes)
        """
        N = x.size(0)
        logits = torch.zeros(N, self.num_classes, device=x.device)
        
        ############################################################################
        # TODO: Implement the forward pass of the Vision Transformer.             #
        # 1. Convert the input image into a sequence of patch vectors.            #
        # 2. Add positional encodings to retain spatial information.              #
        # 3. Pass the sequence through the Transformer encoder.                   #
        # 4. Average pool patch vectors to get a feature vector for each image.   #
        #    You may find torch.mean useful.                                      #
        # 5. Feed it through a linear layer to produce class logits.              #
        ############################################################################

        # 1. Patch Embedding
        out = self.patch_embed(x) # (N, num_patches, embed_dim)

        # 2. Positional_encoding
        out = self.positional_encoding(out)

        # 3. Transformer Encoder
        out = self.transformer(out)

        # 4. Average Pooling
        out = torch.mean(out, dim = 1) # (N, embed_dim)

        # 5. Liner Classifier
        logits = self.head(out) # (N, num_classes)
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################


        return logits

ViT Implement

from cs231n.classification_solver_vit import ClassificationSolverViT

############################################################################
# TODO: Train a Vision Transformer model that achieves over 0.45 test      #
# accuracy on CIFAR-10 after 2 epochs by adjusting the model architecture  #
# and/or training parameters as needed.                                    #
#                                                                          #
# Note: If you want to use a GPU runtime, go to `Runtime > Change runtime  #
# type` and set `Hardware accelerator` to `GPU`. This will reset Colab,    #
# so make sure to rerun the entire notebook from the beginning afterward.  #
############################################################################


learning_rate = 6e-4
weight_decay = 1e-4
batch_size = 64

model = VisionTransformer(
  img_size=32, 
  patch_size=8, 
  in_channels=3,
  embed_dim=128, 
  num_layers=6, 
  num_heads=4,
  dim_feedforward=256, num_classes=10, dropout=0.1
)  # You may want to change the default params.



################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

solver = ClassificationSolverViT(
    train_data=train_data,
    test_data=test_data,
    model=model,
    num_epochs = 2,  # Don't change this
    learning_rate = learning_rate,
    weight_decay = weight_decay,
    batch_size = batch_size,
)

solver.train('cuda' if torch.cuda.is_available() else 'cpu')

Train Epoch: [0/2] Loss: 1.8469 ACC@1: 0.326%: 100%
 782/782 [00:26<00:00, 32.13it/s]
Test Epoch: [0/2] Loss: 1.6308 ACC@1: 0.421%: 100%
 157/157 [00:02<00:00, 48.32it/s]
Train Epoch: [1/2] Loss: 1.5695 ACC@1: 0.435%: 100%
 782/782 [00:27<00:00, 25.27it/s]
Test Epoch: [1/2] Loss: 1.4703 ACC@1: 0.470%: 100%
 157/157 [00:03<00:00, 61.62it/s]

Accuracy on test set: 0.4702

Inline Question 2:

최근 대규모 이미지 인식 과제에서 Vision Transformer(ViT)가 큰 성공을 거두었음에도 불구하고, ViT는 작은 데이터셋에서 학습할 경우 전통적인 CNN보다 성능이 뒤처지는 경우가 많다. 이러한 성능 격차의 근본적인 원인은 무엇이며, 작은 데이터셋에서 ViT의 성능을 향상시키기 위해 사용할 수 있는 기법들에는 어떤 것들이 있는가?

ViT가 소규모 데이터셋에서 CNN보다 성능이 낮은 이유:

Vision Transformer(ViT)는 CNN과는 달리 학습에 필요한 inductive bias가 거의 없는 구조를 가지고 있으며, 대신 더 유연한 표현력을 가진 모델이다. 이는 ViT가 "어떤 특징을 중요하게 볼지"에 대한 사전 지식이 없기 때문에, 해당 정보들을 모두 데이터로부터 직접 학습해야 한다는 것을 의미한다.
따라서 충분히 많은 학습 데이터가 확보되지 않으면, ViT는 CNN에 비해 학습이 어렵고 과적합에 취약할 수 있다.

또한 Transformer 기반 구조는 CNN보다 매우 많은 파라미터 수를 가지므로, 안정적인 학습을 위해 더 많은 데이터가 필요하다.

작은 데이터셋에서 ViT 성능을 향상시키는 방법

데이터 증강(Data Augmentation)
소규모 데이터셋에서 모델의 일반화 성능을 높이기 위해 가장 기본적인 방법은 데이터 증강이다.
예를 들어, Random Cropping, Flipping, Rotation, Color Jitter와 같은 다양한 augmentation 기법을 활용하여 데이터 다양성을 확보할 수 있다.
사전 학습된 모델 활용 (Pretrained ViT)
대규모 데이터셋(예: ImageNet)에서 미리 학습된 ViT 모델을 불러와서,
작은 데이터셋에 맞춰 fine-tuning 하면 매우 좋은 성능을 낼 수 있다.
이는 이미 많은 패턴을 학습한 상태이므로, 적은 데이터로도 효율적으로 적응할 수 있게 해준다.

Inline Question 3:

ViT(Vision Transformer)의 self-attention 층의 **계산 비용(computational cost)**은 아래의 변경 사항들을 각각 독립적으로 적용했을 때 어떻게 변하는가?

- (i) hidden dimension을 두 배로 늘릴 경우
- (ii) 입력 이미지의 height와 width를 두 배로 늘릴 경우
- (iii) patch size를 두 배로 늘릴 경우
- (iv) 레이어 수를 두 배로 늘릴 경우

ViT의 Self-Attention 연산량 변화 분석

Vision Transformer(ViT)의 입력 이미지는 patch 단위로 분할된 후, 각 patch가 임베딩되어 self-attention 모듈로 들어간다.
Self-Attention의 연산 복잡도는 다음과 같다:

: patch의 개수
: hidden dimension (embedding 차원)

(i) Hidden dimension을 두 배로 늘릴 경우

-> 연산량 2배 증가

(ii) 입력 이미지의 Height와 Width를 두 배로 늘릴 경우

이미지 크기: H × W → 2H × 2W
patch size가 고정되어 있으므로 patch 개수는 N → 4N

-> 연산량 16배 증가

(iii) Patch size를 두 배로 늘릴 경우

이미지 크기 고정, patch size: P × P → 2P × 2P
patch 개수는 N → N/4

-> 연산량 1/16로 감소

(iv) Transformer layer 수를 두 배로 늘릴 경우

layer 수: L → 2L

-> Self-Attention 연산이 layer마다 반복되므로, 연산량 2배 증가

728x90

'2. Artificial Intelligence > Stanford Univ. CS231n' 카테고리의 다른 글

[Stanford Univ: CS231n] Spring 2025 Assignment3. Q2(Self-Supervised Learning for Image Classification) (1)	2025.06.04
[Stanford Univ: CS231n] Lecture 14. Reinforcement Learning (3)	2025.05.29
[Stanford Univ: CS231n] Lecture 13. Generative Models (1)	2025.05.25
[Stanford Univ: CS231n] Spring 2025 Assignment2. Q5(Image Captioning with Vanilla RNNs) (1)	2025.05.23
[Stanford Univ: CS231n] Lecture 12. Visualizing and Understanding (1)	2025.05.23

'2. Artificial Intelligence/Stanford Univ. CS231n' Related Articles

newhaneul

[Stanford Univ: CS231n] Spring 2025 Assignment3. Q1(Image Captioning with Transformers) 본문

[Stanford Univ: CS231n] Spring 2025 Assignment3. Q1(Image Captioning with Transformers)

Q1. Image Captioning with Transformers

Transformer: Multi-Headed Attention

Step 1. Query, Key, Value projection

Step 2. Multi-Head 분리(Head Split)

Step 3. Scaled Dot-Product Attention score 계산

Step 4. Apply Mask

Step 5. Softmax + Dropout

Step 6. Weighted sum of values

Step 7. Concatenate heads

Step 8. Final projection

Transformer: Positional Encoding

Step 1. 각 위치와 주파수 준비

Step 2. Positional Encoding 연산

Step 3. Input 데이터에 Positional Encoding을 더한 후 Dropout 적용

Inline Question 1

1. Using multiple attention heads as opposed to one

2. Dividing by sqrt(d/h)​ before applying the softmax function

3. Adding a linear transformation to the output of the attention operation

Transformer: Decoder Layer

Step 1. Self-Attention Block

Step 2. Cross-Attention Block

Step 3. Feedforward Block

Step 1. Caption embedding + Positional Encoding

Step 2. 이미지 특징을 임베딩 차원으로 투영 + 시퀀스 차원 확장

Step 3. 미래 시점 마스킹을 위한 attention mask 생성

Step 4. Transformer Decoder 실행

Step 5. vocab 분포로 투영

Vision Transformer(ViT): Patch Embedding

Case 1)

Case 2)

Transformer: Encoder Layer

Vision Transformer

ViT Implement

Inline Question 2:

ViT가 소규모 데이터셋에서 CNN보다 성능이 낮은 이유:

작은 데이터셋에서 ViT 성능을 향상시키는 방법

Inline Question 3:

ViT의 Self-Attention 연산량 변화 분석

(i) Hidden dimension을 두 배로 늘릴 경우

(ii) 입력 이미지의 Height와 Width를 두 배로 늘릴 경우

(iii) Patch size를 두 배로 늘릴 경우

(iv) Transformer layer 수를 두 배로 늘릴 경우

'2. Artificial Intelligence > Stanford Univ. CS231n' 카테고리의 다른 글

티스토리툴바

2. Dividing by sqrt(d/h) before applying the softmax function