Transformerの仕組みをPythonで理解する【Attention機構を実装してみた】

はじめに
Self-Attentionの仕組み
NumPyでSelf-Attentionを実装
PyTorchでMulti-Head Attentionを実装
なぜ「Multi-Head」なのか
まとめ

はじめに

現代のLLMはすべてTransformerアーキテクチャをベースにしています。E資格の試験でも頻出ですが、「理論は知っているが実装はしたことがない」という方が多いです。今回はシンプルなAttention機構を実装して、仕組みを体で理解していきます。

Self-Attentionの仕組み

Self-Attentionは「文中の各単語が他のすべての単語にどれだけ注目すべきか」を計算する仕組みです。計算には3つの行列（Query・Key・Value）を使います。

Attention(Q, K, V) = softmax(QK^T / √d_k) × V という式で表されます。QとKの内積で「どれだけ関連しているか」のスコアを計算し、Vで重み付けして情報を集約します。

NumPyでSelf-Attentionを実装

import numpy as np

def self_attention(Q, K, V):
    """
    Self-Attentionの実装
    Q, K, V: shape (seq_len, d_k)
    """
    d_k = Q.shape[-1]
    
    # スコアの計算: QK^T / sqrt(d_k)
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    
    # Softmaxで確率に変換
    def softmax(x):
        exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
    
    attention_weights = softmax(scores)
    
    # Valueで重み付け
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights

# 簡単な例で試す
np.random.seed(42)
seq_len = 5   # 文の長さ（単語数）
d_k = 8       # 次元数

Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

output, weights = self_attention(Q, K, V)
print(f'入力 shape: {Q.shape}')
print(f'出力 shape: {output.shape}')
print(f'Attention重み（各単語が他の単語にどれだけ注目するか）:')
print(weights.round(3))

PyTorchでMulti-Head Attentionを実装

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def split_heads(self, x, batch_size):
        x = x.view(batch_size, -1, self.num_heads, self.d_k)
        return x.transpose(1, 2)  # (batch, heads, seq, d_k)
    
    def forward(self, x):
        batch_size = x.size(0)
        
        Q = self.split_heads(self.W_q(x), batch_size)
        K = self.split_heads(self.W_k(x), batch_size)
        V = self.split_heads(self.W_v(x), batch_size)
        
        # Attention計算
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
        weights = torch.softmax(scores, dim=-1)
        out = torch.matmul(weights, V)
        
        # ヘッドを結合
        out = out.transpose(1, 2).contiguous()
        out = out.view(batch_size, -1, self.num_heads * self.d_k)
        
        return self.W_o(out)

# 動作確認
d_model = 64
num_heads = 8
batch_size = 2
seq_len = 10

mha = MultiHeadAttention(d_model, num_heads)
x = torch.randn(batch_size, seq_len, d_model)
output = mha(x)
print(f'入力 shape: {x.shape}')
print(f'出力 shape: {output.shape}')  # 同じshapeになるはず