Attention Mechanism
- Query sequence \(Q \in \mathbb{R}^{T \times D}\).
- Key sequence \(K \in \mathbb{R}^{T' \times D}\).
- Value sequence \(V \in \mathbb{ R }^{T' \times D'}\).
- Attention matrix \(A = \text{softmax}_\text{row} \: \big(\frac{QK^T}{\sqrt{D}}\Big) \in \mathbb{ R }^{T \times T'}\).
- Weighted result sequence \(Y = AV \in \mathbb{ R }^{T\times D'}\)
See Figure 4.13, in P.92 of (??, ????).