Transformers

Taken mainly from (, ). The Transformer network takes in a sequence of \(N\) tokens of dimensionality \(D\). Many types of data can tokenized, and this allows for mixing different modalities. The transformer takes in the input data \(X^{(0)}\) (dimension \(D\times N\)) and outputs another matrix \(X^{(M)}\) of the same size, where the \(n^{th}\) column vector is the feature vector representing the sequence at the location of token \(n\). These representations can be used to predict the \(n+1^{th}\) token (i.e. autoregressive prediction), global classification of the entire sequence, sequence-to-sequence or image-to-image prediction problems etc. \(M\) denotes the nunber of layers of the transformer.

The transformer block has 2 stages:

Self-attention over time Operates across the sequence. (Updates features independently depending on the relationship between tokens across the sequence, acts horizontally across rows of \(X^{(m-1)}\)).

Output is another \(D \times N\) array \(Y^{(m)}\), made by aggregating information across the sequence independently for each feature using the attention mechanism. Specifically, the \(n^{th}\) column of \(Y^{(m)}\) is produced as a weighted average of the input features at location \(n' \in \{1,\cdots, N\}\), i.e. \(y_n^{(m)} = \sum_{n'=1}^{N}x_{n'}^{(m)}A_{n',n}^{(m)}\) where \(A\) is called the attention matrix of size \(N\times N\) and normalizes over columns i.e. \(\sum_{n'}^{}A_{n',n}^{(m)}=1\). The attention matrix assigns higher values to locations \(n'\) which are more relevant for location \(n\). More compactly, we have \(Y^{(m)} = X^{(m-1)}A^{(m)}\).

Now, the question is where the attention matrix comes from - it is generated from the input sequence itself, this is called self-attention. [#TODO Finish this].

Other operating across the features. (Updates the features representing each token, acts vertically across columns of \(X^{(m-1)}\)).

Resources

Transformer Explainer - Interactive web application alongside blog post.
GPT in 60 lines of NumPy.