Transformers

Taken mainly from (, ). The Transformer network takes in a sequence of \(N\) tokens of dimensionality \(D\). Many types of data can tokenized, and this allows for mixing different modalities. The transformer takes in the input data \(X^{(0)}\) (dimension \(D\times N\)) and outputs another matrix \(X^{(M)}\) of the same size, where the \(n^{th}\) column vector is the feature vector representing the sequence at the location of token \(n\). These representations can be used to predict the \(n+1^{th}\) token (i.e. autoregressive prediction), global classification of the entire sequence, sequence-to-sequence or image-to-image prediction problems etc. \(M\) denotes the nunber of layers of the transformer.

The transformer block has 2 stages:

Resources

Emacs 29.4 (Org mode 9.6.15)