Tokenization, Embedding, Positional Encoding,
Multi-head Attention and Sublayers of an AI
Transformer in C/C++

Materials available at: http://forejune.co/cuda/

Recap

Neural Networks Process Numbers

To process language, map words to numbers: tokenization and embedding

Tokenization: split text into tokens, and assign each token a unique identifier (number)

Tokens ~ words, phrases, sub-words, punctuation marks, or characters

Embedding: Organizing tokens into a Matrix
Example:

A vocabulary has 10,000 tokens,
Each embedding has 300 dimensions
The embedding matrix is 10,000 x 300

Implementation:

class Tokenizer { private: // unordered_map stores data in unique key-value pairs unordered_map<string, int> words; //words with index unordered_map<int, string> wordIndex; //indexing words int nTokens; // number of words (tokens) public: Tokenizer(ifstream &fs) { string token; vector<string> tokens; // Read tokens (words) one by one, delimited by whitespace int k = 0; words["[UNK]"] = k; //unknown token wordIndex[k] = "UNK]"; k++; while (fs >> token) { if (!words.contains(token)){ words[token] = k; wordIndex[k] = token; k++; } } nTokens = words.size(); } // Change a vector of tokens back to text string detokenize(const vector<int>& tokenIDs) { string str; for (int i = 0; i < tokenIDs.size(); i++) { if (wordIndex.find(tokenIDs[i]) != wordIndex.end()) { str += wordIndex[tokenIDs[i]]; // add a delimiter if (i < tokenIDs.size() - 1 && wordIndex[tokenIDs[i+1]] != "!") str += " "; } } return str; } ..... };

typedef vector<vector<double>> matrixd; class Embedding { private: int nTokens; // number of words int dim; // embedding dimension matrixd em_matrix; // embedding matrix; public: Embedding(int sequence_length, int embeddingDimension) { nTokens = sequence_length; dim = embeddingDimension; // initialize embeddings as an nTokens x dim matrix with certain random values initMatrix(em_matrix, nTokens, dim); } // create embedding matrix with tokenIDs matrixd embed(const vector<int>& tokenIDs) { matrixd embeddings; embeddings.reserve(tokenIDs.size()); // set capacity of embeddings vector for(int i = 0; i < tokenIDs.size(); i++) { if (tokenIDs[i] >= 0 && tokenIDs[i] < nTokens) embeddings.push_back(em_matrix[tokenIDs[i]]); else // Out-of-dictionary tokens embeddings.push_back(vector<double>(dim, 0.0)); } return embeddings; } ..... };

Transformer

A Neural Network Architecture Introduced in 2017

Transformer architecture
left ~ encoder, right ~ decoder

Totally 6 identical layers

Nx means the block with loop arrow repeating N times

Each layer has sublayers,
(Multi-Head Attention, Feed Forward ...)

Encoder maps an input sequence x = (x₁,x₂, ....,, x_n) to
a continuous representation z = (z₁,z₂, ....,, z_n)

Generates an output sequence (y₁,y₂, ....,, y_m)
of symbols one element at a time

If x is input to a sub-layer and F_s(x) isthe output.
The quantities will be ''Add & Norm": Output of ''Add & Norm" = Normalization ( F_s(x) + x )

The Positional Encoding (PE)

Injects information about token order into the model:

which token comes first, second, ...
relative distance between tokens
directions (left vs. right context)

e.g. without PE,
a dog bites a man = a man bites a dog

_model

pos

Positional encoding would let the transformer know where each token is in the sequence
Characteristics:

Each position gets a unique encoding vector
Allows model easily learn relative positions
Values are bounded
The values are deterministic and fixed

class PositionalEncoding
{
private:
    int maxTokens; // maximum sequence length
    int dModel;    // embedding dimension
    matrixd PE;    // positional_encodings matrix;

public:
    PositionalEncoding(int max_seq_length, int embedding_dimension)
    {
      maxTokens = max_seq_length;
      dModel = embedding_dimension;

      PE.resize(maxTokens, vector(dModel));
      for (int pos = 0; pos < maxTokens; pos++) {
        for (int i = 0; i < dModel/2; i++) {
                PE[pos][2*i] = sin( pos / pow(10000, 2.0*i/dModel) );
                PE[pos][2*i+1] = cos( pos / pow(10000, 2.0*i/dModel) );
        }
      }
    }

   matrixd addPE(const matrixd& embeddings)
   {
        matrixd x = embeddings;
        int nTokens = embeddings.size();

        for (int pos = 0; pos < nTokens; pos++)
          for (int i = 0; i < dModel; i++)
            x[pos][i] += PE[pos][i];

        return x;
    }
    .....
};

Sublayers and Modules

Softmax -- converts logits (outputs from a linear sublayer) into token probabilities. Allows model to

generate text by sampling from the probability distribution
perform statistical tasks such as classification
calculate perplexity ("surprise")
a lower perplexity score implies less surprised
a high perplexity indicates the model struggles with the data

₁

₂

softmax(z_i) = e^z_i / ∑e^z_i

softmax(z_i) = e^z_i-max / ∑e^z_i-max
max = maximum element in vector

// softmax activation function, 1D input vector
vector softmax(const vector &z)
{
  int m = z.size();
  vector<double> y(m, 0);

   //maximum value of vector
  double maxVal = *max_element(z.begin(), z.end());

  double sum = 0;
  // Subtract max for numerical stability
  for (int j = 0; j < m; j++) {
    y[j] = exp(z[j] - maxVal);
    sum += y[j];
  }

  // Normalize
  for (int j = 0; j < m; j++)
    y[j] /= sum;

  return y;
}

Linear -- a linear transformation (an affine transformation) applied to vectors


  y =  x W + b 
where
  x = input vector (or matrix of token embeddings)
  W = weight matrix
  b = bias vector
Example:
  linear in multi-head attention(Q, K, V):

  Q = X x W_Q, K = X x W_K, V = X x W_V

  Y = C x W_o,	C = concatenated attention output

Linear Layers mix features across dimensions
  e.g.	C x W_o mix features from various heads

They are efficient on GPUs/TPUs

....
//X contains the embeddings, WQ.. are weight matrices
auto Q = matmul(X, WQ);
auto K = matmul(X, WK);
auto V = matmul(X, WV);
....
// muliply by WO to give final output
auto y = matmul(concat, WO);

Add & Norm


  A crucial layer combining a residual connection (Add)
    (adding the input back to the sublayer's output) with Layer Normalization (Norm) to 
    stabilize training.

    Add means add a residual connection:
	  Let F(x) = output of sublayer with input x
	  Add(x, F(x)) = x + F(x)

    Norm means layer normalization:
	
		Initially, we can choose γ = 1, β = 0
	  matrixd addNnorm(const matrixd &A, const matrixd &B)
  {
     auto x = addMat(A, B);
     int n = x.size();          // rows
     int m = x[0].size();       // cols
     matrixd y = x;             //output same size as x 
     double epsilon = 1e-5;

     for (int i = 0; i < n; i++) {
       double sum = 0;          // sum of one row
       double std = 0;          //sigma_i square
       for (int j = 0; j < m; j++)
         sum += x[i][j];
        double mean = sum / m;  // mu_i, mean of i-th row

        sum = 0;
        for (int j = 0; j < m; j++) {
          double diff = x[i][j] - mean;
          sum += diff * diff;
        }
        std = sum / m;
        for (int j = 0; j < m; j++) {
          double xd = (x[i][j] - mean) / sqrt(std + epsilon);
          y[i][j] = xd;         // gamma = 1, beta = 0;
        }
     }

     return y;  // final output of Add & norm
   }

Multi-head Attention



Scaled Dot-Product Attention



masking -- discard some non-text special tokens
	class MultiHeadAttention {
private:
    int dModel;
    int nHeads;
    int d_k; // dimension per head

    // weights: combine Q,K,V projections as separate matrices 
    // (d_model x d_model)
    matrixd WQ, WK, WV, WO;
public:
    MultiHeadAttention(int dm, int nh)
    {
      dModel = dm;
      nHeads = nh; 
      d_k = dModel / nHeads; 
      initMatrix(WQ, dModel, dModel);
      initMatrix(WK, dModel, dModel);
      initMatrix(WV, dModel, dModel);
      initMatrix(WO, dModel, dModel);
    }

    // input nTokens x dModel
    //matrixd forward(const matrixd &X)
    matrixd multiHeadAttention(const matrixd &X)
    {
      // X contains the embeddings, WQ ... are weight matrices
      auto Q = matmul(X, WQ);
      auto K = matmul(X, WK);
      auto V = matmul(X, WV);

      int nTokens = X.size();
      
      // cancat outputs of heads 
      matrixd head_outputs;
      // each head: nTokens x d_k
      head_outputs.assign(nHeads, vector<double>(dModel, 0));
      for (int h = 0; h < nHeads; h++){
	// for each token i compute attention over j
	for (int i = 0; i < nTokens; i++){
	  vector<double>scores(nTokens);
	  for(int j = 0; j < nTokens; j++){
	    // dot over slice [h*d_k ... (h+1)*d_k-1]
	    double sum = 0;
	    for(int t = 0; t < d_k; t++) {
	      int idx = h * d_k + t;
	      sum += Q[i][idx] * K[j][idx]; 
	    }
	    scores[j] = sum / sqrt(d_k); 
	  }
	  auto attention1D = softmax( scores );

	  // multiply attention1D by corresponding slice (column) of V 
	  for(int t = 0; t < d_k; t++) {
	    double sum = 0;
	    for(int k = 0; k < nTokens; k++){  //scan the column
	       int idx = h*d_k + k;
	       sum += attention1D[k] * V[k][idx];
	    }
	    head_outputs[h][i*d_k + t] = sum;
	  }
	}
    }
    // concatenate heads: nTokens x dModel
    matrixd concat(nTokens, vector<double>(dModel, 0));
    for(int i = 0; i < nTokens; i++)
      for(int h=0; h < nHeads;  h++)
        for(int t = 0; t < d_k; t++)
          concat[i][h*d_k + t] = head_outputs[h][i*d_k + t];
	
    // muliply by WO to give final output
    auto y = matmul(concat, WO);

    return y;
  }
};




	
	Feed Forward
Self-attention allows tokens to exchange information with other tokens, 
but by itself it is largely linear and focused on where to look.

The feed-forward sublayer is to independently process and 
transform the information at each individual position, after it has been mixed 
with context from other positions by the Attention mechanism

Analogy:
    A token ~ a person in a meeting
    Attention ~ listening to everyone else
    FFN ~ thinking privately and updating your own understanding
  
 For an input vector x_i = (x_i0,x_i1, x_i2, ...., x_im)^T representing one token's embeddings
 with dimension m = d_model (x_i0 is the bias)

	FFN(x_i) = W₂ σ(W₁ x_i) 

    Weight matrix W₁ expands the dimension (e.g. d_ff = 4d_model)
    σ is a non-linear activation function (e.g. ReLU( z ) = max(0, z) )
    
    Weight matrix W₂ projects back to the model the dimension d_model
		
  ~ MLP with one hidden layer, with a nonlinear activation function for hidden layer and
    a linear activation function (identical) for the output layer
         for the compatibility with the residual connections (Add)
    		x_out = x + FFN(x)

If the FFN output were passed through a non-linear activation function:

The residual would mix linear and non-linear spaces
The identity path would be distorted
The gradient flow would be worse

 We could organize the embeddings into an n x d_model matrix	
     n = number of tokens

  matrixd FeedForward(const matrixd& x, const matrixd& w1, const matrixd& w2)
  {

    matrixd z = x;   // output same size as x
    auto A = matmul(x, w1);
    matrixd y = A;   // same matrix size
    int n = A.size();
    int m = A[0].size();
    for (int i = 0; i < n; i++)
       for(int j = 0; j < m; j++)
         y[i][j] = f(A[i][j]);  // f is an activation function

     z = matmul(y, w2);

     return z;   // output of FeedForward
  }

Tokenization, Embedding, Positional Encoding, Multi-head Attention and Sublayers of an AI Transformer in C/C++

Tokenization, Embedding, Positional Encoding,
Multi-head Attention and Sublayers of an AI
Transformer in C/C++