Building Your Own Mini-ChatGPT with R: From Markov Chains to Transformers!
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Remember our journey so far? We started with simple Markov chains showing how statistical word prediction works, then dove into the core concepts of word embeddings, self-attention, and next word prediction. Now, it’s time for the grand finale: if you want to build your own working transformer language model in R, read on!
You will say, no way!?! But yes, according to the mantra that you have only understood what you have built yourself from scratch, we will create a mini-ChatGPT that learns to write like “Alice in Wonderland” and the “Wizard of Oz”!
The Secret Sauce: Bringing It All Together
What we’ve learned so far:
- Neural Networks build a representation of the world based on their training data
- Markov chains showed us that text generation is fundamentally about predicting the next word
- Word embeddings convert words into numerical vectors that capture meaning
- Self-attention lets the model focus on relevant words when making predictions
A transformer combines ALL of these concepts into one powerful architecture. Think of it as a sophisticated Markov chain that doesn’t just look at the previous few words, but can attend to any word in the entire context, understanding relationships and patterns across the whole text!
From Theory to Practice: The R Implementation
Let’s build a complete transformer step by step, using the same alice_oz.txt file from our Markov chain example:
Step 1: Word-Level Tokenization
library(torch) # install from CRAN
# Create word-level tokenizer
create_tokenizer <- function(text) {
  text <- tolower(text)
  words <- unlist(strsplit(text, "\\s+"))
  words <- words[words != ""]
  
  unique_words <- sort(unique(words))
  vocab <- c("<start>", "<end>", unique_words)
  
  word_to_idx <- setNames(seq_along(vocab), vocab)
  idx_to_word <- setNames(vocab, seq_along(vocab))
  
  list(word_to_idx = word_to_idx, idx_to_word = idx_to_word, vocab_size = length(vocab))
}
Unlike our Markov chain that worked with fixed N-grams, this tokenizer prepares words for our transformer to process entire sequences.
Step 2: Self-Attention
transformer_layer <- nn_module(
  initialize = function(d_model, n_heads) {
    self$d_model <- d_model
    self$n_heads <- n_heads
    self$d_k <- d_model %/% n_heads
    
    # The Q, K, V matrices for the attention mechanism
    self$w_q <- nn_linear(d_model, d_model, bias = FALSE)
    self$w_k <- nn_linear(d_model, d_model, bias = FALSE)  
    self$w_v <- nn_linear(d_model, d_model, bias = FALSE)
    self$w_o <- nn_linear(d_model, d_model)
    
    # Feed-forward neural network
    self$ff <- nn_sequential(
      nn_linear(d_model, d_model * 4),
      nn_relu(),
      nn_linear(d_model * 4, d_model)
    )
    
    self$ln1 <- nn_layer_norm(d_model)
    self$ln2 <- nn_layer_norm(d_model)
    self$dropout <- nn_dropout(0.1)
  },
  
  forward = function(x, mask = NULL) {
    # Multi-head self-attention (exactly like our simple example, but multi-headed!)
    batch_size <- x$size(1)
    seq_len <- x$size(2)
    
    q <- self$w_q(x)$view(c(batch_size, seq_len, self$n_heads, self$d_k))$transpose(2, 3)
    k <- self$w_k(x)$view(c(batch_size, seq_len, self$n_heads, self$d_k))$transpose(2, 3)
    v <- self$w_v(x)$view(c(batch_size, seq_len, self$n_heads, self$d_k))$transpose(2, 3)
    
    # Scaled dot-product attention
    scores <- torch_matmul(q, k$transpose(-2, -1)) / sqrt(self$d_k)
    
    if (!is.null(mask)) {
      scores <- scores + mask$unsqueeze(1)$unsqueeze(1)
    }
    
    attn_weights <- nnf_softmax(scores, dim = -1)
    attn_output <- torch_matmul(attn_weights, v)
    
    # Combine heads and apply output projection
    attn_output <- attn_output$transpose(2, 3)$contiguous()$view(c(batch_size, seq_len, self$d_model))
    attn_output <- self$w_o(attn_output)
    
    # Residual connection and layer norm
    x <- self$ln1(x + self$dropout(attn_output))
    
    # Feed-forward
    ff_output <- self$ff(x)
    x <- self$ln2(x + self$dropout(ff_output))
    
    x
  }
)
This is our self-attention mechanism in action! Just like in our simple 3×3 example, but now it works with entire sequences and multiple attention heads.
Step 3: The Transformer Language Model
toy_llm <- nn_module(
  initialize = function(vocab_size, d_model = 256, n_heads = 8, n_layers = 4) {
    # Word embeddings (remember our love/is/wonderful example?)
    self$token_embedding <- nn_embedding(vocab_size, d_model)
    self$pos_encoding <- create_positional_encoding(512, d_model, "cpu")
    
    # Stack of transformer layers
    self$transformer_layer_1 <- transformer_layer(d_model, n_heads)
    if (n_layers >= 2) self$transformer_layer_2 <- transformer_layer(d_model, n_heads)
    if (n_layers >= 3) self$transformer_layer_3 <- transformer_layer(d_model, n_heads)
    if (n_layers >= 4) self$transformer_layer_4 <- transformer_layer(d_model, n_heads)
    self$n_layers <- n_layers
    
    # Output projection (back to vocabulary)
    self$ln_f <- nn_layer_norm(d_model)
    self$lm_head <- nn_linear(d_model, vocab_size)
    self$dropout <- nn_dropout(0.1)
  },
  
  forward = function(x) {
    seq_len <- x$size(2)
    
    # Causal mask (no peeking at future words!)
    mask <- torch_triu(torch_ones(seq_len, seq_len, device = x$device), diagonal = 1)
    mask <- mask$masked_fill(mask == 1, -Inf)
    
    # Token embeddings + positional encoding
    x <- self$token_embedding(x) * sqrt(self$d_model)
    pos_enc <- self$pos_encoding[1:seq_len, ]$to(device = x$device)
    x <- x + pos_enc
    x <- self$dropout(x)
    
    # Pass through transformer layers
    x <- self$transformer_layer_1(x, mask)
    if (self$n_layers >= 2) x <- self$transformer_layer_2(x, mask)
    if (self$n_layers >= 3) x <- self$transformer_layer_3(x, mask)
    if (self$n_layers >= 4) x <- self$transformer_layer_4(x, mask)
    
    # Final layer norm and projection to vocabulary
    x <- self$ln_f(x)
    logits <- self$lm_head(x)
    
    logits
  }
)
This is the core of the LLM, the transformer. This neural network architecture makes use of all of the above concepts, like embeddings, attention, and next word prediction!
Training Our Mini-ChatGPT
Now comes the magic – training our transformer on Alice in Wonderland and the Wizard of Oz:
# Load the same text from our Markov chain example
txt <- readLines(url("http://paulo-jorente.de/text/alice_oz.txt"), warn = FALSE)
training_text <- paste(txt, collapse = " ")
training_text <- gsub("[^a-zA-Z0-9 .,!?;:-]", "", training_text)
training_text <- tolower(training_text)
# Create tokenizer and model
tokenizer <- create_tokenizer(training_text)
model <- toy_llm(vocab_size = tokenizer$vocab_size, d_model = 256, n_heads = 8, n_layers = 4)
# Train the model (this is where the magic happens!)
train_model(model, training_text, tokenizer, epochs = 1500, seq_len = 32, batch_size = 4)
The Results
After training, our mini-transformer produces text like this:
Prompt ‘alice’: alice looked down at them, and considered a little before she was going to shrink in the time and round the
Prompt ‘the queen’: the queen said to the executioner: fetch her here. and the executioner went off like an arrow. the cats head began fading
Prompt ‘down the’: down the chimney, and she said to herself now i can do no more, whatever happens. what will become of me? luckily
Compare this to our original Markov chain output:
anxious returned the Scarecrow It is such an uncomfortable feeling to know one is a crow or a man After the crows had gone I thought this over and decided
The transformer has learned:
- Character names and relationships (duchess, mock turtle, gryphon, queen of hearts, scarecrow, wizard)
- Story context and scenarios (Alice’s wonderland, Dorothy’s journey to Oz, dialogue patterns)
- Proper grammar and sentence structure
- The whimsical, narrative style of both Carroll and Baum’s writing
The Transformer Advantage
Unlike our Markov chain that only looked at the previous 2-3 words, our transformer can:
- See the entire context of hundreds of words
- Understand long-range dependencies (like who “she” refers to)
- Learn complex grammar and style patterns
- Generate coherent narratives, not just word-by-word predictions
From Toy to Production
What we built is essentially a miniature version of ChatGPT! The same principles scale up:
- Industry-level GPTs have up to hundreds of billions of parameters (weights) (our model has 6 million)
- GPTs train on hundreds of terabytes (we used two books)
- GPTs often train for months on high performance clusters (this one takes less than 15 minutes on a standard computer with some GPU)
- Production models use sophisticated tokenizers (we used simple word splitting)
But the core architecture? Exactly the same!
The Unreasonable Effectiveness of Transformers
What’s truly remarkable is that this simple architecture – predicting the next word using self-attention – gives rise to seemingly intelligent behavior. Our tiny model learned:
- Grammar rules (without being taught grammar)
- Character relationships (without being told who’s who)
- Story structure (without understanding “plot”)
- Writing style (without lessons in literature)
All from the simple task of “predict the next word”!
Isn’t it fascinating that so much apparently intelligent behavior emerges from statistical text prediction? As we saw in our Markov chain post, “many tasks that demand human-level intelligence can obviously be reduced to some form of (statistical) text prediction with a sufficiently performant model!”
To give you an intuition, why using a neural network architecture for this is so powerful: we have already seen that neural networks build a representation of their world, a world model (see: Understanding the Magic of Neural Networks). In this case, imagine a detective story which ends with “And now it was clear, the murderer was…”: to sensibly predict the next (and last) word the neural network really must have understood the story in some sense!
Next Steps: The Adventure Continues!
You’ve now built your own language model using the same principles as ChatGPT! Next, we could experimenting with:
- Different texts (Shakespeare? Scientific papers? Your own writing?)
- Larger models (more layers, bigger embeddings)
- Different hyperparameters
- Various generation strategies (temperature, top-k sampling)
Remember: we’ve just implemented the core technology behind the AI revolution. From Markov chains to attention mechanisms to transformers – you’ve mastered the journey from simple statistics to artificial intelligence!
The next time someone asks you “How does ChatGPT work?”, you can confidently say: “Let me show you…” and build one from scratch (or show this post 😉 )!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
