Building a transformer from scratch

Marius Hobbhahn

It is not always obvious whether your skills are sufficiently good to work for one of the various AI safety and alignment organizations. There are many options to calibrate and improve your skills including just applying to an org or talking with other people within the alignment community.

One additional option is to test your skills by working on projects that are closely related to or a building block of the work being done in alignment orgs. By now, there are multiple curricula out there, e.g. the one by Jacob Hilton or the one by Gabriel Mukobi.

One core building block of these curricula is to understand transformers in detail and a common recommendation is to check if you can build one from scratch. Thus, my girlfriend and I have recently set ourselves the challenge to build various transformers from scratch in PyTorch. We think this was a useful exercise and want to present the challenge in more detail and share some tips and tricks. You can find our code here.

The following is a suggestion on how to build a transformer from scratch and train it. There are, of course, many details we omit but I think it covers the most important basics.

Goals

From the ground up we want to

Build the attention mechanism
Build a single-head attention mechanism
Build a multi-head attention mechanism
Build an attention block
Build one or multiple of a text classification transformer, BERT or GPT. The quality of the final model doesn’t have to be great, just clearly better than random.
Train the model on a small dataset.
- We used the polarity dataset for binary text sentiment classification.
- We used the AG_NEWS dataset (PyTorch built-in) for BERT and GPT.
Test that the model actually learned something
- We looked at the first batch of the test data to see if the model predicted something plausible.
- We compared the test loss of a random network with the test loss of the trained network to see if our model is better.

Bonus goals

Visualize one attention head
Visualize how multiple attention heads attend to the words of an arbitrary sentence
Reproduce the grokking phenomenon (see e.g. Neel’s and Tom’s piece).
Answer some of the questions in Jacob Hilton's post.

Soft rules

For this calibration challenge, we used the following rules. Note, that these are “soft rules” and nobody is going to enforce them but it’s in your interest to make some rules before you start.
We were

allowed to read papers such as Attention is all you need or the GPT-3 paper.
allowed to read tutorials on attention such as The illustrated transformer (as long as they don’t contain code snippets).
allowed to look at tutorials to build generic models in PyTorch as long as they don’t contain NLP architectures.
allowed to watch videos such as the ones from Yannic Kilcher on NLP
not allowed to look at the source code of any transformer or attention mechanism before you have implemented it ourselves. In case we struggle a lot, we can take a peek after we tried and failed to implement one building block ourselves.
- We found Andrej Karpathy’s code helpful for the GPT implementation.
allowed to replace a part with a PyTorch implementation once we have demonstrated that it is equivalent. For example, once we have shown that our attention mechanism produces the same output for the same input as the PyTorch attention mechanism, we can use the PyTorch code block.
allowed to use generic PyTorch functions that are not directly related to the task. For example, we don’t have to write the embedding layer, linear layer or layer-norm from scratch.

Things to look out for

Here are some suggestions on what to look out for during the project

Do I understand the tutorials? Does it feel obvious and simple or hard?
Am I comfortable with PyTorch? Do I understand how batching works, what the different dimensions of all matrices mean and what dimensions the intermediate results have?
Am I comfortable reading the paper or tutorials? Does the math they present feel easy or hard? Is “thinking in vectors and matrices” something that feels completely obvious?
Am I comfortable reading PyTorch code? When you compare your code to the PyTorch implementation, do you understand what they are doing and why?
How does the difficulty of this project compare to the intuitive difficulty of other projects you have worked on? Does it feel like implementing “just another neural network” or is it a rather new and hard experience?
How long does it take you to complete the different subparts? I’m not sure what good benchmark times are because I did the project with my girlfriend and we both have experience with PyTorch and ML. But here are some suggestions (I'm not sure if this is short or long; don't feel bad if it takes you longer):
- 5-10 hours to build the attention mechanism, single- and multi-head attention and a transformer block.
- 5 hours to build, train and test a text classifier
- 5-10 hours to build, train and test a small BERT model
- 5-10 hours to build, train and test a small GPT model

I think that the “does it feel right” indicators are more important than the exact timings. There can be lots of random sources of error during the coding or training of neural networks that can take some time to debug. If you felt very comfortable, this might be a sign that you should apply to a technical AI alignment job. If it felt pretty hard, this might be a sign that you should skill up for a bit and then apply.

The final product

In some cases, you might want to show the result of your work to someone else. I’d recommend creating a GitHub repository for the project and creating a jupyter notebook or .py file for every major subpart. You can find our repo here. Don’t take our code as a benchmark to work towards, there might be errors and we might have violated some basic guidelines of professional NLP coding due to our inexperience.

Problems we encountered

PyTorch uses some magic under the hood, e.g. transpositions, reshapes, etc. This often made it a bit weird to double-check our implementations since they were technically correct but still yielded different results from the PyTorch implementation
PyTorch automatically inits the weights of their classes which makes it annoying to compare them. If you want to compare input-output behavior, you have to set the weights manually.
The tokenizer pipeline is a bit annoying. I found the pre-processing steps for NLP much more unintuitive than e.g. for image processing. A lot of this can be solved by using libraries such as huggingface to do the preprocessing for you.
Our models were too small in the beginning. We wanted to start with smaller transformers to make the training faster. However, since we used a relatively large dataset, the biggest computation comes from the final linear layer. Therefore, increasing the depth and width of the network or the number of attention heads doesn’t even make a big difference in the overall runtime. Larger models, as one would expect, showed better performance.

How to think about AI safety up-skilling projects

In my opinion, there are three important considerations.

Primarily, an AI safety up-skilling project is a way for you to calibrate yourself. Do you feel comfortable with the difficulty or is it overwhelming? Do you enjoy the work or not? How long does it take you to finish the project and how much help was needed? The main benefit of such a project is that it is an accessible way to gain clarity about your own skills.
An AI safety up-skilling project should be designed to build skills. Thus, even if you realize that you are not ready to be hired, you get something out of the project. In the case of the “transformer from scratch”, for example, you get an increased understanding of transformers which is useful for other paths in AI safety.
You can use the project as a way to demonstrate your skills to possible employers. Note that increased clarity for your employer is beneficial even if they don’t end up hiring you. They can more clearly point you towards your current weaknesses which makes skill building easier for you. Thereby, you can more easily work on your weaknesses and re-apply one or two years later.

Final words

I hope this is helpful. In case something is unclear, please let me know. In general, I’d be interested to see more “AI safety up-skilling challenges”, e.g. providing more detail to a subsection of Jacob’s or Gabriel’s post.

[-]Algon3y51

For building the skills to make a transformer, I'd highly recommend Karpathy's youtube channel. He hasn't gotten to transformers yet, as he's covering earlier models first. Which is useful, as knowing how to implement a neural network properly will affect your ability to implement a transformer. Yes, these are NLP models, but I think the soft rule of not looking at any NLP architectures is dumb. If the models don't contain the core insights of transforrmers/SOTA NLP architectures, then what's the issue?

To understand what a transformer is, I'd recommend this article. Also, I'd warn against not using pytorch for large models. Unless you know CUDA, that's a bad idea.

EDIT: This is a good post, and I'm glad you (and your girlfriend?) wrote it.

LESSWRONG
LW

42

Building a transformer from scratch - AI safety up-skilling challenge

42

Ω 17