Introduction and Motivation As part of the 5th iteration of the Arena AI safety research program we undertook a capstone project. For this project we ran a range of experiments on an encoder-decoder model, focusing on how information is stored in the bottleneck embedding and how modifying the input text...
TL;DR: We probed SONAR text autoencoders to see if they implicitly learn "correctness" across domains. Turns out they do, but with a clear hierarchy: code validity (96% accuracy) > grammaticality (93%, cross-lingual) > basic arithmetic (76%, addition only) > chess syntax (weak) > chess semantics (absent). The hierarchy suggests correctness...
TL;DR: We showed how Hebbian learning with weight decay could enable a) feedforward circuits (one-to-many) to extract the first principal component of a barrage of inputs and b) recurrent circuits to amplify signals which are present across multiple input streams and suppress signals which are likely spurious. Short recap In...
With this sequence, we (Sam + Jan) want to provide a principled derivation of the natural abstractions hypothesis (which we will introduce in-depth in later posts) by motivating it with insights from computational neuroscience. Goals for this sequence are: * show why we expect natural abstractions to emerge in biological...
TL;DR: An accountability buddy is someone to check in with from time to time to give you social motivation to achieve your goals. There are many additional benefits from this process such as planning together and getting feedback on your progress. I think especially EAs in remote areas or those...
Recently, I wrote an article together with Jan Kirchner on "brain enthusiasts" in AI Safety (if you find work on neuroscience/cognitive science x AI (Safety) interesting, let me know 🙂). When crafting and researching our arguments, we often came across one argument, that made us confused and uncertain about this...
TL;DR: If you're a student of cognitive science or neuroscience and are wondering whether it can make sense to work in AI Safety, this guide is for you! (Spoiler alert: the answer is "mostly yes"). Motivation AI Safety is a rapidly growing field of research that is singular in its...