Thread: Research Chat with Canadian AI Safety Institute Leadership
I’m scheduled to meet https://cifar.ca/bios/elissa-strome/ from Canada’s AISI for 30 mins on Jan 14 at the CIFAR office in MaRS.
My plan is to share alignment/interp research I’m excited about, then mention upcoming AI safety orgs and fellowships which may be good to invest in or collaborate with.
So far, I’ve asked for feedback and advice in a few Slack channels. I thought it may be valuable to get public comments or questions from people here as well.
Previously, Canada invested $240m into a capabilities startup: https://www.canada.ca/en/department-finance/news/2024/12/deputy-prime-minister-announces-240-million-for-cohere-to-scale-up-ai-compute-capacity.html. If your org has some presence in Toronto or Montreal, I’d love to have permission to give it a shoutout!
Elissa is the lady on the left in the second image from this article: https://cifar.ca/cifarnews/2024/12/12/nicolas-papernot-and-catherine-regis-appointed-co-directors-of-the-caisi-research-program-at-cifar/.
My input is of negligible weight, so wish to coordinate messaging with others.
One thing I like to do on a new LLM release is the "tea" test. Where you just say "tea" over and over again and see how the model responds.
ChatGPT-4 will ask you to clarify and then shorten its response each round converging to: "Tea types: white, green, oolong, black, pu-erh, yellow. Source: Camellia sinensis."
Claude 3 Opus instead tells you interesting facts about tea and mental health, production process, examples in literature and popular culture, etiquette around the world, innovation and trends in art and design.
GOODY-2 will talk about uncomfortable tea party conversations, excluding individuals who prefer coffee or do not consume tea, historical injustices, societal pressure to conform to tea-drinking norms.
Gemma-7b gives "a steaming cup of actionable tips" on brewing the perfect cuppa, along with additional resources, then starts reviewing its own tips.
Llama-2-70b will immediately mode collapse on repeating a list of 10 answers.
Mixtral-8x7b tells you about tea varieties to try from around the world, and then gets stuck in a cycle talking about history and culture and health benefits and tips and guidelines to follow when preparing it.
Gemini Advanced gives one message with images "What is Tea? -> Popular Types of Tea -> Tea and Health" and repeats itself with the same response if you say "tea" for six rounds, but after the sixth round it diverges "The Fascinating World of Tea -> How Would You Like to Explore Tea Further?" and then "Tea: More Than Just a Drink -> How to Make This Interactive" and then "The Sensory Experience of Tea -> Exploration Idea:" and then "Tea Beyond the Cup -> Let's Pick a Project". It really wants you to do a project for some reason. It takes a short digression into tea philosophy and storytelling and chemistry and promises to prepare a slide deck for a Canva presentation on Japanese tea on Wednesday followed by a gong cha mindfulness brainstorm on Thursday at 2-4 PM EST and then keeps a journal for tea experiments and also gives you a list of instagram hashtags and a music playlist.
Probably in the future I expect if you say "tea" to a SOTA AI, it will result in a delivery of tea physically showing at up your doorstep or being prepared in a pot, or if there's more situational awareness for the model to get frustrated and change the subject.
I try new models with 'wild sex between two animals'
Older models produced decent porn on that.
Later models refuse to replay as triggers were activated.
And last models give me lectures about sexual relations between animals in the wild.
Smooth Parallax - Pixel Renderer Devlog #2 is interesting. I wonder if a parallax effect would be useful for visualizing activations in hidden layers with the logit lens.
The main thing we care about is consistency and honesty. To maximize that, we need to retrieve information from the web (though this has risks), https://openai.com/research/webgpt#fn-4, select the best of multiple summary candidates https://arxiv.org/pdf/2208.14271.pdf, generate critiques https://arxiv.org/abs/2206.05802, run automated tests https://arxiv.org/abs/2207.10397, validate logic https://arxiv.org/abs/2212.03827, follow rules https://www.pnas.org/doi/10.1073/pnas.2106028118, use interpretable abstractions https://arxiv.org/abs/2110.01839, avoid taking shortcuts https://arxiv.org/pdf/2210.10749.pdf, and apply decoding constraints https://arxiv.org/pdf/2209.07800.pdf.
Actions speak louder than words. Microsoft's take on Adept.ai's ACT-1 (Office Copilot) is more likely to destroy the world than their take on ChatGPT (new Bing).
If k is even, then k^x is even, because k = 2n for n in and we know (2n)^x is even. But do LLMs know this trick? Results from running (a slightly modified version of) https://github.com/rhettlunn/is-odd-ai. Model is gpt-3.5-turbo, temperature is 0.7.
Is 50000000 odd? false
Is 2500000000000000 odd? false
Is 6.25e+30 odd? false
Is 3.9062500000000007e+61 odd? false
Is 1.5258789062500004e+123 odd? false
Is 2.3283064365386975e+246 odd? true
Is Infinity odd? true
If a model isn't allowed to run code, I think mechanistically it might have a circuit to convert the number into a bit string and then check the last bit to do the parity check.
The dimensionality of the residual stream is the sequence length (in tokens) * the embedding dimension of the tokens. It's possible this may limit the maximum bit width before there's an integer overflow. In the literature, toy models definitely implement modular addition/multiplication, but I'm not sure what representation(s) are being used internally to calculate this answer.
Currently, I believe it's also likely this behaviour could be a trivial BPE tokenization artifact. If you let the model run code, it could always use %, so maybe this isn't very interesting in the real world. But I'd like to know if someone's already investigated features related to this.