Maybe useful way to get feedback on how good you are at doing this would be trying to make predictions based on your experience with language models:
Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.
There's a (set of) experiments I'd be keen to see done in this vein, which I think might produce interesting insights.
How well do capabilites generalise across languages?
Stuart Armstrong recently posted this example of GPT-3 failing to generalise to reversed text. The most natural interpretation, at least in my mind, and which is pointed at by a couple of comments, is that the problem here is that there's just been very little training data which contains things like:
m'I gnitirw ffuts ni esrever tub siht si yllautca ytterp erar
(I'm writing stuff in reverse but this is actually pretty rare)
Especially with translations underneath. In particular, there hasn't been enough data to relate the ~token 'ffuts' to 'stuff'. These are just two different things which have been encoded somewhere, one of them tends to appear near english words, the other tends to appear near other rare things like 'ekil'.
It seems that how much of a capability hit language models take when trying to work in 'backwards writing', as well as other edited systems like pig latin or simple cyphers, and how much fine tuning they would take to restore the same capabilities as English, may provide a few interesting insights into model 'theory of mind'.
The central thing I'm interested in here is trying to identify differences between cases where a LLM is both modelling a situation in some sort of abstract way, and then translating from that situation to language output, and cases where the model is 'only' doing language output.
Models which have some sort of world model, and use that world model to output language, should find it much easier to capably generalise from one situation to another. They also seem meaningfully closer to agentic reasoners. There's also an interesting question about how different models look when fine tuned here. If it is the case that there's a ~separate 'world model' and 'language model', training the model to perform well in a different language should, if done well, only change the second. This may even shed light on which parts of the model are doing what, though again I just don't know if we have any ways of representing the internals which would allow us to catch this yet.
Ideas for specific experiments:
If anyone does test some of these, I'd be interested to hear the results!
I had a lot of fun testing PaLM prompts on GPT3, and I'd love to further explore the "psychology" of AIs from a black-box angle. I currently have a job, and not a lot of free time, but I'd be willing to spend a few hours here and there on this, if that would be beneficial.
THANK YOU
ahem
Hi!
I'm a bit late to the party. I postponed writing this because it felt so important that I wanted to reply correctly. I sincerely hope that my enthusiasm will not be mistaken for egotism. Forgive my Chutzpah!
Let me introduce myself. After a background in hard sciences I am currently a medical student in a european country (hint, president's name sounds like a pastry and loves nuclear energy). I am confident that I will become a psychiatrist (i.e. I am still a few years before the final exam where you choose the specialty but it's pretty easy to reach) but my love for computer science is burning as strong as ever. Hence, I have a strong motivation toward the field of computational psychiatry
. I am particulary captivated by AI and think I am a pretty competent coder.
When people ask me what I want to do, I have several possible answers : understanding psychiatric pathologies and help patients (this also has the benefit of helping me get along with my medical peers), understanding counscioussness (this is my existencial goal), playing a role in creating artificial consciousness (that's my personnal motivation). But I am of course interested in all things AI related, including alignment research.
Hence, it is with strong enthusiasm that I read the term "AI psychologist", knowing that I introduced myself several times as "wanna be AI psychiatrist". Those passions of mine are intertwined as I'm convinced that (as Feynmann put it) "If you want to understand the mind, you have to build it".
You said :
(And it wouldn’t shock me if “AI psychologist” turns out to be an economically important occupation in the future, and if you got a notable advantage from having a big head start on it.) I think this is especially likely to be a good fit for analytically strong people who love thinking about language and are interested in AI but don’t love math or computer science.
I recognize myself in this paragraph, although I do love math and computer science.
Having to juggle between medical school and programming, I don't have the brains to be as competent and experienced as I'd like in ML but I think that interpretability research is a sweet spot where my transdisciplinary skills would be useful. Btw if anyone has specific courses, books or material on interpretability, I would be delighted!
I am writing this to signal that this kind of people exist. Unfortunately I am still about 5 to 10 years to completely finish my (currently planned) full curriculum (this includes medical and computer science degrees as well as PhDs), but I hope there will still be hanging fruits by this time :)
Btw I am also a LW meetup organizer in my country. If you ever come in europe in the coming years we could definitely have a chat. Otherwise don't hesitate to reach me, even/especially in years from now, as I'm still in the learning phase.
Note that I subscibed to your comments as well as to comments on this page, this way I can see the advance you publish in this field. I will also take a look every month at Redwood Research website news section.
Sincerely
Here's a fun paper I wrote along these lines. I took an old whitepaper of McCarthy from 1976 where he introduces the idea of natural language understanding and proposes a set of questions about a news article that such a system should be able to answer. I asked the questions to GPT 3 and looked at what it got right and wrong and guessed at why.
What Can a Generative Language Model Answer About a Passage?
It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive
Maybe I missed something here, but how is this supposed to help with deception? I thought the whole reason deceptive alignment is really hard to solve is that you can't tell if the AI's being deceptive via its behavior.
Curated. I've heard a few offhand comments about this type of research work in the past few months, but wasn't quite sure how seriously to take it.
I like this writeup for spelling out details of why it blackbox investigators might be useful, what skills it requires and how you might go about it.
I expect this sort of skillset to have major limitations, but I think I agree with the stated claims that it's a useful skillset to have in conjunction with other techniques.
One general piece of advice is that it seems like it might be useful to have an interface that shows you multiple samples for each prompt (the OpenAI playground just gives you one sample, if you use temperature > 0 then this sample could either be lucky or unlucky)
In addition to seeing more AI behavioral psychology work, I would be excited about seeing more AI developmental psychology -- i.e. studying how varying properties of training or architecture affect AI behavior. Shard theory is an example of this.
I've written a bit about the motivations for AI developmental psychology here.
PaLM’s recently got some attention for explaining jokes. Which is interesting. About a year ago I quizzed GPT-3 [through an intermediary, Phil Mohun] about a Jerry Seinfeld routine, the first one he ever performed on TV. It was about the South Bronx.
I thought GPT-3’s response was fascinating – I’ve put a lightly edited version of the conversation on my blog. As the conversation went on, GPT-3 came up with some interesting answers.
I think quizzing a languag model about jokes could be very revealing. Why? Because the force it into ‘strange ways’ that can reveal what’s going on under the hood.
Do you suspect that black-box knowledge will be transferable between different models, or that the findings will be idiosyncratic to each system?
I suspect that some knowledge transfers. For example, I suspect that increasingly large LMs learn features of language roughly in order of their importance for predicting English, and so I'd expect that LMs that get similar language modeling losses usually know roughly the same features of English. (You could just run two LMs on the same text and see their logprobs on the correct next token for every token, and then make a scatter plot; presumably there will be a bunch of correlation, but you might notice patterns in the things that one LM did much better than the other.)
And the methodology for playing with LMs probably transfers.
But I generally have no idea here, and it seems really useful to know more about this.
There is already a sizable amount of research done in this direction, the so called bertology. I believe the methodology that is being developed is useful, but knowing about specific models is probably superfluous. In few months / years we will have new models and anything that you know will not generalize.
I’ve been thinking about ordinary arithmetic computation in this context. We know that models have trouble with it. The issue interests me because arithmetic calculation has well-understood procedures. We know how people do it. And by that I mean that there’s nothing important about the process that’s hidden, unlike our use of ordinary language. The mechanisms of both sentence-level grammar and discourse structure are unconscious.
It's pretty clear to me that arithmetic requires episodic structure, to introduce a term from old symbolic-systems AI and computational linguistics. That’s obvious from the fact that we don’t teach it to children until grammar school, which is roughly episodic level cognition kicks in (see the paper Hays and I did, Principles and Development of Natural Intelligence).
Arithmetic is not like ordinary language is for humans, which comes to us naturally without much specific training. Fluency in arithmetic requires years of drill. First the child must learn to count; that gives numbers meaning. Once that is well in hand, children are drilled in arithmetic tables for the elementary operations, and so forth. Once this is going smoothly one learns the procedures multiple-digit addition and subtraction, multiple-operand addition and then multiplication and division. Multiple digit division is the most difficult because it requires guessing, which is then checked by actual calculation (multiplication followed by subtraction).
Why do such intellectually simple procedures require so much drill? Because each individual step must be correct. You can’t just go straight ahead. One mistake anywhere, and the whole calculation is thrown off.
Whatever a model is doing in inference mode, I doubt it’s doing anything like what humans do. Where would it pick that up on the web?
I don’t know what’s going on inside model in inference mode, but I’d guess it’s something like this: The inference engine ‘consumes’ a prompt, which moves it to some position in its state space.
And so it moves through its state space in a single unbroken traversal. You can’t do arithmetic that way. You have to keep track of partial results and stop to retrieve them so you can integrate them into the ongoing flow of the calculation.
So now the question is: What other kinds of tasks require the computational style that arithmetic does? Perhaps generating a long strong of coherent prose does.
Let me think about that for awhile.
Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals.
I think that getting really good at this might be a weird but good plan for learning some skills that might turn out to be really valuable for alignment research. (And it wouldn’t shock me if “AI psychologist” turns out to be an economically important occupation in the future, and if you got a notable advantage from having a big head start on it.) I think this is especially likely to be a good fit for analytically strong people who love thinking about language and are interested in AI but don’t love math or computer science.
I'd probably fund people to spend at least a few months on this; email me if you want to talk about it.
Some main activities I’d do if I was a black-box LM investigator are:
The skills you’d gain seem like they have a few different applications to alignment:
My guess is that this work would go slightly better if you had access to someone who was willing to write you some simple code tools for interacting with models, rather than just using the OpenAI playground. If you start doing work like this and want tools, get in touch with me and maybe someone from Redwood Research will build you some of them.