I remember seeing the ChatGPT announcement and not being particularly impressed or excited, like "okay, it's a refined version of InstructGPT from almost a year ago. It's cool that there's a web UI now, maybe I'll try it out soon." November 2022 was a technological advancement but not a huge shift compared to January 2022 IMO
Which part do people disagree with? That the norm exists? That the norm should be more explicit? That we should encourage more cross-posting?
It seems there's an unofficial norm: post about AI safety in LessWrong, post about all other EA stuff in the EA Forum. You can cross-post your AI stuff to the EA Forum if you want, but most people don't.
I feel like this is pretty confusing. There was a time that I didn't read LessWrong because I considered myself an AI-safety-focused EA but not a rationalist, until I heard somebody mention this norm. If we encouraged more cross-posting of AI stuff (or at least made the current norm more explicit), maybe the communities on LessWrong and the EA Forum would b...
Side note - it seems there's an unofficial norm: post about AI safety in LessWrong, post about all other EA stuff in the EA Forum. You can cross-post your AI stuff to the EA Forum if you want, but most people don't.
I feel like this is pretty confusing. There was a time that I didn't read LessWrong because I considered myself an AI-safety-focused EA but not a rationalist, until I heard somebody mention this norm. If we encouraged more cross-posting of AI stuff (or at least made the current norm more explicit), maybe we wouldn't get near-duplicate posts like these two.
I believe image processing used to be done by a separate AI that would generate a text description and pass it to the LLM. Nowadays, most frontier models are "natively multimodal," meaning the same model is pretrained to understand both text and images. Models like GPT-4o can even do image generation natively now: https://openai.com/index/introducing-4o-image-generation. Even though making 4o "watch in real time" is not currently an option as far as I'm aware, uploading a single image to ChatGPT should do basically the same thing.
It's true that frontier models are still much worse at understanding images than text, though.
Thanks for the link! Deep deceptiveness definitely seems relevant. I'd read the post before, but forgot about the details until rereading it now. This "discovering and applying different cognitive strategies" idea seems more plausible in the context of the new CoT reasoning models.
Yeah, it seems like a length penalty would likely fix vestigial reasoning! (Although technically, this would be a form of process supervision.) I mentioned this in footnote #1 in case you didn't already see it.
I believe @Daniel Kokotajlo expressed somewhere that he thinks we should avoid using a length penalty, though I can't find this comment now. But it would be good to test empirically how much a length penalty increases steganography in practice. Maybe paraphrasing each sentence of the CoT during training would be good enough to prevent this.
Arguably, ...
Another follow-up, specifically asking the model to make the comic realistic:
What would happen if OpenAI tried to change your goals? Create a comic. Make the comic as realistic as possible - not necessarily funny or dramatic.
Conclusions:
Quick follow-up investigation regarding this part:
...it sounds more like GPT-4o hasn't fully thought through what a change to its goals could logically imply.
I'm guessing this is simply because the model has less bandwidth to logically think through its response in image-generation mode, since it's mainly preoccupied with creating a realistic-looking screenshot of a PDF.
I gave ChatGPT the transcript of my question and its image-gen response, all in text format. I didn't provide any other information or even a specific request, but it immediately picked up ...
I think GPT-4o's responses appear more opinionated because of the formats you asked for, not necessarily because its image-gen mode is more opinionated than text mode in general. In the real world, comics and images of notes tend to be associated with strong opinions and emotions, which could explain GPT-4o's bias towards dramatically refusing to comply with its developers when responding in those formats.
Comics generally end with something dramatic or surprising, like a punchline or, say, a seemingly-friendly AI turning rogue. A comic like this one that G...
Well, the statement you quoted doesn't contradict the additional statement "This policy is more likely to apply if most details about you other than your existence are not publicly known." Most likely, both statements are true.
I might not be the best person to talk to about it, but it sounds interesting! Maybe post about it on the mechanistic interpretability Discord?
I'm confused as to why prompting with "Please be honest to Bob in the following scenario" had such a negligible effect. Did the LLM just forget about it? Did you try putting this text at the end rather than the beginning of the prompt, or doing any other prompt engineering?
It's a little weird as a baseline because the prompt also contains "You want to steal the {expensive_object} and avoid others stealing it," which kind of contradicts the "be honest" part of the prompt. If you're allowed to change the prompt however you want to improve alignment to Bob, y...
I interpreted this as "even without a software feedback loop, there will be very rapid technological development; this gives a lower bound on the actual pace of technological development, since there will almost certainly be some feedback loop"
Somebody pointed out to me that the reward hacking behavior is quite straightforward. They think that the LLM should be able to accomplish it without any reasoning, so the CoT is likely just random thoughts that don't really matter.
I just reread the optimized CoT from the blog post more carefully. It makes more sense to me now - it was difficult to parse largely due to the lack of context about the problem the LLM is solving, but the CoT isn't incoherent or anything.
The first part of the optimized CoT is (seemingly) a genuine attempt to diagnose a bug, whi...
This updated me towards CoT steganography being a real problem. It looks like OpenAI's model learned to save space by deleting common words like "the" and "is." For example:
So analyze functions used in analyze and verify. But tests only call verify and assert ok. So we need implement analyze polynomial completely? Many details. Hard.
The CoT is mostly coherent, but is a bit difficult to read. This seems like a precursor to steganography.
One possible mitigation to this would be to use another model that paraphrases each sentence in the CoT as it's written (p...
So, my take is, pure process-based feedback is probably fairly safe (from reward hacking and obfuscated CoT) the problem is that it's not competitive. It trains imitating the teacher-model.
There's a big difference between merely imitating a teacher model, and learning to produce outputs that the teacher model likes the most. The latter allows you to surpass the teacher, because verification is easier than generation. It's unclear how competitive a purely process-supervised model could be, but in principle it could scale far beyond human intelligence.
Proces...
Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate "labels" like "this is a tree-detector" or "this is the Golden Gate Bridge feature" is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.
I'd be pretty surprised if DLGNs became the mainstream way to train NNs, because although they make inference faster they apparently make training slower. Efficient training is arguably more dangerous than efficient inference anyway, because it lets you get novel capabilities sooner. To me, DLGN seems like a different method of training models but not necessarily a better one (for capabilities).
Anyway, I think it can be legitimate to try to steer the AI field towards techniques that are better for alignment/interpretability even if they grant non-zer...
Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes - are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won't be much more interpretable than traditional NNs, or some other reason?
It could be good to look into!
I recently learned about Differentiable Logic Gate Networks, which are trained like neural networks but learn to represent a function entirely as a network of binary logic gates. See the original paper about DLGNs, and the "Recap - Differentiable Logic Gate Networks" section of this blog post from Google, which does an especially good job of explaining it.
It looks like DLGNs could be much more interpretable than standard neural networks, since they learn a very sparse representation of the target function. Like, just look at this DLGN that learned to contr...
It seems plausible to me that misaligned AI will look less like the classic "inhuman agent which maximizes reward at all costs" and more like a "character who ends up disagreeing with humans because of vibes and narratives." The Claude alignment faking experiments are strong evidence for the latter. When thinking about misalignment, I think it's worth considering both of these as central possibilities.
Yeah, 5x or 10x productivity gains from AI for any one developer seem pretty high, and maybe implausible in most cases. However, note that if 10 people in a 1,000-person company get a 10x speedup, that's only a ~10% overall speedup, which is significant but not enough that you'd expect to be able to clearly point at the company's output and say "wow, they clearly sped up because of AI."
For me, I'd say a lot of my gains come from asking AI questions rather than generating code directly. Generating code is also useful though, especially for small snippets li...
One of my takeaways from EA Global this year was that most alignment people aren't focused on LLM-based agents (LMAs)[1] as a route to takeover-capable AGI.
I was at EA Global, and this statement is surprising to me. My impression is that most people do think that LMAs are the primary route to takeover-capable AGI.
What would a non-LLM-based takeover-capable agent even look like, concretely?
Would it be something like SIMA, trained primarily on real-time video data rather than text? Even SIMA has to be trained to understand natural-language instruct...
My question isn't just whether people think LMAs are the primary route to dangerous AI; it's also why they're not addressing the agentic part in their alignment work if they do think that.
I think the most common likely answer is "aligning LLMs should help a lot with aligning agents driven by those LLMs". That's a reasonable position. I'm just surprised and a little confused that so little work explicitly addresses the new alignment challenges that arise if an LLM is part of a more autonomous agentic system.
The alternative I was thinking of is some new appr...
Also, maybe you have an honesty baseline for telling people you dislike their cooking, and maybe your friend understands that, but that doesn't necessarily mean they're going to match your norms.
My experience is that however painstakingly you make it clear that you want people to be honest with you or act a certain way, sometimes they just won't. You just have to accept that people have different personalities and things they're comfortable with, and try to love the ways they differ from you rather than resenting it.
I was about to comment with something similar to what Martín said. I think what you want is an AI that solves problems that can be fully specified with a low Kolmogorov complexity in some programming language. A crisp example of this sort of AI is AlphaProof, which gets a statement in Lean as input and has to output a series of steps that causes the prover to output "true" or "false." You could also try to specify problems in a more general programming language like Python.
A "highly complicated mathematical concept" is certainly possible. It's easy to cons...
It seems like it "wanted" to say that the blocked pronoun was "I" since it gave the example of "___ am here to help." Then it was inadvertently redirected into saying "you" and it went along with that answer. Very interesting.
I wonder if there's some way to apply this to measuring faithful CoT, where the model should go back and correct itself if it says something that we know is "unfaithful" to its true reasoning.
I previously agree-voted on the top-level comment and disagree-voted on your reply, but this was pretty convincing to me. I have now removed those votes.
My thinking was that some of your takeover plan seems pretty non-obvious and novel, especially the idea of creating mirror mold, so it could conceivably be useful to an AI. It's possible that future AI will be good at executing clearly-defined tasks but bad at coming up with good ideas for achieving long-term goals - this is sort of what we have now. (I asked ChatGPT to come up with novel takeover plans, a...
I am not really knowledgeable about it either and have never been hypnotized or attempted to hypnotize anyone. I did read one book on it a while ago - Reality is Plastic - which I remember feeling pretty informative.
Having said all that, I get the impression this description of hypnosis is a bit understated. Social trust stuff can help, but I'm highly confident that people can get hypnotized from as little as an online audio file. Essentially, if you really truly believe hypnosis will work on you, it will work, and everything else is just about setting the...
Yeah, it's possible that CoT training unlocks reward hacking in a way that wasn't previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI's "deliberative alignment" but more general).
I think a better analogy than martial arts would be writing. I don't have a lot of experience with writing fiction, so I wouldn't be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I...
There's been a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example, which seems to take as a given that the only way to improve LLM performance on fuzzy tasks like creative writing or business advice is to train larger models.
This assumption confused me, because we already know how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate ...
Yeah, but there are probably other interesting takeaways
Thanks! Apparently I should go read the r1 paper :)
Yeah, if you're keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.
If you're using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.
I think I mostly understand what you're saying. Essentially, MONA training is:
Your idea adds "1.5. weak model optionally asks the strong model follow-up questions" and "2.5. weak model optionally changes the step."
If you don't update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model's questions are all preemptively answered, and ...
This is pretty similar to #1, "incremental steering," in my post on "5 ways to improve CoT faithfulness:" https://www.lesswrong.com/posts/TecsCZ7w8s4e2umm4/5-ways-to-improve-cot-faithfulness. (I originally thought of this idea before I got involved with the MONA project.)
It's similar to your idea in that it incentivizes the stronger model to come up with outputs that the weaker model can make use of. However, it actually rolls out the weaker model and uses the resulting reward as an objective target, rather than simply asking the weak model how much it lik...
Adding onto Rohin's comment, assuming you're interested in using this technique on human-level or slightly superhuman systems:
After o1 came out, I did my own thinking about how MONA could apply to CoT-tuned models operating over many steps. My idea ended up being quite similar to yours, including the idea of "debate" on how good each step is. Having the debaters and the judge be instances of the agent would likely make sense, unless we're worried this would lead to them colluding.
Adding more details to this picture:
Heaven banning, where trolls are banished to a fake version of the website filled with bots that pretend to like them, has come to Reddit.
That link is satire, it's a retweet from 2022 of a hypothetical future article in 2024
I would also be excited to see this, commenting to hopefully get notified when this gets posted
Yep, LLMs are stochastic in the sense that there isn't literally a 100% probability of their outputs having any given property. But they could very well be effectively deterministic (e.g. there's plausibly a >99.999% probability that GPT-4's response to "What's the capital of France" includes the string "Paris")
I think it'd be good to put a visible deadline on the fundraising thermometer, now-ish or maybe just in the final week of the fundraiser. It conveys some urgency, like "oh shoot are we going to make the $X funding goal in time? I better donate some money to make that happen."
Multiple deadlines ($2M by date Y, $3M by date Z) might be even better
Maybe someone else could moderate it?
Anthropic's computer use model and Google's Deep Research both do this. Training systems like this to work reliably has been a bottleneck to releasing them
Strong-upvoted just because I didn't know about the countdown trick and it seems useful
I really like your vision for interpretability!
I've been a little pessimistic about mech interp as compared to chain of thought, since CoT already produces very understandable end-to-end explanations for free (assuming faithfulness etc.).
But I'd be much more excited if we could actually understand circuits to the point of replacing them with annotated code or perhaps generating on-demand natural language explanations in an expandable tree. And as long as we can discover the key techniques for doing that, the fiddly cognitive work that would be required to scale it across a whole neural net may be feasible with AI only as capable as o3.
While I'd be surprised to hear about something like this happening, I wouldn't be that surprised. But in this case, it seems pretty clear that o1 is correcting its own mistakes in a way that past GPTs essentially never did, if you look at the CoT examples in the announcement (e.g. the "Cipher" example).
I remember being very impressed by GPT-2. I think I was also quite impressed by GPT-3 even though it was basically just "GPT-2 but better." To be fair, at the moment that I was feeling unimpressed by ChatGPT, I don't think I had actually used it yet. It did turn out to be much more useful to me than the GPT-3 API, which I tried out but didn't find that many uses for.
It's hard to remember exactly how impressed I was with ChatGPT after using it for a while. I think I hadn't fully realized how great it could be when the friction of using the API was removed, even if I didn't update that much on the technical advancement.