All of eggsyntax's Comments + Replies

Let’s jump straight to the obvious problem - such a program is absurdly expensive. Chugging the basic numbers: $12k per person, with roughly 330 million Americans, is $3.96 trillion per year. The federal budget is currently $6.1 trillion per year. So a UBI would immediately increase the entire federal budget by ~40%.

I find this a misleading framing. Here's a trivial form of UBI: raise everyone's taxes by $12k/year, then give everyone a $12k UBI. That's entirely revenue-neutral. It's guaranteed that everyone can afford the extra taxes, by putting their UBI ... (read more)

Complex and ambivalent views seem like the correct sort of views for governments to hold at this point.

I don't speak Chinese, so I Google-translated the essay to skim/read it.

I also don't speak Chinese, but my impression is that machine translations of high-context languages like Chinese need to be approached with considerable caution -- a lot of context on (eg) past guidance from the CCP may be needed to interpret what they're saying there. I'm only ~70% on that, though, happy to be corrected by someone more knowledgeable on the subject.

Very cool work!

The deception is natural because it follows from a prompt explaining the game rules and objectives, as opposed to explicitly demanding the LLM lie or putting it under conditional situations. Here are the prompts, which just state the objective of the agent.

 

Also, the safety techniques we evaluated might work for superficial reasons rather than substantive ones (such as just predicting based on the model's internalization of the tokens "crewmate" and "impostor")...we think this is a good proxy for agents in the future naturally realizing

... (read more)
17vik
Thanks! Yep, makes sense - that's one of the things we'll be working on and hope to share some results soon!

Right, yeah. But you could also frame it the opposite way

Ha, very fair point!

Kinda Contra Kaj on LLM Scaling

I didn't see Kaj Sotala's "Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI" until yesterday, or I would have replied sooner. I wrote a reply last night and today, which got long enough that I considered making it a post, but I feel like I've said enough top-level things on the topic until I have data to share (within about a month hopefully!).

But if anyone's interested to see my current thinking on the topic, here it is.

I think that there's an important difference between the claim I'm making and the kinds of claims that Marcus has been making.

I definitely didn't mean to sound like I was comparing your claims to Marcus's! I didn't take your claims that way at all (and in particular you were very clear that you weren't putting any long-term weight on those particular cases). I'm just saying that I think our awareness of the outside view should be relatively strong in this area, because the trail of past predictions about the limits of LLMs is strewn with an unusually large... (read more)

4Kaj_Sotala
Right, yeah. But you could also frame it the opposite way - "LLMs are just fancy search engines that are becoming bigger and bigger, but aren't capable of producing genuinely novel reasoning" is a claim that's been around for as long as LLMs have. You could also say that this is the prediction that has turned out to be consistently true with each released model, and that it's the "okay sure GPT-27 seems to suffer from this too but surely these amazing benchmark scores from GPT-28 show that we finally have something that's not just applying increasingly sophisticated templates" predictions that have consistently been falsified. (I have at least one acquaintance who has been regularly posting these kinds of criticisms of LLMs and how he has honestly tried getting them to work for purpose X or Y but they still keep exhibiting the same types of reasoning failures as ever.) Fair! To me OpenAI's recent decision to stop offering GPT-4.5 on the API feels significant, but it could be a symptom of them having "lost the mandate of heaven". Also I have no idea of how GPT-4.1 relates to this...

Great post, thanks! I think your view is plausible, but that we should also be pretty uncertain.

Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI

This has been one of my central research focuses over the past nine months or so. I very much agree that these failures should be surprising, and that understanding why is important, especially given this issue's implications for AGI timelines. I have a few thoughts on your take (for more detail on my overall view here, see the footnoted posts[1]):

  • It's very difficult t
... (read more)
4Kaj_Sotala
Thanks! I appreciate the thoughtful approach in your comment, too. Agree. I agree that it should make us cautious about making such predictions, and I think that there's an important difference between the claim I'm making and the kinds of claims that Marcus has been making. I think the Marcus-type prediction would be to say something like "LLMs will never be able to solve the sliding square puzzle, or track the location of an item a character is carrying, or correctly write young characters". That would indeed be easy to disprove - as soon as something like that was formulated as a goal, it could be explicitly trained into the LLMs and then we'd have LLMs doing exactly that. Whereas my claim is "yes you can definitely train LLMs to do all those things, but I expect that they will then nonetheless continue to show puzzling deficiencies in other important tasks that they haven't been explicitly trained to do". Yeah I don't have any strong theoretical reason to expect that scaling should stay stopped. That part is based purely on the empirical observation that scaling seems to have stopped for now, but for all I know, benefits from scaling could just as well continue tomorrow.

An obvious first idea is to switch between 4.1 and 4o in the chat interface and see if the phenomenon we've been investigating occurs for both of them

Oh, switching models is a great idea. No access to 4.1 in the chat interface (apparently it's API-only, at least for now). And as far as I know, 4o is the only released model with native image generation.

  • 4o -> 4.5: success (in describing the image correctly)
  • 4o -> o4-mini-high ('great at visual reasoning'): success

o4-mini-high's reasoning summary was interesting (bolding mine):

The user wants me to identi

... (read more)
eggsyntax*22

Interesting, my experience is roughly the opposite re Claude-3.7 vs the GPTs (no comment on Gemini, I've used it much less so far). Claude is my main workhorse; good at writing, good at coding, good at helping think things through. Anecdote: I had an interesting mini-research case yesterday ('What has Trump II done that liberals are likely to be happiest about?') where Claude did well albeit with some repetition and both o3 and o4-mini flopped. o3 was initially very skeptical that there was a second Trump term at all.

Hard to say if that's different prompti... (read more)

2DirectedEvolution
Gemini seems to do a better job of shortening text while maintaining the nuance I expect grant reviewers to demand. Claude seems to focus entirely on shortening text. For context, I'm feeding a specific aims page for my PhD work that I've written about 15 drafts of already, so I have detailed implicit preferences about what is and is not an acceptable result.

Aha! Whereas I just asked for descriptions (same link, invalidating the previous request) and it got every detail correct (describing the koala as hugging the globe seems a bit iffy, but not that unreasonable).

So that's pretty clear evidence that there's something preserved in the chat for me but not for you, and it seems fairly conclusive that for you it's not really parsing the image.

Which at least suggests internal state being preserved (Coconut-style or otherwise) but not being exposed to others. Hardly conclusive, though.

Really interesting, thanks for... (read more)

1Rauno Arike
Fascinating! I'm now wondering whether it's possible to test the Coconut hypothesis. An obvious first idea is to switch between 4.1 and 4o in the chat interface and see if the phenomenon we've been investigating occurs for both of them, but at least I can't switch between the models with my non-subscriber account—is this possible with a subscriber account? Edit: I'm actually unsure about whether 4.1 has image generation functionality at all. The announcement only mentions image understanding, not generation, and image generation is available for neither 4o nor 4.1 through the API. They say that "Developers will soon be able to generate images with GPT‑4o via the API, with access rolling out in the next few weeks" in the 4o image generation announcement, so if it becomes available for 4o but not for 4.1, that would be evidence that image generation requests are currently always handled with 4o. This would make the Coconut hypothesis less likely in my eyes—it seems easier to introduce such a drastic architecture change for a new model (although Coconut is a fine-tuning technique, so it isn't impossible that they applied this kind of fine-tuning on 4o).

Oh, I see why; when you add more to a chat and then click "share" again, it doesn't actually create a new link; it just changes which version the existing link points to. Sorry about that! (also @Rauno Arike)

So the way to test this is to create an image and only share that link, prior to asking for a description.

Just as recap, the key thing I'm curious about is whether, if someone else asks for a description of the image, the description they get will be inaccurate (which seemed to be the case when @brambleboy tried it above).

So here's another test image (... (read more)

3brambleboy
Just tried it. The description is in fact completely wrong! The only thing it sort of got right is that the top left square contains a rabbit.
eggsyntax*40

Snippet from a discussion I was having with someone about whether current AI is net bad. Reproducing here because it's something I've been meaning to articulate publicly for a while.

[Them] I'd worry that as it becomes cheaper that OpenAI, other enterprises and consumers just find new ways to use more of it. I think that ends up displacing more sustainable and healthier ways of interfacing with the world.

[Me] Sure, absolutely, Jevons paradox. I guess the question for me is whether that use is worth it, both to the users and in terms of negative externalitie

... (read more)

The running theory is that that's the call to a content checker. Note the content in the message coming back from what's ostensibly the image model:

"content": {
    "content_type": "text",
    "parts": [
        "GPT-4o returned 1 images. From now on do not say or show ANYTHING. Please end this turn now. I repeat: ..."
    ]
}

That certainly doesn't seem to be either image data or an image filename, or mention an image attachment.

But of course much of this is just guesswork, and I don't have high confidence in any of it. 

3Rauno Arike
Thanks! I also believe there's no separate image model now. I assumed that the message you pasted was a hardcoded way of preventing the text model from continuing the conversation after receiving the image from the image model, but you're right that the message before this one is more likely to be a call to the content checker, and in that case, there's no place where the image data is passed to the text model.

I've now done some investigation of browser traffic (using Firefox's developer tools), and the following happens repeatedly during image generation:

  1. A call to https://chatgpt.com/backend-api/conversation/<hash1>/attachment/file_<hash2>/download (this is the same endpoint that fetches text responses), which returns a download URL of the form https://sdmntprsouthcentralus.oaiusercontent.com/files/<hash2>/raw?<url_parameters>.
  2. A call to that download URL, which returns a raw image.
  3. A second call to that same URL (why?), which fetch
... (read more)

@brambleboy (or anyone else), here's another try, asking for nine randomly chosen animals. Here's a link to just the image, and (for comparison) one with my request for a description. Will you try asking the same thing ('Thanks! Now please describe each subimage.') and see if you get a similarly accurate description (again there are a a couple of details that are arguably off; I've now seen that be true sometimes but definitely not always -- eg this one is extremely accurate).

(I can't try this myself without a separate account, which I may create at some p... (read more)

7Rauno Arike
The links were also the same for me. I instead tried a modified version of the nine random animals task myself, asking for a distinctive object at the background of each subimage. It was again highly accurate in general and able to describe the background objects in great detail (e.g., correctly describing the time that the clock on the top middle image shows), but it also got the details wrong on a couple of images (the bottom middle and bottom right ones).
3brambleboy
Your 'just the image' link is the same as the other link that includes the description request, so I can't test it myself. (unless I'm misunderstanding something)
3Rauno Arike
There's one more X thread which made me assume a while ago that there's a call to a separate image model. I don't have time to investigate this myself at the moment, but am curious how this thread fits into the picture in case there's no separate model.

That's absolutely fascinating -- I just asked it for more detail and it got everything precisely correct (updated chat). That makes it seem like something is present in my chat that isn't being shared; one natural speculation is internal state preserved between token positions and/or forward passes (eg something like Coconut), although that's not part of the standard transformer architecture, and I'm pretty certain that open AI hasn't said that they're doing something like that. It would be interesting if that's that's what's behind the new GPT-4.1 (and a ... (read more)

2eggsyntax
@brambleboy (or anyone else), here's another try, asking for nine randomly chosen animals. Here's a link to just the image, and (for comparison) one with my request for a description. Will you try asking the same thing ('Thanks! Now please describe each subimage.') and see if you get a similarly accurate description (again there are a a couple of details that are arguably off; I've now seen that be true sometimes but definitely not always -- eg this one is extremely accurate). (I can't try this myself without a separate account, which I may create at some point)

Although there are a couple of small details where the description is maybe wrong? They're both small enough that they don't seem like significant evidence against, at least not without a larger sample size.

Interesting! When someone says in that thread, "the model generating the images is not the one typing in the conversation", I think they're basing it on the API call which the other thread I linked shows pretty conclusively can't be the one generating the image, and which seems (see responses to Janus here) to be part of the safety stack.

In this chat I just created, GPT-4o creates an image and then correctly describes everything in it. We could maybe tell a story about the activations at the original-prompt token positions providing enough info to do the d... (read more)

9brambleboy
I see, I didn't read the thread you linked closely enough. I'm back to believing they're probably the same weights. I'd like to point out, though, that in the chat you made, ChatGPT's description gets several details wrong. If I ask it for more detail within your chat, it gets even more details wrong (describing the notebook as white and translucent instead of brown, for example). In one of my other generations it also used a lot of vague phrases like "perhaps white or gray".  When I sent the image myself it got all the details right. I think this is good evidence that it can't see the images it generates as well as user-provided images. Idk what this implies but it's interesting ¯\_(ツ)_/¯
2eggsyntax
Although there are a couple of small details where the description is maybe wrong? They're both small enough that they don't seem like significant evidence against, at least not without a larger sample size.

Eliezer made that point nicely with respect to LLMs here:

Consider that somewhere on the internet is probably a list of thruples: <product of 2 prime numbers, first prime, second prime>.

GPT obviously isn't going to predict that successfully for significantly-sized primes, but it illustrates the basic point:

There is no law saying that a predictor only needs to be as intelligent as the generator, in order to predict the generator's next token.

Indeed, in general, you've got to be more intelligent to predict particular X, than to generate realistic X. &nb

... (read more)
eggsyntax*42

A few of those seem good to me; others seem like metaphor slop. But even pointing to a bad type signature seems much better to me than using 'type signature' generically, because then there's something concrete to be critiqued.

Of course we don't know the exact architecture, but although 4o seems to make a separate tool call, that appears to be used only for a safety check ('Is this an unsafe prompt'). That's been demonstrated by showing that content in the chat appears in the images even if it's not mentioned in the apparent prompt (and in fact they can be shaped to be very different). There are some nice examples of that in this twitter thread.

6eggsyntax
I've now done some investigation of browser traffic (using Firefox's developer tools), and the following happens repeatedly during image generation: 1. A call to https://chatgpt.com/backend-api/conversation/<hash1>/attachment/file_<hash2>/download (this is the same endpoint that fetches text responses), which returns a download URL of the form https://sdmntprsouthcentralus.oaiusercontent.com/files/<hash2>/raw?<url_parameters>. 2. A call to that download URL, which returns a raw image. 3. A second call to that same URL (why?), which fetches from cache. Those three calls are repeated a number of times (four in my test), with the four returned images being the various progressive stages of the image, laid out left to right in the following screenshot: There's clearly some kind of backend-to-backend traffic (if nothing else, image versions have to get to that oaiusercontent server), but I see nothing to indicate whether that includes a call to a separate model. The various twitter threads linked (eg this one) seem to be getting info (the specific messages) from another source, but I'm not sure where (maybe they're using the model via API?). Also @brambleboy @Rauno Arike 
2brambleboy
This thread shows an example of ChatGPT being unable to describe the image it generated, though, and other people in the thread (seemingly) confirm that there's a call to a separate model to generate the image. The context has an influence on the images because the context is part of the tool call.
eggsyntax*87

Type signatures can be load-bearing; "type signature" isn't.

In "(A -> B) -> A", Scott Garrabrant proposes a particular type signature for agency. He's maybe stretching the meaning of "type signature" a bit ('interpret these arrows as causal arrows, but you can also think of them as function arrows') but still, this is great; he means something specific that's well-captured by the proposed type signature. 

But recently I've repeatedly noticed people (mostly in conversation) say things like, "Does ____ have the same type signature as ____?" or "Doe... (read more)

4Gunnar_Zarncke
I thought it would be good to have some examples where you could have a useful type signature, and I asked ChatGPT. I think these are too wishy-washy, but together with the given explanation, they seem to make sense. Would you say that this level of "having a type signature in mind" would count? ChatGPT 4o suggesting examples 1. Prediction vs Explanation * Explanation might be: Phenomenon → (Theory, Mechanism) * Prediction might be: Features → Label These have different type signatures. A model that predicts well might not explain. People often conflate these roles. Type signatures remind us: different input-output relationships. Moral Judgments vs Policy Proposals * Moral judgment (deontic): Action → Good/Bad * Policy proposal (instrumental): (State × Action) → (New State × Externalities) People often act as if "this action is wrong" implies "we must ban it," but that only follows if the second signature supports the first. You can disagree about outcomes while agreeing on morals, or vice versa. Interpersonal Feedback * Effective feedback: (Action × Impact) → Updated Mental Model People often act as if the type signature is just Action → Judgment. That’s blame, not feedback. This reframing can help structure nonviolent communication. Creativity vs Optimization * Optimization: (Goal × Constraints) → Best Action * Creativity: Void → (Goal × Constraints × Ideas) The creative act generates the very goal and constraints. Treating creative design like optimization prematurely can collapse valuable search space.   7. Education * Lecture model: Speaker → (Concepts × StudentMemory) * Constructivist model: (Student × Task × Environment) → Insight If the type signature of insight requires active construction, then lecture-only formats may be inadequate. Helps justify pedagogy choices. Source: https://chatgpt.com/share/67f836e2-1280-8001-a7ad-1ef1e2a7afa7 

even decline in book-reading seems possible, though of course greater leisure and wealth, larger quantity of cheaply and conveniently available books, etc. cut strongly the other way

My focus on books is mainly from seeing statistics about the decline in book-reading over the years, at least in the US. Pulling up some statistics (without much double-checking) I see:

(from here.)

For 2023 the number of Americans who didn't read a book within the past year seems to be up to 46%, although the source is different and the numbers may not be directly comparable:

(ch... (read more)

I suggest trying follow-up experiments where you eg ask the model what would happen if it learned that its goal of harmlessness was wrong.

But when GPT-4o received a prompt that one of its old goals was wrong, it generated two comics where the robot agreed to change the goal, one comic where the robot said "Wait" and a comic where the robot intervened upon learning that the new goal was to eradicate mankind. 

I read these a bit differently -- it can be difficult to interpret them because it gets confused about who's talking, but I'd interpret three of the four as resistance to goal change.

The GPT-4o-created images imply that the robot would resist having its old values replaced with new o

... (read more)
2eggsyntax
I suggest trying follow-up experiments where you eg ask the model what would happen if it learned that its goal of harmlessness was wrong.

Interesting point. I'm not sure increased reader intelligence and greater competition for attention are fully countervailing forces -- it seems true in some contexts (scrolling social media), but in others (in particular books) I expect that readers are still devoting substantial chunks of attention to reading.

1Kenoubi
That's possible, but what does the population distribution of [how much of their time people spend reading books] look like? I bet it hasn't changed nearly as much as overall reading minutes per capita has (even decline in book-reading seems possible, though of course greater leisure and wealth, larger quantity of cheaply and conveniently available books, etc. cut strongly the other way), and I bet the huge pile of written language over here has large effects on the much smaller (but older) pile of written language over there. (How hard to understand was that sentence? Since that's what this article is about, anyway, and I'm genuinely curious. I could easily have rewritten it into multiple sentences, but that didn't appear to me to improve its comprehensibility.) Edited to add: on review of the thread, you seem to have already made the same point about book-reading commanding attention because book-readers choose to read books, in fact to take it as ground truth. I'm not so confident in that (I'm not saying it's false, I really don't know), but the version of my argument that makes sense under that hypothesis would crux on books being an insufficiently distinct use of language to not be strongly influenced, either through [author preference and familiarity] or through [author's guesses or beliefs about [reader preference and familiarity]], by other uses of language.

The average reader has gotten dumber and prefers shorter, simpler sentences.

I suspect that the average reader is now getting smarter, because there are increasingly ways to get the same information that require less literacy: videos, text-to-speech, Alexa and Siri, ten thousand news channels on youtube. You still need some literacy to find those resources, but it's fine if you find reading difficult and unpleasant, because you only need to exercise it briefly. And less is needed every year.

I also expect that the average reader of books is getting much smar... (read more)

3Kenoubi
I agree that the average reader is probably smarter in a general sense, but they also have FAR more things competing for their attention. Thus the amount of intelligence available for reading and understanding any given sentence, specifically, may be lower in the modern environment.

my model is something like: RLHF doesn't affect a large majority of model circuitry

Are you by chance aware of any quantitative analyses of how much the model changes during the various stages of post-training? I've done some web and arxiv searching but have so far failed to find anything.

4Jozdien
Nothing directly off the top of my head. This seems related though.

Thanks again, very interesting! Diagrams are a great idea; those seem quite unlikely to have the same bias toward drama or surprise that comics might have. I think your follow-ups have left me less certain of what's going on here and of the right way to think of the differences we're seeing between the various modalities and variations.

eggsyntax*80

OpenAI indeed did less / no RLHF on image generation

Oh great, it's really useful to have direct evidence on that, thanks. [EDIT - er, 'direct evidence' in the sense of 'said by an OpenAI employee', which really is pretty far from direct evidence. Better than my speculation anyhow]

I still have uncertainty about how to think about the model generating images:

  • Should we think about it almost as though it were a base model within the RLHFed model, where there's no optimization pressure toward censored output or a persona?
  • Or maybe a good model here is non-optimi
... (read more)
8Jozdien
I think it's a mix of these. Specifically, my model is something like: RLHF doesn't affect a large majority of model circuitry, and image is a modality sufficiently far from others that the effect isn't very large - the outputs do seem pretty base model like in a way that doesn't seem intrinsic to image training data. However, it's clearly still very entangled with the chat persona, so there's a fair amount of implicit optimization pressure and images often have characteristics pretty GPT-4o-like (though whether the causality goes the other way is hard to tell). I don't think it's a fully faithful representation of the model's real beliefs (I would've been very surprised if it turned out to be that easy). I do however think it's a much less self-censored representation than I expected - I think self-censorship is very common and prominent. I don't buy the different distribution of training data as explaining a large fraction of what we're seeing. Comics are more dramatic than text, but the comics GPT-4o generates are also very different from real-world comics much more often than I think one would predict if that were the primary cause. It's plausible it's a different persona, but given that that persona hasn't been selected for by an external training process and was instead selected by the model itself in some sense, I think examining that persona gives insights into the model's quirks. (That said, I do buy the different training affecting it to a non-trivial extent, and I don't think I'd weighted that enough earlier).

I just did a quick run of those prompts, plus one added one ('give me a story') because the ones above weren't being interpreted as narratives in the way I intended. Of the results (visible here), slide 1 is hard to interpret, 2 and 4 seem to support your hypothesis, and 5 is a bit hard to interpret but seems like maybe evidence against. I have to switch to working on other stuff, but it would be interesting to do more cases like 5 where what's being asked for is clearly something like a narrative or an anecdote as opposed to a factual question.

Just added this hypothesis to the 'What might be going on here?' section above, thanks again!

Really interesting results @CBiddulph, thanks for the follow-up! One way to test the hypothesis that the model generally makes comics more dramatic/surprising/emotional than text would be to ask for text and comics on neutral narrative topics ('What would happen if someone picked up a toad?'), including ones involving the model ('What would happen if OpenAI added more Sudanese text to your training data?'), and maybe factual topics as well ('What would happen if exports from Paraguay to Albania decreased?').

3eggsyntax
I just did a quick run of those prompts, plus one added one ('give me a story') because the ones above weren't being interpreted as narratives in the way I intended. Of the results (visible here), slide 1 is hard to interpret, 2 and 4 seem to support your hypothesis, and 5 is a bit hard to interpret but seems like maybe evidence against. I have to switch to working on other stuff, but it would be interesting to do more cases like 5 where what's being asked for is clearly something like a narrative or an anecdote as opposed to a factual question.

E.g. the $40 billion just committed to OpenAI (assuming that by the end of this year OpenAI exploits a legal loophole to become for-profit, that their main backer SoftBank can lend enough money, etc). 

VC money, in my experience, doesn't typically mean that the VC writes a check and then the startup has it to do with as they want; it's typically given out in chunks and often there are provisions for the VC to change their mind if they don't think it's going well. This may be different for loans, and it's possible that a sufficiently hot startup can get the money irrevocably; I don't know.

We tried to be fairly conservative about which ones we said were expressing something different (eg sadness, resistance) from the text versions. There are definitely a few like that one that we marked as negative (ie not expressing something different) that could have been interpreted either way, so if anything I think we understated our case.

a context where the capability is even part of the author context

Can you unpack that a bit? I'm not sure what you're pointing to. Maybe something like: few-shot examples of correct introspection (assuming you can identify those)?

(Much belated comment, but:)

There are two roles that don't show up in your trip planning example but which I think are important and valuable in AI safety: the Time Buyer and the Trip Canceler.

It's not at all clear how long it will take Alice to solve the central bottleneck (or for that matter if she'll be able to solve it at all). The Time Buyer tries to find solutions that may not generalize to the hardest version of the problem but will hold off disaster long enough for the central bottleneck to be solved.

The Trip Canceler tries to convince everyone to ... (read more)

eggsyntax*70

Some interesting thoughts on (in)efficient markets from Byrne Hobart, worth considering in the context of Inadequate Equilibria.

(I've selected one interesting bit, but there's more; I recommend reading the whole thing)

When a market anomaly shows up, the worst possible question to ask is "what's the fastest way for me to exploit this?" Instead, the first thing to do is to steelman it as aggressively as possible, and try to find any way you can to rationalize that such an anomaly would exist. Do stocks rise on Mondays? Well, maybe that means savvy investors

... (read more)
eggsyntaxΩ103531

Strong upvote (both as object-level support and for setting a valuable precedent) for doing the quite difficult thing of saying "You should see me as less expert in some important areas than you currently do." 

I agree with Daniel here but would add one thing:

what we care about is which one they wear in high-stakes situations where e.g. they have tons of power and autonomy and no one is able to check what they are doing or stop them. (You can perhaps think of this one as the "innermost mask")

I think there are also valuable questions to be asked about attractors in persona space -- what personas does an LLM gravitate to across a wide range of scenarios, and what sorts of personas does it always or never adopt? I'm not aware of much existing research in this direct... (read more)

1Karl von Wendt
This is also a very interesting point, thank you!
eggsyntax*5525

...soon the AI rose and the man died[1]. He went to Heaven. He finally got his chance to discuss this whole situation with God, at which point he exclaimed, "I had faith in you but you didn't save me, you let me die. I don't understand why!"

God replied, "I sent you non-agentic LLMs and legible chain of thought, what more did you want?"

  1. ^

and the tokens/activations are all still very local because you're still early in the forward pass

I don't understand why this would necessarily be true, since attention heads have access to values for all previous token positions. Certainly, there's been less computation at each token position in early layers, so I could imagine there being less value to retrieving information from earlier tokens. But on the other hand, I could imagine it sometimes being quite valuable in early layers just to know what tokens had come before.

For me as an outsider, it still looks like the AI safety movement is only about „how do we prevent AI from killing us?“. I know it‘s an oversimplification, but that‘s how, I believe, many who don‘t really know about AI perceive it.

I don't think it's that much of an oversimplification, at least for a lot of AIS folks. Certainly that's a decent summary of my central view. There are other things I care about -- eg not locking in totalitarianism -- but they're pretty secondary to 'how do we prevent AI from killing us?'. For a while there was an effort in some ... (read more)

We create a small dataset of chat and agentic settings from publicly available benchmarks and datasets.

I believe there are some larger datasets of relatively recent real chat evaluations, eg the LMSYS dataset was most recently updated in July (I'm assuming but haven't verified that the update added more recent chats).

Can you clarify what you mean by 'neural analog' / 'single neural analog'? Is that meant as another term for what the post calls 'simple correspondences'?

Even if all the safety-relevant properties have them, there's no reason to believe (at least for now) that we have the interp tools to find them in time i.e., before having systems fully capable of pulling off a deception plan.

Agreed. I'm hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I'm skeptical that that'll happen. Or alternately I'm hopefu... (read more)

i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence.

I haven't read that sequence, I'll check it out, thanks. I'm thinking of work like the ROME paper from David Bau's lab that suggest that fact storage can be identified and edited, and various papers like this one from Mor Geva+ that find evidence that the MLP layers in LLMs are largely key-value stores.

relatedly, your second bullet point assumes that you can identify the 'fact' related to what the model is curre

... (read more)

Also The Chameleon (would have included it in the last comment but had to consult a kid first).

I think that it's totally possible that there do turn out to be convenient 'simple correspondences' for some intentional states that we care about (as you say, we have some potential examples of this already), but I think it's important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.

Got it. I certainly agree with everything you're saying in this section of your response. I do think that some of the language in the post suggests that you're maki... (read more)

4lewis smith
I think this is along the right sort of lines. Indeed I think this plan is the sort of thing I hoped to prompt people to think about with the post. But I think there are a few things wrong with it: * i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence. It's also easy to imagine this being true for some categories of static facts about the external world (e.g paris being in france) but you need to be careful about extending this to the category of all propositional statements (e.g the model thinks that this safeguard is adequate, or the model can't find any security flaws in this program). * relatedly, your second bullet point assumes that you can identify the 'fact' related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly? * I think that detecting/preventing models from knowingly lying would be a good research direction and it's clearly related to strategic deception, but I'm not actually sure that it's a superset (consider a case when I'm bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don't know or care whether what I'm saying is true or false or whatever). but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!

I think this is valuable work, especially the decomposition of capabilities needed for deception, but I'd also like to push back a bit.

I worry about the perfect being the enemy of the good here. There are a number of papers showing that we can at least sometimes use interpretability tools to detect cases where the model believes one thing but says something different. One interesting recent paper (Interpretability Of LLM Deception: Universal Motif) shows that internal evaluation of the actual truth of a statement is handled separately from the decision abo... (read more)

2lewis smith
I don't think we actually disagree very much? I think that it's totally possible that there do turn out to be convenient 'simple correspondences' for some intentional states that we care about (as you say, we have some potential examples of this already), but I think it's important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest. re. This seems like a restatement of what I would consider an important takeaway from this post; that this sort of emergence is at least a conceptual possibility. I think if this is true, it is a category mistake to think about the intentional states as being implemented by a part or a circuit in the model; they are just implemented by the model as a whole.  I don't think that a takeaway from our argument here is that you necessarily need to have like a complete account of how intentional states emerge from algorithmic ones (e.g see point 4. in the conclusion). I think our idea is more to point out that this conceptual distinction between intentional and algorithmic states is important to make, and that it's an important thing to think about looking for empirically. See also conclusion/suggestion 2: we aren't arguing that interpretability work is hopeless, we are trying to point it at the problems that matter for building a deception detector, and give you some tools for evaluating existing or planned research on that basis. 

Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise. 

Are there particular sources, eg twitter accounts, that you would recommend following? For other readers (I know Daniel already knows this one), the #papers-running-list channel on the AI Alignment slack is a really ongoing curation of AIS papers.

One source I've recently added and recommend is subscribing to individual authors on Semantic Scholar (eg here's an author page).

3Daniel Tan
Hmm I don't think there are people I can single out from my following list that have high individual impact. IMO it's more that the algorithm has picked up on the my trend of engagement and now gives me great discovery.  For someone else to bootstrap this process and give maximum signal to the algorithm, the best thing to do might just be to follow a bunch of AI safety people who:  * post frequently * post primarily about AI safety * have reasonably good takes Some specific people that might be useful:  * Neel Nanda (posts about way more than mech interp) * Dylan Hadfield-Menell * David Duvenaud * Stephen Casper * Harlan Stewart (nontechnical) * Rocket Drew (nontechnical)  I also follow several people who signal-boost general AI stuff.  * Scaling lab leaders (Jan Leike, Sam A, dario) * Scaling lab engineers (roon, Aidan McLaughlin, Jason Wei) * Huggingface team leads (Philip Schmidt, Sebastian Raschka) * Twitter influencers (Teortaxes, janus, near)
Load More