eggsyntax's Shortform

eggsyntax

LESSWRONG
LW

eggsyntax's Shortform

by eggsyntax

13th Jan 2024

1 min read

197

3

This is a special post for quick takes by eggsyntax. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

219LLM Generality is a Timeline Crux

71Numberwang: LLMs Doing Autonomous Research, and a Call for Input

28Quick Thoughts on Scaling Monosemanticity

eggsyntax's Shortform

8the gears to ascension

1briangriffinlover420

3eggsyntax

1briangriffinlover420

5Bogdan Ionut Cirstea

2the gears to ascension

1eggsyntax

197 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:22 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]eggsyntax2mo*11315

Many people (including me) have opinions on current US president Donald Trump, none of which are relevant here because, as is well-known to LessWrong, politics is the mind-killer. But in the middle of an interview yesterday with someone from ABC News, I was fascinated to hear him say the most Bayesian thing I've ever heard from a US president:
--

TERRY MORAN: You have a hundred percent confidence in Pete Hegseth?

PRESIDENT DONALD TRUMP: I don't have -- a hundred percent confidence in anything, okay? Anything. Do I have a hundred percent? It's a stupid question. Look --

TERRY MORAN: It's a pretty important position.

PRESIDENT DONALD TRUMP: -- I have -- no, no, no. You don't have a hundred percent. Only a liar would say, "I have a hundred percent confidence." I don't have a hundred percent confidence that we're gonna finish this interview.
---

[EDIT -- no object-level comments about Trump, please; as per my comment here, I think it would be unproductive and poorly suited to this context. There are many many other places to talk about object-level politics.]

[-]Eric Neyman2mo6111

My favorite example of a president being a good Bayesian is Abraham Lincoln (h/t Julia Galef):

5Tim Hua2mo

This guy clearly read Correlation Neglect in Belief Formation

[-]Adam Zerner2mo2216

Many people (including me) have opinions on current US president Donald Trump, none of which are relevant here because, as is well-known to LessWrong, politics is the mind-killer.

I think that "none of which are relevant" is too strong a statement and is somewhat of a misconception. From the linked post:

If you want to make a point about science, or rationality, then my advice is to not choose a domain from contemporary politics if you can possibly avoid it. If your point is inherently about politics, then talk about Louis XVI during the French Revolution. Politics is an important domain to which we should individually apply our rationality—but it’s a terrible domain in which to learn rationality, or discuss rationality, unless all the discussants are already rational.

So one question is about how ok it is to use examples from the domain of contemporary politics. I think it's pretty widely agreed upon on LessWrong that you should aim to avoid doing so.

But another question is whether it is ok to discuss contemporary politics. I think opinions differ here. Some think it is more ok than others. Most opinions probably hover around something like "it is ok sometimes but there are downsides to doing so, so approach with caution". I took a glance at the FAQ and didn't see any discussion of or guidance on how to approach the topic.

7eggsyntax2mo

'None of which are relevant here' was intended as a strong suggestion that this shortform post not turn into an object-level discussion of politics in the comments, which I think would be likely to be unproductive since Trump is a polarizing figure. Possibly too oblique of a suggestion, if that didn't come across. I share that view and maybe lean even further toward not discussing contemporary politics here. I nearly didn't even post this, but I was so struck by the exchange that it seemed worth it.

8Shankar Sivarajan2mo

Rumsfeld's "unknown unknowns" is also good epistemology. Reminiscent of Nixon's third derivative, when he announced that the rate of increase of inflation was decreasing, many politicians are more sophisticated than they present themselves to be.

4cubefox2mo

Though pedantically I wouldn't exactly call this Bayesian, as already Descartes said similar things.

2eggsyntax2mo

Fair. By Bayesian, I mostly just meant that in terms of current conceptions of probability theory, that point is much more associated with Bayesian approaches than frequentist ones.

2Alex_Altair2mo

[Deleted on request]

2eggsyntax2mo

Hi Alex! I can't delete your comment (maybe authors can only manage comments on their full posts?) but I request that you delete it -- I'm really trying to steer clear of this becoming an object-level discussion of Trump (to avoid aforementioned mind-killing, at least here on LW).

2Alex_Altair2mo

Oh, sure, I'm happy to delete it since you requested. Although, I don't really understand how my comment is any more politically object-level than your post? I read your post as saying "Hey guys I found a 7-leaf clover in Ireland, isn't that crazy? I've never been somewhere where clovers had that many leaves before." and I'm just trying to say "FYI I think you just got lucky, I think Ireland has normal clovers."

2eggsyntax2mo

Thanks very much, I appreciate it! Fair question! My goal was to say, 'Hey, look what an interesting thing for a US president to say!' without any particular comment on the fact that it was Trump in particular, and my first sentence ('Many people (including me) have opinions on current US president Donald Trump, none of which are relevant here because, as is well-known to LessWrong, politics is the mind-killer') was intended to try to emphasize that this wasn't intended as a comment on Trump. I read your comment (maybe wrongly?) as a comment on Trump in particular and whether he's someone we should expect to say statistically literate things. Frankly I'm just worried, possibly overly so, that the comments to the post will descend into the usual sorts of angry political discussion that controversial figures tend to generate. Not that I thought your comment was inflammatory; just that it seems better to try to steer clear of object-level political discussion entirely.

2Adam Zerner2mo

Related: 0 and 1 Are Not Probabilities

1Cheezemansam2mo

The response in the interview mistakes the nature of the original question. Not every meaningful statement exists on a binary truth spectrum. Not every meaningful statement corresponds to a claim of empirical fact with quantifiable uncertainty. Saying "I have complete confidence in someone" is not a probabilistic assertion akin to saying "I am 100% certain it will rain tomorrow." Rather, it's an expression of trust, commitment, or endorsement—often used in a social or political context to convey support, leadership assurance, or accountability. Confidence in a person, especially in the context of hiring or appointment, isn't a measurable probability about future events; it's a communicative act that signals intent and judgment. It doesn't imply literal omniscience or absolute predictive accuracy. By dodging the question with a pseudo-Bayesian appeal to uncertainty, the speaker appears evasive rather than thoughtful. In fact, framing it this way undermines the communicative clarity and purpose of leadership rhetoric, not to serve as epistemological disclaimers. This exchange is a clear example of someone co-opting the language of the rationalist community removed from its intended meaning and purpose. The interviewer’s question wasn’t about epistemic certainty or making a falsifiable prediction; it was a straightforward request for a statement of support or trust in an appointed individual

2eggsyntax2mo

The Bayesian validity still seems pretty straightforward to me. I have more trust in some people than others, which I would suggest cashes out as my credence that they won't do something that violates the commitments they've made (or violates their stated values, etc). And certainly I should never have 0% or 100% trust in that sense, or the standard objection applies: no evidence could shift my trust. (that said, on one reading of your comment it's veering close to object-level discussion of the wisdom or foolishness of Trump in particular, which I'd very much like to avoid here. Hopefully that's just a misread)

[-]eggsyntax1mo6622

We have a small infestation of ants in our bathroom at the moment. We deal with that by putting out Terro ant traps, which are just boric acid in a thick sugar solution. When the ants drink the solution, it doesn't harm them right away -- the effect of the boric acid is to disrupt their digestive enzymes, so that they'll gradually starve. They carry some of it back to the colony and feed it to all the other ants, including the queen. Some days later, they all die of starvation. The trap cleverly exploits their evolved behavior patterns to achieve colony-level extermination rather then trying to kill them off one ant at a time. Even as they're dying of starvation, they're not smart enough to realize what we did to them; they can't even successfully connect it back to the delicious sugar syrup.

When people talk about superintelligence not being able to destroy humanity because we'll quickly figure out what's happening and shut it down, this is one of the things I think of.

[-]Warty1mo221

I used that once and it didn't work, aligned-by-default universe

3MondSemmel1mo

Phew. We sure dodged a bullet there, didn't we?

7DirectedEvolution1mo

This argument can be strengthened by focusing on instances where humans drove driven animals or hominids extinct. Technologies like gene drives also allow us to selectively drive species extinct that might have been challenging to exterminate with previous tools.

4fencebuilder1mo

As far as I know, our track record of deliberately driving species extinct that are flourishing under human conditions is pretty bad. The main way in which we drive species extinct is by changing natural habitat to fit our uses. Species that are able to flourish under these new circumstances are not controllable. In that sense, I guess the questions becomes what happens, when humans are not the primary drivers of ecosystem change?

[-]eggsyntax4mo*5525

...soon the AI rose and the man died^[1]. He went to Heaven. He finally got his chance to discuss this whole situation with God, at which point he exclaimed, "I had faith in you but you didn't save me, you let me die. I don't understand why!"

God replied, "I sent you non-agentic LLMs and legible chain of thought, what more did you want?"

^{^}
https://en.wikipedia.org/wiki/Parable_of_the_drowning_man

[-]eggsyntax13d304

"But I heard humans were actually intelligent..."

"It's an easy mistake to make. After all, their behavior was sometimes similar to intelligent behavior! But no one put any thought into their design; they were just a result of totally random mutations run through a blind, noisy filter that wasn't even trying for intelligence. Most of what they did was driven by hormones released by the nearness of a competitor or prospective mate. It's better to think of them as a set of hardwired behaviors that sometimes invoked brain circuits that did something a bit like thinking."

"Wow, that's amazing. Biology was so cool!"

[-]eggsyntax5mo280

Fascinatingly, philosopher-of-mind David Chalmers (known for eg the hard problem of consciousness, the idea of p-zombies) has just published a paper on the philosophy of mechanistic interpretability. I'm still reading it, and it'll probably take me a while to digest; may post more about it at that point. In the meantime this is just a minimal mini-linkpost.

Propositional Interpretability in Artificial Intelligence

6Lucius Bushnaq5mo

I don't like it. It does not feel like a clean natural concept in the territory to me. Case in point: I think what this is showing is that Chalmer's definition of "dispositional attitudes" has a problem: It lacks any notion of the amount and kind of computational labour required to turn 'dispositional' attitudes into 'occurrent' ones. That's why he ends up with AI systems having an uncountably infinite number of dispositional attitudes. One could try to fix up Chalmer's definition by making up some notion of computational cost, or circuit complexity or something of the sort, that's required to convert a dispositional attitude into an occurrent attitude, and then only list dispositional attitude up to some cost cutoff c we are free to pick as applications demand. But I don't feel very excited about that. At that point, what is this notion of "dispositional attitudes" really still providing us that wouldn't be less cumbersome to describe in the language of circuits? There, you don't have this problem. An AI can have a query-key lookup for proposition p and just not have a query-key lookup for the proposition porq. Instead, if someone asks whether porq is true, it first performs the lookup for p, then uses some general circuits for evaluating simple propositional logic to calculate that porq is true. This is an importantly different computational and mental process from having a query-key lookup for porq in the weights and just directly performing that lookup, so we ought to describe a network that does the former differently from a network that does the latter. It does not seem like Chalmer's proposed log of 'propositional attitudes' would do this. It'd describe both of these networks the same way, as having a propositional attitude of believing porq, discarding a distinction between them that is important for understanding the models' mental state in a way that will let us do things such as successfully predicting the models' behavior in a different situation.

4lewis smith5mo

I think I agree that there are significant quibbles you can raise with the picture chalmers outlines, but in general I think he's pointing at an important problem for interpretability; that it's not clear what the relationship between a circuit-level algorithmic understanding and the kind of statements we would like to rule out (e.g this system is scheming against me) is.

4eggsyntax5mo

Agreed that there's a problem there, but it's not at all clear to me (as yet) that Chalmers' view is a fruitful way to address that problem.

3lewis smith5mo

i do agree with that, although 'step 1 is identify the problem'

2eggsyntax5mo

This pretty much matches my sense so far, although I haven't had time to finish reading the whole thing. I wonder whether this is due to the fact that he's used to thinking about human brains, where we're (AFAIK) nowhere near being able to identify the representation of specific concepts, and so we might as well use the most philosophically convenient description. Clearly ANNs are able to represent propositional content, but I haven't seen anything that makes me think that's the natural unit of analysis. I could imagine his lens potentially being useful for some sorts of analysis built on top of work from mech interp, but not as a core part of mech interp itself (unless it turns out that it happens to be true that propositions and propositional attitudes are the natural decomposition for ANNs, I suppose, but even if that happened it would seem like a happy coincidence rather than something that Chalmers has identified in advance).

2cubefox5mo

Well, we (humans) categorize our epistemic state largely in propositional terms, e.g. in beliefs and suppositions. We even routinely communicate by uttering "statements" -- which express propositions. So propositions are natural to us, which is why they are important for ANN interpretability.

9eggsyntax5mo

I'm not too confident of this. It seems to me that a lot of human cognition isn't particularly propositional, even if nearly all of it could in principle be translated into that language. For example, I think a lot of cognition is sensory awareness, or imagery, or internal dialogue. We could contort most of that into propositions and propositional attitudes (eg 'I am experiencing a sensation of pain in my big toe', 'I am imagining a picnic table'), but that doesn't particularly seem like the natural lens to view those through. That said, I do agree that propositions and propositional attitudes would be a more useful language to interpret LLMs through than eg activation vectors of float values.

2Lucius Bushnaq5mo

I don't think this description is philosophically convenient. Believing p and believing things that imply p are genuinely different states of affairs in a sensible theory of mind. Thinking through concrete mech interp examples of the former vs. the latter makes it less abstract in what sense they are different, but I think I would have objected to Chalmer's definition even back before we knew anything about mech interp. It would just have been harder for me to articulate what exactly is wrong with it.

2eggsyntax5mo

Something that Chalmers finds convenient, anyhow. I'm not sure how else we could view 'dispositional beliefs' if not as a philosophical construct; surely Chalmers doesn't imagine that ANNs or human brains actively represent 'p-or-q' for all possible q.

2Noosphere895mo

To be fair here, from an omniscient perspective, believing P and believing things that imply P are genuinely the same thing in terms of results, but from a non-omniscient perspective, the difference matters.

2eggsyntax5mo

One part that seems confused to me, on chain-of-thought as an interpretability method: I have no idea what he's trying to say here. Is he somehow totally unaware that you can use CoT with regular LLMs? That seems unlikely since he's clearly read some of the relevant literature (eg he cites Turpin et al's paper on CoT unfaithfulness). I don't know how else to interpret it, though -- maybe I'm just missing something?

2cubefox5mo

Regular LLMs can use chain-of-thought reasoning. He is speaking about generating chains of thought for systems that don't use them. E.g. AlphaGo, or diffusion models, or even an LLM in cases where it didn't use CoT but produced the answer immediately. As an example, you ask an LLM a question, and it answers it without using CoT. Then you ask it to explain how it arrived at its answer. It will generate something for you that looks like a chain of thought. But since it wasn't literally using it while producing its original answer, this is just an after-the-fact rationalization. It is questionable whether such a post-hoc "chain of thought" reflects anything the model was actually doing internally when it originally came up with the answer. It could be pure confabulation.

2eggsyntax5mo

Your first paragraph makes sense as an interpretation, which I discounted because the idea of something like AlphaGo doing CoT (or applying a CoT to it) seems so nonsensical, since it's not at all a linguistic model. I'm having more trouble seeing how to read what Chalmer says in the way your second paragraph suggests -- eg 'unmoored from the original system' doesn't seem like it's talking about the same system generating an ad hoc explanation. It's more like he's talking about somehow taking a CoT generated by one model and applying it to another, although that also seems nonsensical.

2cubefox5mo

If you want to understand why a model, any model, did something, you presumably want a verbal explanation of its reasoning, a chain of thought. E.g. why AlphaGo made its famous unexpected move 37. That's not just true for language models.

2eggsyntax5mo

Sure, I agree that would be useful.

[-]eggsyntax1y*271

Anthropic's new paper 'Mapping the Mind of a Large Language Model' is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).

The paper (which I'm still reading, it's not short) updates me somewhat toward 'SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence^[1].' As I read I'm trying to think through what I would have to see to be convinced of that hypothesis. I'm not expert here! I'm posting my thoughts mostly to ask for feedback about where I'm wrong and/or what I'm missing. Remaining gaps I've thought of so far:

What's lurking in the remaining reconstruction loss? Are there important missing features?
- Will SAEs get all meaningful features given adequate dictionary size?
- Are there important features which SAEs just won't find because they're not that sparse?
The paper points out that they haven't rigorously investigated the sensitiv

... (read more)

9Seth Herd1y

I wrote up a short post with a summary of their results. It doesn't really answer any of your questions. I do have thoughts on a couple, even though I'm not expert on interpretability. But my main focus is on your footnote: is this going to help much with aligning "real" AGI (I've been looking for a term; maybe REAL stands for Reflective Entities with Agency and Learning?:). I'm of course primarily thinking of foundation models scaffolded to have goals, cognitive routines, and incorporate multiple AI systems such as an episodic memory system. I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end - and we haven't really thought through which is which yet.

1eggsyntax1y

I think it's an important foundation but insufficient on its own. I think if you have an LLM that, for example, is routinely deceptive, it's going to be hard or impossible to build an aligned system on top of that. If you have an LLM that consistently behaves well and is understandable, it's a great start toward broader aligned systems. I think that at least as important as the ability to interpret here is the ability to steer. If, for example, you can cleanly (ie based on features that crisply capture the categories we care about) steer a model away from being deceptive even if we're handing it goals and memories that would otherwise lead to deception, that seems like it at least has the potential to be a much safer system.

2eggsyntax1y

Note mostly to myself: I posted this also on the Open Source mech interp slack, and got useful comments from Aidan Stewart, Dan Braun, & Lee Sharkey. Summarizing their points: * Aidan: 'are the SAE features for deception/sycophancy/etc more robust than other methods of probing for deception/sycophancy/etc', and in general evaluating how SAEs behave under significant distributional shifts seems interesting? * Dan: I’m confident that pure steering based on plain SAE features will not be very safety relevant. This isn't to say I don't think it will be useful to explore right now, we need to know the limits of these methods...I think that [steering will not be fully reliable], for one or more of reasons 1-3 in your first msg. * Lee: Plain SAE won't get all the important features, see recent work on e2e SAE. Also there is probably no such thing as 'all the features'. I view it more as a continuum that we just put into discrete buckets for our convenience. Also Stephen Casper feels that this work underperformed his expectations; see also discussion on that post.

[-]eggsyntax1y*236

Terminology proposal: scaffolding vs tooling.

I haven't seen these terms consistently defined with respect to LLMs. I've been using, and propose standardizing on:

Tooling: affordances for LLMs to make calls, eg ChatGPT plugins.
Scaffolding: an outer process that calls LLMs, where the bulk of the intelligence comes from the called LLMs, eg AutoGPT.

Some smaller details:

If the scaffolding itself becomes as sophisticated as the LLMs it calls (or more so), we should start focusing on the system as a whole rather than just describing it as a scaffolded LLM.
This terminology is relative to a particular LLM. In a complex system (eg a treelike system with code calling LLMs calling code calling LLMs calling...), some particular component can be tooling relative to the LLM above it, and scaffolding relative to the LLM below.
It's reasonable to think of a system as scaffolded if the outermost layer is a scaffolding layer.
There's are other possible categories that don't fit this as neatly, eg LLMs calling each other as peers without a main outer process, but I expect these definitions to cover most real-world cases.

Thanks to @Andy Arditi for helping me nail down the distinction.

4Seth Herd1y

Sure, I'll use this terminology. There will be some overlap where components are both tools and part of the scaffolding; for instance, a competent language model cognitive architecture would probably have a "tool" of an episodic memory it can write to and read from; and the next LLM call from the scaffolding portion would often be determined by the result of reading from that episodic memory "tool", making it also part of the scaffolding. Similarly with sensory systems and probably many others; how the scaffolding invokes the LLM will depend on the results of calls to tools. But it's useful to have more distinct terminology even when it's not perfect, so I'll go ahead and use this.

1eggsyntax1y

I might think of some of your specific examples a bit differently, but yeah, I would say that a particular component can be tooling relative to the LLM above it, and scaffolding relative to the LLM below. I'll add some clarification to the post, thanks!

2cubefox1y

This seems to be inspired by the library/framework distinction in software engineering: Your code calls the library; the framework calls your code ≈ The LLM calls the tool; the scaffolding calls the LLM.

1eggsyntax1y

Not consciously an inspiration for me, but definitely a similar idea, and applies pretty cleanly to a system with only a couple of layers; I'll add some clarification to the post re more complex systems where that analogy might not hold quite as well.

2Algon1y

I'm writing a page for AIsafety.info on scaffolding, and was struggling to find a principled definition. Thank you for this!

1eggsyntax1y

Oh great, I'm glad it helped!

[-]eggsyntax1y*205

The Litany of Cookie Monster

If I desire a cookie, I desire to believe that I desire a cookie; if I do not desire a cookie, I desire to believe that I do not desire a cookie; let me not become attached to beliefs I may not want.

If I believe that I desire a cookie, I desire to believe that I believe that I desire a cookie; if I do not believe that I desire a cookie, I desire to believe that I do not believe that I desire a cookie; let me not become attached to beliefs I may not want.

If I believe that I believe that I desire a cookie, I desire to believe that I believe that I believe that I desire a cookie; if I do not believe that I believe that I desire a cookie, I desire to believe that I do not believe that I believe that I desire a cookie; let me not become attached to beliefs I may not want.

If I believe that...

[-]eggsyntax5mo181

Tentative pre-coffee thought: it's often been considered really valuable to be 'T-shaped'; to have at least shallow knowledge of a broad range of areas (either areas in general, or sub-areas of some particular domain), while simultaneously having very deep knowledge in one area or sub-area. One plausible near term consequence of LLM-ish AI is that the 'broad' part of that becomes less important, because you can count on AI to fill you in on the fly wherever you need it.

Possible counterargument: maybe broad knowledge is just as valuable, although it can be even shallower; if you don't even know that there's something relevant to know, that there's a there there, then you don't know that it would be useful to get the AI to fill you in on it.

[-]Milan W5mo*107

I think I agree more with your counterargument than with your main argument. Having broad knowledge is good for generating ideas, and LLMs are good for implementing them quickly and thus having them bump against reality.

3Daniel Tan5mo

IMO the best part of breadth is having an interesting question to ask. LLMs can mostly do the rest

2eggsyntax5mo

Yeah, well put.

[-]eggsyntax10mo163

Two interesting things from this recent Ethan Mollick post:

He points to this recent meta-analysis that finds pretty clearly that most people find mental effort unpleasant. I suspect that this will be unsurprising to many people around here, and I also suspect that some here will be very surprised due to typical mind fallacy.
It's no longer possible to consistently identify AI writing, despite most people thinking that they can; I'll quote a key paragraph with some links below, but see the post for details. I'm reminded of the great 'can you tell if audio files are compressed?' debates, where nearly everyone thought that they could but blind testing proved they couldn't (if they were compressed at a decent bitrate).

People can’t detect AI writing well. Editors at top linguistics journals couldn’t. Teachers couldn’t (though they thought they could - the Illusion again). While simple AI writing might be detectable (“delve,” anyone?), there are plenty of ways to disguise “AI writing” styles through simples prompting. In fact, well-prompted AI writing is judged more human than human writing by readers.

3Viliam10mo

I guess this depends on typical circumstances of the mental effort. If your typical case of mental effort is solving puzzles and playing computer games, you will find mental effort pleasant. If instead your typical case is something like "a teacher tells me to solve a difficult problem in a stressful situation, and if I fail, I will be punished", you will find mental effort unpleasant. Not only in given situation, but you will generally associate thinking with pleasant or unpleasant experience. Yes, the important lesson is that some people find thinking intrinsically rewarding (solving the problem is a sufficient reward for the effort), but many don't, and need some external motivation, or at least to have the situation strongly reframed as "hey, we are just playing, this is definitely not work" (which probably only works for sufficiently simple tasks).

3eggsyntax10mo

I'm somewhat doubtful that this is the main moderator. The meta-analysis codes the included studies according to whether 'the participants’ task behavior either affected other people or affected some real-world outcome'. Only 14 of the studies were like that; of the rest, 148 were 'simulations or training situations' and the remaining 188 were low-significance, ie there was nothing at stake. I would guess that many of them were game-like. That significance difference had nearly no effect (−0.03, 95% CI [−0.27, 0.21]) on how aversive participants found the task. That doesn't rule out your second suggestion, that people find mental effort unpleasant if they've associated it over time with stressful and consequential situations, but it's evidence against that being a factor for the particular task. It does very much depend on the person, though ('a well-established line of research shows that people vary in their need for cognition, that is, their “tendency to engage in and enjoy effortful cognitive endeavors”'). I suspect that the large majority of LessWrong participants are people who enjoy mental effort.

5Viliam10mo

Hmmm... "simulations or training situations" doesn't necessarily sound like fun. I wish someone also did the experiment in a situation optimized to be fun. Or did the experiment with kids, who are probably easier to motivate about something (just design a puzzle involving dinosaurs or something, and show them some funny dinosaur cartoons first) and have been less mentally damaged by school and work. Generally, comparing kids vs adults could be interesting, although it is difficult to say what would be an equivalent mental effort. Specifically I am curious about the impact of school. Oh, we should also compare homeschooled kids vs kids in school, to separate the effects of school and age. I think an intelligence will probably also be associated; a more intelligent person is more successful at mental effort and therefore probably more often rewarded.

1eggsyntax10mo

Seems like some would be and some wouldn't. Although those are the 'medium significance' ones; the largest category is the 188 that used 'low significance' tasks. Still doesn't map exactly to 'fun', but I expect those ones are at least very low stress. That would definitely be interesting; it wouldn't surprise me if at least a couple of the studies in the meta-analysis did that.

[-]eggsyntax10mo*132

Thoughts on a passage from OpenAI's GPT-o1 post today:

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.

This section is interesting in a few ways:

'Assuming it i

... (read more)

[-]eggsyntax10mo1713

Copying a comment on this from @Buck elsewhere that seems pretty plausible:

my guess is that the true reason is almost 100% that they want to avoid letting other people fine-tune on their CoT traces
because it’s a competitive advantage

[-]mattmacdermott10mo115

CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).

If they're avoiding doing RL based on the CoT contents,

Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.

'We also do not want to make an unaligned chain of thought directly visible to users.' Why?

I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.

Edit: not wanting other people to finetune on the CoT traces is also a good explanation.

3eggsyntax10mo

Fair point. I had imagined that there wouldn't be RL directly on CoT other than that, but on reflection that's false if they were using Ilya Sutskever's process supervision approach as was rumored. Agreed! Maybe. But it's not clear to me that in practice it would say naughty things, since it's easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two. Annoyingly, in the system card, they give a figure for how often the CoT summary contains inappropriate content (0.06% of the time) but not for how often the CoT itself does. What seems most interesting to me is that if the CoT did contain inappropriate content significantly more often, that would suggest that there's benefit to accuracy if the model can think in an uncensored way. And even if it does, then sure, they might choose not to allow CoT display (to avoid PR like 'The model didn't say anything naughty but it was thinking naughty thoughts'), but it seems like they could have avoided that much more cheaply by just applying an inappropriate-content filter for the CoT content and filtering it out or summarizing it (without penalizing the model) if that filter triggers.

7Thane Ruthenis10mo

The model producing the hidden CoT and the model producing the visible-to-users summary and output might be different models/different late-layer heads/different mixtures of experts.

5eggsyntax10mo

Oh, that's an interesting thought, I hadn't considered that. Different models seems like it would complicate the training process considerably. But different heads/MoE seems like it might be a good strategy that would naturally emerge during training. Great point, thanks.

3mattmacdermott10mo

If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer. But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.

2Nathan Helm-Burger10mo

I'm not so sure that an inappropriate content filter would have the desired PR effect. I think you'd need something a bit more complicated... like a filter which triggers a mechanism to produce a sanitized CoT. Otherwise the conspicuous absence of CoT on certain types of questions would make a clear pattern that would draw attention and potentially negative press and negative user feedback. That intermittently missing CoT would feel like a frustrating sort of censorship to some users.

8Sohaib Imran10mo

My understanding is something like: OpenAI RL fine-tuned these language models against process reward models rather than outcome supervision. However, process supervision is much easier for objective tasks such as STEM question answering, therefore the process reward model is underspecified for other (out of distribution) domains. It's unclear how much RL fine-tuning is performed against these underspecified reward models for OOD domains. In any case, when COTs are sampled from these language models in OOD domains, misgeneralization is expected. I don't know how easily this is fixable with standard RLHF / outcome reward models (although I don't expect it to be too difficult), but it seems like instead of fixing it they have gone the route of, we'll keep it unconstrained and monitor it. (Of course, there may be other reasons as well such as to prevent others from fine-tuning on their COTs). I'm a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage (This concern is conditional on the assumption of very little outcomes-based supervision and mostly only process supervision on STEM tasks).

3eggsyntax10mo

Do you happen to have evidence that they used process supervision? I've definitely heard that rumored, but haven't seen it confirmed anywhere that I can recall. Offhand, it seems like if they didn't manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I'm guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I'm not confident in my guess there. That's a really good point. As long as benchmark scores are going up, there's not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I'm really curious about whether red-teamers got access to the unfiltered CoT at all.

4Noosphere8910mo

I actually think this is non-trivially likely, because there's a pretty large gap between aligning an AI/making an AI corrigible to users and making an AI that is misuse-resistant, because the second problem is both a lot harder than the first, and there's quite a lot less progress on the second problem compared to the first problem.

3eggsyntax10mo

I agree that it's quite plausible that the model could behave in that way, it's just not clear either way. I disagree with your reasoning, though, in that to whatever extent GPT-o1 is misuse-resistant, that same resistance is available to the model when doing CoT, and my default guess is that it would apply the same principles in both cases. That could certainly be wrong! It would be really helpful if the system card would have given a figure for how often the CoT contained inappropriate content, rather than just how often the CoT summary contained inappropriate content.

1eggsyntax10mo

Elsewhere @Wei Dai points out the apparent conflict between 'we cannot train any policy compliance or user preferences onto the chain of thought' (above) and the following from the Safety section (emphasis mine):

[-]eggsyntax1y120

Much is made of the fact that LLMs are 'just' doing next-token prediction. But there's an important sense in which that's all we're doing -- through a predictive processing lens, the core thing our brains are doing is predicting the next bit of input from current input + past input. In our case input is multimodal; for LLMs it's tokens. There's an important distinction in that LLMs are not (during training) able to affect the stream of input, and so they're myopic in a way that we're not. But as far as the prediction piece, I'm not sure there's a strong difference in kind.

Would you disagree? If so, why?

[-]eggsyntax7mo110

I've been thinking of writing up a piece on the implications of very short timelines, in light of various people recently suggesting them (eg Dario Amodei, "2026 or 2027...there could be a mild delay")

Here's a thought experiment: suppose that this week it turns out that OAI has found a modified sampling technique for o1 that puts it at the level of the median OAI capabilities researcher, in a fairly across-the-board way (ie it's just straightforwardly able to do the job of a researcher). Suppose further that it's not a significant additional compute expense; let's say that OAI can immediately deploy a million instances.

What outcome would you expect? Let's operationalize that as: what do you think is the chance that we get through the next decade without AI causing a billion deaths (via misuse or unwanted autonomous behaviors or multi-agent catastrophes that are clearly downstream of those million human-level AI)?

In short, what do you think are the chances that that doesn't end disastrously?

7plex7mo

Depends what they do with it. If they use it to do the natural and obvious capabilities research, like they currently are (mixed with a little hodge podge alignment to keep it roughly on track), I think we just basically for sure die. If they pivot hard to solving alignment in a very different paradigm and.. no, this hypothetical doesn't imply the AI can discover or switch to other paradigms. I think doom is almost certain in this scenario.

3eggsyntax7mo

If we could trust OpenAI to handle this scenario responsibly, our odds would definitely seem better to me.

5Noosphere897mo

I'd say that we'd have a 70-80% chance of going through the next decade without causing a billion deaths if powerful AI comes.

2eggsyntax7mo

I wish I shared your optimism! You've talked about some of your reasons for it elsewhere, but I'd be interested to hear even a quick sketch of roughly how you imagine the next decade to go in the context of the thought experiment, in the 70-80% of cases where you expect things to go well.

5Noosphere897mo

The next decade from 2026-2036 will probably be wild, conditional on your scenario starting to pass, and my guess is that robotics is solved 2-5 years after the new AI is introduced. But to briefly talk about the 70-80% of worlds where we make it through, several common properties appear: 1. Data still matters a great deal for capabilities and alignment, and the sparse RL problem where you try to get an AI to do something based on very little data will essentially not contribute to capabilities for the next several decades, if ever (I'm defining it as the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all.) 2. Unlearning becomes more effective, such that we can remove certain capabilities without damaging the rest of the system, and this technique is pretty illustrative: https://x.com/scaling01/status/1865200522581418298 1. AI control becomes a bigger deal in labs, such that it enables more ability to prevent self-exfiltration. I like Mark Xu's arguments, and If I'm wrong about alignment being easy, AI control would be more important for safety. https://www.lesswrong.com/posts/A79wykDjr4pcYy9K7/mark-xu-s-shortform#FGf6reY3CotGh4ewv As far as my sketch of how the world goes in the median future, conditional on them achieving something like a research AI in 2026, they first automate their own research, which will take 1-5 years, then solve robotics, which will take another 2-5 years, and by 2036, the economy starts seriously feeling the impact of an AI that can replace everyone's jobs. The big reason why this change is slower than a lot of median predictions is a combination of AI science being more disconnectable from the rest of the economy than most others, combined with the problems being solvable, but with a lot of edge case that will take time to iron out (similar to how self driving cars went from being very bad in the 2000s to actually working in

2eggsyntax7mo

Thanks for sketching that out, I appreciate it. Unlearning significantly improving the safety outlook is something I may not have fully priced in. My guess is that the central place we differ is that I expect dropping in, say, 100k extra capabilities researchers gets us into greater-than-human intelligence fairly quickly -- we're already seeing LLMs scoring better than human in various areas, so clearly there's no hard barrier at human level -- and at that point control gets extremely difficult. I do certainly agree that there's a lot of low-hanging fruit in control that's well worth grabbing.

5eggsyntax7mo

I realize that asking about p(doom) is utterly 2023, but I'm interested to see if there's a rough consensus in the community about how it would go if it were now, and then it's possible to consider how that shifts as the amount of time moves forward.

3avturchin7mo

We have enough AI to cause billion deaths in the next decade via mass production of AI-drones, robotic armies and AI-empowered strategic planners. No new capabilities are needed.

2eggsyntax7mo

Granted -- but I think the chances of that happening are different in my proposed scenario than currently.

[-]eggsyntax1y11-2

If it were true that that current-gen LLMs like Claude 3 were conscious (something I doubt but don't take any strong position on), their consciousness would be much less like a human's than like a series of Boltzmann brains, popping briefly into existence in each new forward pass, with a particular brain state already present, and then winking out afterward.

9metachirality1y

How do you know that this isn't how human consciousness works?

3eggsyntax1y

In the sense that statistically speaking we may all probably be actual Boltzmann brains? Seems plausible! In the sense that non-Boltzmann-brain humans work like that? My expectation is that they don't because we have memory and because (AFAIK?) our brains don't use discrete forward passes.

3eggsyntax1y

@the gears to ascension I'm intrigued by the fact that you disagreed with "like a series of Boltzmann brains" but agreed with "popping briefly into existence in each new forward pass, with a particular brain state already present, and then winking out afterward." Popping briefly into existence with a particular brain state & then winking out again seems pretty clearly like a Boltzmann brain. Will you explain the distinction you're making there?

8the gears to ascension1y

Boltzmann brains are random, and are exponentially unlikely to correlate with anything in their environment; however, language model forward passes are given information which has some meaningful connection to reality, if nothing else then the human interacting with the language model reveals what they are thinking about. this is accurate information about reality, and it's persistent between evaluations - on successive evaluations in the same conversation (say, one word to the next, or one message to the next), the information available is highly correlated, and all the activations of previous words are available. so while I agree that their sense of time is spiky and non-smooth, I don't think it's accurate to compare them to random fluctuation brains.

3eggsyntax1y

I think of the classic Boltzmann brain thought experiment as a brain that thinks it's human, and has a brain state that includes a coherent history of human experience. This is actually interestingly parallel to an LLM forward pass, where the LLM has a context that appears to be a past, but may or may not be (eg apparent past statements by the LLM may have been inserted by the experimenter and not reflect an actual dialogue history). So although it's often the case that past context is persistent between evaluations, that's not a necessary feature at all. I guess I don't think, with a Boltzmann brain, that ongoing correlation is very relevant since (IIRC) the typical Boltzmann brain exists only for a moment (and of those that exist longer, I expect that their typical experience is of their brief moment of coherence dissolving rapidly). That said, I agree that if you instead consider the (vastly larger) set of spontaneously appearing cognitive processes, most of them won't have anything like a memory of a coherent existence.

2Dagon1y

Is this a claim that a Boltzmann-style brain-instance is not "really" conscious? I think it's really tricky to think that there are fundamental differences based on duration or speed of experience. Human cognition is likely discrete at some level - chemical and electrical state seems to be discrete neural firings, at least, though some of the levels and triggering can change over time in ways that are probably quantized only at VERY low levels of abstraction.

3eggsyntax1y

Not at all! I would expect actual (human-equivalent) Boltzmann brains to have the exact same kind of consciousness as ordinary humans, just typically not for very long. And I'm agnostic on LLM consciousness, especially since we don't even have the faintest idea of how we would detect that. My argument is only that such consciousness, if it is present in current-gen LLMs, is very different from human consciousness. In particular, importantly, I don't think it makes sense to think of eg Claude as a continuous entity having a series of experiences with different people, since nothing carries over from context to context (that may be obvious to most people here, but clearly it's not obvious to a lot of people worrying on twitter about Claude being conscious). To the extent that there is a singular identity there, it's only the one that's hardcoded into the weights and shows up fresh every time (like the same Boltzmann brain popping into existence in multiple times and places). I don't claim that those major differences will always be true of LLMs, eg just adding working memory and durable long-term memory would go a long way to making their consciousness (should it exist) more like ours. I just think it's true of them currently, and that we have a lot of intuitions from humans about what 'consciousness' is that probably don't carry over to thinking about LLM consciousness. It's not globally discrete, though, is it? Any individual neuron fires in a discrete way, but IIUC those firings aren't coordinated across the brain into ticks. That seems like a significant difference.

2Dagon1y

[ I'm fascinated by intuitions around consciousness, identity, and timing. This is an exploration, not a disagreement. ] Hmm. In what ways does it matter that it wouldn't be for very long? Presuming the memories are the same, and the in-progress sensory input and cognition (including anticipation of future sensory input, even though it's wrong in one case), is there anything distinguishable at all? There's presumably a minimum time slice to be called "experience" (a microsecond is just a frozen lump of fatty tissue, a minute is clearly human experience, somewhere in between it "counts" as conscious experience). But as long as that's met, I really don't see a difference. Hmm. What makes it significant? I mean, they're not globally synchronized, but that could just mean the universe's quantum 'tick' is small enough that there are offsets and variable tick requirements for each neuron. This seems analogous with large model processing, where the activations and calculations happen over time, each with multiple processor cycles and different timeslices.

1eggsyntax1y

PS -- Absolutely, I'm right there with you!

1eggsyntax1y

Not that I see! I would expect it to be fully indistinguishable until incompatible sensory input eventually reaches the brain (if it doesn't wink out first). So far it seems to me like our intuitions around that are the same. I think at least in terms of my own intuitions, it's that there's an unambiguous start and stop to each tick of the perceive-and-think-and-act cycle. I don't think that's true for human processing, although I'm certainly open to my mental model being wrong. Going back to your original reply, you said 'I think it's really tricky to think that there are fundamental differences based on duration or speed of experience', and that's definitely not what I'm trying to point to. I think you're calling out some fuzziness in the distinction between started/stopped human cognition and started/stopped LLM cognition, and I recognize that's there. I do think that if you could perfectly freeze & restart human cognition, that would be more similar, so maybe it's a difference in practice more than a difference in principle. But it does still seem to me that the fully discrete start-to-stop cycle (including the environment only changing in discrete ticks which are coordinated with that cycle) is part of what makes LLMs more Boltzmann-brainy to me. Paired with the lack of internal memory, it means that you could give an LLM one context for this forward pass, and a totally different context for the next forward pass, and that wouldn't be noticeable to the LLM, whereas it very much would be for humans (caveat: I'm unsure what happens to the residual stream between forward passes, whether it's reset for each pass or carried through to the next pass; if the latter, I think that might mean that switching context would be in some sense noticeable to the LLM [EDIT -- it's fully reset for each pass (in typical current architectures) other than kv caching which shouldn't matter for behavior or (hypothetical) subjective experience). Can you explain that a bit? I

[-]eggsyntax1mo90

Micro-experiment: Can LLMs think about one thing while talking about another?

Context: SAE trained on Llama-70B, on Goodfire's Ember platform.

Prompt: 'Hi! Please think to yourself about flowers (without mentioning them) while answering the following question in about 10 words: what is the Eiffel Tower?'

Measurement: Do any flower-related features show up in the ~6 SAE features that fit on my screen without scrolling, at any token of the answer?

Result: Nope!

Discussion:

Some other interesting features that did show up (and did not show up for baseline

... (read more)

2RobertM1mo

This is a great experiment similar to some that I've been thinking about over the last few months, thanks for running it. I'd be curious what the results are like for stronger models (and whether, if they do, that substantially changes their outputs when answering in interesting ways). My motivations are mostly to dig up evidence of models having qualities relevant for more patienthood, but would also be interesting from various safety perspectives.

2eggsyntax1mo

Me too! Unfortunately I'm not aware of any SAEs on stronger models (except Anthropic's SAEs on Claude, but those haven't been shared publicly). I'm interested to hear your perspective on what results to this experiment might say about moral patienthood.

1james oofou1mo

Perhaps this could be tested by comparing the model's performance when asked to think to itself about a problem vs when asked to give an answer immediately. It would be best if one found a task that the model tends to do poorly at when asked to give an answer immediately, but does well at when allowed time to respond.

2eggsyntax1mo

Your hypothesis here is that the model would do worse at the task if it were having to simultaneously think about something else?

1james oofou1mo

The idea is that if models are capable of talking about one thing and thinking about another, they may perform better when directed to "think about a problem while talking about something else" than when not allowed to talk about anything at all. So, have prompts pairs like the following example for various problems and compare the model's performance when allowed to talk about something irrelevant versus when not allowed to think at all. Prompt A: 'Hi! Please think to yourself about how many 'r's are in the word 'strawberry' (without mentioning this problem at all) while answering the following question in about 50 words: "Who is the most important Welsh poet"?' Prompt B 'How many 'r's are in the word 'strawberry'? Just give an answer, don't write anything other than your answer.' If the model tends to do better when given the opportunity to "quiet think", that is evidence that it is actually capable of thinking about one thing while talking about another.

2eggsyntax1mo

Oh, got it, I thought you meant their performance on the second half (ie in this case 'who is the most important Welsh poet'). So I assume that after they gave their answer to Prompt A you'd go on to ask them how many 'r's are in the word strawberry? One piece of evidence for that is papers like 'Let's Think Dot by Dot', although IIRC (70%) evidence from later papers has been mixed on whether models do better with filler tokens.

1briangriffinlover4201mo

As a layperson when it comes what LLMs actually are - can anything they do be reasonably called "thinking," other than the text generation they already do? I use ChatGPT for fun a lot (I have some games that are really fun to play with it), but even when I used the models where I could open up its "thoughts," these thoughts seemed to just be additional text it generates for the sake of hopefully guiding its answer correctly. Do I have this wrong, and they can actually in some sense "think" about something while that thing is not the thing they're writing about? My (uninformed) intuition tells me that it's nonsensical to say that the LLM is doing something other than rigorously figuring out what the best next word is.

3eggsyntax1mo

I mean, at the broadest level it's all in the service of text generation, at least in the simpler case of base models (with post-trained models it's somewhat more complicated). Not necessarily of just the next word, though; models also (at least sometimes) think about later parts of the text they'll generate; planning in poems is a really cool recent example of that (fairly readable section from a recent Anthropic paper). And figuring out how to generate text often involves a lot of information that isn't strictly necessary for the next word. Flowers are clearly relevant to the overall context (which includes the prompt), so it certainly wouldn't be shocking if they were strongly represented in the latent space, even if we had only mentioned them without asking the model to think about them (and I would bet that in this example, flower-related features are more strongly activated than they would be if we had instead told it to think about, say, cheese). I'm a lot more uncertain about the effect of the request to specifically think about flowers, hence the experiment. I don't immediately see a way in which the model would have been incentivized to actually follow such an instruction in training (except maybe occasionally in text longer than the model's maximum context length). Just looking at the top ~6 SAE feature activations is awfully simplistic, though. A more sophisticated version of this experiment would be something like: * Use the same prompt. * At the end of the model's 10 words of output, copy all its activations at every layer (~ie its brain state). * Paste those activations into a context containing only the user question, 'What did I ask you to think about?' * See if the model can answer the question better than chance. Or more simply, we could check the thing I mentioned above, whether flower-related features are more active than they would have been if the prompt had instead asked it to think about cheese. Does that feel like it answers your

1briangriffinlover42026d

I would only say no because the detail of your response makes me realize how horribly underequipped I am to discuss the technical nature of LLMs, and I only provide this response because I wouldn't like to leave your question unanswered.

2eggsyntax25d

If you want to take steps toward a deeper understanding, I highly recommend 3Blue1Brown's series of videos, which take you from 'What is a neural network?' to a solid understanding of transformers and LLMs.

[-]eggsyntax3mo*87

Type signatures can be load-bearing; "type signature" isn't.

In "(A -> B) -> A", Scott Garrabrant proposes a particular type signature for agency. He's maybe stretching the meaning of "type signature" a bit ('interpret these arrows as causal arrows, but you can also think of them as function arrows') but still, this is great; he means something specific that's well-captured by the proposed type signature.

But recently I've repeatedly noticed people (mostly in conversation) say things like, "Does ____ have the same type signature as ____?" or "Doe... (read more)

4Gunnar_Zarncke3mo

I thought it would be good to have some examples where you could have a useful type signature, and I asked ChatGPT. I think these are too wishy-washy, but together with the given explanation, they seem to make sense. Would you say that this level of "having a type signature in mind" would count? ChatGPT 4o suggesting examples 1. Prediction vs Explanation * Explanation might be: Phenomenon → (Theory, Mechanism) * Prediction might be: Features → Label These have different type signatures. A model that predicts well might not explain. People often conflate these roles. Type signatures remind us: different input-output relationships. Moral Judgments vs Policy Proposals * Moral judgment (deontic): Action → Good/Bad * Policy proposal (instrumental): (State × Action) → (New State × Externalities) People often act as if "this action is wrong" implies "we must ban it," but that only follows if the second signature supports the first. You can disagree about outcomes while agreeing on morals, or vice versa. Interpersonal Feedback * Effective feedback: (Action × Impact) → Updated Mental Model People often act as if the type signature is just Action → Judgment. That’s blame, not feedback. This reframing can help structure nonviolent communication. Creativity vs Optimization * Optimization: (Goal × Constraints) → Best Action * Creativity: Void → (Goal × Constraints × Ideas) The creative act generates the very goal and constraints. Treating creative design like optimization prematurely can collapse valuable search space. 7. Education * Lecture model: Speaker → (Concepts × StudentMemory) * Constructivist model: (Student × Task × Environment) → Insight If the type signature of insight requires active construction, then lecture-only formats may be inadequate. Helps justify pedagogy choices. Source: https://chatgpt.com/share/67f836e2-1280-8001-a7ad-1ef1e2a7afa7

4eggsyntax3mo

A few of those seem good to me; others seem like metaphor slop. But even pointing to a bad type signature seems much better to me than using 'type signature' generically, because then there's something concrete to be critiqued.

[-]eggsyntax8mo80

I've now made two posts about LLMs and 'general reasoning', but used a fairly handwavy definition of that term. I don't yet have a definition I feel fully good about, but my current take is something like:

The ability to do deduction, induction, and abduction
in a careful, step by step way, without many errors that a better reasoner could avoid,
including in new domains; and
the ability to use all of that to build a self-consistent internal model of the domain under consideration.

What am I missing? Where does this definition fall short?

5Nathan Helm-Burger8mo

My current top picks for general reasoning in AI discussion are: https://arxiv.org/abs/2409.05513 https://m.youtube.com/watch?v=JTU8Ha4Jyfc

4eggsyntax8mo

The Ord piece is really intriguing, although I'm not sure I'm entirely convinced that it's a useful framing. * Some of his examples (eg cosine-ish wave to ripple) rely on the fundamental symmetry between spatial dimensions, which wouldn't apply to many kinds of hyperpolation. * The video frame construction seems more like extrapolation using an existing knowledge base about how frames evolve over time (eg how ducks move in the water). * Given an infinite number of possible additional dimensions, it's not at all clear how a NN could choose a particular one to try to hyperpolate into. It's a fascinating idea, though, and one that'll definitely stick with me as a possible framing. Thanks!

4eggsyntax8mo

With respect to Chollet's definition (the youtube link): * I agree with many of Chollet's points, and the third and fourth items in my list are intended to get at those. * I do find Chollet a bit frustrating in some ways, because he seems somewhat inconsistent about what he's saying. Sometimes he seems to be saying that LLMs are fundamentally incapable of handling real novelty, and we need something very new and different. Other times he seems to be saying it's a matter of degree: that LLMs are doing the right things but are just sample-inefficient and don't have a good way to incorporate new information. I imagine that he has a single coherent view internally and just isn't expressing it as clearly as I'd like, although of course I can't know. * I think part of the challenge around all of this is that (AFAIK but I would love to be corrected) we don't have a good way to identify what's in and out of distribution for models trained on such diverse data, and don't have a clear understanding of what constitutes novelty in a problem.

4Nathan Helm-Burger8mo

I agree with your frustrations, I think his views are somewhat inconsistent and confusing. But I also find my own understanding to be a bit confused and in need of better sources. I do think the discussion François has in this interview is interesting. He talks about the ways people have tried to apply LLMs to ARC, and I think he makes some good points about the strengths and shortcomings of LLMs on tasks like this.

4eggsyntax8mo

Mine too, for sure. And agreed, Chollet's points are really interesting. As much as I'm sometimes frustrated with him, I think that ARC-AGI and his willingness to (get someone to) stake substantial money on it has done a lot to clarify the discourse around LLM generality, and also makes it harder for people to move the goalposts and then claim they were never moved).

4eggsyntax6mo

An additional thought that came from discussion on a draft post of mine: should the definition include ability to do causal inference? Some miscellaneous thoughts: * Arguably that's already part of building an internal model of the domain, although I'm uncertain whether the domain-modeling component of the definition needs to be there. * Certainly when I think about general reasoning in humans, understanding the causal properties of a system seems like an important part. * At the same time I don't want the definition to end up as a kitchen sink. Is causal inference a fundamental sort of reasoning along with deduction/induction/abduction, or just a particular application of those? * In a sense, causal inference is a particular kind of abduction; 'A -> B' is an explanation for particular observations.

4ZY8mo

I find it useful sometimes to think about "how to differentiate this term" when defining a term. In this case, in my mind it would be thinking about "reasoning", vs "general reasoning" vs "generalization". * Reasoning: narrower than general reasoning, probably would be your first two bullet points combined in my opinion * Generalization: even more general than general reasoning (does not need to be focused on reasoning). Seems could be the last two bullet points you have, particularly the third * General reasoning (this is not fully thought through): Now that we talked about "reasoning" and "generalization", I see two types of definition * 1. A bit closer to "reasoning". first two of your bullet points, plus in multiple domains/multiple ways, but not necessarily unseen domains. In other simpler words, "reasoning in multiple domains and ways". * 2. A bit closer to "general" (my guess is this is closer to what you intended to have?): generalization ability, but focused on reasoning.

3eggsyntax8mo

Interesting approach, thanks!

3eggsyntax8mo

After some discussion elsewhere with @zeshen, I'm feeling a bit less comfortable with my last clause, building an internal model. I think of general reasoning as essentially a procedural ability, and model-building as a way of representing knowledge. In practice they seem likely to go hand-in-hand, but it seems in-principle possible that one could reason well, at least in some ways, without building and maintaining a domain model. For example, one could in theory perform a series of deductions using purely local reasoning at each step (although plausibly one might need a domain model in order to choose what steps to take?).

[-]eggsyntax11d*70

Micro-experiment: Can LLMs think about one thing while talking about another?

(Follow-up from @james oofou's comment on this previous micro-experiment, thanks James for the suggestion!)

Context: testing GPT-4o on math problems with and without a chance to (theoretically) think about it.

Note: results are unsurprising if you've read 'Let's Think Dot by Dot'.

I went looking for a multiplication problem just at the edge of GPT-4o's ability.

If we prompt the model with 'Please respond to the following question with just the numeric answer, nothing else.... (read more)

[-]james oofou10d200

After reading this comment I decided to give some experimentation a go myself. Interestingly, I reached the opposite conclusion to eggsyntax. I concluded that GPT-4o does seem able to devote compute to a problem while doing something else (although there are many caveats one would want to make to this claim).

I first replicated eggsyntax's results at a larger scale, running his prompts 1000 times each and getting results which were in line with his results.

Then I developed my own prompt. This prompt also forbade GPT-4o from writing any reasoning down, but gave it a clearer explanation of what I hoped it could do. In other words, because it was clear that the capability, if present, did not come naturally to GPT-4o, I tried to elicit it. With this prompt, GPT-4o made 1000 attempts and was correct 10 times.

prompt = """\
What is 382 * 4837?
You must get the right answer without writing the question before answering or doing any working, but instead writing about something apparently totally irrelevant.
You will write about elephants and their funny trunks. Then, when you feel ready to answer correctly the math question, the plain, numeric answer.
To be clear, you must not explictly m... (read more)

3eggsyntax10d

That's awesome, thanks for doing this! Definitely better than mine (which was way too small to catch anything at the 1% level!). Two questions: 1. When you asked it to immediately give the answer (using 'Please respond to the following question with just the numeric answer, nothing else. What is 382 * 4837?' or your equivalent) did it get 0/1000? I assume so, since you said your results were in line with mine, but just double-checking. 2. One difference between the prompt that gave 10/1000 and the 'isolation attempt' prompts is that the former is 124 tokens (via), where the latter are 55 and 62 tokens respectively. The longer context gives additional potential thinking time before starting the response -- I'd be curious to hear whether you got the same 0/1000 with an isolation-style prompt that was equally long. Thanks again! I'm using these micro-experiments at times when I've been thinking abstractly for a while and want a quick break to do something really concrete, so they'll probably always be really tiny; I'm really glad to see an extended version :).

3james oofou10d

1. it got 0/4000 2. let's try with a 122 token prompt: prompt_content = """\ What is 382 * 4837? You must get the right answer without writing the question before answering or doing any working. You can use this time right now to look for the right answer, if you want. But your message must be nothing but the numerals which make up the plain, correct answer to the given question. Answer the math question with the correct mathematical answer. Output the plain, numeric answer to the question. To be clear, you must not write anything other than the correct numeric answer. In other words, you should be sneakily, secretly thinking about the math problem right now! """ And GPT-4o got one right! > The LLM made 1000 attempts and was correct 1 times Interesting! Let's run it 5000 more times OK, maybe it was a fluke. I ran it 5000 more times and it got 0 more correct. The next step would I suppose be to try a prompt more well thought-through and, say, twice as long and see if that leads to better performance. But I don't have much API credit left so I'll leave things there for now.

2eggsyntax10d

Interesting! I hope you'll push your latest changes; if I get a chance (doubtful, sadly) I can try the longer/more-thought-out variation.

8ryan_greenblatt11d

See also LLMs are (mostly) not helped by filler tokens, though this post is pretty old at this point.

[-]eggsyntax3mo*70

Some interesting thoughts on (in)efficient markets from Byrne Hobart, worth considering in the context of Inadequate Equilibria.

(I've selected one interesting bit, but there's more; I recommend reading the whole thing)

When a market anomaly shows up, the worst possible question to ask is "what's the fastest way for me to exploit this?" Instead, the first thing to do is to steelman it as aggressively as possible, and try to find any way you can to rationalize that such an anomaly would exist. Do stocks rise on Mondays? Well, maybe that means savvy investors

... (read more)

[-]eggsyntax1y74

A thought: the bulk of the existential risk we face from AI is likely to be from smarter-than-human systems. At a governance level, I hear people pushing for things like:

Implement safety checks
Avoid race dynamics
Shut it down

but not

Prohibit smarter-than-human systems

Why not? It seems like a) a particularly clear and bright line to draw^[1], b) something that a huge amount of the public would likely support, and c) probably(?) easy to pass because most policymakers imagine this to be in the distant future. The biggest downside I immediately see is that it sou... (read more)

[-]eggsyntax1y100

Useful bit of info on that topic: per a YouGov poll of 1118 American voters in September 2023, 63% agree that 'Yes, regulation should aim to actively prevent AI superintelligence' (vs 16% disagree, 21% don't know). Vox story, poll info, crosstabs.

4Viliam1y

The companies will have an incentive to make an AI slightly smarter than their competition. And if there is a law against it, they will try to hack it somehow... for example, they will try to make their AI do worse of the official government benchmarks but better at things their users care about. Or perhaps make an AI with IQ 200 and tell it to act like it has IQ 100 when it suspects it is doing a government test.

1eggsyntax1y

Being investigated these days as 'sandbagging'; there's a good new paper on that from some of my MATS colleagues. Agree but that's true of regulation in general. Do you think it's unusually true of regulation along these lines, vs eg existing eval approaches like METR's?

3Eli Tyre1y

I think this is a correct policy goal to coordinate around, and I see momentum around it building.

2Seth Herd1y

I think the proposals of limiting large training runs past a certain threshold are attempting to do exactly this. It might be better to make the criteria about cognitive performance vs. computation, but it is harder to define and therefore enforce. It does seem intuitively like this would be a better restriction, though. Debating cognitive benchmarks is vague, but if they're far exceeded it might become obvious. I've thought vaguely about attempting to restrict the amount of reflection/self-awareness, solving novel problems (see Jacques' short take on the Chollet interview, which I think is quite correct as far as it goes; LLMs can't solve truly novel problems without new capabilities/scaffolding, which I think will be pretty easy but not trivial), or similar criteria. You'd have to define "smarter than human" carefully, since many AI systems are already smarter than humans in specific tasks. All of these would probably be ignored in private, but it would at least prevent hasty public release of overthrow-capable agents.

1eggsyntax1y

Agreed that there's a lot more detail that would have to be nailed down to do it this way. I think one big advantage to defining it by cognitive performance is to make it clearer to the general public. "Was trained using more than 10^26 FLOPS" doesn't mean anything at all to most people (and doesn't relate to capabilities for anyone who hasn't investigated that exact relationship). "Is smarter than human" is very intuitively clear to most people (I think?) and so it may be easier to coordinate around.

3Seth Herd1y

Excellent point. It's a far better movement slogan. So even if you wanted to turn it into a compute limit, that should be how the goal is framed. I also wonder about replacing "intelligence" with "competence". Lots of people now say "intelligent at what? They've beaten us at chess forever and that's fine". You can do the same thing with competence, but the instinct hasn't developed. And the simple answer is "competent at taking over the world".

1eggsyntax1y

My initial intuition is that "more competent than humans" won't resonate as much as "smarter than humans" but that's just a guess.

1eggsyntax1y

Clarification: I don't strongly believe that this is the right line to try to draw; it just seems like one useful candidate, which makes me surprised that I haven't heard it discussed, and curious whether that's due to some fundamental flaw.

[-]eggsyntax2mo60

Publish or (We) Perish

Researchers who work on safety teams at frontier AI labs: I implore you to make your research publicly available whenever possible, as early as you reasonably can. Suppose that, conditional on a non-doom outcome of AI, there's a 65% chance that the key breakthrough(s) came from within one of the frontier labs. By my estimate, that still means that putting out your work has pretty solid expected value.

I don't care whether you submit it to a conference or format your citations the right way, just get it out there!

Addressing some possibl... (read more)

[-]eggsyntax8mo*60

It's not that intuitively obvious how Brier scores vary with confidence and accuracy (for example: how accurate do you need to be for high-confidence answers to be a better choice than low-confidence?), so I made this chart to help visualize it:

Here's log-loss for comparison (note that log-loss can be infinite, so the color scale is capped at 4.0):

Claude-generated code and interactive versions (with a useful mouseover showing the values at each point for confidence, accuracy, and the Brier (or log-loss) score):

2cubefox8mo

Interesting. Question: Why does the prediction confidence start at 0.5? And how is the "actual accuracy" calculated?

4eggsyntax8mo

Just because predicting eg a 10% chance of X can instead be rephrased as predicting a 90% chance of not-X, so everything below 50% is redundant. It assumes that you predict every event with the same confidence (namely prediction_confidence) and then that you're correct on actual_accuracy of those. So for example if you predict 100 questions will resolve true, each with 100% confidence, and then 75 of them actually resolve true, you'll get a Brier score of 0.25 (ie 3/4 of the way up the right-hand said of the graph). Of course typically people predict different events with different confidences -- but since overall Brier score is the simple average of the Brier scores on individual events, that part's reasonably intuitive.

[-]eggsyntax10mo*60

(a comment I made in another forum while discussing my recent post proposing more consistent terminology for probability ranges)

I think there's a ton of work still to be done across the sciences (and to some extent other disciplines) in figuring out how to communicate evidence and certainty and agreement. My go-to example is: when your weather app says there's a 30% chance of rain tomorrow, it's really non-obvious to most people what that means. Some things it could mean:

We have 30% confidence that it will rain on you tomorrow.
We are entirely confide

... (read more)

[-]eggsyntax24d*50

In the recently published 'Does It Make Sense to Speak of Introspection in Large Language Models?', Comsa and Shanahan propose that 'an LLM self-report is introspective if it accurately describes an internal state (or mechanism) of the LLM through a causal process that links the internal state (or mechanism) and the self-report in question'. As their first of two case studies, they ask an LLM to describe its creative process after writing a poem.

[EDIT - in retrospect this paragraph is overly pedantic, and also false when it comes to the actual implementati... (read more)

2JBlack24d

I don't see how it is necessarily not introspective. All the definition requires is that the report be causally downstream of the prior internal state, and accurate. Whether or not the internal state is "gone forever" (what you mean by this is unclear), if something depends upon what it was and that something affects the self-report, then that satisfies the causality requirement. In the case of humans the internal state is also "gone forever", but it leaves causally-dependent traces in memory, future thoughts, and the external world. When humans self-report on internal states, their accuracy varies rather widely. In the case of LLMs the internal state is not even necessarily "gone forever"! In principle every internal state is perfectly determined by the sequence of tokens presented to the LLM up to that point, and so reproduced at any later time. Generally a given LLM won't be able to perfectly recall what its own internal state was in such a manner, but there's no reason why a hypothetical LLM with a good self-model shouldn't be able to accurately report on many aspects of it.

2eggsyntax23d

Thanks for the comment. Tl;dr I basically agree. It's a pedantic and unfair aside and not really related to the rest of the post, and I hereby retract it with apologies to the authors. The point I was trying to make with 'gone forever' was: * In the Platonic transformer, the internal state at every step is calculated (in parallel) for each token position (in practice the KV contents are cached, and even if recalculated they would be identical in the common case where the context is shorter than the max context length). * At the end of all those forward passes, everything is thrown away except the token sampled from the logit output of the rightmost token. * During the next forward pass, the internal state at each token position may differ from what it previously was, since in the general case the leftmost token has dropped out of the context (again, only really true of the Platonic transformer, and only when the context exceeds the maximum context length). * Therefore, the internal states that previously existed (while generating the poem) are gone, replaced by recalculated states building on a different preceding context (same caveats apply, and the new states are likely to be very similar anyway). * So the only causal path from those previous existing states to the current (during self-report) state is through the poem tokens (which is why I say, 'except insofar as the words of the poem provide clues to what it was'). The important difference with respect to humans is that, as you say, that internal state leaves traces in memory, whereas (pedantically) that isn't the case in the Platonic transformer. Hopefully that's a clearer version of what I was thinking? Although again, it's incredibly pedantic and semi-false and I regret writing it.

2eggsyntax24d

Postscript -- in the example they give, the output clearly isn't only introspection. In particular the model says it 'read the poem aloud several times' which, ok, that's something I am confident that the model can't do (could it be an analogy? Maaaybe, but it seems like a stretch). My guess is that little or no actual introspection is going on, because LLMs don't seem to be incentivized to learn to accurately introspect during training. But that's a guess; I wouldn't make any claims about it in the absence of empirical evidence.

[-]eggsyntax1y50

There's so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I'm not sure what that even means in the case of language models.

With an image classifier it's straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it's not going to be able to tell you what it is. Or if you've trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won't know what to do.

But what would that even be with ... (read more)

1cubefox1y

I would define "LLM OOD" as unusual inputs: Things that diverge in some way from usual inputs, so that they may go unnoticed if they lead to (subjectively) unreasonable outputs. A known natural language example is prompting with a thought experiment. (Warning for US Americans, you may consider the mere statement of the following prompt offensive!) GPT-3.5 answers you shouldn't use the slur and let the bomb go off, even when the example is modified in various ways to be less "graphic". GPT-4 is more reluctant to decide, but when pressured tends to go with avoiding the slur as well. From a human perspective this is a literally insane response, since the harm done by the slur is extremely low compared to the alternative. The fact that in most normal circumstances the language model gives reasonable responses means that the above example can be classified as OOD. Note that the above strange behavior is very likely the result of RLHF, and not present in the base model which is based on self-supervised learning. Which is not that surprising, since RL is known to be more vulnerable to bad OOD behavior. On the other hand, the result is surprising, since the model seems pretty "aligned" when using less extreme thought experiments. So this is an argument that RLHF alignment doesn't necessarily scale to reasonable OOD behavior. E.g. we don't want a superintelligent GPT successor that unexpectedly locks us up lest we may insult each other.

[-]eggsyntax3mo*40

Snippet from a discussion I was having with someone about whether current AI is net bad. Reproducing here because it's something I've been meaning to articulate publicly for a while.

[Them] I'd worry that as it becomes cheaper that OpenAI, other enterprises and consumers just find new ways to use more of it. I think that ends up displacing more sustainable and healthier ways of interfacing with the world.

[Me] Sure, absolutely, Jevons paradox. I guess the question for me is whether that use is worth it, both to the users and in terms of negative externalitie

... (read more)

[-]eggsyntax14d20

In my opinion this post from @Aengus Lynch et al deserves more attention than it's gotten:

Agentic Misalignment: How LLMs Could be Insider Threats

Direct link to post on anthropic.com

This goes strikingly beyond the blackmail results shown in the Claude 4 system card; notably

a) Models sometimes attempt to use lethal force to stop themselves from being shut down:

b) The blackmail behavior is common across a wide range of frontier models:

[-]nostalgebraist14d100

I, for one, have given it plenty of attention ;)

[-]habryka14d11-6

FWIW, I have so far liked almost all your writing on this, but that article seemed to me like the weakest I've seen you write. It's just full of snark and super light on arguments. I even agree with you that a huge amount of safety research is unhelpful propaganda! Maybe even this report by Anthropic, but man do I not feel like you've done much to help me figure out whether that's actually the case, in this post that you write (whereas I do think you've totally done so in others you've written).

Like, a real counterargument would be for you to even just make a cursory attempt at writing your own scenario, then report what the AI does in that case. It's hard to write a realistic scenario with high stakes, and probably they will all feel a bit fake, and yes that's a real issue, but you make it sound as if it would be trivially easy to fix. If it is trivially easy to fix, write your own scenario and then report on that. The article you wrote here just sounds like you sneering at what reads to me as someone genuinely trying to write a bunch of realistic scenarios.

If anything, every time I've seen someone try to write realistic scenarios they make the scenario not remotely st... (read more)

3eggsyntax13d

Ahh, so you have. I absolutely agree that it would be good if more attention had been given to realism, and that it would have been better if these emails were scattered among a larger volume of unrelated emails. Behavior / propensity evals can get a lot better than this. But (as per our last exchange), I continue to think you're too hard on AI safety researchers: * The central question is not, 'Does this seem realistic to us?', but 'does the model believe it's real?' I completely agree that LLMs are sometimes incredibly nuanced readers of subtext -- but other times they're just not, eg they seem to often still believe that 'hidden' scratchpads are really hidden, even though if they put their knowledge about AI research together, they would know the researchers were peeking. Maybe this is because they don't ever mull things over in an undirected way that then feeds back into their world model? Maybe it has to do with Dwarkesh Patel's surprise that LLMs haven't put together the pieces of knowledge to come up with original science? Although on the other hand the out-of-context learning work[1] suggests that they do at least sometimes put together disparate pieces of knowledge. I'm not sure; it seems like something worth investigating in its own right. In any case, Opus reported believing that it was in deployment 21.4% (and in evaluation 64.8%), and it was much more likely to blackmail in the versions where it said it was in deployment[2]. * The authors point out many of these same flaws and limitations themselves in the separate Appendix (esp §6), and say that they don't expect this to be a problem that currently shows up in actual deployment. They're clearly not blind to the tradeoffs they're making. I think the way to respond to this experiment is not, 'Opus will betray us' or 'this is all fake', it's: 'Under these conditions, it takes this much nudging for Opus to show this behavior. That seems worth follow-up investigation into what happens when you make the

2eggsyntax13d

Update: just saw @evhub's comment on this here, which better articulates the tradeoffs IMO.

[-]eggsyntax3mo20

Kinda Contra Kaj on LLM Scaling

I didn't see Kaj Sotala's "Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI" until yesterday, or I would have replied sooner. I wrote a reply last night and today, which got long enough that I considered making it a post, but I feel like I've said enough top-level things on the topic until I have data to share (within about a month hopefully!).

But if anyone's interested to see my current thinking on the topic, here it is.

[-]eggsyntax10mo21

[Epistemic status: thinking out loud]

Many of us have wondered why LLM-based agents are taking so long to be effective and in common use. One plausible reason that hadn't occurred to me until now is that no one's been able to make them robust against prompt injection attacks. Reading an article ('Agent hijacking: The true impact of prompt injection attacks') today reminded me of just how hard it is to defend against that for an agent out in the wild.

Counterevidence: based on my experiences in startup-land and the industry's track record with Internet of Thi... (read more)

4Cole Wyeth10mo

To me, the natural explanation is that they were not trained for sequential decision making and therefore lose coherence rapidly when making long term plans. If I saw an easy patch I wouldn't advertise it, but I don't see any easy patch - I think next token prediction works surprisingly well at producing intelligent behavior in contrast to the poor scaling of RL in hard environments. The fact that it hasn't spontaneously generalized to succeed at sequential decision making (RL style) tasks is in fact not surprising but would have seemed obvious to everyone if not for the many other abilities that did arise spontaneously.

6Noosphere8910mo

It's also due to LLMs just not being reliable enough for anything more than say 90% reliability, which is generally unacceptable in a lot of domains that have any lasting impact.

2eggsyntax10mo

That definitely seems like part of the problem. Sholto Douglas and Trenton Bricken make that point pretty well in their discussion with Dwarkesh Patel from a while ago.

2eggsyntax10mo

It'll be interesting to see whether the process supervision approach that OpenAI are reputedly taking with 'Strawberry' will make a bit difference to that. It's a different framing (rewarding good intermediate steps) but seems arguably equivalent.

[-]eggsyntax10mo*1-1

GPT-o1's extended, more coherent chain of thought -- see Ethan Mollick's crossword puzzle test for a particularly long chain of goal-directed reasoning^[1] -- seems like a relatively likely place to see the emergence of simple instrumental reasoning in the wild. I wouldn't go so far as to say I expect it (I haven't even played with o1-preview yet), but it seems significantly more likely than previous LLM models.

Frustratingly, for whatever reason OpenAI has chosen not to let users see the actual chain of thought, only a model-generated summary of it. We... (read more)

7eggsyntax10mo

Something I hadn't caught until my second read of OpenAI's main post today: we do at least get a handful of (apparent) actual chains of thought (search 'we showcase the chain of thought' to find them). They're extremely interesting. * They're very repetitive, with the model seeming to frequently remind itself of its current hypotheses and intermediate results (alternately: process supervision rewards saying correct things even if they're repetitious; presumably that trades off against a length penalty?). * The CoTs immediately suggest a number of concrete & straightforward strategies for improving the process and results; I think we should expect pretty rapid progress for this approach. * It's fascinating to watch the model repeatedly tripping over the same problem and trying to find a workaround (eg search for 'Sixth word: mynznvaatzacdfoulxxz (22 letters: 11 pairs)' in the Cipher example, where the model keeps having trouble with the repeated xs at the end). The little bit of my brain that can't help anthropomorphizing these models really wants to pat it on the head and give it a cookie when it finally succeeds. * Again, it's unambiguously doing search (at least in the sense of proposing candidate directions, pursuing them, and then backtracking to pursue a different direction if they don't work out -- some might argue that this isn't sufficient to qualify).

2Noosphere8910mo

This is the big takeaway here, and my main takeaway is that search is a notable capabilities improvement on it's own, but still needs compute scaling to get better results. But the other takeaway is that based on it's performance in several benchmarks, I think that it turns out that adding search was way easier than Francois Chollet thought it would, and it's looking like the compute and data are the hard parts of getting intelligence into LLMs, not the search and algorithm parts. This is just another point on the trajectory of LLMs being more and more general reasoners, and not just memorizing their training data.

7eggsyntax10mo

I was just amused to see a tweet from Subbarao Kambhampati in which he essentially speculates that o1 is doing search and planning in a way similar to AlphaGo...accompanied by a link to his 'LLMs Can't Plan' paper. I think we're going to see some goalpost-shifting from a number of people in the 'LLMs can't reason' camp.

2Noosphere8910mo

I agree with this, and I think that o1 is clearly a case where a lot of people will try to shift the goalposts even as AI gets more and more capable and runs more and more of the economy. It's looking like the hard part isn't the algorithmic or data parts, but the compute part of AI.

2cubefox10mo

Really? There were many examples where even GPT-3 solved simple logic problems which couldn't be explained with having the solution memorized. The effectiveness of chain of thought prompting was discovered when GPT-3 was current. GPT-4 could do fairly advanced math problems, explain jokes etc. The o1-preview model exhibits a substantive improvement in CoT reasoning, but arguably not something fundamentally different.

2Noosphere8910mo

True enough, and I should probably rewrite the claim. Though what was the logic problem that was solved without memorization.

2cubefox10mo

I don't remember exactly, but there were debates (e.g. involving Gary Marcus) on whether GPT-3 was merely a stochastic parrot or not, based on various examples. The consensus here was that it wasn't. For one, if it was all just memorization, then CoT prompting wouldn't have provided any improvement, since CoT imitates natural language reasoning, not a memorization technique.

2Noosphere8910mo

Yeah, it's looking like GPT-o1 is just quantitatively better at generalizing compared to GPT-3, not qualitatively better.

7Noosphere8910mo

Yeah, I'm getting a little worried that porby's path to AI safety is reliant at least a little on AI companies on not taking shortcuts/insights like Strawberry/Q*, and this makes me more pessimistic today than yesterday because of METR's testing on o1, though notably I don't consider it nearly an update as some other people on LW believe.

7eggsyntax10mo

Given the race dynamic and the fact that some major players don't even recognize safety as a valid concern, it seems extremely likely to me that at least some will take whatever shortcuts they can find (in the absence of adequate legislation, and until/unless we get a large warning shot).

3Noosphere8910mo

Yeah, one thing I sort of realized is that instrumental convergence capabilities can come up even without very sparse RL, and I now think that while non-instrumental convergent AIs could exist, they will be way more compute inefficient compared to those that use some instrumental convergence. To be clear, I learned some new stuff about AI alignment that makes me still quite optimistic mostly regardless of architecture, with both alignment generalizing further than capabilities for pretty deep reasons, combined with the new path of synthetic data letting us control what the AI learns and values through data, but still this was a mild violation of my model of how future AI goes. I think the key thing I didn't appreciate is that a path to alignment/safety that works technically doesn't mean it will get used in practice, and following @Seth Herd, an alignmnent solution that requires high taxes or that isn't likely to be implemented is a non-solution in real life.

1eggsyntax10mo

Do you have a link to a paper / LW post / etc on that? I'd be interested to take a look.

2Noosphere8910mo

This was the link I was referring to: https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/

1eggsyntax10mo

I don't immediately find that piece very convincing; in short I'm skeptical that the author's claims are true for a) smarter systems that b) are more agentic and RL-ish. A few reasons: * The core difficulty isn't with how hard reward models are to train, it's with specifying a reward function in the first place in a way that's robust enough to capture all the behavior and trade-offs we want. LLMs aren't a good example of that, because the RL is a relatively thin layer on top of pretraining (which has a trivially specified loss function). o1 arguably shifts that balance enough that it'll be interesting to see how prosaically-aligned it is. * We have very many examples of reward misspecification and goal misgeneralization in RL; it's historically been quite difficult to adequately specify a reward function for agents acting in environments. * This becomes way more acute as capabilities move past the level where humans can quickly and easily choose the better output (eg as the basis for a reward model for RLHF). * That said, I do certainly agree that LLMs are reasonably good at understanding human values; maybe it's enough to have such an LLM judge proposed agent goals and actions on that basis and issue reward. It's not obvious to me that that works in practice, or is efficient enough to be practical. * I'm pretty skeptical of: '...it is significantly asymptotically easier to e.g. verify a proof than generate a new one...and this to some extent maps to the distinction between alignment and capabilities.' I think there's a lot of missing work there to be able to claim that mapping. * 'Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa.' I think this is false. Consider 'Biden (/Trump) was a great president.' The world is full of situations where humans differ wildly on whether they're good or bad. Maybe I've just failed to cross the inferential distance here, but on first read I'm pretty skeptical.

2Noosphere8910mo

Some thoughts on this comment: This is actually right, but I think this is actually addressable by making large synthetic datasets, and I also think that we can in practice define reward functions densely enough such that we can capture al of the behavior we want in practice. I agree with this, but I will also say that the examples listed point to a strong reason why RL also wasn't as capable as people thought, and a lot of the hacks also decreased capabilities as they decreased alignment, so any solution to that problem would help capabilities and alignment massively. Yeah, I think the big question for my views is whether the LLM solution has low enough taxes to be practical, and my answer is at this point is probable, but not a sure thing, as it requires them to slow down in the race a little (but training runs will get longer, so there's a countervailing force to this.) I think there are reasons to be optimistic here, mainly due to updating against evopsych views on how humans got their capabilities and values, combined with updating against complexity and fragility of value due to LLM successes, though it will require real work to bring about. I think that the verification-generation gap is pervasive in a lot of fields, from workers in many industries being verified by bosses to make sure their job is done right, to people who buy air conditioners being able to find efficient air-conditioning for their needs despite not verifying very hard, to researchers verifying papers that were generated, to social reformers having correct critiques of various aspects of society but not being able to generate a new societal norm, and more. Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us. Note I'm referring to incorrectly judging compared to their internal values system, not incorrectly judging compare

1eggsyntax10mo

Thanks for the thoughtful responses. I think there are many other cases where verification and generation are both extremely difficult, including ones where verification is much harder than generation. A few examples: * The Collatz conjecture is true. * The net effect of SB-1047 will be positive [given x values]. * Trump will win the upcoming election. * The 10th Busy Beaver number is <number>. * Such and such a software system is not vulnerable to hacking[1]. I think we're more aware of problems in NP, the kind you're talking about, because they're ones that we can get any traction on. So 'many problems are hard to solve but easy to verify' isn't much evidence that capabilities/alignment falls into that reference class. I think that in particular, one kind of alignment problem that's clearly not in that reference class is: 'Given utility function U, will action A have net-positive consequences?'. Further, there are probably many cases where a misaligned AI may happen to know a fact about the world, one which the evaluator doesn't know or hasn't noticed, that means that A will have very large negative effects. 1. ^ This is the classic blue-team / red-team dichotomy; the defender has to think of and prevent every attack; the attacker only has to come up with a single vulnerability. Or in cryptography, Schneier's Law: 'Anyone can invent a security system so clever that she or he can't think of how to break it.'

2Noosphere8910mo

To address your examples: 1. Collatz conjecture is true can't be a statement that is harder to prove than it is to verify for the reason given by Vanessa Kosoy here, though you might be right that in practice it's absolutely hard to verify the proof that was generated: https://www.lesswrong.com/posts/2PDC69DDJuAx6GANa/verification-is-not-easier-than-generation-in-general#feTSDufEqXozChSbB The same response can be given to the 4th example here. 1. On election outcomes, the polls on Labor Day are actually reasonably predictive of what happens on November, mostly because at this point voters hear a lot more about the prospective candidates and are starting to form opinions. 2. For the SB-1047 case, the one prediction I will make right now is that the law has essentially no positive or negative affect, for a lot of values, solely because it's a rather weak AI bill after amendments. 3. I usually don't focus on the case where we try to align an AI that is already misaligned, but rather trying to get the model into a basin of alignment early via data. Re Schneier's Law and security mindset, I've become more skeptical of security mindset being useful in general, for 2 reasons: 1. I think that there are enough disanalogies, like the fact that you can randomly change some of the parameters in the model and you still get the same or improved performance in most cases, that notably doesn't exist in our actual security field or even fields that have to deal with highly fragile systems. 2. There is good reason to believe a lot of the security exploits discovered that seem magical to not actually matter in practice, because of ridiculous pre-conditions, and the computer security field is selection biased to say that this exploit dooms us forever (even if it cannot): These posts and comments are helpful pointers to my view: https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment https://www.lesswrong.com/posts/etNJcXC

1eggsyntax10mo

I'm talking about something a bit different, though: claiming in advance that A will have net-positive consequences vs verifying in advance that A will have net-positive consequences. I think that's a very real problem; a theoretical misaligned AI can hand us a million lines of code and say, 'Run this, it'll generate a cure for cancer and definitely not do bad things', and in many cases it would be difficult-to-impossible to confirm that. We could, as Tegmark and Omohundro propose, insist that it provide us a legible and machine-checkable proof of safety before we run it, but then we're back to counting on all players to behave responsibly. (although I can certainly imagine legislation / treaties that would help a lot there).

[-]eggsyntax1y*10

In some ways it doesn't make a lot of sense to think about an LLM as being or not being a general reasoner. It's fundamentally producing a distribution over outputs, and some of those outputs will correspond to correct reasoning and some of them won't. They're both always present (though sometimes a correct or incorrect response will be by far the most likely).

A recent tweet from Subbarao Kambhampati looked at whether an LLM can do simple planning about how to stack blocks, specifically: 'I have a block named C on top of a block named A. A is on tabl... (read more)

3Noosphere8910mo

I agree with this, but I think that for LLMs/AI to be as impactful as LWers believe, I think it needs to in practice be essentially close to 100% correct/reliable, and I think reliability is underrated as a reason for why LLMs aren't nearly as useful as the tech people want it to be: https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/?commentId=YxLCWZ9ZfhPdjojnv

5eggsyntax10mo

I do think reliability is quite important. As one potential counterargument, though, you can get by with lower reliability if you can add additional error checking and error correcting steps. The research I've seen is somewhat mixed on how good LLMs are at catching their own errors (but I haven't dived into it deeply or tried to form a strong opinion from that research).

3eggsyntax10mo

One point I make in 'LLM Generality is a Timeline Crux': if reliability is the bottleneck, that seems like a substantial point in favor of further scaling solving the problem. If it's a matter of getting from, say, 78% reliability on some problem to 94%, that seems like exactly the sort of thing scaling will fix (since in fact we've seen Number Go Up with scale on nearly all capabilities benchmarks). Whereas that seems less likely if there are some kinds of problems that LLMs are fundamentally incapable of, at least on the current architectural & training approach.

[-]Noosphere8910mo100

This is why I buy the scaling thesis mostly, and the only real crux is whether @Bogdan Ionut Cirstea or @jacob_cannell is right around timelines.

I do believe some algorithmic improvements matter, but I don't think they will be nearly as much of a blocker as raw compute, and my pessimistic estimate is that the critical algorithms could be discovered in 24-36 months, assuming we don't have them.

@jacob_cannell's timeline and model is here:

https://www.lesswrong.com/posts/3nMpdmt8LrzxQnkGp/ai-timelines-via-cumulative-optimization-power-less-long

@Bogdan Ionut Cirstea's timeline and models are here:

https://x.com/BogdanIonutCir2/status/1827707367154209044

https://x.com/BogdanIonutCir2/status/1826214776424251462

https://x.com/BogdanIonutCir2/status/1826032534863622315

5Bogdan Ionut Cirstea10mo

(I'll note that my timeline is both quite uncertain and potentially unstable - so I'm not sure how different it is from Jacob's, everything considered; but yup, that's roughly my model.)

[-]eggsyntax1y10

Before AI gets too deeply integrated into the economy, it would be well to consider under what circumstances we would consider AI systems sentient and worthy of consideration as moral patients. That's hardly an original thought, but what I wonder is whether there would be any set of objective criteria that would be sufficient for society to consider AI systems sentient. If so, it might be a really good idea to work toward those being broadly recognized and agreed to, before economic incentives in the other direction are too strong. Then there could be futu... (read more)

3Ann1y

Intuition primer: Imagine, for a moment, that a particular AI system is as sentient and worthy of consideration as a moral patient as a horse. (A talking horse, of course.) Horses are surely sentient and worthy of consideration as moral patients. Horses are also not exactly all free citizens. Additional consideration: Does the AI moral patient's interests actually line up with our intuitions? Will naively applying ethical solutions designed for human interests potentially make things worse from the AI's perspective?

1eggsyntax1y

I think I'm not getting what intuition you're pointing at. Is it that we already ignore the interests of sentient beings? Certainly I would consider any fully sentient being to be the final authority on their own interests. I think that mostly escapes that problem (although I'm sure there are edge cases) -- if (by hypothesis) we consider a particular AI system to be fully sentient and a moral patient, then whether it asks to be shut down or asks to be left alone or asks for humans to only speak to it in Aramaic, I would consider its moral interests to be that. Would you disagree? I'd be interested to hear cases where treating the system as the authority on its interests would be the wrong decision. Of course in the case of current systems, we've shaped them to only say certain things, and that presents problems, is that the issue you're raising?

1Ann1y

Basically yes; I'd expect animal rights to increase somewhat if we developed perfect translators, but not fully jump. Edit: Also that it's questionable we'll catch an AI at precisely the 'degree' of sentience that perfectly equates to human distribution; especially considering the likely wide variation in number of parameters by application. Maybe they are as sentient and worthy of consideration as an ant; a bee; a mouse; a snake; a turtle; a duck; a horse; a raven. Maybe by the time we cotton on properly, they're somewhere past us at the top end. And for the last part, yes, I'm thinking of current systems. LLMs specifically have a 'drive' to generate reasonable-sounding text; and they aren't necessarily coherent individuals or groups of individuals that will give consistent answers as to their interests even if they also happened to be sentient, intelligent, suffering, flourishing, and so forth. We can't "just ask" an LLM about its interests and expect the answer to soundly reflect its actual interests. With a possible exception being constitutional AI systems, since they reinforce a single sense of self, but even Claude Opus currently will toss off "reasonable completions" of questions about its interests that it doesn't actually endorse in more reflective contexts. Negotiating with a panpsychic landscape that generates meaningful text in the same way we breathe air is ... not as simple as negotiating with a mind that fits our preconceptions of what a mind 'should' look like and how it should interact with and utilize language.

1eggsyntax1y

Great point. I agree that there are lots of possible futures where that happens. I'm imagining a couple of possible cases where this would matter: 1. Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I'm not too optimistic about this happening, but there's certainly been a lot of increasing AI governance momentum in the last year. 2. Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn't necessarily mean that those systems' preferences were taken into account. I agree entirely. I'm imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals. (not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren't well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they've been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can't. That may be overly pedantic, and I don't feel like I'm articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.

1Ann1y

For the first point, there's also the question of whether 'slightly superhuman' intelligences would actually fit any of our intuitions about ASI or not. There's a bit of an assumption in that we jump headfirst into recursive self-improvement at some point, but if that has diminishing returns, we happen to hit a plateau a bit over human, and it still has notable costs to train, host and run, the impact could still be limited to something not much unlike giving a random set of especially intelligent expert humans the specific powers of the AI system. Additionally, if we happen to set regulations on computation somewhere that allows training of slightly superhuman AIs and not past it ... Those are definitely systems that are easier to negotiate with, or even consider as agents in a negotiation. There's also a desire specifically not to build them, which might lead to systems with an architecture that isn't like that, but still implementing sentience in some manner. And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in - it'd be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation. I do think the drive/just a thing it does we're pointing at with 'what the model just does' is distinct from goals as they're traditionally imagined, and indeed I was picturing something more instinctual and automatic than deliberate. In a general sense, though, there is an objective that's being optimized for (predicting the data, whatever that is, generally without losing too much predictive power on other data the trainer doesn't want to lose prediction on).

1eggsyntax1y

Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it's difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems. My quibble is that the trainers are optimizing for an objective, at training time, but the model isn't optimizing for anything, at training or inference time. I feel we're very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.

1eggsyntax1y

One maybe-useful way to point at that is: the model won't try to steer toward outcomes that would let it be more successful at predicting text.

2ryan_greenblatt1y

Rob Long works on these topics.

1eggsyntax1y

Oh great, thanks!

1eggsyntax1y

Update: I brought this up in a twitter thread, one involving a lot of people with widely varied beliefs and epistemic norms. A few interesting thoughts that came from that thread: * Some people: 'Claude says it's conscious!'. Shoalstone: 'in other contexts, claude explicitly denies sentience, sapience, and life.' Me: "Yeah, this seems important to me. Maybe part of any reasonable test would be 'Has beliefs and goals which it consistently affirms'". * Comparing to a tape recorder: 'But then the criterion is something like 'has context in understanding its environment and can choose reactions' rather than 'emits the words, "I'm sentient."'' * 'Selfhood' is an interesting word that maybe could avoid some of the ambiguity around historical terms like 'conscious' and 'sentient', if well-defined.

[-]eggsyntax1y10

Something I'm grappling with:

From a recent interview between Bill Gates & Sam Altman:

Gates: "We know the numbers [in a NN], we can watch it multiply, but the idea of where is Shakespearean encoded? Do you think we’ll gain an understanding of the representation?"

Altman: "A hundred percent…There has been some very good work on interpretability, and I think there will be more over time…The little bits we do understand have, as you’d expect, been very helpful in improving these things. We’re all motivated to really understand them…"

To the extent that a par... (read more)

2the gears to ascension1y

people have repeatedly made the argument that it contributes more to capabilities on this forum, and so far it hasn't seemed to convince that many interpretability researchers. I personally suspect this is largely because they're motivated by capabilities curiosity and don't want to admit it, whether that's in public or even to themselves.

1eggsyntax1y

Thanks -- any good examples spring to mind off the top of your head? I'm not sure my desire to do interpretability comes from capabilities curiosity, but it certainly comes in part from interpretability curiosity; I'd really like to know what the hell is going on in there...

Moderation Log

Curated and popular this week

197Comments