LESSWRONG
LW

All of Seth Herd's Comments + Replies

It's looped back to cool for me, I'm going to check it out.

AI companies should be safety-testing the most capable versions of their models

The most capable version of each model has not yet been created when the model is released. As well as fine-tuning for specific tasks, scaffolding matters. The agentic scaffolds people create have an increasingly important role in the model's ultimate capability.

1sjadler18h

Scaffolding for sure matters, yup! I think you're generally correct that the most-capable version hasn't been created, though there are times where AI companies do have specialized versions for a domain internally, and don't seem to be testing these anyway. It's reasonable IMO to think that these might outperform the unspecialized versions.

What life will be like for humans if aligned ASI is created

Seth Herd1d30

Suit yourself, but I happen to want to create many great continuations. I enjoy hearing about other people's happiness. I enjoy it more the better I understand them. I understand myself pretty well.

But I don't want to be greedy. I'm not sure a lot of forks of each person are better than making more new people.

Let me also mention that it's probably possible to merge forks. Simply averaging the weight changes in your simulated cortex and hippocampus will approximately work to share the memories across two forks. How far out that works before you start to get... (read more)

Policy for LLM Writing on LessWrong

Seth Herd2d20

Interesting! Do you think humans could pick up on word use that well? My perception is that humans mostly cue on structure to detect LLM slop writing, and that is relatively easily changed with prompts (although it's definitely not trivial at this point - but I haven't searched for recipes).

I did concede the point, since the research I was thinkingg of didn't use humans who've practiced detecting LLM writing.

Policy for LLM Writing on LessWrong

Seth Herd2d20

I concede the point. That's a high bar for getting LLM submissions past you. I don't know of studies that tested people who'd actually practiced detecting LLM writing.

I'd still be more comfortable with a disclosure criteria of some sort, but I don't have a great argument beyond valuing transparency and honesty.

They Took MY Job?

Seth Herd3d20

I read it too and had no such thought. I think that loose poetic free association type thing f writing is hard for humans and easy for LLMs.

Policy for LLM Writing on LessWrong

Seth Herd3d40

That's a good point and it does set at least a low bar of bothering to try.

But they don't have to try hard. They can almost just append the prompt with "and don't write it in standard LLM style".

I think it's a little more complex than that, but not much. Humans can't tell LLM writing from human writing in controlled studies. The question isn't whether you can hide the style or even if it's hard, just how easy.

Which raises the question of whether they'd even do that much, because of course they haven't read the FAQ before posting.

Really just making sure that new authors read SOMETHING about what's appreciated here would go a long way toward reducing slop posts.

2habryka2d

I am quite confident I can tell LLM writing from human writing. Yes, there are prompts sufficient to fool me, but only for a bit until I pick up on it. Adding "don't write in a standard LLM style" would not be enough, and my guess is nothing that takes less than half an hour to figure out would be enough.

9osmarks2d

Average humans can't distinguish LLM writing from human writing, presumably through lack of exposure and not trying (https://arxiv.org/abs/2502.12150 shows that it is not an extremely hard problem). We are much more Online than average.

Policy for LLM Writing on LessWrong

Seth Herd3d4-6

If you wrote the whole thing, then prompted Claude to rewrite it, that would seem to "add significant value." If you then read the whole thing carefully to say "that's what I meant, and it didn't make anything up I'm not sure about", then you've more than met the requirement laid out here, right?

They're saying the second part is all you have to do. If you had some vague prompt like "write an essay about how the field of alignment is misguided" and then proofread it you've met the criteria as laid out. So if your prompt was essentially the complete essay, y... (read more)

2RobertM3d

No, such outputs will almost certainly fail this criteria (since they will by default be written with the typical LLM "style").

Policy for LLM Writing on LessWrong

Seth Herd3d15-7

I'd like clarification on using AI as a writing assistant by having a whole conversation with it, then letting it do the primary writing. I'm hoping this meets your criteria of "add significant value".

I thought Jan Kulveit had real success with this method in A Three-Layer Model of LLM Psychology and AI Assistants Should Have a Direct Line to Their Developers. He credited Claude with the writing without mentioning how much he edited it. I find it plausible that he edited very little because his contribution had been extensive on the "prompting" ... (read more)

AI "Deep Research" Tools Reviewed

Seth Herd3d20

This is really useful, thanks!

I've also spent a lot of time trying to delegate research, and it's hard to reach the break-even point where you're spending less time explaining what you want and how to search for it, even to a bright student in the field, than just doing it yourself.

I think any proper comparison of LLMs has to take into account prompting strategies. There are two questions: which is easiest to get results from, and which is best once you learn to use it. And the best prompts very likely vary across systems.

OpenAI's Deep Research is by far t... (read more)

Chris_Leong's Shortform

Seth Herd4d40

Despite my contention on the associated paper post that focusing on wisdom in this sense is ducking the hard part of the alignment problem, I'll stress here that it Iseems thoroughly useful if it's a supplement not a substitute for work on the hard parts of the problem - technical, theoretical and societal.

I also think it's going to be easier to create wise advisors than you think, at least in the weak sense that they make their human users effectively wiser.

In short, think simple prompting schemes and eventually agentic scaffolds can do a lot of the extra... (read more)

Chris_Leong's Shortform

Seth Herd4d20

Hm, I thought this use of "wise" is almost identical to capabilities. It's sort of like capabilities with less slop or confabulation, and probably more ability to take the context of the problem/question into account. Both of those are pretty valuable, although people might not want to bother even swerving capabilities in that direction.

Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.

Seth Herd4d20

This is great! I'll comment on that short-form.

In short, I think that wise (or even wise-ish) advisors are low-hanging fruit that will help any plan succeed, and that creating them is even easier than you suppose.

Delicious Boy Slop - Boring Diet, Effortless Weightloss

Seth Herd4d40

It's an interesting and plausible claim that eating plain food is better than fighting your appetite. I tend to believe it. I'm curious how you handle eating as a social occasion; do you avoid it or go ahead and eat differently when there's a social occasion without it disrupting your diet or appetite?

Your boy slop also happens to follow my dietary theory.

I'm embarassed to share my diet philosophy but I'm going to anyway. It's embarassing because I am in fact modestly overweight. I feel it's still worth sharing as a datapoint for its strengths: I am only m... (read more)

Prioritizing threats for AI control

Seth Herd9d102

I'm suddenly expecting the first AI escapes to be human-aided. And that could be a good thing.

Your mention of human-aided AI escape brought to mind Zvi's Going Nova post today about LLMs convincing humans they're conscious to get help in "surviving". My comment there is about how those arguments will be increasingly compelling because LLMs have some aspects of human consciousness and will have more as they're enhanced, particularly with good memory systems.

If humans within orgs help LLM agents "escape", they'll get out before they could manage it on ... (read more)

Going Nova

Seth Herd9d20

I haven't written about this because I'm not sure what effect similar phenomena will have on the alignment challenge.

But it's probably going to be a big thing in public perception of AGI, so I'm going to start writing about it as a means of trying to figure out how it could be good or bad for alignment.

Here's one crucial thing: there's an almost-certainly-correct answer to "but are they really conscious" and the answer is "partly".

Consciousness is, as we all know, a suitcase term. Depending on what someone means by "conscious", being able to reason correct... (read more)

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format

Seth Herd9d42

Can you provide more details on the exact methods, like example prompts? Or did I miss a link that has these?

This is really interesting and pretty important if the methods support your interpretations.

1Roland Pihlakas8d

Thank you for asking! I am encouraged to hear you find it important :) It is under Links section. Code, system prompts, output data files, plots, and a more detailed report can be found here: https://github.com/levitation-opensource/bioblue Please let me know if you have any further questions! The GitHub readme document and PDF report contain a lot of information and maybe the structure could be improved in the future. The Python code files for the four experiments are runnable independently from each other and are relatively short, about 250 - 300 lines each. To see the nuances of the results you might want to take a closer look at the output data files with your own eye. This enables you to see that the models fail at different points in time, but the sequence of actions after failure point is usually typical. In the coming days, I will create one more output folder with annotated log files so the failure modes can be found more easily.

Against Yudkowsky's evolution analogy for AI x-risk [unfinished]

Seth Herd10d20

I'm actually interested in your responses here. This is useful for my strategies how I frame things and understanding different people's intuitions.

Do you think we can't make autonomous agents that pursue goals well enough to get things done? Do you really think they'll lose focus between being goal-focused long enough for useful work, and long enough for taking over the world if they interpret their goals differently than we intended? Do you think there's no way RL or natural language could be misinterpreted?

I'm thinking it's easy to keep an LLM agent goa... (read more)

1Fiora Sunshine10d

I also think it should be easy-ish to keep deep learning-based systems goal-focused, though mostly because I imagine that at some point, we'll have agents which are actively undergoing more RL while they're still in deployment. This means you can replicate the way humans learn to stay focused on tasks they're passionate about by just being positively reinforced for doing it all the time. My contention is just that, to the extent that the RL is misunderstood, it probably won't lead to a massive catastrophe. It's hard to think about this in the absence of concrete scenarios, but... I think to get a catastrophe, you need the system to be RL'd in ways that reliably teach it behaviors that steer a given situation towards a catastrophic outcome? I don't think you like, reliably reinforce the model for being nice to humans, but it misunderstands "being nice to humans" in such a way that causes it to end up steering the future towards some weird undesirable outcome; Claude does well enough at this kind of thing in practice. I think a real catastrophe has to look something like... you pretrain a model to give it an understanding of the world, then you RL it to be really good at killing people so you can use it as a military weapon, but you don't also RL it to be nice to people on your own side, and then it goes rogue and starts killing people on your own side. I guess that's a kind of "misunderstanding your creators' intentions", but like... I expect those kinds of errors to follow from like, fairly tractable oversights in terms of teaching a model the right caveats to intended but dangerous behavior. I don't think e.g. RLing Claude to give good advice to humans when asked could plausibly lead to it acquiring catastrophic values. edit: actually, maybe a good reference point for this is when humans misunderstand their own reward functions? i.e. "i thought i would enjoy this but i didn't"? i wonder if you could mitigate problems in this area just by telling an llm the princi

Against Yudkowsky's evolution analogy for AI x-risk [unfinished]

Seth Herd10d20

Hm. I think you're thinking of current LLMs, not AGI agents based on LLMs? If so I fully agree that they're unlkely to be dangerous at all.

I'm worried about agentic cognitive architectures we've built with LLMs as the core cognitive engine. We are trying to make them goal directed and to have human-level competence; superhuman competence/intelligence follows after that if we don't somehow halt progress permanently.

Current LLMs, like most humans most of the time, aren't strongly goal directed. But we want them to be strongly goal-directed so they do t... (read more)

1Fiora Sunshine10d

my view is that humans obtain their goals largely by a reinforcement learning process, and that they're therefore good evidence about both how you can bootstrap up to goal-directed behavior via reinforcement learning, and the limitations of doing so. the basic picture is that humans pursue goals (e.g. me, trying to write the OP) largely as a byproduct of me reliably feeling rewarded during the process, and punished for deviating from that activity. like i enjoy writing and research, and also writing let me feel productive and therefore avoid thinking about some important irl things i've been needing to get done for weeks, and these dynamics can be explained basically in the vocabulary of reinforcement learning. this gives us a solid idea of how we'd go about getting similar goals into deep learning-based AGI. (edit: also it's notable that even when writing this post i was sometimes too frustrated, exhausted, or distracted by socialization or the internet to work on it, suggesting it wasn't actually a 100% relentless goal of mine, and that goals in general don't have to be that way.) it's also worth noting that getting humans to pursue goals consistently does require kind of meticulous reinforcement learning. like... you can kind of want to do your homework, but find it painful enough to do that you bounce back and forth between doing it and scrolling twitter. same goes for holding down a job or whatever. learning to reliably pursue objectives that foster stability is like, the central project of maturation, and the difficulty of it suggests the difficulty of getting an agent that relentlessly pursues some goal without the RL process being extremely encouraging of them moving along in that direction. (one central advantage that humans have over natural selection wrt alignment is that we can much more intelligently evaluate which of an agent's actions we want to reinforce. natural selection gave us some dumb, simple reinforcement triggers, like cuddles or food or s

davekasten's Shortform

Seth Herd10d60

Would I have to go to DC? Because I hate going to DC.

Not that I wouldn't to save the world, but I'd want to be sure it was necessary.

Only partly kidding. Maybe if people got a rationalist enclave in DC going we'd be less averse?

davekasten10d147

You can definitely meet your own district's staff locally (e.g., if you're in Berkeley, Congresswoman Simon has an office in Oakland, Senator Padilla has an office in SF, and Senator Schiff's offices look not to be finalized yet but undoubtedly will include a Bay Area Office).

You can also meet most Congressional offices' staff via Zoom or phone (though some offices strongly prefer in-person meetings).

There is also indeed a meaningful rationalist presence in DC, though opinions vary as to whether the enclave is in Adams Morgan-Columbia Heights... (read more)

Against Yudkowsky's evolution analogy for AI x-risk [unfinished]

Seth Herd10d20

anyway i'm still not very convinced of Doom [...], because i have doubts about whether efficient explicit utility maximizers are even possible,

What? I'm not sure what you mean be "efficient" utility maximizers, but I think you're setting too high a bar for being concerned. I don't think doom is certain but I think it's obviously possible. Humans are dangerous, and we are possible. Anything smarter than humans is more dangerous if it has misaligned goals. We are building things that will become smarter than us. They will have goals. We do not know how to... (read more)

2Fiora Sunshine10d

it seems unlikely to me that they'll end up with like, strong, globally active goals in the manner of an expected utility maximizer, and it's not clear to me that it's likely for the goals they do develop to end up sufficiently misaligned as to cause a catastrophe. like... you get LLMs to situationally steer certain situations in certain directions by RLing it when it actually does steer those situations in those directions; if you do that enough, hopefully it catches the pattern. and... to the extent that it doesn't catch the pattern, it's not clear that it will instead steer those kinds of situations (let alone all situations) towards some catastrophic outcome. their misgeneralizations can just result in noise, or taking actions that steer certain situations into weird but ultimately harmless territory. it seems like the catastrophic outcomes are a very small subset of the ways this could end up going wrong, since you're not giving them goals to pursue relentlessly, you're just giving them feedback on the ways you want them to behave in particular types of situations.

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Seth Herd10d7-2

Isn't this model behavior what's described as jailbroken? It sounds similar to reports of jailbroken model behavior. It seems like the reports I've glanced at could be described as having lowered inhibitions and being closer to base model behavior.

In any case, good work investigating alternate explanations of the emergent misalignment paper.

DAL's Shortform

Seth Herd10d97

This is a key point for a different discussion: job loss and effect on economies. Supposing writing software is almost all automated. Nobody is going to get the trillions currently spent on it. If just two companies, say Anthropic and OpenAI have agents that automate it, they'll compete and drive the prices down to near the compute costs (or collude until others make systems that can compete...)

Now those trillions aren't being spent on writing code. Where do they go? Anticipating how businesses will use their surplus as they pay less wages is probably some... (read more)

Against Yudkowsky's evolution analogy for AI x-risk [unfinished]

Seth Herd10d30

Did the paper really show that they're explicit optimizers? If so, what's your definition of them?

I have representations of future states and I choose actions that might lead toward them. Those representations of my goals are explicit, so I'd call myself an explicit optimizer.

I added a bunch to the previous comment in an edit, sorry! I was switching from phone to laptop when it got longer. So you might want to look at it again to see if you got the full version.

3Fiora Sunshine10d

"explicit optimizer" here just means that you search through some space of possibilities, and eventually select one that scores high according to some explicit objective function. (this is also how MIRI's RFLO paper defines optimization.) the paper strongly suggests that neural networks sometimes run something like gradient descent internally, which fits this definition. it's not necessarily about scheming to reach long-term goals in the external world, though that's definitely a type of optimization. (it's clear that Claude etc. can do that kind of optimization verbally, i.e. not actually within its own weights; it can think through multiple ideas for action, rank them, and pick the best one too. the relevant difference between this and paperclip-style optimization is that its motivation to actually pursue any given goal is dependent on its weights; you could totally prompt an LLM to with a natural language command to pursue some goal, but it refuses because it's been trained to not pursue such goals. and this relates to the things where like... at the layer of natural language processing anyway, your verbally thought "goals" are more like attempts to steer a fuzzy inference process, which itself may or may not have an explicit internal representation of end-state it's actually aiming at. if not, the yudkowsian image of utility maximization becomes misleading, and there's no longer reason to expect the system to be "trying" to steer the system towards some alien inscrutable outcome that just incidentally looks like optimizing for something intelligible for as long as the system remains sufficiently weak.) anyway i'm still not very convinced of Doom despite this post's argument against the emergence of internal optimization algorithms being apparently wrong, because i have doubts about whether efficient explicit utility maximizers are even possible, not to mention the question of whether the particular inducitve biases of deep learning would actually lead to them

Against Yudkowsky's evolution analogy for AI x-risk [unfinished]

Seth Herd10d30

Separately from my other comment, and more on the object level on your argument:

You focus on loops and say a feedforward network can't be an "explicit optimizer". Depending on what you mean by that term, I think you're right.

I think it's actually a pretty strong argument that a feedforward neural network itself can't be much of an optimizer.

Transformers do some effective looping by doing multiple forward passes. They make the output of the last pass the input of the next pass. That's a loop that's incorporating their past computations into their new comput... (read more)

Against Yudkowsky's evolution analogy for AI x-risk [unfinished]

Seth Herd10d30

Why do you think genes don't have an advantage over weights for mesaoptimization? The paper shows that weights can do it but mightn't genes still have an advantage?

I didn't follow the details, I'm just interested in the logic you're applying to conclude that your theoretical work was invalidated by that empirical study.

I also think Yudkowsky's argument isn't just about mesaoptimizers. You can have the whole network optimize for the training set, and just be disappointed to find out that it didn't learn what you'd hoped it would learn when it gets into new ... (read more)

2Fiora Sunshine10d

in the section of the post i didn't finish and therefore didn't include here, i talk about how like... okay so valuing some outcome is about reliably taking actions which increase the subjective probability of that outcome occurring. explicit utility maximizers are constantly doing this by nature, but systems that acquire their values via RL (such as humans and chat models) only do so contextually and imperfectly. like... the thing RL fundamentally is, is a way of learning to produce outputs that predictably get high reward from the loss function. this only creates systems which optimize over the external world to the extent that, in certain situations, the particular types of actions a model learns happen to tend to steer the future in particular directions. so... failures of generalization here don't necessarily result in systems that optimize effectively for anything at all; their misgeneralized behavior can in principle just be noise, and indeed it typically empirically is in deep learning, e.g. memorizing the training data. (see also the fact that e.g. claude sometimes steers the future from certain "tributary states" like the user asking it for advice towards certain attractor basisns, like the user making a good decision. claude does this reliably despite not trying to optimize the cosmos for something else besides that. and it's hard to imagine concretely what a "distributional shift" that would cause asked-for-advice claude to start reliably giving bad advice; maybe if the user has inhuman psychology, i guess? such that claude's normal advice was bad? idk. i suppose claude can be prompted to be a little bit malicious if you really know what you're doing, which can "steer the world" towards mildly but still targetedly bad outcomes given certain input states...) anyway, humans are examples of systems that do somewhat effectively optimize for things other than what they were trained to optimize for, but that's an artifact of the particular methods natural se

Is Peano arithmetic trying to kill us? Do we care?

Answer by Seth HerdMar 18, 20256-1

I worry very little about AI harming us by accident. I think that's a much lower class of unsafe than an intelligent being adopting the goal of harming you. And I think we're are busily working toward building AGI that might do just that. So I don't worry about other classes of unsafe things much at all.

Tons of things like cars and the Facebook engagement algorithm are unsafe by changing human behavior in ways that directly and indirectly cause physical and psychological harm.

Optimized to cause harm is another level of unsafe. An engineered or natural viru... (read more)

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

Seth Herd11d32

Have you elaborated this argument? I tend to think a military project would be a lot more cautious than move-fast-and-break-things silicone valley businesses.

The argument that orgs with reputations to lose might start being careful when AI becomes actually dangerous or even just autonomous enough to be alarming is important if true. Most folks seem to assume they'll just forge ahead until they succeed and let a misaligned AGI get loose.

I've made an argument that orgs will be careful to protect their reputations in System 2 Alignment. I think this will be h... (read more)

Davey Morse's Shortform

Seth Herd12d20

Right. I suppose that day ea interact with identity.

If I get significantly dumber, I'd still roughly be me, and I'd want to preserve that if it's not wipes ng out or distorting the other things too much. If I got substantially smarter, I'd be a somewhat different person - I'd act differently often, because I'd see situations differently (more clearly/holistically) but it feels as though that persone might actually be more me than I am now. I'd be better able to do what I want, including values (which I'd sort of wrapped in to habits of thought, but values might deserve a spot on the list).

Davey Morse's Shortform

Seth Herd12d*3-2

The way I usually frame identity is

Beliefs
Habits (edit - including of thought)
Memories

Edit: values should probably be considered a separate class, since every thought has an associated valence.

In no particular order, and that's the whole list.

Character is largely beliefs and habits.

There's another part of character that's purely emotional; it's sort of a habit to get angry, scared, happy, etc in certain circumstances. I'd want to preserve that too but it's less important than the big three.

There are plenty of beings striving to survive, so preserving that ... (read more)

1Davey Morse9d

In America/Western culture, I totally agree. I'm curious whether alien/LLM-based would adopt these semantics too.

1Davey Morse9d

I wonder under what conditions one would make the opposite statement—that there's not striving. For example, I wonder if being omniscient would affect one's view of whether there's already enough striving or not.

2cubefox12d

There are also cognitive abilities, e.g. degree of intelligence.

Seth Herd's Shortform

Seth Herd13d201

Draft comment on US AI policy, to be submitted shortly (deadline midnight ET tonight!) to the White House Office of Science and Technology Policy, who may largely shape official policy; See Dave Kasten's brief description in his short form if you haven't already, and regulations.gov/document/NSF_FRDOC_0001-3479… for their also brief description and submission email.

After thinking hard about it and writing Whether governments will control AGI is important and neglected, I've decided to share my whole opinion on the matter with the committee. My reason... (read more)

Whether governments will control AGI is important and neglected

Seth Herd13d20

Interesting. I hadn't thought about Musk's influence and how he is certainly AGI-pilled.

abstractapplic's Shortform

Seth Herd15d147

I think they meant over 99% when used on a non-randomly selected human who's bothering to take a pregnancy test. Your rock would run maybe 70% or so on that application.

davekasten's Shortform

Seth Herd17d*7-3

I wasn't able to finish that post in the few minutes I've got so far today, so here's the super short version. I remain highly uncertain whether my comments will include any mention of AGI.

(Edit: I finally finished it: Whether governments will control AGI is important and neglected)

I think whether AGI-pilling governments is a good idea is quite complex. Pushing the government to become aware of AGI x-risks will probably decelerate progress, but it could even accelerate it if the conclusion is "build it first, don't worry we'll be super careful when we get ... (read more)

Knight Lee's Shortform

Seth Herd17d31

I don't think the happier worlds are less predictable; the Christians and their heaven of singing just lacked imagination. We'll want some exciting and interesting happy simulations, too.

But this overall scenario is quite concerning as an s-risk. To think that Musk putched a curiosity drive for Grok as a good thing boggles my mind.

Emergent curiosity drives should be a major concern.

1Knight Lee17d

I guess it's not extremely predictable, but it still might be repetitive enough that only half the human-like lives in a curiosity driven simulation will be in a happy post-singularity world. It won't last a million years, but a similar duration to the modern era.

davekasten's Shortform

Seth Herd17d*61

Edit: I finished that post on this topic: Whether governments will control AGI is important and neglected.

I'm hoping for discussion on that post, and quite ready to change my draft comment, or not submit one, based on those arguments. After putting a bunch of thought into it, my planned comment will recommend forming a committee that can work in private to investigate the opportunities and risks of AI development, to inform future policy. I will note that this was Roosevelt's response to Einstein's letter on the potential of nuclear weaponry.

I hope t... (read more)

davekasten17d139

I think on net, there are relatively fewer risks related to getting governments more AGI-pilled vs. them continuing on their current course; governments are broadly AI-pilled even if not AGI/ASI-pilled and are doing most of the accelerating actions an AGI-accelerator would want.

OpenAI: Detecting misbehavior in frontier reasoning models

Seth Herd17d50

The fastest route to solving a complex problem and showing your work is often to just show the work you're doing anyway. That's what teachers are going for when they demand it. If you had some reason for making up fake work instead you could. But you'd need a reason.

Here it may be relevant that some of my friends did make up fake work when using shortcut techniques of guessing the answer in algebra.

Sure it would be better to have a better alignment strategy. But there are no plausible routes I know of to getting people to stop developing LLMs and LLM agen... (read more)

OpenAI: Detecting misbehavior in frontier reasoning models

Seth Herd17dΩ581

This is also encouraging because OpenAI is making some actual claims about safety procedures. Sure they could walk it back pretty easily, but it does indicate that at least as of now they likely intend to try to maintain a faithful CoT.

You assumed no faithful CoT in What goals will AIs have?, suggesting that you expected OpenAI to give up on it. That's concerning given your familiarity with their culture. Of course they still might easily go that way if there's a substantial alignment tax for maintaining faithful CoT, but this is at least nice to see.

Daniel Kokotajlo17dΩ5134

Indeed! This is most of why I'm happy about this -- from internal discussions it's long been speculated (since '23 when I was talking about it, and probably before) that training away the bad thoughts from the CoT would result in obfuscated bad thoughts. It's cool that they actually did the experiments to prove it, and in a real-world setting too! But it's even cooler that they published this with a strong recommendation attached.

Now we see how long this recommendation holds up under the pressure of incentives.

Sometime in the next few years probably,... (read more)

Linch's Shortform

Seth Herd17d40

Right. There's no advantage to being a bad liar, but there may be an advantage to being seen as a bad liar. But it's probably not worth lying badly to get that reputation, since that would also wreck your reputation for honesty.

when will LLMs become human-level bloggers?

Seth Herd18d30

Nice, I might have to borrow that Agancé joke. It's a branching nested set of sparkling while loops, but yeah.

And episodic memory to hold it together, and to learn new strategies for different sorts of executive function

Capabilities and alignment of LLM cognitive architectures

And if we get some RL learning helping out, that makes it easier and require less smarts from the LLM that's prompted to act as its own thought manager.

when will LLMs become human-level bloggers?

Seth Herd18d40

This is pretty good. I of course am pleased that these are other things you'd get just by giving an LLM an episodic memory and letting it run continuously pursuing this task. It would develop a sense of taste, and have continuity and evolution of thought. It would build and revise mental models in the form of beliefs.

I'm pretty sure Clause already has a lot of curiosity (or at least 3.5 did). It oculd be more "genuine" if it accompanied continuous learning and was accompanied by explicit, reliable beliefs in memory about valuing curiosity and exploration.

Book Review: Affective Neuroscience

Seth Herd18d40

My one quibble is the "implications". The most important implication of how emotions work in humans is that cognitive reframing is quite effective for emotional control in situations where there's a convincing alternate framing.

Telling yourself to not be afraid isn't useful at all, and potentially counterpreductive through the "white elephant effect". Imagining the feeling of calm is somewhat useful (that type of direct emotional induction is under-studied but I've found it useful as it should be from principals of how cognitive control and imagination wor... (read more)

Neil Warren's Shortform

Seth Herd18d*81

I agree that it would be useful to have an official position.

There is no official position AFAIK but individuals in management have expressed the opinion that uncredited AI writing on LW is bad because it pollutes the epistemic commons (my phrase and interpretation).

I agree with this statement.

I don't care if an AI did the writing as long as a human is vouching for the ideas making sense.

If no human is actively vouching for the ideas and claims being plausibly correct and useful, I don't want to see it. There are more useful ideas here than I have time to ... (read more)

when will LLMs become human-level bloggers?

Seth Herd18d30

doing lengthy research and summarizing it is important work but not typically what I associate with "blogging". But I think pulling that together into an attractive product uses much the same cognitive skills as producing original seeing. The missing step in the process you describe is figuring out when the research did produce surprising insights, which might be a class of novel problems (unless a general formulaic approach works and someone scaffolds that in). To the extent it doesn't require solving novel problems, I think it's predictably easier than quality blogging that doesn't rely on research for the novel insights.

2ozziegooen18d

"The missing step in the process you describe is figuring out when the research did produce surprising insights, which might be a class of novel problems (unless a general formulaic approach works and someone scaffolds that in)." -> I feel optimistic about the ability to use prompts to get us fairly far with this. More powerful/agentic systems will help a lot to actually execute those prompts at scale, but the core technical challenge seems like it could be fairly straightforward. I've been experimenting with LLMs to try to detect what information that they could come up with that would later surprise them. I think this is fairly measurable.

when will LLMs become human-level bloggers?

Seth Herd18d40

I think this is correct. Blogging isn't the easy end of the spectrum, it actually involves solving novel problems of finding new useful viewpoints. But this answer it leaves out answering the core question: what advances will allow LLMs to produce original seeing?

If you think about how humans typically produce original seeing, I think there are relatively straightforward ways that an LLM-based cognitive architecture that can direct its own investigations, "think" about what it'd found, and remember (using continuous learning of some sort) what it's found c... (read more)

Linch's Shortform

Seth Herd19d42

All of the below is speculative; I just want to not that there are at least equally good arguments for the advantages of being seen as a bad liar (and for actually being a bad liar).

I disagree on the real world advantages. Judging what works from a few examples who are known as good liars (Trump and Musk for instance) isn't the right way to judge what works on average (and I'm not sure those two are even "succeeding" by my standards; Trump at least seems quite unhappy).

I have long refused to play social deception games because not only do I not want to be ... (read more)

2Linch17d

I agree being high-integrity and not lying is a good strategy in many real-world dealings. It's also better for your soul. However I will not frame it as "being a bad liar" so much as "being honest." Being high-integrity is often valuable, and ofc you accrue more benefits from actually being high-integrity when you're also known as high-integrity. But these benefits mostly come from actually not lying, rather than lying and being bad at it.

2Viliam18d

Depends on the environment. Among relatively smart people who know each other, trust their previous experience, and communicate their previous experience with each other -- yes. But this strategy breaks down if you keep meeting strangers, or if people around you believe the rumors (so it is easy to character-assassinate a honest person).

A Bear Case: My Predictions Regarding AI Progress

Seth Herd21d51

I don't think this path is easy; I think immense effort and money will be directed at it by default, since there's so much money to be made by replacing human labor with agents. And I think no breakthroughs are necessary, just work in fairly obvious directions. That's why I think this is likely to lead to human-level agents.

I don't think it would take insane amounts of compute, but compute costs will be substantial. They'll be roughly like costs for OpenAIs Operator; it runs autonomously, making calls to frontier LLMs and vision models essentially continuo

... (read more)

groblegark's Shortform

Seth Herd21d30

I just bumped across "atomic thinking", which asks the model to break the problem into co.ponent parts, attack each separately, and only produce an answer after that's done and they can all be brought together.

This is how smart humans attack some problems, and it's notably different from chain of thought.

I expect this approach could also be used to train models, by training on componenet problems. If other techniques don't keep progressing so fast as to make it irrelevant.

1groblegark21d

slightly related https://arxiv.org/abs/2503.00735

What are the strongest arguments for very short timelines?

Answer by Seth HerdMar 06, 202540

We should expect a significant chance of very short (2-5 year) timelines because we don't have good estimates of timelines.

We are estimating an ETA by having good estimates of our position and velocity, but not a well-known destination.

A good estimate of the end point for timelines would require a good gears-level models of AGI. We don't have that.

The rational thing to do is admit that we have very broad uncertainties, and make plans for different possible timelines. I fear we're mostly just hoping tinmelines aren't really short.

This argument is separate f... (read more)

1arisAlexis22d

no 2. is much more important than academic ML researchers which is the majority of the surveys done. When someone delivers a product and is the only one building it and they tells you X, you should belive X unless there is a super strong argument for the contrary and there just isn't.

A Bear Case: My Predictions Regarding AI Progress

Seth Herd22d*40

Yes, probably. The progression thus far is that the same level of intelligence gets more efficient - faster or cheaper.

I actually think current systems don't really think much faster than humans - they're just faster at putting words to thoughts, since their thinking is more closely tied to text. But if they don't keep getting smarter, they will still likely keep getting faster and cheaper.