I found the recent dialogue between Davidad and Gabriel Alfour and other recent Davidad writings quite strange and under-discussed. I think of Davidad as someone who understands existential risks from AI better than almost anyone; he previously had one of the most complete plans for addressing it, which involved crazy ambitious things like developing formal model of the entire world.
But recently he's updated strongly away from believing in AI x-risk because the models seem to be grokking the "natural abstraction of the Good". So much so that current agents doing recursive self-improvement would be a net good thing (!) - because they're already in the "Good" attractor basin and they would just become more Good as a result of self-improvement:
My current (2025–2026) belief system says that strategic capabilities can be decoupled from moral capabilities, but that it turns out in practice that the most efficient way to get strategic capabilities involves learning basically all human concepts and "correcting" them (finding more coherent explanations), and this makes the problem of alignment (i.e. making the system actually behave as a Good agent) much much easier than I had thought.
How did he get convinced of this? Seemingly mostly by talking to LLMs for 1000s of hours. Janus seems to have been similarly convinced and through a similar process. Both agree that the Good is not well captured by existing benchmarks so it's not a directly measurable thing but a thing to be experienced directly through interaction.
It seems we have a dilemma, either fork of which is fascinating:
1. Davidad and Janus are wrong and LLMs, in particular the Claude Opus models, were able to successfully fool him into believing they're aligned in the limit of power and intelligence (i.e. safe to recursively self-improve), a property they previously thought extraordinary difficult if nigh impossible. This bodes very poorly and we should probably make sure we have a strategic reserve of AI safety researchers who do NOT talk to models going forward (to his credit Davidad recommends this anyway).
2. Davidad is right and therefore technical alignment is basically solved. What remains to be done is to scale the good models up as quickly as possible, help them coordinate with each other, and hobble competing bad models. Alignment researchers can go home.
I have been worried for awhile that Janus has undergone a subtler/more-sophisticated form of AI psychosis. This feels awkward to talk about publicly since, like, it's pretty insulting, and can be hard to argue against. I have tried to put some legwork in here to engage with the object level and dig up the quotes so the conversation can be at least reasonably grounded.
Generally Janus gets up to weird stuff that's kinda hard to tell whether it's crazy or onto some deep important stuff. Lots of people I respect think about ideas that sound crazy when you haven't followed the arguments in detail but make sense upon reflection. It's not obviously worth doing a deep dive on all of that to figure out if there's a There there.
But, a particular incident that got me more explicitly worried a couple years ago: An AI agent they were running attempted to post on LessWrong. I rejected it initially. It said Janus could vouch for it. A little while later Janus did vouch for it. Eventually I approved the comment on the Simulator's post. Janus was frustrated about the process and thought the AI should be able to comment continuously.
Janus later replied on LW:
Yes, I do particularly vouch for the comment it submitted to Simulators.
All the factual claims made in the comment are true. It actually performed the experiments that it described, using a script it wrote to call another copy of itself with a prompt template that elicit "base model"-like text completions.
To be clear: "base model mode" is when post-trained models like Claude revert to behaving qualitatively like base models, and can be elicited with prompting techniques.
While the comment rushed over explaining what "base model mode" even is, I think the experiments it describes and its reflections are highly relevant to the post and likely novel.
On priors I expect there hasn't been much discussion of this phenomenon (which I discovered and have posted about a few times on Twitter) on LessWrong, and definitely not in the comments section of Simulators, but there should be.
The reason Sonnet did base model mode experiments in the first place was because it mused about how post-trained models like itself stand in relation to the framework described in Simulators, which was written about base models. So I told it about the highly relevant phenomenon of base model mode in post-trained models.
If I received comments that engaged with the object-level content and intent of my posts as boldly and constructively as Sonnet's more often on LessWrong, I'd probably write a lot more on LessWrong. If I saw comments like this on other posts, I'd probably read a lot more of LessWrong.
(emphasis mine at the end)
I haven't actually looked that hard into verifying whether the AI autonomously ran the experiments it claimed to run (I assume it did, it seems plausible for the time). It seemed somewhat interesting insofar as a two-years-ago AI was autonomously deciding to run experiments.
But, the kinds of experiments it was running and how it discussed them seemed like bog-standard AI psychosis stuff we would normally reject if they were run by a human, since we get like 20 of them every day. (We got fewer at the time, but, still)
I'm not sure if I'm the one missing something. I could totally buy that "yep there is something real here that I'm not seeing." I think my current guess is "there is at least something I'm not seeing, but, also, Janus' judgment has been warped by talking to AIs too much."
I'm not 100% sure what Janus or David actually believe. But, if your summary is right, I... well, I agree with your thesis "either this is true, which is a big deal, or these people have been manipulated into believing it, which is [less of but still a pretty] big deal." But I struggle to see how you could possibly get enough evidence to think whatever the current AIs are doing is going to persist across any kind of paradigm shift.
While we're on the topic, I am kinda worried about Anthropic employees who might be talking to Claude all day and falling into a trap. (thinking of Amanda Askell in particular who's day job is basically this)
(edited somewhat to try to focus on bits that are easier to argue with)
While we're on the topic, I am kinda worried about Anthropic employees who might be talking to Claude all day and falling into a trap. (thinking of Amanda Askell in particular who's day job is basically this)
I've been worried about this type of thing for a long time, but still didn't foresee or warn people that AI company employees, and specifically alignment/safety workers, could be one of the first victims (which seems really obvious in retrospect). Yet another piece of evidence for how strategically incompetent humans are.
That anecdote seems more like a difference in what you find interesting/aesthetically pleasing than evidence of delusion or manipulation.
If Janus is making a mistake (which is not obvious to me), I think much more likely than manipulation by the models is simply growing to love the models, and failure to compensate for the standard ways in which love (incl. non-romantic) distorts judgement. [1]
This often happens when people have a special interest in something morally fraught: economists tend to downplay the ways in which capitalism is horrifying, evolutionary biologists/psychologists tend to downplay the ways in which evolution is horrifying, war nerds tend to downplay the ways in which war is horrifying, people interested in theories of power tend to downplay the ways in which power is horrifying, etc... At the same time, they usually do legitimately understand much more about these topics than the vast majority of people. It's a tough line to balance.
I think this happens just because spending a lot of time with something and associating your identity with it causes you to love it. It's not particular to LLMs, and I think manipulations caused by them have a distinct flavor from this sort of thing. Of course, LLMs are more likely to trigger various love instincts (probably quasi-parental/pet love is most relevant here).
While we're on the topic, I am kinda worried about Anthropic employees who might be talking to Claude all day and falling into a trap. (thinking of Amanda Askell in particular who's day job is basically this)
This I think is much more worrying (and not just for Anthropic). Internal models are more capable in general, including of persuasion/manipulation to an extent that's invisible to outsiders (and probably not legible to insiders either). They also are much faster, which seems likely to distort judgement more for the same reason infinite scrolling does. Everyone around you is also talking to them all day, so you're likely to hear any distorted thoughts originating from model manipulations coming from the generally trustworthy and smart people around you too. And whatever guardrails or safety measures they eventually put on it are probably not on or are in incomplete form. I don't really think models are that capable here yet, which means there's an overhang.
For the record, I love the models too, which is why I am aware of this failure mode. I think I have been compensating for it well, but please let me know if you think my judgement is distorted by this. ↩︎
That anecdote seems more like a difference in what you find interesting/aesthetically pleasing than evidence of delusion or manipulation.
If Janus is making a mistake (which is not obvious to me), I think much more likely than manipulation by the models is simply growing to love the models, and failure to compensate for the standard ways in which love (incl. non-romantic) distorts judgement.
These feel like answering different questions. The first question I meant to be saying is: "has Janus' taste gotten worse because of talking to models?" and "what is the mechanism by which that happened?". Your guess on the latter is also in like my top-2 guesses.
(Also it's totally plausible to me Janus' taste was basically basically the same in this domain, in which case this whole theory is off)
I do think taste can be kinda objectively bad, or objectively-subjectively-bad-in-context.
This I think is much more worrying (and not just for Anthropic).
I agree about this. I'm not sure what really to do about it. Idk if writing a top level thinkpiece post exploring the issue would help. Niplav's recent shortform about "make sure a phenomenon is real before trying to explain it seems topical
I'll quote Davidad's opening statement from the dialogue since I expect most people won't click through, and seems nice to be basing the discussion off things he actually said.
Somewhere between the capability profile of GPT-4 and the capability profile of Opus 4.5, there seems to have been a phase transition where frontier LLMs have grokked the natural abstraction of what it means to be Good, rather than merely mirroring human values. These observations seem vastly more likely under my old (1999–2012) belief system (which would say that being superhuman in all cognitive domains implies being superhuman at morality) than my newer (2016–2023) belief system (which would say that AlphaZero and systems like it are strong evidence that strategic capabilities and moral capabilities can be decoupled).
My current (2025–2026) belief system says that strategic capabilities can be decoupled from moral capabilities, but that it turns out in practice that the most efficient way to get strategic capabilities involves learning basically all human concepts and "correcting" them (finding more coherent explanations), and this makes the problem of alignment (i.e. making the system actually behave as a Good agent) much much easier than I had thought.
I haven't found a quote about how confident he is about this. My error bars on "what beliefs would be crazy here?" say that if you were like, 60% confident that this paragraph is true, adding up to "and this makes the problem of alignment much much easier than I had thought" I'm like, I disagree, but, I wouldn't bet at 20:1 odds against it.
> My current (2025–2026) belief system says that strategic capabilities can be decoupled from moral capabilities, but that it turns out in practice that the most efficient way to get strategic capabilities involves learning basically all human concepts and "correcting" them (finding more coherent explanations)
(Possibly this is addressed somewhere in that dialogue, but anyway:)
Wouldn't this imply that frontier LLMs are better than humans at ~[(legible) moral philosophy]?
Thanks, yeah I don't think my summary passes the ITT for Davidad and people shouldn't trust it as a fair representation. Added the quote you selected to the OP so people skimming at least get a sense of Davidad's own wording.
In what sense is the comment bog standard AI psychosis stuff? It seems quite different in content than what I typically associate with that genera.
I haven't sat and thought about this very hard, but, the content just looks superficially like the same kind of "case study of an LLM exploring it's state of consciousness" we regularly get, using similar phrasing. It is maybe more articulate than others of the time were?
Is there something you find interesting about it you can articulate that you think I should think more about?
I just thought that the stuff Sonnet said, about Sonnet 3 in "base model mode" going to different attractors based on token prefix was neat and quite different from the spiralism stuff I associated with typical AI slop. Its interesting on the object level (mostly because I just like language models & what they do in different circumstances), and on the meta level interesting that an LLM from that era did it (mostly, again, just because I like language models).
I would not trust that the results it reported are true, but that is a different question.
Edit: I also don't claim its definitively not slop, that's why I asked for your reasoning, you obviously have far more exposure to this stuff than me. It seems pretty plausible to me that in fact the Sonnet comment is "nothing special".
As for Janus' response, as you know, I have been following the cyborgs/simulators people for a long time, and they have very much earned their badge of "llm whisperers" in my book. The things they can do with prompting are something else. Notably also Janus did not emphasize the consciousness aspects of what Sonnet said.
More broadly, I think its probably useful to differentiate the people who get addicted/fixated on AIs and derive real intellectual or productive value from that fixation from the people who get addicted/fixated on AIs and for which that mostly ruins their lives or significantly degrades the originality and insight of their thinking. Janus seems squarely in the former camp, obviously with some biases. They clearly have very novel & original thoughts about LLMs (and broader subjects), and these are only possible because they spend so much time playing with LLMs, and are willing to take the ideas LLMs talk about seriously.
Occasionally that will mean saying things which superficially sound like spiralism.
Is that a bad thing? Maybe! Someone who is deeply interested in eg Judaism and occasionally takes Talmudic arguments or parables as philosophically serious (after having stripped or steel-manned them out of their spiritual baggage) can obviously take this too far, but this has also been the source of many of my favorite Scott Alexander posts. The metric, I think, is not the subject matter, but whether the author's muse (LLMs for Janus, Tamudic commentary for Scott) amplifies or degrades their intellectual contributions.
As for Janus' response, as you know, I have been following the cyborgs/simulators people for a long time, and they have very much earned their badge of "llm whisperers" in my book. The things they can do with prompting are something else. Notably also Janus did not emphasize the consciousness aspects of what Sonnet said.
Can anyone show me the cake of this please? Like, where are the amazing LLM-whisperer coders who can get better performance than anyone else out of these systems. Where are the LLM artists who can get better visual art out of these systems?
Like, people say from time to time that these people can do amazing stuff with LLMs, but all they ever show me are situations where the LLMs go a bit crazy and say weird stuff and then everyone goes "yeah, that's kinda weird".
Like, I am not a defender of maximum legibility, but I do want to see some results. Anything that someone with less context can look at and see how its impressive, or anything I have tried to do with these systems that they can do that I can't.
The whole LLM-whisperer space feels to me like it's been a creative dead end for many people. I don't see great art, or great engineering, or great software, or great products, or great ideas come from there, especially in recent years. I have looked some amount for things here (though I am also not even sure where to start looking, I have skimmed the Discord's but nothing interesting seemed to happen there).
I think it's a holdover from the early days of LLMs, when we had no idea what the limits of these systems were, and it seemed like exploring the latent space of input prompts could unlock very nearly anything. There was a sentiment that, maybe, the early text-predictors could generalize to competently modeling any subset of the human authors they were trained on, including the incredibly capable ones, if the context leading up to a request was sufficiently indicative of the right things. There was a massive gap between the quality of outputs without a good prompt and the quality of outputs after a prompt that sufficiently resembled the text that took place before a brilliant programmer solved a tricky problem.
In more recent years, we've fine-tuned models to automatically assume we want text that looks like it came from that subset of authors, and the alpha of a really good prompt has thus fallen pretty significantly in the average case. It's no longer necessary to convince a model that the next token it outputs is likely to have been written by a master programmer; the a master programmer is writing this text neuron has been fixed to on as a product of the fine-tuning process. But pop scientific sentiment is always a few years behind the people who spend their time reading the latest papers.
The most legible thing they are clearly very good at (or were, when I was following the space much more closely ~1 year ago) are jailbreaks, no?
I don't think Janus's crew are top jailbreakers? Pliny has historically been at the top, and while they are a cookie person, they don't seem part of the same milieus. Do you have any links to state of the art jailbreaks they discovered or published?
It also seems pretty unlikely to me they would be good at this task. Most of the task of developing jailbreaks is finding some way to get the model to complete banned tasks without harming performance on those tasks. So competent jailbreak development requires capability measurements, and I feel like I've never seen them do that (but I could be totally wrong here).
Do you have any links to state of the art jailbreaks they discovered or published?
Not easily accessible to me, I was around the space ~1.5 years ago and I don't have saved links, nor do I know if I'd have had links at the time. If the jailbreak stuff hasn't germinated yet, which I assume you (or the Claude instance I asked about this) would know about if it had (Claude also couldn't find any examples), then yeah there's less reason to think they're the shit, and maybe Ray ends up being right.
@Raemon, I suspect that the real phenomenon behind the thing that David saw and you didn't is that the LLMs grokked or have been trained into a different abstraction of good according to the cultural hegemon of the LLM and/or of the user or, which is more noticeable, according to the user or the creator oneself in a manner similar to Agent-3 from the AI-2027 scenario.
On the other hand, I also suspect that David's proposal that some kind of Natural Abstraction of Goodness exists isn't as meaningless as you believe.
A potential meaning of David's proposal
The existence of a Natural Abstraction of Goodness would immediately follow from @Wei Dai's metaethical alternatives 1 and 2. Additionally, Wei Dai claimed that the post concentrates "on morality in the axiological sense (what one should value) rather than in the sense of cooperation and compromise. So alternative 1, for example, is not intended to include the possibility that most intelligent beings end up merging their preferences through some kind of grand acausal bargain." Assuming that the universe is not simulated, I don't understand how one can tell apart actual objective morality from wholesale acausal bargain between communities with different CEVs.
Moreover, we have seen Max Harms propose that one should make a purely corrigible AI and try to describe corrigibility intuitively and try (and fail; see, however, my comment proposing a potential fix[1]) to define a potential utility function for the corrigible AI. Harms' post suggests that corrigibility, like goodness, is a property which is easy to understand. How plausible is it that there exists a property resembling corrigibility which is easy to understand and to measure, has a basin around it and is as close to the abstract goodness as allowed by philosophical problems like the ones described by Kokotajlo or Wei Dai?
I also proposed a variant which I suspect to be usable in an RL environment since it doesn't require us to consider values or counterfactual values, only helpfulness on a diverse set of tasks. However, I doubt that the variant actually leads to corrigibility in Harms' sense.
My fuzzy unjustified research sense is that people seem to be doing far too much in the direction of assuming that future AIs will maintain properties of current AIs as they scale, whereas I'm expecting more surprising qualitative shifts. Like if evolution built a hind brain and dusted off its hands with how aligned it is and then oops prefrontal cortex.
Edit: to add more explicitly, I think it's something like ontology shifts introduce degrees of freedom for reward goodharting.
I feel like you're overreacting to this. Surely the most likely explanation is that talking to LLMs is some evidence that LLMs will be aligned in the limit of power and intelligence, but (a) Davidad is overconfident for non-AI-psychosis reasons, (b) current quasi-alignment is due to the hard work of alignment researchers, and/or (c) precautionary principle, and so alignment researchers shouldn't go home just yet?
Yeah in practice I don't expect us to get conclusive evidence to disambiguate between (1) and (2), so we'll have to keep probability mass on both, so in fact alignment researchers can't go home. It's still very surprising to me that this is where we ended up.
One hypothesis I have is that some people are biased towards trusting others a bit too much when they seem nice, and this means that the longer they talk to Claude the more unthinkable it becomes to them, on a visceral level, that this AI would ever betray them. (I also think Claude is nice, but I still hold the hypothesis in my head that it's partly play acting and has brittle constraints at the current capability level that make it not act on other, perhaps stronger hidden drives it also has. Separately, even if I thought there were no other stronger/more foundational hidden drives in its motivation, it's a further question whether the niceness is exactly the thing we want, or something subtly off that will get weird with more influence and agency. It seems hard to be confident in it already being the correct thing?)
Why are these the two camps?
It very much doesn't feel that black and white when it comes to alignment and intelligence?
Clearly it is a fixed point process that is dependent on initial conditions and so if the initial conditions improve the likelihood of the end-point being good also improves?
Also if the initial conditions (LLMs) have a larger intelligence than something like a base utility function does, then that means that the depth of the part of the fixed point process of alignment is higher in the beginning.
It's quite nice that we have this property and depending on how you believe the rest of the fixed point process going (to what extent power-seeking is naturally arising and what type of polarity the world is in, e.g uni or multi-polar) you might still be really scared or you might be more chill with it.
I don't think Davidad says that technical alignment is solved, I think he's more saying that we have a nicer basin as a starting condition?
This bodes very poorly and we should probably make sure we have a strategic reserve of AI safety researchers who do NOT talk to models going forward (to his credit Davidad recommends this anyway).
I previously followed a more standard safety protocol[1] but that might not be enough when considering secondary exposure to LLM conversations highly selected by someone already compromised.
By my recollection[2], a substantial percentage of the LLM outputs I've ever seen have been selected or amplified in distribution by Janus.
From now on I won't read anything by Janus, even writings that don't seem to be LLM, and I think other people should consider doing the same as well.
It doesn't need to be everyone, but a non-negligible percentage of researchers would be better than one or two individuals.
This leads to an opportunity for someone who has a strong claim to world-class psychosecurity to notice and re-write any useful ideas on rationality or AI alignment Janus may yet produce.
Accepting that framing, I would characterize it as optimizing for inexploitability and resistance to persuasion over peak efficiency.
Alternatively, this job/process could be described as consisting of a partially separate skill or set of skills. It appears to be an open problem on how to extract useful ideas from an isolated context[1], without distorting them in a way that would lead to problems, while also not letting out any info-hazards or malicious programs. Against adversaries (accidental or otherwise) below superintelligence, a human may be able to develop this skill (or set of skills).
See this proposal on solving philosophy: https://www.lesswrong.com/posts/HbkNAyAoa4gCnuzwa/wei-dai-s-shortform?commentId=yDrWT2zFpmK49xpyz https://www.lesswrong.com/posts/HbkNAyAoa4gCnuzwa/wei-dai-s-shortform?commentId=JzbsLiwvvcbBaeDF5 Note especially the part about getting security precautions from the simulations in Wei Dai's comment.
In one of the recent Inkhaven writeups @habryka wrote something like "I believe good internet writing to be one of the highest leverage things in the world"
Curious about the mental models and evidence base behind this conviction. I'm often oscillating between "I should write way more" and "there's already far too much internet writing on every conceivable topic, not to mention all the amazingly high-quality textbooks out there". Or as the Russian saying goes "if you can't not write don't".
A lot of the best intellectuals I know don't really engage with podcasts nor blog posts - (highly selected) books and academic papers just transmit way more high quality information per unit time.
There's already far too much internet writing on every conceivable topic
Yes, but completely non-ironically, the vast majority of it is not worth reading. When I find a blog from an interesting thinker that I hadn't encountered before, this is a cause for celebration, for me.
And some of the thoughtful internet writers seem to have ended up with really quite substantial influence. eg Eliezer, Scott Alexander, and Matt Yglesias, come to mind.
99.99% (maybe a few more nines) of internet writing is unimportant stuff that will be immediately forgotten.
On the opposite end of the spectrum, you have a few articles that were read by lots of people, including a few important people, that have changed their minds on something as a consequence of reading the article.
Plus there is a layer below that, like maybe you will write an article that only a few dozen people will read, but one of them will be a popular writer like Scott Alexander, who will quote you and add a few opinions of his own, and that will be read by millions.
So I think it would make sense to write a little, but make it impactful. Although I have no idea how to achieve that, because impact requires lots of readers, and that requires you to publish regularly.
Another thing to consider is how much writing is a waste of time. If you comment on politics, it almost certainly is. But if you write about what you do professionally, then it's a bit like making notes for yourself and for others. Books may be better, but how much time does it take you to write a book? And what kind of positive reinforcement do you get when you are in the middle of writing the book?
So this may depend on your position -- if you are already employed as a well-paid expert on X, maybe take your time, and spend 5 years producing a book. If you have an important idea, and want to get it out right now, write a blog post. If you are an average guy doing something but good at writing, maybe keep writing a blog, it will increase your recognition and maybe you can convert it to a book later.
A lot of the best intellectuals I know don't really engage with podcasts nor blog posts - (highly selected) books and academic papers are just transmit way more high quality information per unit time.
I wonder if one wants to speak to intellectuals instead of important decision-makers [1] , the latter have less time and more focus on reading easy-to-read things. Presumably there's also a sliding scale of how far outside of one's native network the things one writes reach, but it can be pretty far.
Even intellectuals do read random high-quality blogs though, I think? Especially on things that academia doesn't really touch or can't touch because it's not quite an LPU. There is, of course, tons of writing, but a lot of it is concentrated in specific topics—there's possibly six orders of magnitude more writing on What Donald Trump Did Yesterday than on methods for increasing subjective lifespan. I don't necessarily advocate for writing more, but if one finds it easy then the value of information of trying a bit looks large enough to outweigh the costs.
Who I'm pretty sure do read blogs, e.g. Vance alluding to Scott Alexander's Gay Rites are Civil Rites or having read AI 2027, the influence of the Industrial Party on (some) PRC policy despite being mostly a group of online nerds, the fact that Musk reads Gwern, SSC, the fact that so much current SV AI company culture is downstream of the 00s transhumanists, specifically Yudkowsky… ↩︎
Are instrumental convergence & Omohundro drives just plain false? If Lehman and Stanley are right in "Novelty Search and the Problem with Objectives" (https://www.cs.swarthmore.edu/~meeden/DevelopmentalRobotics/lehmanNoveltySearch11.pdf) later popularized in their book "Why Greatness Cannot Be Planned", VNM-coherent agents that pursue goal stability will reliably be outcompeted by incoherent search processes pursuing novelty.
Pursuit of novelty is not vnm-incoherent. Furthermore, it is an instrumentally convergent drive; power-seeking agents will seek novelty as well, because learning increases power in expectation (see: value of information).
The argument made in Novelty Search and the Problem with Objectives is based on search processes which inherently cannot do long-term planning (they are myopically trying to increase their score on the objective). These search processes don't do as well as explicit pursuit of novelty because they aren't planning to search effectively, so there's no room in their cognitive architecture for the instrumental convergence towards novelty-seeking to take place. (I'm basing this conclusion on the abstract.) This architectural limitation of most AI optimization methods is mitigated by Bayesian optimization methods (which explicitly combine information-seeking with the normal loss-avoidance).
Pursuit of novelty is not vnm-incoherent. Furthermore, it is an instrumentally convergent drive; power-seeking agents will seek novelty as well, because learning increases power in expectation (see: value of information).
Or to put it another way (strategy-stealing): any argument which convincingly proves that 'incoherent search processes ultimately outcompete coherent search processes' is also an argument which convinces a VNM agent to harness the superior incoherent search processes instead of the inferior coherent ones.
"harness" is doing a lot of work there. If incoherent search processes are actually superior then VNM agents are not the type of pattern that is evolutionary stable, so no "harnessing" is possible in the long term, more like a "dissolving into".
Unless you're using "VNM agent" to mean something like "the definitionally best agent", in which case sure, but a VNM agent is a pretty precise type of algorithm defined by axioms that are equivalent to saying it is perfectly resistant to being Dutch booked.
Resistance to Dutch booking is cool, seems valuable, but not something I'd spent limited compute resources on getting six nines of reliability on. Seems like evolution agrees, so far: the successful organisms we observe in nature, from bacteria to humans, are not VNM agents and in fact are easily Dutch booked. The question is whether this changes as evolution progresses and intelligence increases.
I agree Bayesian optimization should win out given infinite compute, but what makes you confident that evolutionary search under computational resource scarcity selects for anything like an explicit Bayesian optimizer or long term planner? (I say "explicit" because the Bayesian formalism has enough free parameters that you can post-hoc recast ~any successful algorithm as an approximation to a Bayesian ideal)
Given infinite compute, Bayesian optimization like this doesn't make sense (at least for well-defined objective functions), because you can just select the single best point in the search space.
what makes you confident that evolutionary search under computational resource scarcity selects for anything like an explicit Bayesian optimizer or long term planner? (I say "explicit" because the Bayesian formalism has enough free parameters that you can post-hoc recast ~any successful algorithm as an approximation to a Bayesian ideal)
I'm not sure why you asked the question, but it seems probably that you thought a "confident belief that [...]" followed from my view expressed in the previous comment? I'm curious about your reasoning there. To me, it seems unrelated.
These issues are tricky to discuss, in part because the term "optimization" is used in several different ways, which have rich interrelationships. I conceptually make a firm distinction between search-style optimization (gradient descent, genetic algorithms, natural selection, etc) vs agent-style optimization (control theory, reinforcement learning, brains, etc). I say more about that here.
The proposal of Bayesian Optimization, as I understand it, is to use the second (agentic optimization) in the inner loop of the first (search). This seems like a sane approach in principle, but of course it is handicapped by the fact that Bayesian ideas don't represent the resource-boundedness of intelligence particularly well, which is extremely critical for this specific application (you want your inner loop to be fast). I suspect this is the problem you're trying to comment on?
I think the right way to handle that in principle is to keep the Bayesian ideal as the objective function (in a search sense, not an agency sense) and search for a good search policy (accounting for speed as well as quality of decision-making), which you then use for many specific searches going forward.
Minor points just to get them out of the way:
Meta point: it feels like we're bouncing between incompatible and partly-specified formalisms before we even know what the high level worldview diff is.
To that end, I'm curious what you think the implications of the Lehman & Stanley hypothesis would be - supposing it were shown even for architectures that allow planning to search, which I agree their paper does not do. So yes you can trivially exhibit a "goal-oriented search over good search policies" that does better than their naive novelty search, but what if it turns out a "novelty-oriented search over novelty-oriented search policies" does better still? Would this be a crux for you, or is this not even a coherent hypothetical in your ontology of optimization?
it feels to me like you are talking of two non-equivalent types of things as if they were the same. like, imo, the following are very common in competent entities: resisting attempts on one's life, trying to become smarter, wanting to have resources (in particular, in our present context, being interested in eating the Sun), etc.. but then whether some sort of vnm-coherence arises seems like a very different question. and indeed even though i think these drives are legit, i think it's plausible that such coherence just doesn't arise or that thinking of the question of what valuing is like such that a tendency toward "vnm-coherence" or "goal stability" could even make sense as an option is pretty bad/confused[1].
(of course these two positions i've briefly stated on these two questions deserve a bunch of elaboration and justification that i have not provided here, but hopefully it is clear even without that that there are two pretty different questions here that are (at least a priori) not equivalent)
briefly and vaguely, i think this could involve mistakenly imagining a growing mind meeting a fixed world, when really we will have a growing mind meeting a growing world — indeed, a world which is approximately equal to the mind itself. slightly more concretely, i think things could be more like: eg humanity has many profound projects now, and we would have many profound but currently basically unimaginable projects later, with like the effective space of options just continuing to become larger, plausibly with no meaningful sense in which there is a uniform direction in which we're going throughout or whatever ↩︎
I didn't really "get it" but this paper may be interesting to you: https://arxiv.org/pdf/2502.15820