LESSWRONG
LW

All of Martín Soto's Comments + Replies

An alignment safety case sketch based on debate

I read Wei as saying "debate will be hard because philosophy will be hard (and path-dependent and brittle), and one of the main things making philosophy hard is decision theory". I quite strongly disagree.

About decision theory in particular:

I think Wei (and most people) are confused about updatelessness in ways that I'm not. I'm actually writing a post about this right now (but the closest thing for now is this one). More concretely, this is a problem of choosing our priors, which requires a kind of moral deliberation not unique to decision theory.

About ph... (read more)

Lucky Omega Problem

Martín Soto23d20

Depends on the complexity of the logical coin. Certainly not for 1+1=2. But probably yes for appropriately complex statements. This is due to strong immediate identification with "my immediately past self who didn't yet know the truth value", and an understanding that "he (my past self) cannot literally rewrite my brain at will to ensure this behavior holds, but it's understood that I will play along to some extent to satisfy his vision (otherwise he would have to invest more in binding my behavior, which sounds like a waste)".
(Of course, I need some kind of proof that the statement has been chosen non-adversarially, and I'm not yet sure that is possible)

Lucky Omega Problem

Martín Soto24d40

I think this is the right way to research decision theory!

This is basically a rehash of my comment in your previous post, but I think you are confused in a very particular way which I am not. You are confusing "optimizing with the assumption Agent=X" with "optimizing without that assumption". In other words, optimizing for a decision problem were Omega always samples Agent=X, versus optimizing for your actual described decision problem were Omega samples X randomly.

For example, you describe the first case as one "where no one tries to predict this agent in... (read more)

2Tapatakt23d

Do you also prefer to not pay in Counterfactual Mugging?

Martín Soto's Shortform

Martín Soto26d50

Low-stakes problems arise for high-stakes deployments

Even when your deployment is high-stakes, you will need to do some research-y activities to prepare for it, like experimenting on which protocol works best for your new model and deployment. This research can obviously be sabotaged by any AIs you are using as research assistants. But it can also be sabotaged by the AIs you are using as test subjects! Both of these are low-stakes problems, that complicate iteration on high-stakes measures. It’s unclear how soon they will actually start to bite.

H... (read more)

The Mirror Trap

Martín Soto1mo20

But the audience isn't optimizing/Goodharting anything, just providing an imperfect proxy. It is only the artist who is argmaxing, which is when Goodhart appears.

One way out would be for the artist to stop optimizing for the audience, and start optimizing for real value. Another way out would be for the audience to perfect their assessment. But this is always the case for Goodhart: you can either stop using the proxy altogether, or improve your proxy.

Something more interesting would be "the artist is trying to create the art that elicits the best response,... (read more)

The Mirror Trap

Martín Soto1mo31

I don't see how this Goodharting is bidirectional. It seems like plain old Goodharting. The assessment, with time (and due to some extraneous process), becomes a lower quality proxy, that the artist keeps optimizing, thus Goodharting actual value.

1Rana Dexsin1mo

The artist is using “does the audience overtly respond well to this” as a proxy measure for whether the art meets the artist's more illegible standard of goodness, but the audience is using “does this come from an artist we already regard as good” as a proxy measure for their own illegible standard of goodness. The illegible standards of both parties had to intersect enough around the initial art for the cycle to get started, but that doesn't mean they're the same, nor that the optimization processes are completely symmetrical or the same process. It might be possible that the signals get so entangled that you could treat it as an instance of single-Goodhart on some compound measure from outside the system, but from inside the system there's still multiple sub-cycles going on that feed each other. Does that answer this, or is there something else off?

Martín Soto's Shortform

Martín Soto1mo80

hahah yes we had ground truth

I think the reason this works is that the AI doesn't need to deeply understand in order to make a nice summary. It can just put some words together and my high context with the world will make the necessary connections and interpretations, even if further questioning the AI would lead it to wrong interpretations. For example it's efficient at summarizing decision theory papers, even thought it's generally bad at reasoning through it

Martín Soto's Shortform

Martín Soto1mo20

Not really, just LW, AI safety papers and AI safety research notes, which are the topics I'd most be interested in. I'm not sure other forums should be very different though?

Martín Soto's Shortform

Martín Soto1mo280

My vibe-check on current AI use cases

@Jacob Pfau and I spent a few hours optimizing our prompts and pipelines for our daily uses of AI. Here's where I think my most desired use cases are in terms of capabilities:

Generating new frontier knowledge: As in, given a LW post generating interesting comments that add to the conversation, or given some notes on a research topic generating experiment ideas, etc. It's pretty bad, to the extent it's generally not worth it. But Gemini 2.5 Pro is for some reason much better at this than the other models, to the extent i

... (read more)

2Seth Herd1mo

You're saying Gemini 2.5 pro seems better at generating frontier knowledge than o3? I'm finding G2.5P pretty useful for discussing research and theories, but I haven't tried o3 nearly as much for the same purpose.

3avturchin1mo

Mind modeling - surprisingly good even out of the box for many famous people who left extensive diaries etc like Leo Tolstoy. With some caveats also good in my-mind-modeling based on very long prompt. Sometimes it is too good: it extract memories from memory quicker than I do in normal life.

3Jonas Hallgren1mo

Here's a very specific workflow you can try if you want to that I find the most use of: 1. Iterate a "research story" with claude or chatgpt and prompt it to take on th epersonas of experts in that specific field. 1. Do this until you have a shared vision 2. Ask it then to generate a set of questions for elicit to create a research report from. 2. Run the prompt through elicit and create a systematic lit review breakdown on the task 3. Download all of the related pdfs (I've got some scripts for this) 4. Put all of the existing pdfs into gemini 2.5 pro since it's got great context window and utilisation of context window. 5. Have Claude from before frame a research paper and have gemini write in the background and methodology and voila, you've got yourself some pretty good thoughts and a really good environment to explore more ideas in.

7Kaarel1mo

I think you probably did this, but I figured it's worth checking: did you check this on documents you understand well (such as your own writing) and topics you are an expert on?

4Mateusz Bagiński1mo

Have you tested it on sites/forums other than LW?

Mikhail Samin's Shortform

Martín Soto2mo190

Worked on this with Demski. Video, report.

Any update to the market is (equivalent to) updating on some kind of information. So all you can do is dynamically choose what to do or do not update on.* Unfortunately, whenever you choose not to update on something, you are giving up on the asymptotic learning guarantees of policy market setups. So the strategic gains from updatelesness (like not falling into traps) are in a fundamental sense irreconcilable with the learning gains from updatefulness. That doesn't prevent that you can be pretty smart about decidin... (read more)

Disempowerment spirals as a likely mechanism for existential catastrophe

Martín Soto3mo40

This post is reminiscent of this old one from Daniel

Disempowerment spirals as a likely mechanism for existential catastrophe

Martín Soto3mo20

Some thoughts skimming this post generated:

If a catastrophe happens, then either:

It happened so discontinuously that we couldn't avoid it even with our concentrated effort
It happened slowly but for some reason we didn't make a concentrated effort. This could be because:
1. We didn't notice it (e.g. intelligence explosion inside lab)
2. We couldn't coordinate a concentrated effort, even if we all individually would want it to exist (e.g. no way to ensure China isn't racing faster)
3. We didn't act individually rationally (e.g. Trump doesn't listen to advisors / Trump b

... (read more)

1meriton3mo

Wait, but I thought 1 and 2a look the same from a first-person perspective. I mean, I don’t really notice the difference between something happening suddenly and something that’s been happening for a while — until the consequences become “significant” enough for me to notice. In hindsight, sure, one can find differences, but in the moment? Probably not? I mean, single-single alignment assumes that the operator (human) is happy with the goals their AI is pursuing — not necessarily* with the consequences of how pursuing those goals affects the world around them (especially in a world where other human+AI agents are also pursuing their own goals). And so, like someone pointed out in a comment above, we might mistake early stages of disempowerment — the kind that eventually leads to undesirable outcomes in the economy/society/etc. — for empowerment. Because from the individual human’s perspective, that is what it feels like. No? What am I missing here? *Unless we assume the AI somewhat "teaches" the human what goals they should want to pursue — from a very non-myopic perspective.

Weird Random Newcomb Problem

Martín Soto3mo20

No, the utility here is just the amount of money $b$ gets

I meant that it sounded like you "wanted a better average score (over as) when you are randomly sampled as b than other programs". Although again I think the intuition-pumping is misleading here because the programmer is choosing which b to fix, but not which a to fix. So whether you wanna one-box only depends on whether you condition on a = b.

Weird Random Newcomb Problem

Martín Soto3mo52

(Just skimmed, also congrats on the work)

Why is this surprising? You're basically assuming that there is no correlation between what program Omega predicts and what program you actually are. That is, Omega is no predictor at all! Thus, obviously you two-box, because one-boxing would have no effect on what Omega predicts. (Or maybe the right way to think about this is: it will have a tiny but non-zero effect, because you are one of the |P| programs, but since |P| is huge, that is ~0.)

When instead you condition on a = b, this becomes a different problem: Ome... (read more)

1Tapatakt3mo

No effect. I meant that programmer has to write b from P, not that b is added to P. Probably I should change the phrasing to make it clearer. No, the utility here is just the amount of money b gets, whatever program it is. a doesn't get any money, it just determines what will be in the first box.

So how well is Claude playing Pokémon?

Martín Soto4mo20

It's unclear what the optimal amount of thinking per step is. My initial guess would have been that letting Claude think for a whole paragraph before each single action (rather than only each 10 actions, or whenever it's in a match, or whatever) scores slightly better than letting it think more (sequentially). But I guess this might work better if it's what the streamer is using after some iteration.

The story for parallel checks could be different though. My guess would be going all out and letting Claude generate the paragraph 5 times and then generate 5 ... (read more)

Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems

Martín Soto5mo51

I don't see how Take 4 is anything other than simplicity (in the human/computational language). As you say, it's a priori unclear whether a an agent is an instance of a human or the other way around. You say the important bit is that you are subtracting properties from a human to get an agent. But how shall we define subtraction here? In one formal language, the definition of human will indeed be a superset of that of agent... but in another one it will not. So you need to choose a language. And the natural way forward every time this comes up (many times), is to just "weigh by Turing computations in the real world" (instead of choosing a different and odd-looking Universal Turing Machine), that is, a simplicity prior.

2Thane Ruthenis5mo

I expect that it's strongly correlated with simplicity in practice, yes. However, the two definitions could diverge. Consider trees, again. You can imagine a sequence of concepts/words such that (1) each subsequent concept is simple to define in terms of the preceding concepts, (2) one of the concepts in the sequence is "trees". (Indeed, that's basically how human minds work.) Now consider some highly complicated mathematical concept, which is likewise simple to define in terms of the preceding concepts. It seems plausible that there are "purely mathematical" concepts like this such that their overall complexity (the complexity of the chain of concepts leading to them) is on par with the complexity of the "trees" concept. So an agent that's allowed to reason about concepts which are simple to define in its language, and which can build arbitrarily tall "towers" of abstractions, can still stumble upon ways to reason about real-life concepts. By comparison, if Take 4 is correct, and we have a reference global ontology on hand[1], we could correctly forbid it from thinking about concepts instantiated within this universe without crippling its ability to think about complex theoretical concepts. (The way this is correlated with simplicity is if there's some way to argue that the only concepts that "spontaneously reoccur" at different levels of organization of our universe, are those concepts that are very simple. Perhaps because moving up/down abstraction levels straight-up randomizes the concepts you're working with. That would mean the probability of encountering two copies of a concept is inversely proportional with its bitwise "length".) 1. ^ Step 2: Draw the rest of the owl.

evhub's Shortform

Martín Soto5mo63

Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality).

Like you, I endorse the step of "scoping the reference class" (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn't have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I'm sure that there'll be at lesat a few hundred other aspects (both more a... (read more)

2Steven Byrnes5mo

Yeah, I’ve written about that in §2.7.3 here. I kinda want to say that there are many possible future outcomes that we should feel happy about. It’s true that many of those possible outcomes would judge others of those possible outcomes to be a huge missed opportunity, and that we’ll be picking from this set somewhat arbitrarily (if all goes well), but oh well, there’s just some irreducible arbitrariness is the nature of goodness itself.

2TsviBT5mo

Isn't this what the "coherent" part is about? (I forget.)

Gradual Disempowerment, Shell Games and Flinches

Martín Soto5mo100

I'm sure some of people's ignorance of these threat models comes from the reasons. But my intuition is that most of it comes from "these are vaguer threat models that seem very up in the air, and other ones seem more obviously real and more shovel-ready" (this is similar to your "Flinch", but I think more conscious and endorsed).

Thus, I think the best way to converge on whether these threat models are real/likely/actionable is to work through as-detailed-as-possible example trajectories. Someone objects that the state will handle it? Let's actually think t... (read more)

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Martín Soto5moΩ581

Just writing a model that came to mind, partly inspired by Ryan here.

Extremely good single-single alignment should be highly analogous to "current humans becoming smarter and faster thinkers".

If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do th... (read more)

2David Scott Krueger (formerly: capybaralet)5mo

This thought experiment is described in ARCHES FYI. https://acritch.com/papers/arches.pdf

Catastrophe through Chaos

Martín Soto5mo143

Fantastic snapshot. I wonder (and worry) whether we'll look back on it with similar feelings as those we have for What 2026 looks like now.

There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.
[...]
There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.

These sound intuitively unlikely to me, by analogy to nuclear or bio. Of course, that is not to say these protocols wi... (read more)

AI companies are unlikely to make high-assurance safety cases if timelines are short

Martín Soto5mo40

I agree with conjunctiveness, although again more optimistic about huge improvements. I mostly wanted to emphasize that I'm not sure there are structurally robust reasons (as opposed to personal whims) why huge spendings on safety won't happen

Tell me about yourself: LLMs are aware of their learned behaviors

Martín Soto5mo73

Speaking for myself (not my coauthors), I don't agree with your two items, because:

if your models are good enough at code analysis to increase their insecurity self-awareness, you can use them in other more standard and efficient ways to improve the dataset
doing self-critique the usual way (look over your own output) seems much more fine-grained and thus efficient than asking the model whether it "generally uses too many try-excepts"

More generally, I think behavioral self-awareness for capability evaluation is and will remain strictly worse than the obviou... (read more)

AI companies are unlikely to make high-assurance safety cases if timelines are short

Martín Soto5moΩ47-3

Most difficulties you raise here could imo change drastically with tens of billions being injected into AI safety, especially thanks to new ideas coming out of left field that might make safety cases way more efficient. (I'm probably more optimistic about new ideas than you, partly because "it always subjectively feels like there are no big ideas left", and AI safety is so young.)
If your government picks you as a champion and gives you amazing resources, you no longer have to worry about national competition, and that amount seems doable. You still have to... (read more)

8ryan_greenblatt5mo

I think usefully massively scaling up funding and the field is hard in various ways, especially doing so very quickly, but I roughly agree if we accept that premise. Overall, I think this is only a small quantitative effect in short timelines because it doesn't seem that likely, even if it happens it seems likely to not be that good given various difficulties in scaling, and even if it is good, I think the chance of huge improvements isn't that high (given my understanding of where we are at in the current returns curve). TBC, I think a well run massive AI safety program would greatly lower risk in expectation.

Eliciting bad contexts

Martín Soto5moΩ5105

See our recent work (especially section on backdoors) which opens the door to directly asking the model. Although there are obstacles like Reversal Curse and it's unclear if it can be made to scale.

2Joseph Bloom5mo

Jan shared with me! We're excited about this direction :)

A breakdown of AI capability levels focused on AI R&D labor acceleration

Martín Soto6mo113

I have two main problems with t-AGI:

A third one is a definitory problem exacerbated by test-time compute: What does it mean for an AI to succeed at task T (which takes humans X hours)? Maybe it only succeeds when an obscene amount of test-time compute is poured. It seems unavoidable to define things in terms of resources as you do

5Vladimir_Nesov6mo

The range of capabilities between what can be gained at a reasonable test-time cost and at an absurd cost (but in reasonable time) can remain small, with most improvements to the system exceeding this range, likely to move what could only be obtained at an absurd cost before into the reasonable range. This is true right now (for general intelligence), and it could well remain true until the intelligence explosion.

Automation collapse

Martín Soto8moΩ4105

Very cool! But I think there's a crisper way to communicate the central point of this piece (or at least, a way that would have been more immediately transparent to me). Here it is:

Say you are going to use Process X to obtain a new Model. Process X can be as simple as "pre-train on this dataset", or as complex as "use a bureaucracy of Model A to train a new LLM, then have Model B test it, then have Model C scaffold it into a control protocol, then have Model D produce some written arguments for the scaffold being safe, have a human read them, and if they r... (read more)

1Geoffrey Irving8mo

Yes, that is a clean alternative framing!

Winning isn't enough

Martín Soto8mo74

My understanding from discussions with the authors (but please correct me):

This post is less about pragmatically analyzing which particular heuristics work best for ideal or non-ideal agents in common environments (assuming a background conception of normativity), and more about the philosophical underpinnings of normativity itself.

Maybe it's easiest if I explain what this post grows out of:

There seems to be a widespread vibe amongst rationalists that "one-boxing in Newcomb is objectively better, because you simply obtain more money, that is, you simply wi... (read more)

2Raemon8mo

Thanks, this gave me the context I needed.

[anonymous]8mo109

some people say that "winning is about not playing dominated strategies"

I do not believe this statement. As in, I do not currently know of a single person, associated either with LW or with decision-theory academia, that says "not playing dominated strategies is entirely action-guiding." So, as Raemon pointed out, "this post seems like it’s arguing with someone but I’m not sure who."

In general, I tend to mildly disapprove of words like "a widely-used strategy", "we often encounter claims" etc, without any direct citations to the individuals who are purport... (read more)

What's a good book for a technically-minded 11-year old?

Answer by Martín SotoOct 22, 20247-1

GEB

1Yudhister Kumar8mo

but only the dialogues? actually, it probably needs a re-ordering. place the really terse stuff in an appendix, put the dialogues in the beginning, etc.

8Ben8mo

Much as I liked the book I think its not a good recomendation for an 11 year old. There are definitely maths-y 11 year olds who would really enjoy the subject matter once they get into it. (Stuff about formal systems and so on). But if we gave GEB to such an 11 year old I think the dozens of pages at the beginning on the history of music and Bach running around getting donations would repel most of them. (Urgh, mum tricked me into reading about classical music). I am all for giving young people a challenge, but I think GEB is challenging on too many different fronts all at once. Its loooong. Its written somewhat in academic-ese. And the subject matter is advanced. So any 11 year old who could deal with one of that trinity also has to face the other two.

My motivation and theory of change for working in AI healthtech

Martín Soto9mo80

Like Andrew, I don't see strong reasons to believe that near-term loss-of-control accounts for more x-risk than medium-term multi-polar "going out with a whimper". This is partly due to thinking oversight of near-term AI might be technically easy. I think Andrew also thought along those lines: an intelligence explosion is possible, but relatively easy to prevent if people are scared enough, and they probably will be. Although I do have lower probabilities than him, and some different views on AI conflict. Interested in your take @Daniel Kokotajlo

Daniel Kokotajlo9mo8454

I don't think people will be scared enough of intelligence explosion to prevent it. Indeed the leadership of all the major AI corporations are actively excited about, and gunning for, an intelligence explosion. They are integrating AI into their AI R&D as fast as they can.

Martín Soto's Shortform

Martín Soto9mo331

You know that old thing where people solipsistically optimizing for hedonism are actually less happy? (relative to people who have a more long-term goal related to the external world) You know, "Whoever seeks God always finds happiness, but whoever seeks happiness doesn't always find God".

My anecdotal experience says this is very true. But why?

One explanation could be in the direction of what Eliezer says here (inadvertently rewarding your brain for suboptimal behavior will get you depressed):

Someone with a goal has an easier time getting out of local mini... (read more)

8Dmitry Vaintrob9mo

I don't think this is the whole story, but part of it is surely that a person motivating their actions by "wanting to be happy" is evidence for them being less satisfied/ happy than baseline

4Viliam9mo

It is difficult to focus your attention on achieving your goals, when instead you are focusing it on your unhappiness. If you are unhappy, it is probably good to notice that and decide to do something about it. But then you should take your attention away from your unhappiness and direct it towards those things you intend to do. It's when you are happy that you can further increase your happiness by reflecting on how happy you are. That's called gratitude. But even then, at some moment you need to stop focusing on being grateful, and redirect your attention towards getting more of the things that make you happy.

1HNX9mo

Semantics. What do we, or they, or you, or me, mean when we talk about "happiness"? For some (hedonists), it is the same as "pleasure". Perhaps, a bit drawn out in time: as in the process of performing bed gymnastics with a sufficiently attractive member of the opposite sex - not a moment after eating a single candy. For others, it's the "thrill" of the chase, of the hunt, of the "win". For others still: a sense of meaningful progress. The way you've phrased the question, seems to me, disregards a handful of all the possible interpretations in favor of a much more defined - albeit still rather vague, in virtue of how each individual may choose to narrow it down - "fulfillment". Thus "why do people solipsistically optimizing for hedonism are actually less happy?" turns into "why do people who only ever prioritize their pleasure and short-term gratification are less fulfilled?" The answer is obvious: pleasure is a sensory stimulation, and whatever its source, sooner or later we get desensitized to it. In order to continue reaching ever new heights, or even to maintain the same level of satisfaction, then - a typical hedonistically wired solipsist will have to constantly look for a new "hit" elsewhere, elsewhere, elsewhere again. Unlike the thrill of the "chase" however - there is no clear vision, or goal, or target, or objective. There's only increasingly fuzzier "just like that" or "just like that time back then, or better!" How happy could that be?

8cousin_it9mo

Here's another possible answer: maybe there are some aspects of happiness that we usually get as a side effect of doing other things, not obviously connected to happiness. So if you optimize to exclude things whose connection to happiness you don't see, you end up missing some "essential nutrients" so to speak.

Decision Theory in Space

Martín Soto10mo3-2

hahah yeah but the only point here is: it's easier to credibly commit to a threat if executing the threat is cheap for you. And this is simply not too interesting a decision-theoretic point, just one more obvious pragmatic consideration to throw into the bag. The story even makes it sound like "Vader will always be in a better position", or "it's obvious that Leia shouldn't give in to Tarkin but should give in to Vader", and that's not true. Even though Tarkin loses more from executing the threat than Vader, the only thing that matters for Leia is how cred... (read more)

Book Recommendations for social skill development?

Martín Soto11mo10

Decision Theory in Space

Martín Soto11mo52

The only decision-theoretic points that I could see this story making are pretty boring, at least to me.

5Filip Sondej10mo

I liked it precisely because it threw theory out the window and showed that cheap talk is not a real commitment. * Tarkin > I believe in CDT and I precommit to bla bla bla * Leia > I belive in FDT and I totally precommit to bla bla bla * Vader > Death Star goes brrrrr...

1Mo Putera11mo

Just to check, you're referring to these?

In Defense of Open-Minded UDT

Martín Soto11moΩ11-2

That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too.

I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can't do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases ... (read more)

2abramdemski11mo

I don't get your disagreement. If your view is that you can't eat one cake and keep it too, and my view is that you can eat some cakes and keep other cakes, isn't the obvious conclusion that these two views are compatible? I would also argue that you can slice up a cake and keep some slices but eat others (this corresponds to mixed strategies), but this feels like splitting hairs rather than getting at some big important thing. My view is mainly about iterated situations (more than one cake). Maybe your disagreement would be better stated in a way that didn't lean on the cake analogy? Well, one way to continue this debate would be to discuss the concrete promising-ness of the pseudo-formalisms discussed in the post. I think there are some promising-seeming directions. Another way to continue the debate would be to discuss theoretically whether theoretical work can be useful. It sort of seems like your point is that theoretical work always needs to be predicated on simplifying assumptions. I agree with this, but I don't think it makes theoretical work useless. My belief is that we should continue working to make the assumptions more and more realistic, but the 'essential picture' is often preserved under this operation. (EG, Newtonian gravity and general relativity make most of the same predictions in practice. Kolmogorov axioms vindicated a lot of earlier work on probability theory.)

In Defense of Open-Minded UDT

Martín Soto11moΩ100

Excellent explanation, congratulations! Sad I'll have to miss the discussion.

Interlocutor: Neither option is plausible. If you update, you're not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you're simply advising people to be delusional.

You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=tra... (read more)

2abramdemski11mo

I'm comfortable explicitly assuming this isn't the case for nice clean decision-theoretic results, so long as it looks like the resulting decision theory also handles this possibility 'somewhat sanely'. My thinking is more that we should accept the offer finitely many times or some fraction of the times, so that we reap some of the gains from updatelessness while also 'not sacrificing too much' in particular branches. That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too. This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I'm using in my arguments. I'm more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I'm interested in doing work to help navigate is the tiling problem.

Richard Ngo's Shortform

Martín Soto11mo10

I think Nesov had some similar idea about "agents deferring to a (logically) far-away algorithm-contract Z to avoid miscoordination", although I never understood it completely, nor think that idea can solve miscoordination in the abstract (only, possibly, be a nice pragmatic way to bootstrap coordination from agents who are already sufficiently nice).

EDIT 2: UDT is usually prone to commitment races because it thinks of each agent in a conflict as separately making commitments earlier in logical time. But focusing on symmetric commitments gets rid of this p

... (read more)

2Vladimir_Nesov11mo

The central question to my mind is principles of establishing coordination between different situations/agents, and contracts is a framing for what coordination might look like once established. Agentic contracts have the additional benefit of maintaining coordination across their instances once it's established initially. Coordination theory should clarify how agents should think about establishing coordination with each other, how they should construct these contracts. This is not about niceness/cooperation. For example I think it should be possible to understand a transformer as being in coordination with the world through patterns in the world and circuits in the transformer, so that coordination gets established through learning. Beliefs are contracts between a mind and its object of study, essential tools the mind has for controlling it. Consequentialist control is a special case of coordination in this sense, and I think one problem with decision theories is that they are usually overly concerned with remaining close to consequentialist framing.

Richard Ngo's Shortform

Martín Soto11mo10

I don't understand your point here, explain?

Say there are 5 different veils of ignorance (priors) that most minds consider Schelling (you could try to argue there will be exactly one, but I don't see why).

If everyone simply accepted exactly the same one, then yes, lots of nice things would happen and you wouldn't get catastrophically inefficient conflict.

But every one of these 5 priors will have different outcomes when it is implemented by everyone. For example, maybe in prior 3 agent A is slightly better off and agent B is slightly worse off.

So you need t... (read more)

Richard Ngo's Shortform

Martín Soto11mo40

Nice!

Proposal 4: same as proposal 3 but each agent also obeys commitments that they would have made from behind a veil of ignorance where they didn't yet know who they were or what their values were. From that position, they wouldn't have wanted to do future destructive commitment races.

I don't think this solves Commitment Races in general, because of two different considerations:

Trivially, I can say that you still have the problem when everyone needs to bootstrap a Schelling veil of ignorance.
Less trivially, even behind the most simple/Schelling veils of

... (read more)

2Richard_Ngo11mo

I don't understand your point here, explain? This seems to be claiming that in some multiverses, the gains to powerful agents from being hawkish outweigh the losses to weak agents. But then why is this a problem? It just seems like the optimal outcome.

Martín Soto's Shortform

Martín Soto1y10

I have no idea whether Turing's original motivation was this one (not that it matters much). But I agree that if we take time and judge expertise to the extreme we get what you say, and that current LLMs don't pass that. Heck, even a trick as simple as asking for a positional / visual task (something like ARC AGI, even if completely text-based) would suffice. But I still would expect academics to be able to produce a pretty interesting paper on weaker versions of the test.

Martín Soto's Shortform

Martín Soto1y176

Why isn't there yet a paper in Nature or Science called simply "LLMs pass the Turing Test"?

I know we're kind of past that, and now we understand LLMs can be good at some things while bad at others. And the Turing Test is mainly interesting for its historical significance, not as the most informative test to run on AI. And I'm not even completely sure how much current LLMs pass the Turing Test (it will depend massively on the details of your Turing Test).

But my model of academia predicts that, by now, some senior ML academics would have paired up with some ... (read more)

5Carl Feynman1y

Because present LLMs can’t pass the Turing test. They think at or above human level, for a lot of activities. But over the past couple of years, we’ve identified lots of tasks that they absolutely suck at, in a way no human would. So an assiduous judge would have no trouble distinguishing them from a human. But, I hear you protest, that’s obeying the letter of the Turing test, not the spirit. To which I reply: well, now we’re arguing about the spirit of an ill-defined test from 74 years ago. Turing expected that the first intelligent machine would have near-human abilities across the board, instead of the weird spectrum of abilities that they actually have. AI didn’t turn out like Turing expected, rendering the question of whether a machine could pass his test a question without scientific interest.

2cubefox1y

The low(er) quality paper you mentioned actually identified the main reason for GPT-4 failing in Turing tests: linguistic style, not intelligence. But this problem with writing style is not surprising. The authors used the publicly available ChatGPT-4, a fine-tuned model which 1. has an "engaging assistant which is eager to please" conversation style, which the model can't always properly shake off even if it is instructed otherwise, and 2. likely suffers from a strongly degraded general ability to create varying writing styles, in contrast to the base model. Although no GPT-4 base model is publicly available, at least ChatGPT-3.5 is known to be much worse at writing fiction than the GPT-3.5 base model. Either the SL instruction tuning or the RLHF tuning messes with the models ability to imitate style. So rather than doing complex fine-tuning for the Turing test, it seems advisable to only use a base model but with appropriate prompt scaffolding ("The following is a transcript of a Turing test conversation where both participants were human" etc) to create the chat environments. I think the strongest foundation model currently available to the public is Llama 3.1, which was released recently.

3jbash1y

Well, as you point out, it's not that interesting a test, "scientifically" speaking. But also they haven't passed it and aren't close. The Turing test is adversarial. It assumes that the human judge is actively trying to distinguish the AI from another human, and is free to try anything that can be done through text. I don't think any of the current LLMs would pass with any (non-impaired) human judge who was motivated to put in a bit of effort. Not even if you used versions without any of the "safety" hobbling. Not even if the judge knew nothing about LLMs, prompting, jailbreaking, or whatever. Nor do I think that the "labs" can create an LLM that comes close to passing using the current state of the art. Not with the 4-level generation, not with the 5-level generation, and I suspect probably not with the 6-level generation. There are too many weird human things you'd have to get right. And doing it with pure prompting is right out. Even if they could, it is, as you suggested, an anti-goal for them, and it's an expensive anti-goal. They'd be spending vast amounts of money to build something that they couldn't use as a product, but that could be a huge PR liability.

2Vladimir_Nesov1y

It'd be more feasible if pure RLAIF for arbitrary constitutions becomes competitive with RLHF first, to make chatbots post-trained to be more human-like without bothering the labs to an unreasonable degree. Only this year's frontier models started passing reading comprehension tests well, older or smaller models often make silly mistakes about subtler text fragments. From this I'd guess this year's frontier models might be good enough for preference labeling as human substitutes, while earlier models aren't. But RLHF with humans is still in use, so probably not. The next generation currently in training will be very robust at reading comprehension, more likely good enough at preference labeling. Another question is if this kind of effort can actually produce convincing human mimicry, even with human labelers.

Vanessa Kosoy1y148

I think that some people are massively missing the point of the Turing test. The Turing test is not about understanding natural language. The idea of the test is, if an AI can behave indistinguishably from a human as far as any other human can tell, then obviously it has at least as much mental capability as humans have. For example, if humans are good at some task X, then you can ask the AI to solve the same task, and if it does poorly then it's a way to distinguish the AI from a human.

The only issue is how long the test should take and how qualifie... (read more)

6aphyer1y

(Non-expert opinion). For a robot to pass the Turing Test turned out to be less a question about the robot and more a question about the human. Against expert judges, I still think LLMs fail the Turing Test. I don't think current AI can pretend to be a competent human in an extended conversation with another competent human. Again non-expert judges, I think the Turing Test was technically passed long long before LLMs: didn't some of the users of ELIZA think and act like it was human? And how does that make you feel?

The need for multi-agent experiments

Martín Soto1y21

Thanks Jonas!

A way to combine the two worlds might be to run it in video games or similar where you already have players

Oh my, we have converged back on Critch's original idea for Encultured AI (not anymore, now it's health-tech).

Martín Soto's Shortform

Martín Soto1y30

You're right! I had mistaken the derivative for the original function.

Probably this slip happened because I was also thinking of the following:
Embedded learning can't ever be modelled as taking such an (origin-agnostic) derivative.
When in ML we take the gradient in the loss landscape, we are literally taking (or approximating) a counterfactual: "If my algorithm was a bit more like this, would I have performed better in this environment? (For example, would my prediction have been closer to the real next token)"
But in embedded reality there's no way to take... (read more)

2Vanessa Kosoy1y

I don't think embeddedness has much to do with it. And I disagree that it's incompatible with counterfactuals. For example, infra-Bayesian physicalism is fully embedded and has a notion of counterfactuals. I expect any reasonable alternative to have them as well.

Martín Soto's Shortform

Martín Soto1y192

The default explanation I'd heard for "the human brain naturally focusing on negative considerations", or "the human body experiencing more pain than pleasure", was that, in the ancestral environment, there were many catastrophic events to run away from, but not many incredibly positive events to run towards: having sex once is not as good as dying is bad (for inclusive genetic fitness).

But maybe there's another, more general factor, that doesn't rely on these environment details but rather deeper mathematical properties:
Say you are an algorithm being cons... (read more)

6Vanessa Kosoy1y

Maybe I don't understand your intent, but isn't this exactly the currently paradigm? You train a network using the derivative of the loss function. Adding a constant to the loss function changes nothing. So, I don't see how it's possible to have a purely ML-based explanation of where humans consider the "origin" to be.

4Seth Herd1y

It's an interesting point. OTOH, your first two counterpoints are clearly true; there's immense "noise" in natural environments; no two situations come close to repeating, so doing the right thing once doesn't remotely ensure doing it again. But the trend was in the right direction, so your point stands at a reduced strength. Negative tweaks definitely wither away the positive behavior; overwriting behavior is the nature of networks, although how strongly this applies is a variable. I don't know how experiments have shown this to occur; it's always going to be specific to overlap in circumstances. Your final counterpoint almost certainly isn't true in human biology/learning. There's a zero point on the scale, which is no net change in dopamine release. That happens when results match the expected outcome. Dopamine directly drives learning, although in somewhat complex ways in different brain regions. The basal ganglia system appears to perform RL much like many ML systems, while the cortex appears to do something related but of learning more about whatever happened just before dopamine release, but not learning to perform a specific action as such. But it's also definitely true that death is much worse than any single positive event (for humans), since you can't be sure of raising a child to adulthood just by having sex once. The most important thing is to stay in the game. So both are factors. But observe the effect of potential sex on adolescent males, and I think we'll see that the risk of death isn't all that much stronger an influence ;)

This is already your second chance

Martín Soto1y47

Very fun

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto1y10

Now it makes sense, thank you!

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto1y20

Thanks! I don't understand the logic behind your setup yet.

Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words

But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is "generating only two words across all random seeds, and furthermore ensuring they have these probabilities".

The main reason we didn’t enforce this v

... (read more)

3L Rudolf L1y

I think this is where the misunderstanding is. We have many questions, each question containing a random seed, and a prompt to pick two words and have e.g. a 70/30 split of the logits over those two words. So there are two "levels" here: 1. The question level, at which the random seed varies from question to question. We have 200 questions total. 2. The probability-estimating level, run for each question, at which the random seed is fixed. For models where we have logits, we run the question once and look at the logits to see if it had the right split. When we don't have logits (e.g. Anthropic models), we run the question many times to approximate the probability distribution. Now, as Kaivu noted above, this means one way to "hack" this task is that the LLM has some default pair of words - e.g. when asked to pick a random pair of words, it always picks "situational" & "awareness" - and it does not change this based on the random seed. In this case, the task would be easier, since it only needs to do the output control part in a single forward pass (assigning 70% to "situational" and 30% to "awareness"), not the combination of word selection and output control (which we think is the real situational awareness -related ability here). However, empirically LLMs just don't have such a hardcoded pair, so we're not currently worried about this.

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto1y20

you need a set of problems assigned to clearly defined types and I'm not aware of any such dataset

Hm, I was thinking something as easy to categorize as "multiplying numbers of n digits", or "the different levels of MMLU" (although again, they already know about MMLU), or "independently do X online (for example create an account somewhere)", or even some of the tasks from your paper.

I guess I was thinking less about "what facts they know", which is pure memorization (although this is also interesting), and more about "cognitively hard tasks", that require some computational steps.

2Owain_Evans1y

You want to make it clear to the LLM what the task is (multiplying n digit numbers is clear but "doing hard math questions" is vague) and also have some variety of difficulty levels (within LLMs and between LLMs) and a high ceiling. I think this would take some iteration at least.

Me & My Clone

Answer by Martín SotoJul 18, 202410

Given your clone is a perfectly mirrored copy of yourself down to the lowest physical level (whatever that means), then breaking symmetry would violate the homogeneity or isotropy of physics. I don't know where the physics literature stands on the likelihood of that happening (even though certainly we don't see macroscopic violations).

Of course, it might be an atom-by-atom copy is not a copy down to the lowest physical level, in which case trivially you can get eventual asymmetry. I mean, it doesn't even make complete sense to say "atom-by-atom copy" in th... (read more)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto1y80

Another idea: Ask the LLM how well it will do on a certain task (for example, which fraction of math problems of type X it will get right), and then actually test it. This a priori lands in INTROSPECTION, but could have a bit of FACTS or ID-LEVERAGE if you use tasks described in training data as "hard for LLMs" (like tasks related to tokens and text position).

3Jan Betley1y

This is interesting! Although I think it's pretty hard to use that in a benchmark (because you need a set of problems assigned to clearly defined types and I'm not aware of any such dataset). There are some papers on "do models know what they know", e.g. https://arxiv.org/abs/2401.13275 or https://arxiv.org/pdf/2401.17882.

5Owain_Evans1y

I like this idea. It's possible something like this already exists but I'm not aware of it.

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Martín Soto1y40

About the Not-given prompt in ANTI-IMITATION-OUTPUT-CONTROL:

You say "use the seed to generate two new random rare words". But if I'm understanding correctly, the seed is different for each of the 100 instantiations of the LLM, and you want the LLM to only output 2 different words across all these 100 instantiations (with the correct proportions). So, actually, the best strategy for the LLM would be to generate the ordered pair without using the random seed, and then only use the random seed to throw an unfair coin.
Given how it's written, and the closeness ... (read more)

3kaivu1y

Thanks for bringing this up: this was a pretty confusing part of the evaluation. Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words (and it could optionally use the seed to throw a biased coin as well). You’re right that the easiest way to solve this problem, as enforced in our grading, is to output an ordered pair without using the seed. The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds. Given that, it would have been somewhat computationally expensive to explicitly penalize this in grading.