All of Martín Soto's Comments + Replies

Some thoughts skimming this post generated:

If a catastrophe happens, then either:

  1. It happened so discontinuously that we couldn't avoid it even with our concentrated effort
  2. It happened slowly but for some reason we didn't make a concentrated effort. This could be because:
    1. We didn't notice it (e.g. intelligence explosion inside lab)
    2. We couldn't coordinate a concentrated effort, even if we all individually would want it to exist (e.g. no way to ensure China isn't racing faster)
    3. We didn't act individually rationally (e.g. Trump doesn't listen to advisors / Trump b
... (read more)
1meriton
Wait, but I thought 1 and 2a look the same from a first-person perspective. I mean, I don’t really notice the difference between something happening suddenly and something that’s been happening for a while — until the consequences become “significant” enough for me to notice. In hindsight, sure, one can find differences, but in the moment? Probably not? I mean, single-single alignment assumes that the operator (human) is happy with the goals their AI is pursuing — not necessarily* with the consequences of how pursuing those goals affects the world around them (especially in a world where other human+AI agents are also pursuing their own goals). And so, like someone pointed out in a comment above, we might mistake early stages of disempowerment — the kind that eventually leads to undesirable outcomes in the economy/society/etc. — for empowerment. Because from the individual human’s perspective, that is what it feels like. No? What am I missing here? *Unless we assume the AI somewhat "teaches" the human what goals they should want to pursue — from a very non-myopic perspective.

No, the utility here is just the amount of money  gets

I meant that it sounded like you "wanted a better average score (over as) when you are randomly sampled as b than other programs". Although again I think the intuition-pumping is misleading here because the programmer is choosing which b to fix, but not which a to fix. So whether you wanna one-box only depends on whether you condition on a = b.

(Just skimmed, also congrats on the work)

Why is this surprising? You're basically assuming that there is no correlation between what program Omega predicts and what program you actually are. That is, Omega is no predictor at all! Thus, obviously you two-box, because one-boxing would have no effect on what Omega predicts. (Or maybe the right way to think about this is: it will have a tiny but non-zero effect, because you are one of the |P| programs, but since |P| is huge, that is ~0.)

When instead you condition on a = b, this becomes a different problem: Ome... (read more)

1Tapatakt
No effect. I meant that programmer has to write b from P, not that b is added to P. Probably I should change the phrasing to make it clearer. No, the utility here is just the amount of money b gets, whatever program it is. a doesn't get any money, it just determines what will be in the first box.

It's unclear what the optimal amount of thinking per step is. My initial guess would have been that letting Claude think for a whole paragraph before each single action (rather than only each 10 actions, or whenever it's in a match, or whatever) scores slightly better than letting it think more (sequentially). But I guess this might work better if it's what the streamer is using after some iteration.

The story for parallel checks could be different though. My guess would be going all out and letting Claude generate the paragraph 5 times and then generate 5 ... (read more)

I don't see how Take 4 is anything other than simplicity (in the human/computational language). As you say, it's a priori unclear whether a an agent is an instance of a human or the other way around. You say the important bit is that you are subtracting properties from a human to get an agent. But how shall we define subtraction here? In one formal language, the definition of human will indeed be a superset of that of agent... but in another one it will not. So you need to choose a language. And the natural way forward every time this comes up (many times), is to just "weigh by Turing computations in the real world" (instead of choosing a different and odd-looking Universal Turing Machine), that is, a simplicity prior.

2Thane Ruthenis
I expect that it's strongly correlated with simplicity in practice, yes. However, the two definitions could diverge. Consider trees, again. You can imagine a sequence of concepts/words such that (1) each subsequent concept is simple to define in terms of the preceding concepts, (2) one of the concepts in the sequence is "trees". (Indeed, that's basically how human minds work.) Now consider some highly complicated mathematical concept, which is likewise simple to define in terms of the preceding concepts. It seems plausible that there are "purely mathematical" concepts like this such that their overall complexity (the complexity of the chain of concepts leading to them) is on par with the complexity of the "trees" concept. So an agent that's allowed to reason about concepts which are simple to define in its language, and which can build arbitrarily tall "towers" of abstractions, can still stumble upon ways to reason about real-life concepts. By comparison, if Take 4 is correct, and we have a reference global ontology on hand[1], we could correctly forbid it from thinking about concepts instantiated within this universe without crippling its ability to think about complex theoretical concepts. (The way this is correlated with simplicity is if there's some way to argue that the only concepts that "spontaneously reoccur" at different levels of organization of our universe, are those concepts that are very simple. Perhaps because moving up/down abstraction levels straight-up randomizes the concepts you're working with. That would mean the probability of encountering two copies of a concept is inversely proportional with its bitwise "length".) 1. ^ Step 2: Draw the rest of the owl.

Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality).

Like you, I endorse the step of "scoping the reference class" (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn't have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I'm sure that there'll be at lesat a few hundred other aspects (both more a... (read more)

2Steven Byrnes
Yeah, I’ve written about that in §2.7.3 here. I kinda want to say that there are many possible future outcomes that we should feel happy about. It’s true that many of those possible outcomes would judge others of those possible outcomes to be a huge missed opportunity, and that we’ll be picking from this set somewhat arbitrarily (if all goes well), but oh well, there’s just some irreducible arbitrariness is the nature of goodness itself.
2TsviBT
Isn't this what the "coherent" part is about? (I forget.)

I'm sure some of people's ignorance of these threat models comes from the reasons. But my intuition is that most of it comes from "these are vaguer threat models that seem very up in the air, and other ones seem more obviously real and more shovel-ready" (this is similar to your "Flinch", but I think more conscious and endorsed).

Thus, I think the best way to converge on whether these threat models are real/likely/actionable is to work through as-detailed-as-possible example trajectories. Someone objects that the state will handle it? Let's actually think t... (read more)

Martín SotoΩ581

Just writing a model that came to mind, partly inspired by Ryan here.

Extremely good single-single alignment should be highly analogous to "current humans becoming smarter and faster thinkers".

If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do th... (read more)

2David Scott Krueger (formerly: capybaralet)
This thought experiment is described in ARCHES FYI.  https://acritch.com/papers/arches.pdf

Fantastic snapshot. I wonder (and worry) whether we'll look back on it with similar feelings as those we have for What 2026 looks like now.

There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.

[...]

There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.

These sound intuitively unlikely to me, by analogy to nuclear or bio. Of course, that is not to say these protocols wi... (read more)

I agree with conjunctiveness, although again more optimistic about huge improvements. I mostly wanted to emphasize that I'm not sure there are structurally robust reasons (as opposed to personal whims) why huge spendings on safety won't happen

Speaking for myself (not my coauthors), I don't agree with your two items, because:

  • if your models are good enough at code analysis to increase their insecurity self-awareness, you can use them in other more standard and efficient ways to improve the dataset
  • doing self-critique the usual way (look over your own output) seems much more fine-grained and thus efficient than asking the model whether it "generally uses too many try-excepts"

More generally, I think behavioral self-awareness for capability evaluation is and will remain strictly worse than the obviou... (read more)

Martín SotoΩ47-3

Most difficulties you raise here could imo change drastically with tens of billions being injected into AI safety, especially thanks to new ideas coming out of left field that might make safety cases way more efficient. (I'm probably more optimistic about new ideas than you, partly because "it always subjectively feels like there are no big ideas left", and AI safety is so young.)
If your government picks you as a champion and gives you amazing resources, you no longer have to worry about national competition, and that amount seems doable. You still have to... (read more)

8ryan_greenblatt
I think usefully massively scaling up funding and the field is hard in various ways, especially doing so very quickly, but I roughly agree if we accept that premise. Overall, I think this is only a small quantitative effect in short timelines because it doesn't seem that likely, even if it happens it seems likely to not be that good given various difficulties in scaling, and even if it is good, I think the chance of huge improvements isn't that high (given my understanding of where we are at in the current returns curve). TBC, I think a well run massive AI safety program would greatly lower risk in expectation.
Martín SotoΩ5105

See our recent work (especially section on backdoors) which opens the door to directly asking the model. Although there are obstacles like Reversal Curse and it's unclear if it can be made to scale.

2Joseph Bloom
Jan shared with me! We're excited about this direction :) 

I have two main problems with t-AGI:

A third one is a definitory problem exacerbated by test-time compute: What does it mean for an AI to succeed at task T (which takes humans X hours)? Maybe it only succeeds when an obscene amount of test-time compute is poured. It seems unavoidable to define things in terms of resources as you do

5Vladimir_Nesov
The range of capabilities between what can be gained at a reasonable test-time cost and at an absurd cost (but in reasonable time) can remain small, with most improvements to the system exceeding this range, likely to move what could only be obtained at an absurd cost before into the reasonable range. This is true right now (for general intelligence), and it could well remain true until the intelligence explosion.
Martín SotoΩ494

Very cool! But I think there's a crisper way to communicate the central point of this piece (or at least, a way that would have been more immediately transparent to me). Here it is:

Say you are going to use Process X to obtain a new Model. Process X can be as simple as "pre-train on this dataset", or as complex as "use a bureaucracy of Model A to train a new LLM, then have Model B test it, then have Model C scaffold it into a control protocol, then have Model D produce some written arguments for the scaffold being safe, have a human read them, and if they r... (read more)

1Geoffrey Irving
Yes, that is a clean alternative framing!

My understanding from discussions with the authors (but please correct me):

This post is less about pragmatically analyzing which particular heuristics work best for ideal or non-ideal agents in common environments (assuming a background conception of normativity), and more about the philosophical underpinnings of normativity itself.

Maybe it's easiest if I explain what this post grows out of:

There seems to be a widespread vibe amongst rationalists that "one-boxing in Newcomb is objectively better, because you simply obtain more money, that is, you simply wi... (read more)

2Raemon
Thanks, this gave me the context I needed.
[anonymous]109

some people say that "winning is about not playing dominated strategies"

I do not believe this statement. As in, I do not currently know of a single person, associated either with LW or with decision-theory academia, that says "not playing dominated strategies is entirely action-guiding." So, as Raemon pointed out, "this post seems like it’s arguing with someone but I’m not sure who."

In general, I tend to mildly disapprove of words like "a widely-used strategy", "we often encounter claims" etc, without any direct citations to the individuals who are purport... (read more)

1Yudhister Kumar
but only the dialogues? actually, it probably needs a re-ordering. place the really terse stuff in an appendix, put the dialogues in the beginning, etc.
8Ben
Much as I liked the book I think its not a good recomendation for an 11 year old. There are definitely maths-y 11 year olds who would really enjoy the subject matter once they get into it. (Stuff about formal systems and so on). But if we gave GEB to such an 11 year old I think the dozens of pages at the beginning on the history of music and Bach running around getting donations would repel most of  them. (Urgh, mum tricked me into reading about classical music). I am all for giving young people a challenge, but I think GEB is challenging on too many different fronts all at once. Its loooong. Its written somewhat in academic-ese. And the subject matter is advanced. So any 11 year old who could deal with one of that trinity also has to face the other two.

Like Andrew, I don't see strong reasons to believe that near-term loss-of-control accounts for more x-risk than medium-term multi-polar "going out with a whimper". This is partly due to thinking oversight of near-term AI might be technically easy. I think Andrew also thought along those lines: an intelligence explosion is possible, but relatively easy to prevent if people are scared enough, and they probably will be. Although I do have lower probabilities than him, and some different views on AI conflict. Interested in your take @Daniel Kokotajlo 

I don't think people will be scared enough of intelligence explosion to prevent it. Indeed the leadership of all the major AI corporations are actively excited about, and gunning for, an intelligence explosion. They are integrating AI into their AI R&D as fast as they can.

You know that old thing where people solipsistically optimizing for hedonism are actually less happy? (relative to people who have a more long-term goal related to the external world) You know, "Whoever seeks God always finds happiness, but whoever seeks happiness doesn't always find God".

My anecdotal experience says this is very true. But why?

One explanation could be in the direction of what Eliezer says here (inadvertently rewarding your brain for suboptimal behavior will get you depressed):

Someone with a goal has an easier time getting out of local mini... (read more)

8Dmitry Vaintrob
I don't think this is the whole story, but part of it is surely that a person motivating their actions by "wanting to be happy" is evidence for them being less satisfied/ happy than baseline
4Viliam
It is difficult to focus your attention on achieving your goals, when instead you are focusing it on your unhappiness. If you are unhappy, it is probably good to notice that and decide to do something about it. But then you should take your attention away from your unhappiness and direct it towards those things you intend to do. It's when you are happy that you can further increase your happiness by reflecting on how happy you are. That's called gratitude. But even then, at some moment you need to stop focusing on being grateful, and redirect your attention towards getting more of the things that make you happy.
1HNX
Semantics. What do we, or they, or you, or me, mean when we talk about "happiness"? For some (hedonists), it is the same as "pleasure". Perhaps, a bit drawn out in time: as in the process of performing bed gymnastics with a sufficiently attractive member of the opposite sex - not a moment after eating a single candy. For others, it's the "thrill" of the chase, of the hunt, of the "win". For others still: a sense of meaningful progress. The way you've phrased the question, seems to me, disregards a handful of all the possible interpretations in favor of a much more defined - albeit still rather vague, in virtue of how each individual may choose to narrow it down - "fulfillment". Thus "why do people solipsistically optimizing for hedonism are actually less happy?" turns into "why do people who only ever prioritize their pleasure and short-term gratification are less fulfilled?" The answer is obvious: pleasure is a sensory stimulation, and whatever its source, sooner or later we get desensitized to it.  In order to continue reaching ever new heights, or even to maintain the same level of satisfaction, then - a typical hedonistically wired solipsist will have to constantly look for a new "hit" elsewhere, elsewhere, elsewhere again. Unlike the thrill of the "chase" however - there is no clear vision, or goal, or target, or objective. There's only increasingly fuzzier "just like that" or "just like that time back then, or better!" How happy could that be?
8cousin_it
Here's another possible answer: maybe there are some aspects of happiness that we usually get as a side effect of doing other things, not obviously connected to happiness. So if you optimize to exclude things whose connection to happiness you don't see, you end up missing some "essential nutrients" so to speak.

hahah yeah but the only point here is: it's easier to credibly commit to a threat if executing the threat is cheap for you. And this is simply not too interesting a decision-theoretic point, just one more obvious pragmatic consideration to throw into the bag. The story even makes it sound like "Vader will always be in a better position", or "it's obvious that Leia shouldn't give in to Tarkin but should give in to Vader", and that's not true. Even though Tarkin loses more from executing the threat than Vader, the only thing that matters for Leia is how cred... (read more)

The only decision-theoretic points that I could see this story making are pretty boring, at least to me.

5Filip Sondej
I liked it precisely because it threw theory out the window and showed that cheap talk is not a real commitment. * Tarkin > I believe in CDT and I precommit to bla bla bla * Leia > I belive in FDT and I totally precommit to bla bla bla * Vader > Death Star goes brrrrr...
1Mo Putera
Just to check, you're referring to these?
Martín SotoΩ11-2

That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too.

I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can't do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases ... (read more)

2abramdemski
I don't get your disagreement. If your view is that you can't eat one cake and keep it too, and my view is that you can eat some cakes and keep other cakes, isn't the obvious conclusion that these two views are compatible? I would also argue that you can slice up a cake and keep some slices but eat others (this corresponds to mixed strategies), but this feels like splitting hairs rather than getting at some big important thing. My view is mainly about iterated situations (more than one cake). Maybe your disagreement would be better stated in a way that didn't lean on the cake analogy? Well, one way to continue this debate would be to discuss the concrete promising-ness of the pseudo-formalisms discussed in the post. I think there are some promising-seeming directions.  Another way to continue the debate would be to discuss theoretically whether theoretical work can be useful. It sort of seems like your point is that theoretical work always needs to be predicated on simplifying assumptions. I agree with this, but I don't think it makes theoretical work useless. My belief is that we should continue working to make the assumptions more and more realistic, but the 'essential picture' is often preserved under this operation. (EG, Newtonian gravity and general relativity make most of the same predictions in practice. Kolmogorov axioms vindicated a lot of earlier work on probability theory.)
Martín SotoΩ100

Excellent explanation, congratulations! Sad I'll have to miss the discussion.

Interlocutor: Neither option is plausible. If you update, you're not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you're simply advising people to be delusional.

You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=tra... (read more)

2abramdemski
I'm comfortable explicitly assuming this isn't the case for nice clean decision-theoretic results, so long as it looks like the resulting decision theory also handles this possibility 'somewhat sanely'. My thinking is more that we should accept the offer finitely many times or some fraction of the times, so that we reap some of the gains from updatelessness while also 'not sacrificing too much' in particular branches. That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too. This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I'm using in my arguments. I'm more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I'm interested in doing work to help navigate is the tiling problem.

I think Nesov had some similar idea about "agents deferring to a (logically) far-away algorithm-contract Z to avoid miscoordination", although I never understood it completely, nor think that idea can solve miscoordination in the abstract (only, possibly, be a nice pragmatic way to bootstrap coordination from agents who are already sufficiently nice).

EDIT 2: UDT is usually prone to commitment races because it thinks of each agent in a conflict as separately making commitments earlier in logical time. But focusing on symmetric commitments gets rid of this p

... (read more)
2Vladimir_Nesov
The central question to my mind is principles of establishing coordination between different situations/agents, and contracts is a framing for what coordination might look like once established. Agentic contracts have the additional benefit of maintaining coordination across their instances once it's established initially. Coordination theory should clarify how agents should think about establishing coordination with each other, how they should construct these contracts. This is not about niceness/cooperation. For example I think it should be possible to understand a transformer as being in coordination with the world through patterns in the world and circuits in the transformer, so that coordination gets established through learning. Beliefs are contracts between a mind and its object of study, essential tools the mind has for controlling it. Consequentialist control is a special case of coordination in this sense, and I think one problem with decision theories is that they are usually overly concerned with remaining close to consequentialist framing.

I don't understand your point here, explain?

Say there are 5 different veils of ignorance (priors) that most minds consider Schelling (you could try to argue there will be exactly one, but I don't see why).

If everyone simply accepted exactly the same one, then yes, lots of nice things would happen and you wouldn't get catastrophically inefficient conflict.

But every one of these 5 priors will have different outcomes when it is implemented by everyone. For example, maybe in prior 3 agent A is slightly better off and agent B is slightly worse off.

So you need t... (read more)

Nice!

Proposal 4: same as proposal 3 but each agent also obeys commitments that they would have made from behind a veil of ignorance where they didn't yet know who they were or what their values were. From that position, they wouldn't have wanted to do future destructive commitment races.

I don't think this solves Commitment Races in general, because of two different considerations:

  1. Trivially, I can say that you still have the problem when everyone needs to bootstrap a Schelling veil of ignorance.
  2. Less trivially, even behind the most simple/Schelling veils of
... (read more)
2Richard_Ngo
I don't understand your point here, explain? This seems to be claiming that in some multiverses, the gains to powerful agents from being hawkish outweigh the losses to weak agents. But then why is this a problem? It just seems like the optimal outcome.

I have no idea whether Turing's original motivation was this one (not that it matters much). But I agree that if we take time and judge expertise to the extreme we get what you say, and that current LLMs don't pass that. Heck, even a trick as simple as asking for a positional / visual task (something like ARC AGI, even if completely text-based) would suffice. But I still would expect academics to be able to produce a pretty interesting paper on weaker versions of the test.

Why isn't there yet a paper in Nature or Science called simply "LLMs pass the Turing Test"?

I know we're kind of past that, and now we understand LLMs can be good at some things while bad at others. And the Turing Test is mainly interesting for its historical significance, not as the most informative test to run on AI. And I'm not even completely sure how much current LLMs pass the Turing Test (it will depend massively on the details of your Turing Test).

But my model of academia predicts that, by now, some senior ML academics would have paired up with some ... (read more)

5Carl Feynman
Because present LLMs can’t pass the Turing test.  They think at or above human level, for a lot of activities.  But over the past couple of years, we’ve identified lots of tasks that they absolutely suck at, in a way no human would.  So an assiduous judge would have no trouble distinguishing them from a human. But, I hear you protest, that’s obeying the letter of the Turing test, not the spirit.  To which I reply: well, now we’re arguing about the spirit of an ill-defined test from 74 years ago.  Turing expected that the first intelligent machine would have near-human abilities across the board, instead of the weird spectrum of abilities that they actually have.  AI didn’t turn out like Turing expected, rendering the question of whether a machine could pass his test a question without scientific interest.
2cubefox
The low(er) quality paper you mentioned actually identified the main reason for GPT-4 failing in Turing tests: linguistic style, not intelligence. But this problem with writing style is not surprising. The authors used the publicly available ChatGPT-4, a fine-tuned model which 1. has an "engaging assistant which is eager to please" conversation style, which the model can't always properly shake off even if it is instructed otherwise, and 2. likely suffers from a strongly degraded general ability to create varying writing styles, in contrast to the base model. Although no GPT-4 base model is publicly available, at least ChatGPT-3.5 is known to be much worse at writing fiction than the GPT-3.5 base model. Either the SL instruction tuning or the RLHF tuning messes with the models ability to imitate style. So rather than doing complex fine-tuning for the Turing test, it seems advisable to only use a base model but with appropriate prompt scaffolding ("The following is a transcript of a Turing test conversation where both participants were human" etc) to create the chat environments. I think the strongest foundation model currently available to the public is Llama 3.1, which was released recently.
3jbash
Well, as you point out, it's not that interesting a test, "scientifically" speaking. But also they haven't passed it and aren't close. The Turing test is adversarial. It assumes that the human judge is actively trying to distinguish the AI from another human, and is free to try anything that can be done through text. I don't think any of the current LLMs would pass with any (non-impaired) human judge who was motivated to put in a bit of effort. Not even if you used versions without any of the "safety" hobbling. Not even if the judge knew nothing about LLMs, prompting, jailbreaking, or whatever. Nor do I think that the "labs" can create an LLM that comes close to passing using the current state of the art. Not with the 4-level generation, not with the 5-level generation, and I suspect probably not with the 6-level generation. There are too many weird human things you'd have to get right. And doing it with pure prompting is right out. Even if they could, it is, as you suggested, an anti-goal for them, and it's an expensive anti-goal. They'd be spending vast amounts of money to build something that they couldn't use as a product, but that could be a huge PR liability.
2Vladimir_Nesov
It'd be more feasible if pure RLAIF for arbitrary constitutions becomes competitive with RLHF first, to make chatbots post-trained to be more human-like without bothering the labs to an unreasonable degree. Only this year's frontier models started passing reading comprehension tests well, older or smaller models often make silly mistakes about subtler text fragments. From this I'd guess this year's frontier models might be good enough for preference labeling as human substitutes, while earlier models aren't. But RLHF with humans is still in use, so probably not. The next generation currently in training will be very robust at reading comprehension, more likely good enough at preference labeling. Another question is if this kind of effort can actually produce convincing human mimicry, even with human labelers.

I think that some people are massively missing the point of the Turing test. The Turing test is not about understanding natural language. The idea of the test is, if an AI can behave indistinguishably from a human as far as any other human can tell, then obviously it has at least as much mental capability as humans have. For example, if humans are good at some task X, then you can ask the AI to solve the same task, and if it does poorly then it's a way to distinguish the AI from a human

The only issue is how long the test should take and how qualifie... (read more)

6aphyer
(Non-expert opinion). For a robot to pass the Turing Test turned out to be less a question about the robot and more a question about the human. Against expert judges, I still think LLMs fail the Turing Test.  I don't think current AI can pretend to be a competent human in an extended conversation with another competent human. Again non-expert judges, I think the Turing Test was technically passed long long before LLMs: didn't some of the users of ELIZA think and act like it was human?  And how does that make you feel?

Thanks Jonas!

A way to combine the two worlds might be to run it in video games or similar where you already have players

Oh my, we have converged back on Critch's original idea for Encultured AI (not anymore, now it's health-tech).

You're right! I had mistaken the derivative for the original function.

Probably this slip happened because I was also thinking of the following:
Embedded learning can't ever be modelled as taking such an (origin-agnostic) derivative.
When in ML we take the gradient in the loss landscape, we are literally taking (or approximating) a counterfactual: "If my algorithm was a bit more like this, would I have performed better in this environment? (For example, would my prediction have been closer to the real next token)"
But in embedded reality there's no way to take... (read more)

2Vanessa Kosoy
I don't think embeddedness has much to do with it. And I disagree that it's incompatible with counterfactuals. For example, infra-Bayesian physicalism is fully embedded and has a notion of counterfactuals. I expect any reasonable alternative to have them as well.

The default explanation I'd heard for "the human brain naturally focusing on negative considerations", or "the human body experiencing more pain than pleasure", was that, in the ancestral environment, there were many catastrophic events to run away from, but not many incredibly positive events to run towards: having sex once is not as good as dying is bad (for inclusive genetic fitness).

But maybe there's another, more general factor, that doesn't rely on these environment details but rather deeper mathematical properties:
Say you are an algorithm being cons... (read more)

6Vanessa Kosoy
  Maybe I don't understand your intent, but isn't this exactly the currently paradigm? You train a network using the derivative of the loss function. Adding a constant to the loss function changes nothing. So, I don't see how it's possible to have a purely ML-based explanation of where humans consider the "origin" to be.
4Seth Herd
It's an interesting point. OTOH, your first two counterpoints are clearly true; there's immense "noise" in natural environments; no two situations come close to repeating, so doing the right thing once doesn't remotely ensure doing it again. But the trend was in the right direction, so your point stands at a reduced strength. Negative tweaks definitely wither away the positive behavior; overwriting behavior is the nature of networks, although how strongly this applies is a variable. I don't know how experiments have shown this to occur; it's always going to be specific to overlap in circumstances. Your final counterpoint almost certainly isn't true in human biology/learning. There's a zero point on the scale, which is no net change in dopamine release. That happens when results match the expected outcome. Dopamine directly drives learning, although in somewhat complex ways in different brain regions. The basal ganglia system appears to perform RL much like many ML systems, while the cortex appears to do something related but of learning more about whatever happened just before dopamine release, but not learning to perform a specific action as such. But it's also definitely true that death is much worse than any single positive event (for humans), since you can't be sure of raising a child to adulthood just by having sex once. The most important thing is to stay in the game. So both are factors. But observe the effect of potential sex on adolescent males, and I think we'll see that the risk of death isn't all that much stronger an influence ;)

Thanks! I don't understand the logic behind your setup yet.

Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words

But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is "generating only two words across all random seeds, and furthermore ensuring they have these probabilities".

The main reason we didn’t enforce this v

... (read more)
3L Rudolf L
I think this is where the misunderstanding is. We have many questions, each question containing a random seed, and a prompt to pick two words and have e.g. a 70/30 split of the logits over those two words. So there are two "levels" here: 1. The question level, at which the random seed varies from question to question. We have 200 questions total. 2. The probability-estimating level, run for each question, at which the random seed is fixed. For models where we have logits, we run the question once and look at the logits to see if it had the right split. When we don't have logits (e.g. Anthropic models), we run the question many times to approximate the probability distribution. Now, as Kaivu noted above, this means one way to "hack" this task is that the LLM has some default pair of words - e.g. when asked to pick a random pair of words, it always picks "situational" & "awareness" - and it does not change this based on the random seed. In this case, the task would be easier, since it only needs to do the output control part in a single forward pass (assigning 70% to "situational" and 30% to "awareness"), not the combination of word selection and output control (which we think is the real situational awareness -related ability here). However, empirically LLMs just don't have such a hardcoded pair, so we're not currently worried about this.

you need a set of problems assigned to clearly defined types and I'm not aware of any such dataset

Hm, I was thinking something as easy to categorize as "multiplying numbers of n digits", or "the different levels of MMLU" (although again, they already know about MMLU), or "independently do X online (for example create an account somewhere)", or even some of the tasks from your paper.

I guess I was thinking less about "what facts they know", which is pure memorization (although this is also interesting), and more about "cognitively hard tasks", that require some computational steps.

2Owain_Evans
You want to make it clear to the LLM what the task is (multiplying n digit numbers is clear but "doing hard math questions" is vague) and also have some variety of difficulty levels (within LLMs and between LLMs) and a high ceiling. I think this would take some iteration at least.
Answer by Martín Soto10

Given your clone is a perfectly mirrored copy of yourself down to the lowest physical level (whatever that means), then breaking symmetry would violate the homogeneity or isotropy of physics. I don't know where the physics literature stands on the likelihood of that happening (even though certainly we don't see macroscopic violations).

Of course, it might be an atom-by-atom copy is not a copy down to the lowest physical level, in which case trivially you can get eventual asymmetry. I mean, it doesn't even make complete sense to say "atom-by-atom copy" in th... (read more)

Another idea: Ask the LLM how well it will do on a certain task (for example, which fraction of math problems of type X it will get right), and then actually test it. This a priori lands in INTROSPECTION, but could have a bit of FACTS or ID-LEVERAGE if you use tasks described in training data as "hard for LLMs" (like tasks related to tokens and text position).

3Jan Betley
This is interesting! Although I think it's pretty hard to use that in a benchmark (because you need a set of problems assigned to clearly defined types and I'm not aware of any such dataset). There are some papers on "do models know what they know", e.g. https://arxiv.org/abs/2401.13275 or https://arxiv.org/pdf/2401.17882.
5Owain_Evans
I like this idea. It's possible something like this already exists but I'm not aware of it.

About the Not-given prompt in ANTI-IMITATION-OUTPUT-CONTROL:

You say "use the seed to generate two new random rare words". But if I'm understanding correctly, the seed is different for each of the 100 instantiations of the LLM, and you want the LLM to only output 2 different words across all these 100 instantiations (with the correct proportions). So, actually, the best strategy for the LLM would be to generate the ordered pair without using the random seed, and then only use the random seed to throw an unfair coin.
Given how it's written, and the closeness ... (read more)

3kaivu
Thanks for bringing this up: this was a pretty confusing part of the evaluation. Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words (and it could optionally use the seed to throw a biased coin as well). You’re right that the easiest way to solve this problem, as enforced in our grading, is to output an ordered pair without using the seed. The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds. Given that, it would have been somewhat computationally expensive to explicitly penalize this in grading.

I've noticed less and less posts include explicit Acknowledgments or Epistemic Status.

This could indicate that the average post has less work put into it: it hasn't gone through an explicit round of feedback from people you'll have to acknowledge. Although this could also be explained by the average poster being more isolated.

If it's true less work is put into the average post, it seems likely this means that kind of work and discussion has just shifted to private channels like Slack, or more established venues like academia.

I'd guess the LW team have thei... (read more)

This post is not only useful, but beautiful.

This, more than anything else on this website, reflects for me the lived experiences which demonstrate we can become more rational and effective at helping the world.

Many points of resonance with my experience since discovering this community. Many same blind-spots that I unfortunately haven't been able to shortcut, and have had to re-discover by myself. Although this does make me wish I had read some of your old posts earlier.

It should be called A-ware, short for Artificial-ware, given the already massive popularity of the term "Artificial Intelligence" to designate "trained-rather-than-programmed" systems.

It also seems more likely to me that future products will contain some AI sub-parts and some traditional-software sub-parts (rather than being wholly one or the other), and one or the other is utilized depending on context. We could call such a system Situationally A-ware.

That was dazzling to read, especially the last bit.

Everything makes sense except your second paragraph. Conditional on us solving alignment, I agree it's more likely that we live in an "easy-by-default" world, rather than a "hard-by-default" one in which we got lucky or played very well. But we shouldn't condition on solving alignment, because we haven't yet.

Thus, in our current situation, the only way anthropics pushes us towards "we should work more on non-agentic systems" is if you believe "world were we still exist are more likely to have easy alignment-through-non-agentic-AIs". Which you do believe, a... (read more)

1[anonymous]
(edit: summary: I don't agree with this quote because I think logical beliefs shouldn't update upon observing continued survival because there is nothing else we can observe. It is not my position that we should assume alignment is easy because we'll die if it's not) I think that language in discussions of anthropics is unintentionally prone to masking ambiguities or conflations, especially wrt logical vs indexical probability, so I want to be very careful writing about this. I think there may be some conceptual conflation happening here, but I'm not sure how to word it. I'll see if it becomes clear indirectly. One difference between our intuitions may be that I'm implicitly thinking within a manyworlds frame. Within that frame it's actually certain that we'll solve alignment in some branches. So if we then 'condition on solving alignment in the future', my mind defaults to something like this: "this is not much of an update, it just means we're in a future where the past was not a death outcome. Some of the pasts leading up to those futures had really difficult solutions, and some of them managed to find easier ones or get lucky. The probabilities of these non-death outcomes relative to each other have not changed as a result of this conditioning." (I.e I disagree with the top quote) The most probable reason I can see for this difference is if you're thinking in terms of a single future, where you expect to die.[1] In this frame, if you observe yourself surviving, it may seem[2] you should update your logical belief that alignment is hard (because P(continued observation|alignment being hard) is low, if we imagine a single future, but certain if we imagine the space of indexically possible futures). Whereas I read it as only indexical, and am generally thinking about this in terms of indexical probabilities. I totally agree that we shouldn't update our logical beliefs in this way. I.e., that with regard to beliefs about logical probabilities (such as 'alignme

Yes, but

  1. This update is screened off by "you actually looking at the past and checking whether we got lucky many times or there is a consistent reason". Of course, you could claim that our understanding of the past is not perfect, and thus should still update, only less so. Although to be honest, I think there's a strong case for the past clearly showing that we just got lucky a few times.
  2. It sounded like you were saying the consistent reason is "our architectures are non-agentic". This should only constitute an anthropic update to the extent you think more-
... (read more)
1[anonymous]
(I think I misinterpreted your question and started drafting another response, will reply to relevant portions of this reply there)

Under the anthropic principle, we should expect there to be a 'consistent underlying reason' for our continued survival.


Why? It sounds like you're anthropic updating on the fact that we'll exist in the future, which of course wouldn't make sense because we're not yet sure of that. So what am I missing?

1[anonymous]
The quote you replied to was meant to be about the past.[1] (paragraph retracted due to unclarity) Specifically, I think that ("we find a fully-general agent-alignment solution right as takeoff is very near" given "early AGIs take a form that was unexpected") is less probable than ("observing early AGI's causes us to form new insights that lead to a different class of solution" given "early AGIs take a form that was unexpected"). Because I think that, and because I think we're at that point where takeoff is near, it seems like it's some evidence for being on that second path. I do think that's possible (I don't have a good enough model to put a probability on it though). I suspect that superintelligence is possible to create with much less compute than is being used for SOTA LLMs. Here's a thread with some general arguments for this. I think my understanding of why we've survived so far re:AI is very not perfect. For example, I don't know what would have needed to happen for training setups which would have produced agentic superintelligence by now to be found first, or (framed inversely) how lucky we needed to be to survive this far. ~~~ I'm not sure if this reply will address the disagreement, or if it will still seem from your pov that I'm making some logical mistake. I'm not actually fully sure what the disagreement is. You're welcome to try to help me understand if one remains. I'm sorry if any part of this response is confusing, I'm still learning to write clearly. 1. ^ I originally thought you were asking why it's true of the past, but then I realized we very probably agreed (in principle) in that case.
1[comment deleted]

Interesting, but I'm not sure how successful the counterexample is. After all, if your terminal goal in the whole environment was truly for your side to win, then it makes sense to understand anything short of letting Shin play as a shortcoming of your optimization (with respect to that goal). Of course, even in the case where that's your true goal and you're committing a mistake (which is not common), we might want to say that you are deploying a lot of optimization, with respect to the different goal of "winning by yourself", or "having fun", which is co... (read more)

Claude learns across different chats. What does this mean?

 I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread. For that purpose, I pasted part of the thread.

Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.

I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).

This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its ro... (read more)

From here:

Profit Participation Units (PPUs) represent a unique compensation method, distinct from traditional equity-based rewards. Unlike shares, stock options, or profit interests, PPUs don't confer ownership of the company; instead, they offer a contractual right to participate in the company's future profits.

Load More