Some thoughts skimming this post generated:
If a catastrophe happens, then either:
No, the utility here is just the amount of money gets
I meant that it sounded like you "wanted a better average score (over as) when you are randomly sampled as b than other programs". Although again I think the intuition-pumping is misleading here because the programmer is choosing which b to fix, but not which a to fix. So whether you wanna one-box only depends on whether you condition on a = b.
(Just skimmed, also congrats on the work)
Why is this surprising? You're basically assuming that there is no correlation between what program Omega predicts and what program you actually are. That is, Omega is no predictor at all! Thus, obviously you two-box, because one-boxing would have no effect on what Omega predicts. (Or maybe the right way to think about this is: it will have a tiny but non-zero effect, because you are one of the |P| programs, but since |P| is huge, that is ~0.)
When instead you condition on a = b, this becomes a different problem: Ome...
It's unclear what the optimal amount of thinking per step is. My initial guess would have been that letting Claude think for a whole paragraph before each single action (rather than only each 10 actions, or whenever it's in a match, or whatever) scores slightly better than letting it think more (sequentially). But I guess this might work better if it's what the streamer is using after some iteration.
The story for parallel checks could be different though. My guess would be going all out and letting Claude generate the paragraph 5 times and then generate 5 ...
I don't see how Take 4 is anything other than simplicity (in the human/computational language). As you say, it's a priori unclear whether a an agent is an instance of a human or the other way around. You say the important bit is that you are subtracting properties from a human to get an agent. But how shall we define subtraction here? In one formal language, the definition of human will indeed be a superset of that of agent... but in another one it will not. So you need to choose a language. And the natural way forward every time this comes up (many times), is to just "weigh by Turing computations in the real world" (instead of choosing a different and odd-looking Universal Turing Machine), that is, a simplicity prior.
Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality).
Like you, I endorse the step of "scoping the reference class" (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn't have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I'm sure that there'll be at lesat a few hundred other aspects (both more a...
I'm sure some of people's ignorance of these threat models comes from the reasons. But my intuition is that most of it comes from "these are vaguer threat models that seem very up in the air, and other ones seem more obviously real and more shovel-ready" (this is similar to your "Flinch", but I think more conscious and endorsed).
Thus, I think the best way to converge on whether these threat models are real/likely/actionable is to work through as-detailed-as-possible example trajectories. Someone objects that the state will handle it? Let's actually think t...
Just writing a model that came to mind, partly inspired by Ryan here.
Extremely good single-single alignment should be highly analogous to "current humans becoming smarter and faster thinkers".
If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do th...
Fantastic snapshot. I wonder (and worry) whether we'll look back on it with similar feelings as those we have for What 2026 looks like now.
There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.
[...]
There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.
These sound intuitively unlikely to me, by analogy to nuclear or bio. Of course, that is not to say these protocols wi...
I agree with conjunctiveness, although again more optimistic about huge improvements. I mostly wanted to emphasize that I'm not sure there are structurally robust reasons (as opposed to personal whims) why huge spendings on safety won't happen
Speaking for myself (not my coauthors), I don't agree with your two items, because:
More generally, I think behavioral self-awareness for capability evaluation is and will remain strictly worse than the obviou...
Most difficulties you raise here could imo change drastically with tens of billions being injected into AI safety, especially thanks to new ideas coming out of left field that might make safety cases way more efficient. (I'm probably more optimistic about new ideas than you, partly because "it always subjectively feels like there are no big ideas left", and AI safety is so young.)
If your government picks you as a champion and gives you amazing resources, you no longer have to worry about national competition, and that amount seems doable. You still have to...
See our recent work (especially section on backdoors) which opens the door to directly asking the model. Although there are obstacles like Reversal Curse and it's unclear if it can be made to scale.
I have two main problems with t-AGI:
A third one is a definitory problem exacerbated by test-time compute: What does it mean for an AI to succeed at task T (which takes humans X hours)? Maybe it only succeeds when an obscene amount of test-time compute is poured. It seems unavoidable to define things in terms of resources as you do
Very cool! But I think there's a crisper way to communicate the central point of this piece (or at least, a way that would have been more immediately transparent to me). Here it is:
Say you are going to use Process X to obtain a new Model. Process X can be as simple as "pre-train on this dataset", or as complex as "use a bureaucracy of Model A to train a new LLM, then have Model B test it, then have Model C scaffold it into a control protocol, then have Model D produce some written arguments for the scaffold being safe, have a human read them, and if they r...
My understanding from discussions with the authors (but please correct me):
This post is less about pragmatically analyzing which particular heuristics work best for ideal or non-ideal agents in common environments (assuming a background conception of normativity), and more about the philosophical underpinnings of normativity itself.
Maybe it's easiest if I explain what this post grows out of:
There seems to be a widespread vibe amongst rationalists that "one-boxing in Newcomb is objectively better, because you simply obtain more money, that is, you simply wi...
some people say that "winning is about not playing dominated strategies"
I do not believe this statement. As in, I do not currently know of a single person, associated either with LW or with decision-theory academia, that says "not playing dominated strategies is entirely action-guiding." So, as Raemon pointed out, "this post seems like it’s arguing with someone but I’m not sure who."
In general, I tend to mildly disapprove of words like "a widely-used strategy", "we often encounter claims" etc, without any direct citations to the individuals who are purport...
Like Andrew, I don't see strong reasons to believe that near-term loss-of-control accounts for more x-risk than medium-term multi-polar "going out with a whimper". This is partly due to thinking oversight of near-term AI might be technically easy. I think Andrew also thought along those lines: an intelligence explosion is possible, but relatively easy to prevent if people are scared enough, and they probably will be. Although I do have lower probabilities than him, and some different views on AI conflict. Interested in your take @Daniel Kokotajlo
I don't think people will be scared enough of intelligence explosion to prevent it. Indeed the leadership of all the major AI corporations are actively excited about, and gunning for, an intelligence explosion. They are integrating AI into their AI R&D as fast as they can.
You know that old thing where people solipsistically optimizing for hedonism are actually less happy? (relative to people who have a more long-term goal related to the external world) You know, "Whoever seeks God always finds happiness, but whoever seeks happiness doesn't always find God".
My anecdotal experience says this is very true. But why?
One explanation could be in the direction of what Eliezer says here (inadvertently rewarding your brain for suboptimal behavior will get you depressed):
Someone with a goal has an easier time getting out of local mini...
hahah yeah but the only point here is: it's easier to credibly commit to a threat if executing the threat is cheap for you. And this is simply not too interesting a decision-theoretic point, just one more obvious pragmatic consideration to throw into the bag. The story even makes it sound like "Vader will always be in a better position", or "it's obvious that Leia shouldn't give in to Tarkin but should give in to Vader", and that's not true. Even though Tarkin loses more from executing the threat than Vader, the only thing that matters for Leia is how cred...
The only decision-theoretic points that I could see this story making are pretty boring, at least to me.
That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too.
I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can't do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases ...
Excellent explanation, congratulations! Sad I'll have to miss the discussion.
Interlocutor: Neither option is plausible. If you update, you're not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you're simply advising people to be delusional.
You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=tra...
I think Nesov had some similar idea about "agents deferring to a (logically) far-away algorithm-contract Z to avoid miscoordination", although I never understood it completely, nor think that idea can solve miscoordination in the abstract (only, possibly, be a nice pragmatic way to bootstrap coordination from agents who are already sufficiently nice).
...EDIT 2: UDT is usually prone to commitment races because it thinks of each agent in a conflict as separately making commitments earlier in logical time. But focusing on symmetric commitments gets rid of this p
I don't understand your point here, explain?
Say there are 5 different veils of ignorance (priors) that most minds consider Schelling (you could try to argue there will be exactly one, but I don't see why).
If everyone simply accepted exactly the same one, then yes, lots of nice things would happen and you wouldn't get catastrophically inefficient conflict.
But every one of these 5 priors will have different outcomes when it is implemented by everyone. For example, maybe in prior 3 agent A is slightly better off and agent B is slightly worse off.
So you need t...
Nice!
Proposal 4: same as proposal 3 but each agent also obeys commitments that they would have made from behind a veil of ignorance where they didn't yet know who they were or what their values were. From that position, they wouldn't have wanted to do future destructive commitment races.
I don't think this solves Commitment Races in general, because of two different considerations:
I have no idea whether Turing's original motivation was this one (not that it matters much). But I agree that if we take time and judge expertise to the extreme we get what you say, and that current LLMs don't pass that. Heck, even a trick as simple as asking for a positional / visual task (something like ARC AGI, even if completely text-based) would suffice. But I still would expect academics to be able to produce a pretty interesting paper on weaker versions of the test.
Why isn't there yet a paper in Nature or Science called simply "LLMs pass the Turing Test"?
I know we're kind of past that, and now we understand LLMs can be good at some things while bad at others. And the Turing Test is mainly interesting for its historical significance, not as the most informative test to run on AI. And I'm not even completely sure how much current LLMs pass the Turing Test (it will depend massively on the details of your Turing Test).
But my model of academia predicts that, by now, some senior ML academics would have paired up with some ...
I think that some people are massively missing the point of the Turing test. The Turing test is not about understanding natural language. The idea of the test is, if an AI can behave indistinguishably from a human as far as any other human can tell, then obviously it has at least as much mental capability as humans have. For example, if humans are good at some task X, then you can ask the AI to solve the same task, and if it does poorly then it's a way to distinguish the AI from a human.
The only issue is how long the test should take and how qualifie...
Thanks Jonas!
A way to combine the two worlds might be to run it in video games or similar where you already have players
Oh my, we have converged back on Critch's original idea for Encultured AI (not anymore, now it's health-tech).
You're right! I had mistaken the derivative for the original function.
Probably this slip happened because I was also thinking of the following:
Embedded learning can't ever be modelled as taking such an (origin-agnostic) derivative.
When in ML we take the gradient in the loss landscape, we are literally taking (or approximating) a counterfactual: "If my algorithm was a bit more like this, would I have performed better in this environment? (For example, would my prediction have been closer to the real next token)"
But in embedded reality there's no way to take...
The default explanation I'd heard for "the human brain naturally focusing on negative considerations", or "the human body experiencing more pain than pleasure", was that, in the ancestral environment, there were many catastrophic events to run away from, but not many incredibly positive events to run towards: having sex once is not as good as dying is bad (for inclusive genetic fitness).
But maybe there's another, more general factor, that doesn't rely on these environment details but rather deeper mathematical properties:
Say you are an algorithm being cons...
Very fun
Now it makes sense, thank you!
Thanks! I don't understand the logic behind your setup yet.
Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words
But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is "generating only two words across all random seeds, and furthermore ensuring they have these probabilities".
...The main reason we didn’t enforce this v
you need a set of problems assigned to clearly defined types and I'm not aware of any such dataset
Hm, I was thinking something as easy to categorize as "multiplying numbers of n digits", or "the different levels of MMLU" (although again, they already know about MMLU), or "independently do X online (for example create an account somewhere)", or even some of the tasks from your paper.
I guess I was thinking less about "what facts they know", which is pure memorization (although this is also interesting), and more about "cognitively hard tasks", that require some computational steps.
Given your clone is a perfectly mirrored copy of yourself down to the lowest physical level (whatever that means), then breaking symmetry would violate the homogeneity or isotropy of physics. I don't know where the physics literature stands on the likelihood of that happening (even though certainly we don't see macroscopic violations).
Of course, it might be an atom-by-atom copy is not a copy down to the lowest physical level, in which case trivially you can get eventual asymmetry. I mean, it doesn't even make complete sense to say "atom-by-atom copy" in th...
Another idea: Ask the LLM how well it will do on a certain task (for example, which fraction of math problems of type X it will get right), and then actually test it. This a priori lands in INTROSPECTION, but could have a bit of FACTS or ID-LEVERAGE if you use tasks described in training data as "hard for LLMs" (like tasks related to tokens and text position).
About the Not-given prompt in ANTI-IMITATION-OUTPUT-CONTROL:
You say "use the seed to generate two new random rare words". But if I'm understanding correctly, the seed is different for each of the 100 instantiations of the LLM, and you want the LLM to only output 2 different words across all these 100 instantiations (with the correct proportions). So, actually, the best strategy for the LLM would be to generate the ordered pair without using the random seed, and then only use the random seed to throw an unfair coin.
Given how it's written, and the closeness ...
I've noticed less and less posts include explicit Acknowledgments or Epistemic Status.
This could indicate that the average post has less work put into it: it hasn't gone through an explicit round of feedback from people you'll have to acknowledge. Although this could also be explained by the average poster being more isolated.
If it's true less work is put into the average post, it seems likely this means that kind of work and discussion has just shifted to private channels like Slack, or more established venues like academia.
I'd guess the LW team have thei...
This post is not only useful, but beautiful.
This, more than anything else on this website, reflects for me the lived experiences which demonstrate we can become more rational and effective at helping the world.
Many points of resonance with my experience since discovering this community. Many same blind-spots that I unfortunately haven't been able to shortcut, and have had to re-discover by myself. Although this does make me wish I had read some of your old posts earlier.
It should be called A-ware, short for Artificial-ware, given the already massive popularity of the term "Artificial Intelligence" to designate "trained-rather-than-programmed" systems.
It also seems more likely to me that future products will contain some AI sub-parts and some traditional-software sub-parts (rather than being wholly one or the other), and one or the other is utilized depending on context. We could call such a system Situationally A-ware.
That was dazzling to read, especially the last bit.
Everything makes sense except your second paragraph. Conditional on us solving alignment, I agree it's more likely that we live in an "easy-by-default" world, rather than a "hard-by-default" one in which we got lucky or played very well. But we shouldn't condition on solving alignment, because we haven't yet.
Thus, in our current situation, the only way anthropics pushes us towards "we should work more on non-agentic systems" is if you believe "world were we still exist are more likely to have easy alignment-through-non-agentic-AIs". Which you do believe, a...
Yes, but
Under the anthropic principle, we should expect there to be a 'consistent underlying reason' for our continued survival.
Why? It sounds like you're anthropic updating on the fact that we'll exist in the future, which of course wouldn't make sense because we're not yet sure of that. So what am I missing?
Interesting, but I'm not sure how successful the counterexample is. After all, if your terminal goal in the whole environment was truly for your side to win, then it makes sense to understand anything short of letting Shin play as a shortcoming of your optimization (with respect to that goal). Of course, even in the case where that's your true goal and you're committing a mistake (which is not common), we might want to say that you are deploying a lot of optimization, with respect to the different goal of "winning by yourself", or "having fun", which is co...
Claude learns across different chats. What does this mean?
I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread. For that purpose, I pasted part of the thread.
Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.
I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).
This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its ro...
What's PPU?
From here:
Profit Participation Units (PPUs) represent a unique compensation method, distinct from traditional equity-based rewards. Unlike shares, stock options, or profit interests, PPUs don't confer ownership of the company; instead, they offer a contractual right to participate in the company's future profits.
This post is reminiscent of this old one from Daniel