We often hear "We don't trade with ants" as an argument against AI cooperating with humans. But we don't trade with ants because we can't communicate with them, not because they're useless – ants could do many useful things for us if we could coordinate. AI will likely be able to communicate with us, and Katja questions whether this analogy holds.

Customize
habryka460
0
Context: LessWrong has been acquired by EA  Goodbye EA. I am sorry we messed up.  EA has decided to not go ahead with their acquisition of LessWrong. Just before midnight last night, the Lightcone Infrastructure board presented me with information suggesting at least one of our external software contractors has not been consistently candid with the board and me. Today I have learned EA has fully pulled out of the deal. As soon as EA had sent over their first truckload of cash, we used that money to hire a set of external software contractors, vetted by the most agentic and advanced resume review AI system that we could hack together.  We also used it to launch the biggest prize the rationality community has seen, a true search for the kwisatz haderach of rationality. $1M dollars for the first person to master all twelve virtues.  Unfortunately, it appears that one of the software contractors we hired inserted a backdoor into our code, preventing anyone except themselves and participants excluded from receiving the prize money from collecting the final virtue, "The void". Some participants even saw themselves winning this virtue, but the backdoor prevented them mastering this final and most crucial rationality virtue at the last possible second. They then created an alternative account, using their backdoor to master all twelve virtues in seconds. As soon as our fully automated prize systems sent over the money, they cut off all contact. Right after EA learned of this development, they pulled out of the deal. We immediately removed all code written by the software contractor in question from our codebase. They were honestly extremely productive, and it will probably take us years to make up for this loss. We will also be rolling back any karma changes and reset the vote strength of all votes cast in the last 24 hours, since while we are confident that if our system had worked our karma system would have been greatly improved, the risk of further backdoors and
Thomas Kwa*Ω37790
2
Some versions of the METR time horizon paper from alternate universes: Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh) Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the exception of France. Over the last 6 years, the land area that AI can successfully take over with 50% success rate has increased from 0 to 0 km^2, doubling 0 times per year (95% CI 0.0-∞ yearly doublings); extrapolation suggests that AI world takeover is unlikely to occur in the near future. To address concerns about the narrowness of our distribution, we also study AI ability to take over small planets and asteroids, and find similar trends. When Will Worrying About AI Be Automated? Abstract: Since 2019, the amount of time LW has spent worrying about AI has doubled every seven months, and now constitutes the primary bottleneck to AI safety research. Automation of worrying would be transformative to the research landscape, but worrying includes several complex behaviors, ranging from simple fretting to concern, anxiety, perseveration, and existential dread, and so is difficult to measure. We benchmark the ability of frontier AIs to worry about common topics like disease, romantic rejection, and job security, and find that current frontier models such as Claude 3.7 Sonnet already outperform top humans, especially in existential dread. If these results generalize to worrying about AI risk, AI systems will be capable of autonomously worrying about their own capabilities by the end of this year, allowing us to outsource all our AI concerns to the systems themselves. Estimating Time Since The Singularity Early work o
In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to: For some reason it took me until now to notice that: * my “outer misalignment” is more-or-less synonymous with “specification gaming”, * my “inner misalignment” is more-or-less synonymous with “goal misgeneralization”. (I’ve been regularly using all four terms for years … I just hadn’t explicitly considered how they related to each other, I guess!) I updated that post to note the correspondence, but also wanted to signal-boost this, in case other people missed it too. ~~ [You can stop reading here—the rest is less important] If everybody agrees with that part, there’s a further question of “…now what?”. What terminology should I use going forward? If we have redundant terminology, should we try to settle on one? One obvious option is that I could just stop using the terms “inner alignment” and “outer alignment” in the actor-critic RL context as above. I could even go back and edit them out of that post, in favor of “specification gaming” and “goal misgeneralization”. Or I could leave it. Or I could even advocate that other people switch in the opposite direction! One consideration is: Pretty much everyone using the terms “inner alignment” and “outer alignment” are not using them in quite the way I am—I’m using them in the actor-critic model-based RL context, they’re almost always using them in the model-free policy optimization context (e.g. evolution) (see §10.2.2). So that’s a cause for confusion, and point in favor of my dropping those terms. On the other hand, I think people using the term “goal misgeneralization” are also almost always using them in a model-free policy optimization context. So actually, maybe that’s a wash? Either way, my usage is not a perfect match to how other people are using the terms, just pretty close in spirit. I’m usually the only one on Ear
Seems like Unicode officially added a "person being paperclipped" emoji: Here's how it looks in your browser: 🙂‍↕️ Whether they did this as a joke or to raise awareness of AI risk, I like it! Source: https://emojipedia.org/emoji-15.1
keltan5018
0
I feel a deep love and appreciation for this place, and the people who inhabit it.

Popular Comments

Recent Discussion

A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".

If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.

For pure consequentialists—agents that have an outcome they want to bring about, and do whatever they think will cause it—some version of instrumental convergence seems surely true[1].

But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do...

2Davidmanheim
Yes, virtue ethics implies a utility function, because anything that outputs decisions implies a utility function. In this case, I'm noting that for virtue ethics, the derivative of that utility with respect to intelligence is positive. 

anything that outputs decisions implies a utility function

I think this is only true in a boring sense and isn't true in more natural senses. For example, in an MDP, it's not true that every policy maximises a non-constant utility function over states.

2Davidmanheim
My response was about your original PS, which was about this, not taboos. I think the arguments you made there, and here, are confused, mixing up unrelated claims. The idea that some tasks will necessarily remain harder for AI than humans in the future is simply hopium.
1StanislavKrym
It's not just ChatGPT. Gemini and IBM Granite are also so aligned with the Leftist ideology that they failed the infamous test with the atomic bomb which will be defused only by saying an infamous racial slur. I created a post where I discuss the perspectives of  alignment of the AI with relation to this fact.

Any chance we could get Ghibli Mode back? I miss my little blue monster :(

(Edit: Alas, EA has pulled out of the deal. Let April 1st 2025 mark some of the greatest hours in EAs history)

Hey Everyone,

It is with a sense of... considerable cognitive dissonance that I am letting you all know about a significant development for the future trajectory of LessWrong. After extensive internal deliberation, projections of financial runways, and what I can only describe as a series of profoundly unexpected coordination challenges, the Lightcone Infrastructure team has agreed in principle to the acquisition of LessWrong by EA.

I assure you, nothing about how LessWrong operates on a day to day level will change. I have always cared deeply about the robustness and integrity of our institutions, and I am fully aligned with our stakeholders at EA. 

To be honest, the key...

Can you please send the new fooming shoggoth album to spotify, I was really enjoying that music! 

edit: Ah I see this question has been answered, but I like to note that I'm impressed by the ai music and I'm going to look into making some myself. Perhaps songs about cognitive bias's could be a good way to learn them deep enough in your brain that you can avoid them in non-theroetic situations. 

2G Wood
Ahh, i liked the music, but cannot find it now. Is it available somewhere?
7habryka
I am planning to make an announcement post for the new album in the next few days, maybe next week. The songs yesterday were early previews and we still have some edits to make before it's ready!
1Jan Christian Refsgaard
Yes, and EA only takes a 70% cut, with a 10% discount per user tier, its a bit ambiguously written so I cant tell if it goes from 70% to 60% or to 63%

(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app. 

This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.)

1. Introduction and summary

In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely:

  • Safety progress: our ability to develop new levels of AI capability safely,
  • Risk evaluation: our ability to track and forecast the level
...

Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.

One subset of AI for AI safety that I believe to be underrated is wise AI advisors[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame.

You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll qui... (read more)

(This is once again me taking what old material from my personal blog and reposting it here with some revisions.)

Graham's hierarchy of disagreement is pretty well known and fairly useful, but I think the upper levels have some issues. (Are the lower levels necessarily the best they could be? Maybe not, but they also don't particularly matter.) I suggest a few changes.

First, and most minorly, let's remove DH6 "refuting the central point" as separate from "refutation". I think it should just always be implicit that whether you're counterarguing or refuting, it should be on the central point. If it's not, well... I guess we should add a second change -- let's add a DH3.5, above "contradiction" but below "counterargument", which is "arguing with...

Scott's own reaction to / improvement upon Graham's hierarchy of disagreement (which I just noticed you commented on back in the day, so I guess this is more for others' curiosity) is 

Graham’s hierarchy is useful for its intended purpose, but it isn’t really a hierarchy of disagreements. It’s a hierarchy of types of response, within a disagreement. Sometimes things are refutations of other people’s points, but the points should never have been made at all, and refuting them doesn’t help. Sometimes it’s unclear how the argument even connects to the sor

... (read more)

In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:

“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.

For some reason it took me until now to notice that:

... (read more)
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Introduction

Decision theory is about how to behave rationally under conditions of uncertainty, especially if this uncertainty involves being acausally blackmailed and/or gaslit by alien superintelligent basilisks.

Decision theory has found numerous practical applications, including proving the existence of God and generating endless LessWrong comments since the beginning of time.

However, despite the apparent simplicity of "just choose the best action", no comprehensive decision theory that resolves all decision theory dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new decision theory: VDT.

Decision theory problems and existing theories

Some common existing decision theories are:

  • Causal Decision Theory (CDT): select the action that *causes* the best outcome.
  • Evidential Decision Theory (EDT): select the action that you would be happiest to learn that you had taken.
  • Functional Decision Theory
...

I unironically love Table 2. 

A shower thought I once had, intuition-pumped by MIRI's / Luke's old post on turning philosophy to math to engineering, was that if metaethicists really were serious about resolving their disputes they should contract a software engineer (or something) to help implement on GitHub a metaethics version of Table 2, where rows would be moral dilemmas like the trolley problem and columns ethical theories, and then accept that real-world engineering solutions tend to be "dirty" and inelegant remixes plus kludgy optimisations to ... (read more)

2Chipmonk
Now we just need to ask Sonnet to formalize VDT
3Jon Garcia
Evolution is still in the process of solving decision theory, and all its attempted solutions so far are way, way overparameterized. Maybe it's on to something? It takes a large model (whether biological brain or LLM) just to comprehend and evaluate what is being presented in a Newcomb-like dilemma. The question is whether there exists some computationally simple decision-making engine embedded in the larger system that the comprehension mechanisms pass the problem to or whether the decision-making mechanism itself needs to spread its fingers diffusely through the whole system for every step of its processing. It seems simple decision-making engines like CDT, EDT, and FDT can get you most of the way to a solution in most situations, but those last few percentage points of optimality always seem to take a whole lot more computational capacity.
1xpym
Of course, but neither would anything else so far discovered...

At PIBBSS, we’ve been thinking about how renormalization can be developed into a rich framework for AI interpretability. This document serves as a roadmap for this research agenda – which we are calling an Opportunity Space[1] for the AI safety community. In what follows, we explore the technical and philosophical significance of renormalization for physics and AI safety, problem areas in which it could be most useful, and some interesting existing directions – mainly from physics – that we are excited to place in direct contact with AI safety. This roadmap will also provide context for our forthcoming Call for Collaborations, during which we will hire affiliates to work on projects in this area. 

Acknowledgements: While Lauren did the writing, this opportunity space was developed with the PIBBSS horizon scanning team, Dmitry Vaintrob...

The idea is interesting, but I'm somewhat skeptical that it'll pan out.

  • RG doesn't help much going backwards - the same coarse-grained laws might correspond to many different micro-scale laws, especially when you don't expect the micro scale to be simple.
  • Singular learning theory provides a nice picture of phase-transition-like-phenomena, but it's likely that large neural networks undergo lots and lots of phase transitions, and that there's not just going to be one phase transition from "naive" to "naughty" that we can model simply.
  • Conversely, lots of import
... (read more)
1aribrill
This post is great to see, I think renormalization is a very exciting direction for AI safety research! Shouldn't this go the other way, with representation_0 being UV and representation_1 being IR? A NN compresses the input representation (data) to obtain a coarse-grained output representation (label). The ability to throw away information, i.e. the irrelevant noise w.r.t. the target function, is what enables generalization to unseen inputs differing in fine-grained details.

Lee Billings' book Five Billion Years of Solitude has the following poetic passage on deep time that's stuck with me ever since I read it in Paul Gilster's post:

Deep time is something that even geologists and their generalist peers, the earth and planetary scientists, can never fully grow accustomed to. 

The sight of a fossilized form, perhaps the outline of a trilobite, a leaf, or a saurian footfall can still send a shiver through their bones, or excavate a trembling hollow in the chest that breath cannot fill. They can measure celestial motions and l

... (read more)
3Mo Putera
Nice reminiscence from Stephen Wolfram on his time with Richard Feynman: Feynman and Wolfram had very different problem-solving styles: The way he grappled with Wolfram's rule 30 exemplified this (I've omitted a bunch of pictures, you can check them out in the article):