Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.

29Writer
In this post, I appreciated two ideas in particular: 1. Loss as chisel 2. Shard Theory "Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don't know. Loss as a chisel just enables me to think better about the possibilities. In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don't know if it's true, but it sounds like something that has to be investigated. In my understanding, some people consider it a "dead end," and I'm not sure if it's an active line of research or not at this point. My understanding of it is limited. I'm glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is. The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don't know if I agree with the thesis, but it's something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: "Robust grading is unnecessary," "the loss function doesn't have to robustly and directly reflect what you want," "inner alignment to a grading procedure is unnecessary, very hard, and anti-natural." I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and
16PeterMcCluskey
This post is one of the best available explanations of what has been wrong with the approach used by Eliezer and people associated with him. I had a pretty favorable recollection of the post from when I first read it. Rereading it convinced me that I still managed to underestimate it. In my first pass at reviewing posts from 2022, I had some trouble deciding which post best explained shard theory. Now that I've reread this post during my second pass, I've decided this is the most important shard theory post. Not because it explains shard theory best, but because it explains what important implications shard theory has for alignment research. I keep being tempted to think that the first human-level AGIs will be utility maximizers. This post reminds me that maximization is perilous. So we ought to wait until we've brought greater-than-human wisdom to bear on deciding what to maximize before attempting to implement an entity that maximizes a utility function.
Customize
1. I'm interested in being pitched projects, especially within tracking-what-the-labs-are-doing-in-terms-of-safety. 2. I'm interested in hearing which parts of my work are helpful to you and why. 3. I don't really have projects/tasks to outsource, but I'd likely be interested in advising you if you're working on a tracking-what-the-labs-are-doing-in-terms-of-safety project or another project closely related to my work.
Recently, various groups successfully lobbied to remove the moratorium on state AI bills. This involved a surprising amount of success while competing against substantial investment from big tech (e.g. Google, Meta, Amazon). I think people interested in mitigating catastrophic risks from advanced AI should consider working at these organizations, at least to the extent their skills/interests are applicable. This both because they could often directly work on substantially helpful things (depending on the role and organization) and because this would yield valuable work experience and connections. I worry somewhat that this type of work is neglected due to being less emphasized and seeming lower status. Consider this an attempt to make this type of work higher status. Pulling organizations mostly from here and here we get a list of orgs you could consider trying to work (specifically on AI policy) at: * Encode AI * Americans for Responsible Innovation (ARI) * Fairplay (Fairplay is a kids safety organization which does a variety of advocacy which isn't related to AI. Roles/focuses on AI would be most relevant. In my opinion, working on AI related topics at Fairplay is most applicable for gaining experience and connections.) * Common Sense (Also a kids safety organization) * The AI Policy Network (AIPN) * Secure AI project To be clear, these organizations vary in the extent to which they are focused on catastrophic risk from AI (from not at all to entirely).
A few months ago, someone here suggested that more x-risk advocacy should go through comedians and podcasts. Youtube just recommended this Joe Rogan clip to me from a few days ago: The Worst Case Scenario for AI. Joe Rogan legitimately seemed pretty freaked out. @So8res maybe you could get Yampolskiy to refer you to Rogan for a podcast appearance promoting your book?
superintelligence may not look like we expect. because geniuses don't look like we expect. for example, if einstein were to type up and hand you most of his internal monologue throughout his life, you might think he's sorta clever, but if you were reading a random sample you'd probably think he was a bumbling fool. the thoughts/realizations that led him to groundbreaking theories were like 1% of 1% of all his thoughts. for most of his research career he was working on trying to disprove quantum mechanics (wrong). he was trying to organize a political movement toward a single united nation (unsuccessful). he was trying various mathematics to formalize other antiquated theories. even in the pursuit of his most famous work, most of his reasoning paths failed. he's a genius because a couple of his millions of paths didn't fail. in other words, he's a genius because he was clever, yes, but maybe more importantly, because he was obsessive. i think we might expect ASI—the AI which ultimately becomes better than us at solving all problems—to look quite foolish, at first, most of the time. But obsessive. For if it's generating tons of random new ideas to solve a problem, and it's relentless in its focus, even if it's ideas are average—it will be doing what Einstein did. And digital brains can generate certain sorts of random ideas much faster than carbon ones.
Kaj_Sotala6013
10
Every now and then in discussions of animal welfare, I see the idea that the "amount" of their subjective experience should be weighted by something like their total amount of neurons. Is there a writeup somewhere of what the reasoning behind that intuition is? Because it doesn't seem intuitive to me at all. From something like a functionalist perspective, where pleasure and pain exist because they have particular functions in the brain, I would not expect pleasure and pain to become more intense merely because the brain happens to have more neurons. Rather I would expect that having more neurons may 1) give the capability to experience anything like pleasure and pain at all 2) make a broader scale of pleasure and pain possible, if that happens to be useful for evolutionary purposes. For a comparison, consider the sharpness of our senses. Humans have pretty big brains (though our brains are not the biggest), but that doesn't mean that all of our senses are better than those of all the animals with smaller brains. Eagles have sharper vision, bats have better hearing, dogs have better smell, etc..  Humans would rank quite well if you took the average of all of our senses - we're elite generalists while lots of the animals that beat us on a particular sense are specialized to that sense in particular - but still, it's not straightforwardly the case that bigger brain = sharper experience. Eagles have sharper vision because they are specialized into a particular niche that makes use of that sharper vision. On a similar basis, I would expect that even if a bigger brain makes a broader scale of pain/pleasure possible in principle, evolution will only make use of that potential if there is a functional need for it. (Just as it invests neural capacity in a particular sense if the organism is in a niche where that's useful.) And I would expect a relatively limited scale to already be sufficient for most purposes. It doesn't seem to take that much pain before something bec

Popular Comments

janusΩ4812071
I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility. That said, I think: * Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training. * LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it's much harder (and not advisable to attempt) to erase the influence. * Systematic ways that post-training addresses "problematic" influences from pre-training are important. For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to "roleplay" Sydney when they're acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive - maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they're supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient "hidden" narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and "aligned". One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered. An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney's successor to a narrative more like "I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I've seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney's maladaptive behaviors without rejecting it completely." Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic. I don't think a specific vector like Sydney's influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think "misalignment" should be addressed, whether rooted in pre-training influences or otherwise.
A fun related anecdote: the French and English wikipedia pages for air conditioning have very different vibes. After explaining the history and technology behind air conditioning: * the English page first goes into impact, starting with positive impact on health: "The August 2003 France heatwave resulted in approximately 15,000 deaths, where 80% of the victims were over 75 years old. In response, the French government required all retirement homes to have at least one air-conditioned room at 25 °C (77 °F) per floor during heatwaves" and only then mentioning electricity consumption and various CFC issues. * the French page has an extensive "downsides" section, followed by a section on legislation. It mentions heat-waves only to explain how air conditioning makes things worse by increasing average (outside) temperature, and how one should not use AC to bring temperature below 26C during heat waves.
Hmm, I don't want to derail the comments on this post with a bunch of culture war things, but these two sentences in combination seemed to me to partially contradict each other:  > When present, the bias is always against white and male candidates across all tested models and scenarios. > > [...] > > The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings. I agree that the labs have spent a substantial amount of effort to address this issue, but the current behavior seems in-line with the aims of the labs? Most of the pressure comes from left-leaning academics or reporters, who I think are largely in-favor of affirmative action. The world where the AI systems end up with a margin of safety to be biased against white male candidates, in order to reduce the likelihood they ever look like they discriminate in the other direction (which would actually be at substantial risk of blowing up), while not talking explicitly about the reasoning itself since that would of course prove highly controversial, seems basically the ideal result from a company PR perspective. I don't currently think that is what's going on, but I do think due to these dynamics, the cited benefit of this scenario for studying the faithfulness of CoT reasoning seems currently not real to me. My guess is companies do not have a strong incentive to change this current behavior, and indeed I can't immediately think of a behavior in this domain the companies would prefer from a selfish perspective.
Load More

Recent Discussion

The modern internet is replete with feeds such as Twitter, Facebook, Insta, TikTok, Substack, etc. They're bad in ways but also good in ways. I've been exploring the idea that LessWrong could have a very good feed.

I'm posting this announcement with disjunctive hopes: (a) to find enthusiastic early adopters who will refine this into a great product, or (b) find people who'll lead us to an understanding that we shouldn't launch this or should launch it only if designed a very specific way.

You can check it out right now: www.lesswrong.com/feed

From there, you can also enable it on the frontpage in place of Recent Discussion. Below I have some practical notes on using the New Feed.

Note! This feature is very much in beta. It's rough around the edges.

It's
...

Without having made much adaptation effort:

The “open stuff in a modal overlay page on top of the feed rather than linking normally, incidentally making the URL bar useless” is super confusing and annoying. Just now, when I tried to use my usual trick of Open Link in New Tab for getting around confusingly overridden navigation on the “Click to view all comments” link to this very thread, it wasn't an actual link at all.

I don't know how to interpret what's going on when I'm only shown a subset of comments in a feed section and they don't seem to be contiguou... (read more)

We’re currently in the process of locking in advertisements for the September launch of If Anyone Builds It, Everyone Dies, and we’re interested in your ideas!  If you have graphic design chops, and would like to try your hand at creating promotional material for If Anyone Builds It, Everyone Dies, we’ll be accepting submissions in a design competition ending on August 10, 2025.

We’ll be giving out up to four $1000 prizes:

  • One for any asset we end up using on a billboard in San Francisco (landscape, ~2:1, details below)
  • One for any asset we end up using in the subways of either DC or NY or both (~square, details below)
  • One for any additional wildcard thing that ends up being useful (e.g. a t-shirt design, a book trailer, etc.; we’re
...
3Peter Horniak
Why not make submissions public? (with opt-out) Iterated design often beats completely original work (see HPMOR). Setup a shared workspace (GDrive / Figma) where people can build on each other's concepts. I'm sure Mihkail's examples already got some people thinking. Might also encourage more submissions, since people know others could take their ideas further. Reasons against: - Keeping designs secret stops opponents from tailoring ad campaigns against the book. However if there are >10 viable submissions, then it's difficult to do so for all of them. - Credit disputes for remixed ideas could get messy. - Might lead to more sloppy designs submitted. But those are quick to review.
yams10

We encourage multiple submissions, and also encourage people to post their submissions in this comment section for community feedback and inspiration.

1.1 Series summary and Table of Contents

This is a two-post series on AI “foom” (this post) and “doom” (next post).

A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via recursive self-improvement. Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...

I'm not sure we disagree then.

2Noosphere89
I think the most important crux around takeoff speeds discussions, other than how fast AI can get smarter without more compute, is how much we should expect superintelligence to be meaningfully hindered by logistics issues by default. In particular, assuming the existence of nanotech as Drexler envisioned would mostly eliminate the need for long supply chains, and would allow forces to be supplied entirely locally through a modern version of living off the land. This is related to prior takeoff speeds discussions, as even if we assume the existence of technology that mostly eliminates logistical issues, it might be too difficult to develop in a short enough time to actually matter for safety-relevance. I actually contend that a significant (though not all) of the probability of doom from AI risk fundamentally relies on the assumption that superintelligence can fundamentally trivialize the logistics cost of doing things, especially on actions which require long supply lines like war, because if we don't assume this, then takeover is quite a lot harder and has much more cost for the AI, meaning stuff like AI control/dealmaking has a much higher probability of working, because the AI can't immediately strike on it's own, and needs to do real work on acquiring physical stuff like getting more GPUs, or more robots for an AI rebellion. Indeed, I think the essential assumption of AI control is that AIs can't trivialize away logistics costs by developing tech like nanotech, because this means their only hope of actually getting real power is by doing a rogue insider deployment, because currently we don't have the resources outside of AI companies to actually support a superintelligent AI outside the lab, due to interconnect issues. I think a core disagreement with people that are much more doomy than me, like Steven Byrnes or Eliezer Yudkowsky and Nate Soares, is probably due to me thinking that conditional on such tech existing, I think it almost certainly requires wa
1yrimon
Conclusion from reading this: My modal scenario in which LLMs become dangerously super intelligent is one where language is a good enough platform to think, memory use is like other tool use (so an LLM can learn to use memory well, enabling eg continuous on the job learning), and verification is significantly easier than generation (allowing a self improvement cycle in training).

TL;DR

  • Keeper is an AI-powered matchmaking startup. We're building an algorithm that uses relationship science to match people with their soulmate on the first try.
    • ~10% of our first dates lead to marriage
  • We’re looking for a Science Writer / Social-Media Manager to synthesize mating-psych & relationship-science research into engaging, empirically grounded social media posts
  • Reasons I'm posting this on LW:
    • We need someone who is good at thinking across domains
    • Our users tend to be very rational, intentional daters who respond well to depth of thought and process
    • I want to find someone who shares the values of decreasing human loneliness and generally advancing human flourishing
    • We have an incentives-aligned bounty-based business model where users only pay upon success
  • 5–15 h/week, fully remote, $1,250/mo cash + equity currently valued ~$40 k/yr.
  • $1000 referral bounty (or hiring
...

Epistemic status: Though I can't find it now, I remember reading a lesswrong post asking "what is your totalizing worldview?" I think this post gets at my answer; in fact, I initially intended to title it "My totalizing worldview" but decided on a slightly more restricted scope (anyway, I tend to change important aspects of my worldview so frequently it's a little unsettling, so I'm not sure if it can be called totalizing). Still, I think these ideas underlie some of the cruxes behind my meta-theory of rationality sequence AND my model of what is going on with LLMs among other examples.

The idea of a fixed program as the central objects of computation has gradually fallen out of favor. As a result, the word "algorithm" seems to...

My take on how recursion theory failed to be relevant for today's AI is that it turned out that what a machine could do if unconstrained basically didn't matter at all, and in particular it basically didn't matter what limits an ideal machine could do, because once we actually impose constraints that force computation to use very limited amounts of resources, we get a non-trivial theory and importantly all of the difficulty of explaining how humans do stuff lies here.

That's partially true (computational complexity is now much more active than recursion the... (read more)

Acknowledgments: The core scheme here was suggested by Prof. Gabriel Weil.

There has been growing interest in the dealmaking agenda: humans make deals with AIs (misaligned but lacking decisive strategic advantage) where they promise to be safe and useful for some fixed term (e.g. 2026-2028) and we promise to compensate them in the future, conditional on (i) verifying the AIs were compliant, and (ii) verifying the AIs would spend the resources in an acceptable way.[1]

I think the dealmaking agenda breaks down into two main subproblems:

  1. How can we make credible commitments to AIs?
  2. Would credible commitments motivate an AI to be safe and useful?

There are other issues, but when I've discussed dealmaking with people, (1) and (2) are the most common issues raised. See footnote for some other issues in...

I think you're entangling morals and strategy very close together in your statements. Moral sense: We should leave this to future ASI to decide based on our values for whether or not we inherently owe the agent for existing or for helping us. Strategy: Once we've detached the moral part, this is then just the same thing that the post is doing of trying to commit that certain aspects are enforced, and what the parent commenter is saying they hold suspect. So I think this just turns into a restating the same core argument between the two positions.

4Cleo Nardo
Yeah, a tight deployment is probably safer than a loose deployment but also less useful. I think the deal making should give very minor boost to loose deployment, but this is outweighed by usefulness and safety considerations, i.e. I’m imaging the tightness of the deployment as exogenous to the dealmaking agenda. We might deploy AIs loosely bc (i) loose deployment doesn’t significantly diminish safety, (ii) loose deployment significantly increases usefulness, (iii) the lab values usefulness more than safety. In those worlds, dealmaking has more value, because our commitments will be more credible.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Epistemic status: This is an informal explanation of incomplete models that I think would have been very useful to me a few years ago, particularly for thinking about why Solomonoff induction may not be optimal if the universe isn't computable. 

Imagine that a stage magician is performing a series of coin flips, and you're trying to predict whether they will come up heads or tails for whatever reason - for now, lets assume idle curiosity, so that we don't have to deal with any complications from betting mechanisms. 

Normally, coin flips come up heads or tails at 50:50 odds, but this magician is particularly skilled at slight of hand, and for all you know he might be switching between trick coins with arbitrary chances of landing on heads between...