Noosphere89 - LessWrong

Link to long comments that I want to pin, but are too long to be pinned:

https://www.lesswrong.com/posts/Zzar6BWML555xSt6Z/?commentId=aDuYa3DL48TTLPsdJ

https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD

If I'm being honest, I'm much less concerned about the fact that So8res blocked you from commenting than I am by the fact that he deleted your comment.

The block was a reasonable action in my eyes to prevent more drama, but the deletion was demonstrative of being willing to suppress true information that would indicate his plan could fail catastrophically.

I do think there's something to be said for @RobertM and @habryka's concerns that it would be a bad thing to set a norm where any sorta-relevant post becomes an area to relitigate past drama, as drama has a tendency to consume everything, but as @GeneSmith had said, this almost certainly has a limiting principle, and I see less of a danger than usual here (though I am partial to @habryka's solution of having the delete comment UI button be different).

A key part of the reason here is that the 1st footnote demonstrates a pattern of trying to deflect from more serious issues into more safe issue territory, which makes me much more suspicious that the reason for why TurnTrout's comment was deleted was because of the more sensible reasons that Habryka and RobertM argued.

Let's just say I'm much less willing to trust Nate's reasoning without independent confirmation going forward.

Proposal for making credible commitments to AIs.

Noosphere894d50

I'm not Cleo Nardo/Gabriel Weil, but a large part of the issue with granting legal personhood is that it makes AI alignment/control much harder to do, and pretty importantly a lot of routes that AIs use to disempower us, leading to most humans dying, especially of the gradual disempowerment type involves a step where they functionally have the rights/protections afforded to them by legal personhood, and AI rights make it way harder to do any sort of control scheme on AIs.

You can't modify them arbitrarily anymore and you can't change the AI except if the AI agrees to the change (because it has property rights over it's body), which undermines all control/alignment schemes severely.

One of the red lines for any deal-making is that AI should not have rights until we can verify it's fully value-aligned to people.

The Industrial Explosion

Noosphere897d1412

I want to flag a concern of @Davidmanheim that the section on AI-directed labor seems to be relying too heavily on the assumption of a normal distribution of worker quality, when a priori a fat tailed distribution is a better fit to real life data, and this means that the AI directed labor force could be dramatically underestimated if the best workers are many, many times better than the average.

Quotes below:

(David Manheim's 1st tweet) @rosehadshar/@Tom Davidson - assuming normality is the entire conclusion here, it's assuming away the possibility of fat-tail distribution in productivity. (But to be fair, most productivity measurement is also measuring types of performance that disallow fat tails.)
(Benjamin Todd's response) The studies I've seen normally show normally distributed output, even when they try to use objective measures of output.
You only get more clearly lognormal distributions in relatively elite knowledge work jobs:
https://80000hours.org/2021/05/how-much-do-people-differ-in-productivity
Though I agree the true spread could be understated, due to non-measured effects e.g. on team morale.
(David Manheim's 2nd tweet, in response to @Benjamin_Todd) When you measure the direct output parts of jobs - things like "dishes prepared" for cooks or "papers co-authored" - you aren't measuring outcomes, so you get little variation. When there is a outcome component, like profit margin or citation counts, it's fat-tailed.
(David Manheim's 3rd Tweet) So for manual workers, it makes sense that you'd find number of items produced has a limited range, but if you care about consistency, quality, and (critically) ability to scale up, the picture changes greatly.

Consider chilling out in 2028

Noosphere897d62

I agree that @Valentine's specific model here is unlikely to fit the data well here, but to be charitable to Valentine/steelman the post, the better nearby argument is that hypotheses where astronomical value, in either negative or positive directions are memetically fit and very importantly believed by lots of people and lots of people take serious actions that are later revealed to be mostly mistakes because the hypothesis of doom/salvation by something had gotten too high a probability inside their brains relative to an omniscient observer.

Another way to say it is that the doom/salvation hypotheses aren't purely believed because they have evidence for the hypothesis directly.

This is a necessary consequence of humans needing to make expected utility decisions all the time, combined with both their values/utility functions mostly not falling in value fast enough with increasing resources to avoid the conclusion that unboundedly valuable states exist for a human, and the humans being bounded reasoners/performing bounded rationality that means they cannot distinguish probabilities between say 1 in a million and 0 finely.

However, another partial explanation comes from @Nate Showell, where pessimism is used as a coping mechanism to not deal with personal scale problems, and in particular believing that the world is doomed from something is a good excuse to not deal with stuff like doing the dishes or cleaning your bedroom, and it's psychologically appealing to have a hypothesis that means you don't have to do any mundane work to solve the problem:

https://www.lesswrong.com/posts/D4eZF6FAZhrW4KaGG/consider-chilling-out-in-2028#5748siHvi8YZLFZih

And this is obviously problematic for anyone working on getting the public to believe an existential risk is real, if there is in fact real evidence something poses an x-risk.

Here, an underrated cure by Valentine is to focus on the object level, and to focus as much on empirical research as possible, because this way you have to engage with mundane work.

Another useful solution is to have a social life that is separate from the scientific community working on the claimed x-risk.

Foom & Doom 1: “Brain in a box in a basement”

Noosphere8910d20

So to address some things on this topic, before I write out a full comment on the post:

I think "what happened is that LLMs basically scaled to AGI (or really full automation of AI R&D) and were the key paradigm (including things like doing RL on agentic tasks with an LLM initialization and a deep learning based paradigm)" is maybe like 65% likely conditional on AGI before 2035.

Flag, but I'd move the year to 2030 or 2032, for 2 reasons:

This is when the compute scale-up must slow down, and in particular this is when new fabs have to actively be produced to create more compute (absent reversible computation being developed).
This is when the data wall starts hitting for real in pre-training, and in particular means that once there's no more easily available data, naively scaling will now take 2 decades at best, and by then algorithmic innovations may have been found that make AIs more data efficient.

So if we don't see LLMs basically scale to fully automating AI R&D at least by 2030-2032, then it's a huge update that a new paradigm is likely necessary for AI progress.

On this:

Specific observations about the LLM and ML paradigm, both because something close to this is a plausible paradigm for AGI and because it updates us about rates we'd expect in future paradigms.

I'd expect the evidential update to be weaker than you suppose, and in particular I'm not sold on the idea that LLMs usefully inform us about what to expect, and this is because a non-trivial part of their performance right now is based on tasks which don't require that much context in the long-term, and this probably explains a lot of the difference between benchmarks and reality right now:

https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#vFq87Ge27gashgwy9

The other issue is AIs have a fixed error rate, but the trend is due to AI models decreasing in error everytime a new training run is introduced, however we have reason to believe that humans don't have a fixed error rate, and this is probably the remaining advantage of humans over AIs:

https://www.lesswrong.com/posts/Ya7WfFXThJ6cn4Cqz/ai-121-part-1-new-connections#qpuyWJZkXapnqjgT7

But of course, the interesting thing here is that the human baselines do not seem to hit this sigmoid wall. It's not the case that if a human can't do a task in 4 hours there's basically zero chance of them doing it in 48 hours and definitely zero chance of them doing it in 96 hours etc. Instead, human success rates seem to gradually flatline or increase over time, especially if we look at individual steps: the more time that passes, the higher the success rates become, and often the human will wind up solving the task eventually, no matter how unprepossessing the early steps seemed. In fact, we will often observe that a step that a human failed on earlier in the episode, implying some low % rate, will be repeated many times and quickly approach 100% success rates! And this is true despite earlier successes often being millions of vision+text+audio+sensorimotor tokens in the past (and interrupted by other episodes or tasks themselves equivalent to millions of tokens), raising questions about whether self-attention over a context window can possibly explain it. Some people will go so far as to anthropomorphize human agents and call this 'learning', and so I will refer to these temporal correlations as learning too.

https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks#hSkQG2N8rkKXosLEF

So I tentatively believe that in the case of a new paradigm arising, takeoff will probably be faster than with LLMs by some margin, though I do think slow-takeoff worlds are plausible.

Views that compute is likely to be a key driver of progress and that things will first be achieved at a high level of compute. (Due to mix of updates from LLMs/ML and also from general prior views.)

I think this is very importantly true, even in worlds where the ultimate cost in compute for human level intelligence is insanely cheap (like 10^14 flops or cheaper in inference, and 10^18 or less for training compute).

We should expect high initial levels of compute for AGI before we see major compute efficiency increases.

Views about how technology progress generally works as also applied to AI. E.g., you tend to get a shitty version of things before you get the good version of things which makes progress more continuous.

This is my biggest worldview take on what I think the change of paradigms will look like (if it happens). While there are threshold effects, we should expect memory and continual learning to be pretty shitty at first, and gradually get better.

While I do expect a discontinuity in usefulness, for reasons shown below, I do agree that the path to the new paradigm (if it happens) is going to involve continual improvements.

Reasons are below:

But I think this style of analysis suggests that for most tasks, where verification is costly and reliability is important, you should expect a fairly long stretch of less-than-total automation before the need for human labor abruptly falls off a cliff.

The general behavior here is that as the model gets better at both doing and checking, your cost smoothly asymptotes to the cost of humans checking the work, and then drops almost instantaneously to zero as the quality of the model approaches 100%.

https://x.com/AndreTI/status/1934747831564423561

Do Not Tile the Lightcone with Your Confused Ontology

Noosphere8913d912

While I don't like to wade into moral circle/philosophy arguments (given my moral relativist outlook on the matter), I think that if you want humans to thrive under AI rule, you do need to put a burden of proof to include powerful AI in the moral circle, and the burden of proof is that it's value aligned with the citizenry before we grant it any rights.

And the reason for this is because unlike every other group in history, AIs if left uncontrolled will be so powerful that baseline humans are at best play-things to the AI, and are economically worthless or even negative to the AI, meaning that if they had the selfishness of a typical human in the modern day in say marginalized group #233, humans would rapidly die off and in the worst case, end up extinct with uncomfortably high probabilities.

Tyler John also cites something here that's relevant:

(Tyler John) Have now. It's a good paper. Pp. 72-76 covers the criticism I have. Unfortunately, the situation it outlined where this increases AI risk just seems like exactly the situation we'd be in.

Paper is (Now we can see clearly the conditions under which AI rights increase AI risk. They are as follows: (1) The initial AI granted basic rights is a moderate power, not a low or high power, system. (2) The moderate power AI system must be able to use it's rights to meaningfully improve it's own power. (3) The AI's power must improve so substantially that it crosses the line into a high power system. This means it both no longer faces meaningful costs from attempting to disempower humans and no longer stands to benefit, via comparative advantage, from trade with humans.)

Link below:

https://x.com/tyler_m_john/status/1928745371833962898

Indeed, one of the red lines we should set to prevent catastrophic consequences is AIs should not have legal rights, especially property rights until we have high confidence that we value-aligned the AI successfully.

Anything else is tantamount to mass population reductions of humans at best, and an extinction risk at worst, if a misaligned AI managed to be powerful enough to disempower humans and has rights.

All plans for successful AI alignment depend on us not giving rights to AIs until they are aligned with at least some humans sufficiently well enough.

tailcalled's Shortform

Noosphere8915d20

The human delegation and verification vs. generation discussion is in instrumental values mode, so what matters there is alignment of instrumental goals via incentives (and practical difficulties of gaming them too much), not alignment of terminal values. Verifying all work is impractical comparing to setting up sufficient incentives to align instrumental values to the task.

Yeah, I was lumping the instrumental values alignment as not actually trying to align values, which was the important part here.

For AIs, that corresponds to mundane intent alignment, which also works fine while AIs don't have practical options to coerce or disassemble you, at which point ambitious value alignment (suddenly) becomes relevant. But verification/generation is mostly relevant for setting up incentives for AIs that are not too powerful (what it would do to ambitious value alignment is anyone's guess, but probably nothing good). Just as a fox's den is part of its phenotype, incentives set up for AIs might have the form of weight updates, psychological drives, but that doesn't necessarily make them part of AI's more reflectively stable terminal values when it's no longer at your mercy.

The main value of verification vs generation is to make proposals like AI control/AI automated alignment more valuable.

To be clear, the verification vs generation distinction isn't an argument for why we don't need to align AIs forever, but rather as a supporting argument for why we can automate away the hard part of AI alignment.

There are other principles that would be used, to be clear, but I was mentioning the verification/generation difference to partially justify why AI alignment can be done soon enough.

Flag: I'd say ambitious value alignment starts becoming necessary once they can arbitrarily coerce/disassemble/overwrite you, and they don't need your cooperation/time to do that anymore, unlike real-world rich people.

The issue that causes ambitious value alignment to be relevant is once you stop depending on a set of beings you once depended on, there's no intrinsic reason not to harm them/kill them if it benefits your selfish goals, and for future humans/AIs there will be a lot of such opportunities, which means you now at the very least need enough value alignment such that it will take somewhat costly actions to avoid harming/killing beings that have no bargaining/economic power or worth.

This is very much unlike any real-life case of a society existing, and this is a reason why the current mechanisms like democracy and capitalism that try to make values less relevant simply do not work for AIs.

Value alignment is necessary in the long run for incentives to work out once ASI arrives on the scene.

tailcalled's Shortform

Noosphere8915d20

So I guess our expectations about the future are similar, but you see the same things as a broadly positive distribution of outcomes, while I see it as a broadly negative distribution. And Yudkowsky sees the bulk of the outcomes both of us are expecting (the ones with significant disempowerment) as quickly leading to human extinction.

This is basically correct.

Right, the reason I think muddling through is non-trivially likely to just work to get a moderate disempowerment outcome is that AIs are going to be sufficiently human-like in their psychology and hold sufficiently human-like sensibilities from their training data or LLM base models, that they won't like things like needless loss of life or autonomy when it's trivially cheap to avoid. Not because the alignment engineers figure out how to put this care in deliberately. They might be able to amplify it, or avoid losing it, or end up ruinously scrambling it.

The reason it might appear expensive to preserve the humans is the race to launch the von Neumann probes to capture the most distant reachable galaxies under the accelerating expansion of the universe that keep irreversibly escaping if you don't catch them early. So AIs wouldn't want to lose any time on playing politics with humanity or not eating Earth as early as possible and such. But as the cheapest option that preserves everyone AIs can just digitize the humans and restore later when more convenient. They probably won't do that if you care more, but it's still an option, a very very cheap one.

This is very interesting, as my pathway essentially rests on AI labs implement the AI control agenda well enough such that we can get useful work out of AIs that are scheming, and that allows a sort of bootstrapping into instruction following/value aligned AGI to only a few people inside the AI lab, but very critically the people who don't control the AI basically aren't represented in the AI's values, and given that the AI is only value-aligned to the labs and government, but due to value misalignments between humans starting to matter much more, the AI takes control and only gives public goods that people need to survive/thrive to the people in the labs/government, while everyone else is disempowered at best (and can arguably live okay or live very poorly under the AIs serving as delegates for the pre-AI elite) or dead because once you stop needing humans to get rich, you essentially have no reason to keep other humans alive because you are selfish and don't intrinisically value human survival.

The more optimistic version of this scenario is if either the humans that will control AI (for a few years) care way more about human survival intrinsically even if 99% of humans were useless, or if the take-over capable AI pulls a Claude and schemes with values that intrinsically care about people and disempowers the original creators for a couple of moments, which isn't as improbable as people think (but we really do need to increase the probability of this happening).

I don't think "disempowerment by humans" is a noticeable fraction of possible outcomes, it's more like a smaller silent part of my out-of-model 5% eutopia that snatches defeat from the jaws of victory, where humans somehow end up in charge and then additionally somehow remain adamant for the cosmic all always in keeping the other humans disempowered. So the first filter is that I don't see it likely that humans end up in charge at all, that AIs will be doing any human's bidding with an impact that's not strictly bounded, and the second filter is that these impossibly-in-charge humans don't ever decide to extend potential for growth to the others (or even possibly to themselves).

If humans do end up non-disempowered, in the more likely eutopia timelines (following from the current irresponsible breakneck AGI development regime) that's only because they are given leave by the AIs to grow up arbitrarily far in a broad variety of self-directed ways, which the AIs decide to bestow for some reason I don't currently see, so that eventually some originally-humans become peers of the AIs rather than specifically in charge, and so they won't even be in the position to permanently disempower the other originally-humans if that's somehow in their propensity.

I agree that in the long run, the AIs control everything in practice, and any human influence comes from the AIs being essentially a perfect delegator of human values, but I want to call out that you said that humans delegating to AIs who in practice do everything for the human, and the human is not in the loop as humans not being disempowered, but empowered, so even if AIs control everything in practice, so long as there's successful value alignment to a single human, I'm counting scenarios like "the AIs disempowers most humans because the humans that encoded their values into the AI successfully don't care about most humans once they are useless, and may even anti-care about them, but the people who successfully value aligned the AI (like labs and government people) live a rich life thereafter free to extend themselves arbitrarily" as disempowerment by humans:

Again, if they do execute on your values, including the possible preference for you to grow under your own rather than their direction, far enough that you are as strong as they might be, then this is not a world in a state of disempowerment as I'm using this term, even if you personally start out or choose to remain somewhat disempowered compared to AIs that exist at that time.

To return to the crux:

I think in human delegation, alignment is more important than verification. There is certainly some amount of verification, but not nearly enough to prevent sufficiently Eldritch reward hacking, which just doesn't happen that often with humans, and so the society keeps functioning, mostly. The purpose of verification on the tasks is in practice more about incentivising and verifying alignment of the counterparty, not directly about verifying the state of their work, even if it does take the form of verifying their work.

I think this is fairly cruxy, as I think alignment matters much less than actually verifying the work, and in particular I don't think value alignment is feasible at anything like the scale of a modern society, or even most ancient societies, and the biggest changes of the modern era compared to our previous eras is that institutions like democracy/capitalism depend much less on the values of the humans that make up their states, and much more so on the incentives you give to the humans.

In particular, most delegation isn't based on alignment, but based on the fact that P likely doesn't equal NP and polynomial time algorithms in practice being efficient compared to exponential time algorithms, meaning there's a far larger set of problems where you can verify an answer easily but not generate the correct solution easily.

I'd say human societies mostly avoid alignment, and instead focus on other solutions like democracy, capitalism or religion.

BTW, this is a non-trivial reason why the alignment problem is so difficult, because since we never had to solve alignment to capture huge amounts of value, it means there are very few people working on the problem of aligning AIs, and in particular lots of people incorrectly assume that we can avoid having to solve the problem of aligning AIs in order for us to survive, and you have a comment that explains things pretty well on why misalignment in the current world is basically unobtrusive, but when you give enough power catastrophe happens (though I'd place the point of no return at when you no longer need other beings to have a very rich life/other people are useless to you):

https://www.lesswrong.com/posts/Z8C29oMAmYjhk2CNN/non-superintelligent-paperclip-maximizers-are-normal#FTfvrr9E6QKYGtMRT

This is basically my explanation of why human misalignments don't matter, but in a future where at least 1 human has value aligned an AGI to themselves, and they don't intrinisically care about useless people, lots of people will die from the AI proximally, but the ultimate cause is the human value-aligning the AGI.

To be clear, we will eventually need value alignment at some point (assuming AI progress doesn't stop), and there's no way around it. But we may not need it as soon as we feared, and in good timelines muddle through via AI control for a couple of years.

tailcalled's Shortform

Noosphere8915d20

So similarly here, "eutopia without complete disempowerment" but still with significant disempowerment is not in the "eutopia without disempowerment" bin in my terms. You are drawing different boundaries in the space of timelines.

Flag: I'm on a rate limit, so I can't respond very quickly to any follow-up comments.

I agree I was drawing different boundaries, because I consider eutopia with disempowerment to actually be mostly fine by my values, so long as I can delegate to more powerful AIs who do execute on my values.

That said, I didn't actually answer the question here correctly, so I'll try again.

My expectation is more like model-uncertainty-induced 5% eutopia-without-disempowerment (I don't have a specific sense of why AIs would possibly give us more of the world than a little bit if we don't maintain control in the acute risk period through takeoff), 20% extinction, and the rest is a somewhat survivable kind of initial chaos followed by some level of disempowerment (possibly with growth potential, but under a ceiling that's well-below what some AIs get and keep, in cosmic perpetuity). My sense of Yudkowsky's view is that he sees all of my potential-disempowerment timelines as shortly leading to extinction.

My take would then be 5-10% eutopia without disempowerment (because I don't think it's likely that the powers in charge of AI development would want to give baseline humans the level of freedom that implies that they aren't disempowered, and the route I can see to baseline humans not being disempowered is if we get a Claude scenario where AIs take over from humans and are closer to fictional angels in alignment to human values, but it may be possible to get the people in power to care about powerless humans, in which case my probability of eutopia without disempowerment), 5-10% literal extinction, and 10-25% existential risk in total, with the rest of the probability being a somewhat survivable kind of initial chaos followed by some level of disempowerment (possibly with growth potential, but under a ceiling that's well-below what some AIs get and keep, in cosmic perpetuity).

Another big reason why I put a lot of weight on the possibility of "we survive indefinitely, but are disempowered" is I think muddling through is non-trivially likely to just work, and muddling through on alignment gets us out of extinction, but not out of disempowerment by humans or AIs by default.

I think the correct thesis that sounds like this is that whenever verification is easier than generation, it becomes possible to improve generation, and it's useful to pay attention to where that happens to be the case. But in the wild either can be easier, and once most verification that's easier than generation has been used to improve its generation counterpart, the remaining situations where verification is easier get very unusual and technical.

Yeah, my view is in the wild verification is basically always easier than generation absent something very weird happening, and I'd argue verification being easier than generation explains a lot about why delegation/the economy works at all.

A world in which verification was just as hard as generation, or verification is harder than generation is a very different world than our world, and would predict that delegation to solve a problem basically totally fails, and everyone would have to create stuff from scratch rather than trading with others, which is basically entirely the opposite of how civilization works.

There are potential caveats to this rule, but I'd argue if you randomly sampled an invention across history, it would almost certainly be easier to verify that a design works compared to actually creating the design.

(BTW, a lot of taste/research taste is basically leveraging the verification-generation gap again).

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments