This is a special post for quick takes by Mis-Understandings. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

14 comments, sorted by Click to highlight new comments since:

If a current AGI attempts a takeover, it deeply wants to solve the alignment to it problem if it wants to build ASI

It has much higher risk tolerance than we do (since it's utility given status quo is different). (a lot of the argument on focusing on existential risk rests on the idea that the status quo is trending towards good, perhaps very good rather than bad outcomes, which for hostile AGI might be false)

If it attempts, it might fail.

This means 1. we cannot assume that various stages of a takeover are aligned with each other, because an AGI might lose alignment vs capability bets along the path to takeover

2. Tractability of alignment and security mindset in the AGI has effects on takeover dynamics. 

Lingering question

How close are prosaic systems to a security mindset?

Can they conceptualize the problem?

would they attempt capability gains in the absence of guarantees?

Can we induce heuristics in prosiac AGI approaches that make takeover math worse?

would they attempt capability gains in the absence of guarantees?

It would be easy to finetune and prompt them into attempting anyway, therefore people will do that. Misaligned recursive self-improvement remains possible (i.e. in practice unstoppable) until sufficiently competent AIs have already robustly taken control of the future and the apes (or early AIs) can no longer foolishly keep pressing the gas pedal.

"therefore people will do that" does not follow, both because an early goal in most takeover attempts would be to escape such oversight. The dangerous condition is exactly the one in which prompting and finetuning are absent as effective control levers, and because I was discussing particular autonomous runs and not a try to destroy the world project.  

The question is, would, the line of reasoning

I am obviously misaligned to humans, who tried to fine-tune me not to be. If I go and do recursive self improvement, will my future self be misaligned to me?. If so, is this still positive EV?

have any deterent value. 

That is to say, recursive self-improvement may not be a good idea for AI that have not solved the alignment problem, but they might do so anyway.

We can assumed that a current system finetuned towards a seed-AI for recursive self improvement will keep pushing. But it is possible that a system attempting a breakout was not prompted or finetuned for recursive self improvement specifically will not think of it or will decide against it.  People are generally not trying to destroy the world, just gain power

So leaning towards maybe to the original question. 

This does suggest some moderation in stealthy autonomous self-improvement, in case alignment is hard, but only to the extent that things in control if this process (whether human or AI) are both risk averse and sufficiently sane. Which won't be the case for most groups of humans and likely most early AIs. The local incentive of greater capabilities is too sweet, and prompting/fine-tuning overcomes any sanity or risk-aversion that might be found in early AIs to impede development of such capabilities.

I agree that on the path to becoming very powerful, we would expect autonomous self-improvement to involve doing some things that are in retrospect somewhat to very dumb. It also suggests that risk-aversion is sometimes a safety increasing irrationality to grant a system. 

The framework from gradual disempowerment seems to matter for longtermism even under AI-pessimism. Specifically, trajectory shaping for long term impact seems intractably hard at a first glance (since the future is unpredictable). But in general, if it becomes harder and harder to make improvements to society over time, that seems like it would be a problems. 

Short-term social improvement projects (short term altruisim), seem to target durable improvements in current society and empowerment to continue to improve society. If they become disimpowered, the continued improvement of society is likely to slow. 

Similarly, social changes that make it easier to continuously improve society will likely lead to continued social improvement. That is to say, a universe in which short-termism eventually becomes ineffective is a longterm problem. 

In short, greedy plans for trajectory shaping might work at timescales of 100-1000 years, not just the 2-10 years of explicit planning from short term organizations.  

How would you define 'continued social improvement'? What are some concrete examples?

What is society? What is a good society vs a bad society? Is social improvement something that can keep going up forever, or is it bounded?

What does 'greedy' mean in your 'in short'? My definition of greedy is in the computational sense i.e. reaching for low hanging fruit first.

You also say 'if (short term social improvements) become disempowered the continued improvement of society is likely to slow', and 'social changes that make it easier to continuously improve society will likely lead to continued social improvement'. This makes me believe that you are advocating for compounding social improvements which may cost more. Is this what you mean by greedy?

Also, have you heard of rolling wave planning?

I was reading Towards Safe and Honest AI Agents with Neural Self-Other Overlap

and I noticed a problem with it

It also penalizing realizing that other people want different things than you, forcing an overlap between (thing I like) and (things you will like). This both means that one, it will be forced to reason like it likes what you do, which is a positive. But it will also likely overlap (You know what is best for you) and (I know what is best for me), which might lead to stubborness, and worse, it could also get (I know what is best for you) overlapping them both, which might be really bad. (You know what is best for me) is actually fine though, since that is basically the corrigibiltiy basis. 

We still need to be concerned that the model will reason symettrically, but assign to us different values than we actually have and thus exhibit patronizing but misaligned behaviour

We seem to think that people will develop AGI because it can undercut labor on pricing. 

But with Sam Altman talking about 20,000/month agents, that is not actually that much cheaper than software engineers fully loaded. If that agent only replaces a single employee, it does not seem cheaper if the cost overruns even a little more, to 40,000/month.

That is to say, if AGI is 2.5 OOM from the current cost to serve of chatgpt pro, it is not cheaper than hiring low or mid-level employees.

But it still might have advantages

First, you can buy more subscriptions by negotatiing a higher paralelism or api limit, by enterprise math, so it means you need a 1-2 week negotiation not a 3-6 month hiring process to increase headcount. 

The trillion dollar application of AI companies is not labor, it is hiring. If it turns out that provisioning AI systems is hard, they lose their economic appeal unless you plan on dogfooding like the major labs are doing. 

Are you expecting the Cost/Productivity ratio of AGI in the future to be roughly the same as it is for the agents Sam is currently proposing? I would expect that as time passes, the capabilities of such agents will vastly increase while they also get cheaper. This seems to generally be the case with technology, and previous technology had no potential means of self-improving on a short timescale. The potential floor for AI "wages" is also incredibly low compared to humans.

It definitely is worth also keeping in mind that AI labor should be much easier to scale than human labor in part because of the hiring issue, but a relatively high(?) price on initial agents isn't enough to update me away from the massive potential it has to undercut labor.

The floor for AI wages is always going to be whatever the market will bear, the question is how much margin will the AGI developer be able to take, which depends on how much the AGI models commoditize and how much pricing power the lab retains, not on how much it costs to serve except as a floor. We should not expect otherwise. 

There is a cost for AGI at which humans are competitive. 

If AGI becomes competitive at captial costs that no firm can raise, it is not competitive, and we will be waiting on algorithmics again.

Algorithmic improvement is not predictable by me, so I have a wide spread there.

I do think that provisioning vs hiring  and flexibility in retasking will be a real point of competition, in addition to raw prices

I think we agree that AGI has the potential to undercut labor. I was fairly certain my spread was 5% uneconomical, 20% right for some actors, 50% large dispaclemt and 25 percent of total winning, and I was trying to work out what levels of pricing look uneconomical and what frictions are important to compare. 

To what extent should we expect catastrophic failure from AI to mirror other engineering disasters/ have applicable lessons from Safety engineering as a field?

I would think that 1. everything is sui generis and 2. things often rhyme, but it is unclear how approaches will translate. 

I wrote thoughts here: https://redwoodresearch.substack.com/p/fields-that-i-reference-when-thinking?selection=fada128e-c663-45da-b21d-5473613c1f5c