otto.barten - LessWrong

"AI Alignment" is a Dangerously Overloaded Term

I think it's a great idea to think about what you call goalcraft.

I see this problem as similar to the age-old problem of controlling power. I don't think ethical systems such as utilitarianism are a great place to start. Any academic ethical model is just an attempt to summarize what people actually care about in a complex world. Taking such a model and coupling that to an all-powerful ASI seems a highway to dystopia.

(Later edit: also, an academic ethical model is irreversible once implemented. Any goal which is static cannot be reversed anymore, since this will never bring the current goal closer. If an ASI is aligned to someone's (anyone's) preferences, however, the whole ASI could be turned off if they want it to, making the ASI reversible in principle. I think ASI reversibility (being able to switch it off in case we turn out not to like it) should be mandatory, and therefore we should align to human preferences, rather than an abstract philosophical framework such as utilitarianism.)

I think letting the random programmer that happened to build the ASI, or their no less random CEO or shareholders, determine what would happen to the world, is an equally terrible idea. They wouldn't need the rest of humanity for anything anymore, making the fates of >99% of us extremely uncertain, even in an abundant world.

What I would be slightly more positive about is aggregating human preferences (I think preferences is a more accurate term than the more abstract, less well defined term values). I've heard two interesting examples, there are no doubt a lot more options. The first is simple: query chatgpt. Even this relatively simple model is not terrible at aggregating human preferences. Although a host of issues remain, I think using a future, no doubt much better AI for preference aggregation is not the worst option (and a lot better than the two mentioned above). The second option is democracy. This is our time-tested method of aggregating human preferences to control power. For example, one could imagine an AI control council consisting of elected human representatives at the UN level, or perhaps a council of representative world leaders. I know there is a lot of skepticism among rationalists on how well democracy is functioning, but this is one of the very few time tested aggregation methods we have. We should not discard it lightly for something that is less tested. An alternative is some kind of unelected autocrat (e/autocrat?), but apart from this not being my personal favorite, note that (in contrast to historical autocrats), such a person would also in no way need the rest of humanity anymore, making our fates uncertain.

Although AI and democratic preference aggregation are the two options I'm least negative about, I generally think that we are not ready to control an ASI. One of the worst issues I see is negative externalities that only become clear later on. Climate change can be seen as a negative externality of the steam/petrol engine. Also, I'm not sure a democratically controlled ASI would necessarily block follow-up unaligned ASIs (assuming this is at all possible). In order to be existentially safe, I would say that we would need a system that does at least that.

I think it is very likely that ASI, even if controlled in the least bad way, will cause huge externalities leading to a dystopia, environmental disasters, etc. Therefore I agree with Nathan above: "I expect we will need to traverse multiple decades of powerful AIs of varying degrees of generality which are under human control first. Not because it will be impossible to create goal-pursuing ASI, but because we won't be sure we know how to do so safely, and it would be a dangerously hard to reverse decision to create such. Thus, there will need to be strict worldwide enforcement (with the help of narrow AI systems) preventing the rise of any ASI."

About terminology, it seems to me that what I call preference aggregation, outer alignment, and goalcraft mean similar things, as do inner alignment, aimability, and control. I'd vote for using preference aggregation and control.

Finally, I strongly disagree with calling diversity, inclusion, and equity "even more frightening" than someone who's advocating human extinction. I'm sad on a personal level that people at LW, an otherwise important source of discourse, seem to mostly support statements like this. I do not.

METR: Measuring AI Ability to Complete Long Tasks

otto.barten6d*10

Interesting and nice to play with a bit.

METR seems to imply 167 hours, approximately one working month, is the relevant project length for getting a well-defined, non-messy research task done.

It's interesting that their doubling time varies between 7 months and 70 days depending on which tasks and which historical time horizon they look at.

For a lower bound estimate, I'd take 70 days doubling time and 167 hrs, and current max task length one hour. In that case, if I'm not mistaken,

2^(t/d) = 167 (t time, d doubling time)

t = d*log(167)/log(2) = (70/365)*log(167)/log(2) = 1.4 yr, or October 2026

For a higher bound estimate, I'd take their 7 months doubling time result and a task of one year, not one month (perhaps optimistic to finish SOTA research work in one month?). That means 167*12=2004 hrs.

t = d*log(2004)/log(2) = (7/12)*log(2004)/log(2) = 6.4 yr, or August 2031

Not unreasonable to expect AI that can autonomously do non-messy tasks in domains with low penalties for wrong answers in between these two dates?

It's also noteworthy though that timelines for what the paper calls messy work, in the current paradigm, could be a lot longer, or could provide architecture improvements.

METR: Measuring AI Ability to Complete Long Tasks

otto.barten6d10

Have we eventually solved world hunger by giving 1% of GDP to the global poor?

Also, note it's not obvious that ASI can be aligned.

The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better

otto.barten1mo232

I'm founder of the Existential Risk Observatory, a nonprofit aiming to reduce xrisk by informing the public since 2021. We have published four TIME Ideas pieces (including the first one on xrisk ever) and about 35 other media pieces in six countries. We're also doing research into AI xrisk comms, notably producing to my knowledge the first paper on the topic. Finally, we're organizing events, coupling xrisk experts such as Bengio, Tegmark, Russell, etc. to leaders of the societal debate (incl. journalists from TIME, Economist, etc.) and policymakers.

First, I think you're a bit too negative about online comms. Some Yud tweets, but also e.g. Lex Fridman xrisky interviews, actually have millions of views: that's not a bubble anymore. I think online xrisk comms is firmly net positive, including AI Notkilleveryoneism Memes. Journalists are also on X.

But second, I definitely agree that there's a huge opportunity informing the public about AI xrisk. We did some research on this (see paper above) and, perhaps unsurprising, an authority (leading AI prof) on a media channel people trust seems to work best. There's also a clear link between length of the item and effect. I'd say: try to get Hinton, Bengio, and Russell in the biggest media possible, as much as possible, as long as possible (and expand: get other academics to be xrisk messengers as well). I think eg this item was great.

What also works really well: media moments. The FLI open letter and CAIS open statement created a ripple big enough to be reported by almost all media. Another example is the Nobel Prize of Hinton. Another easy one: Hinton updating his pdoom in an interview from 10% to 10-20%, that was news apparently. If anyone can create more of such moments: amazingly helpful!

All in all, I'd say the xrisk space is still unconvinced about getting the public involved. I think that's a pity. I know projects that don't get funded now, but could help spread awareness at scale. Re activism: I share your view that it won't really work until the public is informed. However, I think groups like PauseAI are helpful in informing the public about xrisk, making them net positive too.

Jesse Hoogland's Shortform

otto.barten2mo10

this is a constraint on how the data can be generated, not on how efficiently other models can be retrained

Maybe we can regulate data generation?

The Case Against AI Control Research

otto.barten2mo10

I didn't read everything, but just flagging that there are also AI researchers, such as Francois Chollet to name one example, who believe that even the most capable AGI will not be powerful enough to take over. On the other side of the spectrum, we have Yud believing AGI will weaponize new physics within the day. If Chollet is a bit right, but not quite, and the best AI possible is just able to take over, than control approaches could actually stop it. I think control/defence should not be written off even as a final solution.

If we solve alignment, do we die anyway?

otto.barten2mo20

So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake

Again, I'm glad that we agree on this. I notice you want to do what I consider the right thing, and I appreciate that.

The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point.

I can see the following scenario occur: the AI, with its AC, decided rightly that a pivotal act needs to be undertaken to avoid xrisk (or srisk). However, the public mostly doesn't recognize the existence of such risks. The AI will proceed sabotaging people's unsafe AI projects against public will. What happens now is: the public gets absolutely livid at the AI, that is subverting human power by acting against human will. Almost all humans team up to try to shut down the AI. The AI recognizes (and had already recognized) that if it looses, humans risk going extinct, so it fights this war against humanity and wins. I think in this scenario, an AI, even one with artificial conscience, could become the most hated thing on the planet.

I think people underestimate the amount of pushback we're going to get once you get into pivotal act territory. That's why I think it's hugely preferred to go the democratic route and not count on AI taking unilateral actions, even if it would be smarter or even wiser, whatever that might mean exactly.

All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.

So yes definitely agree with this. I don't think lack of conscience or ethics is the issue though, but existential risk awareness.

The Intelligence Curse

otto.barten2mo10

Pick a goal where your success doesn't directly cause obvious problems

I agree but I'm afraid value alignment doesn't meet this criterion. (I'm copy pasting my response on VA from elsewhere below).

I don't think value alignment of a super-takeover AI would be a good idea, for the following reasons:

1) It seems irreversible. If we align with the wrong values, there seems little anyone can do about it after the fact.

2) The world is chaotic, and externalities are impossible to predict. Who would have guessed that the industrial revolution would lead to climate change? I think it's very likely that an ASI will produce major, unforseeable externalities over time. If we have aligned it in an irreversible way, we can't correct for externalities happening down the road. (Speed also makes it more likely that we can't correct in time, so I think we should try to go slow).

3) There is no agreement on which values are 'correct'. Personally, I'm a moral relativist, meaning I don't believe in moral facts. Although perhaps niche among rationalists and EAs, I think a fair amount of humans shares my beliefs. In my opinion, a value-aligned AI would not make the world objectively better, but merely change it beyond recognition, regardless of the specific values implemented (although it would be important which values are implemented). It's very uncertain whether such change would be considered as net positive by any surviving humans.

4) If one thinks that consciousness implies moral relevance, AIs will be conscious, creating more happy morally relevant beings is morally good (as MacAskill defends), and AIs are more efficient than humans and other animals, the consequence seems to be that we (and all other animals) will be replaced by AIs. I consider that an existentially bad outcome in itself, and value alignment could point straight at it.

I think at a minimum, any alignment plan would need to be reversible by humans, and to my understanding value alignment is not. I'm somewhat more hopeful about intent alignment and e.g. a UN commission providing the AI's input.

The killer app for ASI is, and always has been, to have it take over the world and stop humans from screwing things up

I strongly disagree with this being a good outcome, I guess mostly because I would expect the majority of humans to not want this. If humans would actually elect an AI to be in charge, and they could be voted out as well, I could live with that. But a takeover by force from an AI is as bad for me as a takeover by force from a human, and much worse if it's irreversible. If an AI is really such a good leader, let them show it by being elected (if humans decide that an AI should be allowed to run at all).

If we solve alignment, do we die anyway?

otto.barten2mo20

Thanks for your reply. I think we should use the term artificial conscience, not value alignment, for what you're trying to do, for clarity. I'm happy to see we seem to agree that reversibility is important and replacing humans is an extremely bad outcome. (I've talked to people into value alignment of ASI who said they "would bite that bullet", in other words would replace humanity by more efficient happy AI consciousness, so this point does not seem to be obvious. I'm also not convinced that leading longtermists necessarily think replacing humans is a bad outcome, and I think we should call them out on it.)

If one can implement artificial conscience in a reversible way, it might be an interesting approach. I think a minimum of what an aligned ASI would need to do is block other unaligned ASIs or ASI projects. If humanity supports this, I'd file it under a positive offense defense balance, which would be great. If humanity doesn't support it, it would lead to conflict with humanity to do it anyway. I think an artificial conscience AI would either not want to fight that conflict (making it unable to stop unaligned ASI projects), or if it would, people would not see it as good anymore. I think societal awareness of xrisk and from there, support for regulation (either by AI or not) is what should make our future good, rather than aligning an ASI in a certain way.

otto.barten's Shortform

otto.barten2mo10

Care to elaborate? Are there posts on the topic?

LESSWRONG
LW

Posts

Wikitag Contributions

Comments