User Comment Replies

Two ideas for alignment, perpetual mutual distrust and induction

If we can steer an AI to an extent where they will follow such an arbitrary rule that we provide them, we can fully align AIs too with the tools we use to make it do such a thing.

I think my point is lowering it to just there being a non trivial probability of it following the rule. Fully aligning AIs to near certainty may be a higher bar than just potentially aligning AI.

They key word that confuses me here seems to be "align". How exactly does $A_{n}$ properly align $A_{n + 1}$ ? How does a human being align a GPT-2 model, for example? What d

APaleBlueDot2y-2-8

It was a relatively fringe topic that only recently got the attention of a large number of real researchers. And parts of it could need large amounts of computational power afforded by only by superhuman narrow AI.

There have been a few random phd dissertations saying the topic is hard but as far as I can tell there has only recently been push for a group effort by capable and well funded actors (I.e. openAI’s interpretability research).

I don’t trust older alignment research much as an outsider. It seems to me that Yud has built a cult of personality around... (read more)

5Remmelt2y

I don’t get this sense. More like Yudkowsky sees the rate at which AI labs are scaling up and deploying code and infrastructure of ML models, and recognises that there a bunch of known core problems that would need to be solved before there is any plausible possibility of safely containing/aligning AGI optimisation pressure toward outcomes. I personally think some of the argumentation around AGI being able to internally simulate the complexity in the outside world and play it like a complicated chess game is unsound. But I would not attribute the reasoning in eg. the AGI Ruin piece to Yudkowsky’s cult of personality. I was gesturing back at “AGI” in the previous paragraph here, and something like precursor AI systems before “AGI”. Thanks for making me look at that. I just rewrote it to “dangerous autonomous AI systems”.

Unpredictability and the Increasing Difficulty of AI Alignment for Increasingly Intelligent AI

APaleBlueDot2y21

Could this translate to agents having difficulty predicting other agents values and reactions, leading to a lesser likelihood of multiple agent systems acting as one?

1Max_He-Ho2y

If you have multiple AI agents in mind here, then yes, though it's not the focus. Otherwise, also yes, though depending on what you mean with "multiple agent systems acting as one", one could also see this as being fulfilled by some agents dominating other agents so that they (have to) act as one. I'd put it rather as the difficulty of predicting other agents' actions and goals leads to the risks from AI & the difficulty of the alignment and control problem.

An artificially structured argument for expecting AGI ruin

APaleBlueDot2y10

And, sure, but it's not clear why any of this matters? What is the thing that we're going to (attempt) to do with AI, if not use it to solve real-world problems?

It matters because the original poster isn’t saying we don’t use it to solve real world problems, but rather that real world constraints (I.e. laws of physics) will limit its speed of advancement.

An AI likely cannot easily predict a chaotic system unless it can simulate reality at a high fidelity. I guess Op is assuming the TAI won’t have this capability, so even if we do solve real world problems with AI, it is still limited by real world experimentation requirements.

A smart enough LLM might be deadly simply if you run it for long enough

APaleBlueDot2y2-1

I better include the predict words about the appropriate amount of green-coloured objects”, and write about green-coloured objects even more frequently, and then also notice that, and in the end, write exclusively about green objects.

Can you explain this logic to me? Why would it write more and more on green coloured objects even if its training data was biased towards green colored objects? If there is a bad trend in its output, without reinforcement, why would it make that trend stronger? Do you mean, it recognizes incorrectly that improving said bad tre... (read more)

2Mikhail Samin2y

The bias I'm talking about isn't in its training data, it's in the model, which doesn't perfectly represent the training data. If you designed a system that is an aligned AI that successfully helps preventing the destruction of the world until you figure out how to make an AI that correctly does CEV, you have solved alignment. The issue is that without understanding minds to a sufficient level and without solving agent foundations I don't expect you to be able to design a system that avoids all the failure modes that happen by default. Building such a system is an alignment-complete problem; solving an alignment-complete problem using AI to speed up the hard human reasoning to multiple orders of magnitude is an alignment-complete problem.

LESSWRONG
LW

All of APaleBlueDot's Comments + Replies