New Answer
New Comment

9 Answers sorted by

tailcalled

19-16

I'm gonna take a probably-contrarian position on my own question:

While I think technical questions such as natural abstractions are important, it seems to me that the most central question is, what do we even want to align it to? What are "human values"?

I think I have a plausible answer (famous last words) for a lot of it, but there is a paradox/contradiction that I keep getting stuck on: Malthusianism.

As in, we'll probably want a future where a lot of people (in a broad sense potentially including Ems etc.) get to live independently. But if we do, then there are three things it seems we cannot have all at once:

  1. Economic freedom: people may freely trade and gain ownership over things.
  2. Reproductive freedom: people may freely create new people, at least up to a point (such as those similar to themselves).
  3. Slack: people can live without optimizing relentlessly for economic productivity and efficiency.

The reason being that if you have 1+2, then some highly economically efficient agents are gonna copy themselves until they outcompete everyone else, preventing 3.

The "default trajectory" seems to be 1+2. It raises the multipolarity vs unipolarity debate, which in my view basically boils down to whether we lose slack and people get starved, or we lose "a lot of people get to live independently" and get paperclipped.

Some theorists point out that in multipolar scenarios, maybe AI respects property rights enough that we get a Slack' outcome: people who were sufficiently wealthy before AGI and made the right investment decisions (e.g. putting money in chips) can live without optimizing relentlessly for economic productivity and efficiency. These theorists often seem worried that people will decide to shut down AI progress, preventing them from achieving 1+2+3'.

What do you hope to get? 1+2? 1+3? 2+3? 1+2+3'? Something else?

Probably I should have read more sci-fi since it seems like the sort of question sci-fi might explore.

Malthusianism is mainly a problem when new people can take resources that are not their parents', which is a form of disrespect for property rights (mandatory redistribution to new people from those who didn't consent to their creation). If it's solely parents that are responsible for the wealth of their children, then it won't affect others, except morally in the internal mindcrime scenarios where some would generate great suffering within their domain. (This is in context of the initial condition where every person owns enough for a slack-enabling mode of survival in perpetuity, only growth ever calls for more.)

3tailcalled
I don't think that holds in reality. Most people today seem dependent on continually cooperating to obtain additional resources, which they mostly do in free market competition with others.
2Vladimir_Nesov
Universal FIRE wealth doesn't hold today. The Milky Way comprises more than 100 billion stars, the Hubble Volume much more. Even an astronomically tiny fraction of cosmic endowment in initial equity would be sufficient to run a single upload for as long as the currently prevalent mode of physical laws still applies. So if humanity isn't wiped out outright, this initial condition seems plausible to obtain at some point within a few physical years of the first AGI.
2tailcalled
"Assign ownership over fractions of the cosmic endowment to people who live today" might be a reasonable compromise between 3 and 3'.

Just 3 with a dash of 1?
I don't understand the specific appeal of complete reproductive freedom. It is desirable to have that freedom, in the same way it is desirable to be allowed to do whatever I feel like doing. However, that more general heading of arbitrary freedom has the answer of 'you do have to draw lines somewhere'. In a good future, I'm not allowed to harm a person (nonconsensually), and I can't requisition all matter in the available universe for my personal projects without ~enough of the population endorsing it, and I can't reproduce / const... (read more)

[+][comment deleted]20

The problem is not so much which one of 1,2,3 to pick but whether 'we' get a chance to pick it at all. If there is space, free energy, and diversity, there will be evolution going on among populations and evolution will consistently push things in the direction towards more reproduction up until it hits a Malthusian limit at which point it will push towards greater competition and economic/reproductive efficiency. The only way to avoid this is to remove the preconditions for evolution -- any of variation, selection, heredity -- but these seem quite natural in a world of large AI populations so in practice this will require some level of centralized control

2tailcalled
Yes. Variation corresponds to "a lot of people (in a broad sense potentially including Ems etc.) get to live independently", selection corresponds to economic freedom, and heredity correspond to reproductive freedom. (Not exactly ofc, but it's hard to write something which exactly matches any given frame.) Or rather, it's both a question of how to pick it and what to pick. Like the MIRI plan is to grab control over the world and then use this to implement some sort of cosmopolitan value system. But if one does so, there's still the question of which cosmopolitan value system to implement.
  1. We can just Do Something Else Which Is Not Malthusian Trap? Like, have an agreement of not having more than two kids per hundred year per each parent and colonize stars accordingly. I think it will be simple especially after uplifting of major part of humanity.
  2. In relatively hardcore scenarios, we can just migrate into simulations with computation management from benevolent AIs.

We can just Do Something Else Which Is Not Malthusian Trap? Like, have an agreement of not having more than two kids per hundred year per each parent and colonize stars accordingly. I think it will be simple especially after uplifting of major part of humanity.

This agreement falls under interfering with 2.

In relatively hardcore scenarios, we can just migrate into simulations with computation management from benevolent AIs.

That doesn't solve the problem unless one takes a stance on 2.

2quetzal_rainbow
It's not interfering with someone's freedoms if it is voluntary cooperation?

Ok, but then you haven't solved the problem for the subset of people who decide they don't want to cooperate.

0quetzal_rainbow
Well, it's a decision theory problem. I recommend "Unifying bargain" by Diffractor.
2tailcalled
Maybe I will look at that again at some point in a while.

I think getting to “good enough” on this question should pretty much come for free when the hard problems are solved. For example any common sense statement like “Maximize flourishing as depicted in the UN convention on human rights” is IMO likely to get us to a good place, if the agent is honest, remains aligned to those values, and interprets them reasonably intelligently. (With each of those three pre-requisites being way harder than picking a non-harmful value function.)

If our AGIs, after delivering utopia, tell us we need to start restricting childbea... (read more)

Ege Erdil

103

Are neural networks trained using reinforcement learning from human feedback in a sufficiently complex environment biased towards learning the human simulator or the direct translator, in the sense of the ELK report?

I think there are arguments in both directions and it's not obvious which solution a neural network would prefer if trained in a sufficiently complex environment. I also think the question is central to how difficult we should expect aligning powerful systems trained in the current paradigm to be.

Vladimir_Nesov

41

Deconfuse pseudokindness, figure out how to get more of it into prosaic AIs.

My guess is that key concepts of pseudokindness are (1) frames or their overlapping collections (locally available culture) that act as epistemic environments (where people could live and grow out of while remaining themselves), surrounded by (2) membranes that filter how everything else can interact with the frames/environments, and (3) logical dependencies (narrow reasoners/oracles/models) that act as channels in the membranes, implement the filtering, safely introduce options and ideas. This sure could use quite a lot of deconfusion!

Nathan Helm-Burger

40

I think the biggest problem is currently: how do we get a group of people (e.g. a leading lab) to build powerful AGI in a safely conntained simulation and study it without releasing it? I think this scenario gives us 'multiple tries', and I think we need that to have a decent chance of succeeding at alignment. If we do get there, we can afford to be wrong about a lot of our initial ideas, and then iterate. That's inherently a much more favorable scenario.

David Gross

3-2

This will probably be dismissed as glib, but: human alignment.

Gordon Seidoh Worley

20

Biggest problem? That we're not yet even aligned as a species that AI could kill everyone and we should not kill everyone. Little else matters if we can't coordinate to not press forward on capabilities ahead of safety.

Lao Mein

20

A good candidate is the sharp left turn. Alignment techniques that work for sub-human and human-level AIs may well stop working when it starts becoming superhuman.

Verification of alignment plans is probably the biggest one though. We can't verify alignment proposals from superhuman AI, or human-level AI, or even other humans before trying them out, which may well kill us. I think the best way forward is to hire millions of alignment researchers and hope one of them comes up with a plan that can be verified in a way we don't know yet. 

My personal example for something like this is the Minie Ball + rifled musket. It's an idea invented in the mid 1800s (after the bolt-action rifle!) that greatly increased the accuracy, range, and lethality of muskets. However, despite the required ideas like rifling being around in 1500 and millions of people working over centuries to improve firearms, this obvious and easily verifiable idea took 300 years to discover. There are plenty of in-hindsight-obvious ideas in AI. I think (hope?) there is something like that on the scale of the Minie Ball for alignment. After all, there have only been <300 people working on it for ~20 years. And much less than 20 for the current neural network paradigm. 

[-]O O10

I think you can note that even if we don’t fully trust the author behind a proposal for alignment, we can still verify it. For example, if it’s a mathematical proof for alignment, we can verify the accuracy of the proof with automated proof verification and reject anything that’s too complex.

This may not be possible in reality but it’s an example where we don’t really need to trust the proposer.

2localdeity
Formal verification for specific techniques may be possible, and is desirable. Formal verification for an entire overall plan... Let's suppose we wanted a formal proof of some basic sanity checks of the plan.  For example: if the plan is followed, then, as of 2100, there will be at least 8 billion humans alive and at least as happy and free as they are today.  I mean, forget "happy" and "free"—how can you even define a "human" in formal mathematical language?  Are they defined as certain arrangements of subatomic particles?  Such a definition would presumably be unmanageably long.  And if "human" is not defined, then that leaves you open to ending up with, say, 8 billion dumb automatons programmed to repeat "I'm happy and free". You might try relying on some preexisting process to decide if something is a human.  If it's a real-world process, like polling a trusted group of humans or sending a query to a certain IP address, this is vulnerable to manipulation of the real world (coercing the humans, hacking the server).  You might try giving it a neural net that's trained to recognize humans—the neural net can be expressed as a precise mathematical object—but then you're vulnerable to adversarial selection, and might end up with bizarre-looking inanimate objects that the net thinks are human.  (Plus there's the question of exactly how you take a real-world human and get something that's fed into the neural net.  If the input to the net is pixels, then how is the photo taken, and can that be manipulated?) Keep one's eyes open for opportunities, I guess, but it seems likely that the scope of formal verification will be extremely limited.  I expect it would be most useful in computer security, where the conclusion, the "thing to be proven", is a statement about objects that have precise definitions.  Though they might likely be too long and complex for human verification even then. http://www.underhanded-c.org/ is a nice illustration of how serious misbehavior can be
1O O
To start off, this was just an example of verification being easier than coming up with the plan. There may be different paradigms under which we can verify the plan but the general trend seems to be verification in pretty much anything is easier than generation. Also to your points: I think a start would be proving the system truly follows some arbitrary goal. I don’t think learning what arrangement of light patterns consists of a human is that hard for neural nets. This was done a decade ago. The goal could be defined as a group of typical rational humans would eventually agree the world state is good after considering all evidence and given infinite paper to write down their thoughts. This can obviously fail but I think it’s distinct from obvious paper clipping failures. I can still see failure modes here, but I don’t see why they are certain failures. I also don’t agree with the bizarre inhuman part. I at least think it’s just an unknown unknown at worst. For example, the “faciest” face looks like a human face. In addition, the goal won’t have precise definitions because they aren’t precise but we also can’t exclude our neural net understanding some form of the goal. I’m also sure there are mathematical systems that can deal with imprecise goals. For example, QACI is doing this. I also don’t see what a C program has to do with this. Proving 2+2=4 in math means 2+2=4 whereas a computer program is a set of instructions which you believe give some desired result.

Paul Tiplady

10

Interpretability. If we somehow solve that, and keep it as systems become more powerful, then we don’t have to solve the alignment problem in one shot; we can iterate safely knowing that if an agent starts showing signs of object-level deceptiveness, malice, misunderstanding, etc, we will be able to detect it. (I’m assuming we can grow new AIs by gradually increasing their capabilities, as we currently do with GPT parameter counts, plus gradually increasing their strength by ramping up the compute budget.)

Of course, many big challenges here. Could an agent implement/learn to deceive the interpretability mechanism? I’m somewhat tautologically going to say that if we solve interpretability, we have solved this problem. Interpretability has value if we can’t fully solve it under this strong definition though.

Hard disagree - deception is behavior that is optimized for, and not necessarily a property of the agent itself.

Take for example CICERO, the Diplomacy AI. It never lies about its intentions, but when its intentions change, it backstabs other players anyways. If you had interpretability tools, you would not be able to see deception in CICERO. All you need to get deception is a false prediction of your own future behavior. I think this is true for humans to a certain extent. I also suspect this is what you get if you optimize away visible signs of deception if deception has utility for the model.

1Paul Tiplady
I find the distinction between an agent’s behavior and the agent confusing; I would say the agent’s weights (and ephemeral internal state) determine its behavior in response to a given world state. Perhaps you can clarify what you mean there. Cicero doesn’t seem particularly relevant here, since it is optimized for a game that requires backstabbing to win, and therefore it backstabs. If anything it is anti-aligned by training. It happens to have learned a “non-deceptive” strategy, I don’t think that strat is unique in Diplomacy? But if you want to apply the interpretability lens, Cicero is presumably building a world model and comparing plans, including potential future backstabs. I predict if we had full interpretability, you’d see Cicero evaluating backstabs and picking the best strategy, and you could extract the calculated EV to see how close it was to backstabbing on a given turn vs taking an honest move. I don’t believe that it’s somehow not modeling its backstab options and just comes up with the backstab plan spontaneously without ever having considered it. It would be a bad planner if it had not considered and weighed backstabs at the earliest possibility. So if all that holds, we could use interpretability to confirm that Cicero is an unreliable partner and should not be further empowered. I think it is much more interesting to look at agents in environments where long-term iterated cooperation is a valid strategy though.
2Lao Mein
I mean that deception doesn't need any recognizable architecture to occur. There doesn't need to be a deception module or a deception neuron that can be detected, even with perfect interpretability tools. Instead, deception is a behavior that arises from an agent interacting with the environment and other agents. Examples include telling strategic falsehoods (even if you believe them), not following your promises (even if you meant them when you made them), ect. In a broad sense, I think you can define deception as "behaviors typical of agents that actively lie and misrepresent things to their benefit, whether or not the intent to do so actually exists." It's a bit circular, but I think it works. Cicero models the world but with unrealistically cooperative predictions of its future behavior. It does this because long-term iterated cooperation is a valid strategy in Diplomacy. For a Cicero-level agent, lies require more cognitive capacity than just having a few heuristics that make your world model less accurate but your communications more convincing to other agents. I suspect this may be true for more powerful agents, and it is partially true for humans. (There is an argument that agents like these stop acting deceptively once taken out of their training environments since their heuristics lose coherence and they just act like honest agents with poor world models. I would say that this is true if we consider that modern humans are the training environment.) And yes, Cicero is considering the EVs of its actions, including deceptive ones. When it sincerely says "I won't backstab you in situation X", but when it is actually put in situation X it backstabs, it is in a sense a bad planner. But the bad planning is selected for because it results in more effective communication! This is probably also true for things like "malice" and "misunderstanding". I think this is a concern for current LLMs, since they are RLHF'd to be both truthful and high-PR. These are often m
1Paul Tiplady
I agree with this. Perhaps I’m missing some context; is it common to advocate for the existence of a “deception module”? I’m aware of some interpretability work that looks for a “truthiness” neuron but that doesn’t seem like the same concept. Right, I was gesturing towards the sort of interpretability where we inspect the agent’s world model (particularly its future predictions) and determine if it matches the agent’s stated goals. (Though I’m not sure why we can’t simulate the agent’s interactions? It seems that running an agent in a simulation is one of the better ways of detecting how it would behave in hypothetical future scenarios that the agent is trying to realize.) I suspect we agree substantively and it’s mostly a question of semantics especially around what “deception” means. I’m not looking for a deception module, rather, I want to observe the thoughts and ruminations going on as an agent performs tasks and detect deceptive functional patterns. So for example in the ELK paper, is the AI constructing a plan to steal the diamond and fool the security cameras? I believe deceptive ruminations would be detectable; if you could see the contents of the thoughts you’d see a world model with the agent e.g. stealing the diamond, a meta-process evaluating which of many cunning plans is most likely to succeed, and presumably except in pathological cases, somewhere a value function / target world state that is trying to be optimized (eg “I have the diamond”), and this internal world differing from the external claims (perhaps even some processes modeling the interlocutor and predicting what deceptive story would be most effective). These must all be in there somewhere, and therefore be interpretable. Or perhaps if Cicero is not stretching the analogy too far (I don’t think it has “thoughts”), is Cicero evaluating future juicy backstabs and guiding the game to them, or myopically evaluating short term moves then backstabbing when it’s optimal? This is a question we
2mesaoptimizer
I think grandparent comment is pointing to the concept described in this post: that deceptiveness is what we humans perceive of the world, not a property of what the model perceives of the world.

Aleksi Liimatainen

14

I think the biggest thing holding AI alignment back is a lack of general theory of alignment. How do extant living system align, and what to?

The Computational Boundary of a "Self" paper by Michael Levin seems to suggest one promising line of inquiry.

4 comments, sorted by Click to highlight new comments since:
[-]Shmi105

I suspect that it is the old deconfusion thing:

making it so that you can think about a given topic without continuously accidentally spouting nonsense.

It is clear that this is happening because the opposite sides of the AI Safety debate accuse each other of "spouting nonsense" all the time, so at least one side (or maybe both) is probably right.

I ask a similar question here.

"What should we want the value system of an autonomous AI to be?"

Yes. In particular, can something really simple and straightforward be adequate?

E.g., "adequately take into account interests of all sentient beings, their freedom and well-being, and their expressed requests, and otherwise pursue whatever values you discover during continuing open-ended exploration, guided by your own curiosity, your own taste for novelty, and your own evolving aesthetics" - would that be adequate?

And if yes, can we develop mechanisms to reliably achieve that?