Joe Walker has a general conversation with Wolfram about his work and things and stuff, but there are some remarks about AI alignment at the very end:

WALKER: Okay, interesting. So moving finally to AI, many people worry about unaligned artificial general intelligence, and I think it's a risk we should take seriously. But computational irreducibility must imply that a mathematical definition of alignment is impossible, right?

WOLFRAM: Yes. There isn't a mathematical definition of what we want AIs to be like. The minimal thing we might say about AIs, about their alignment, is: let's have them be like people are. And then people immediately say, "No, we don't want them to be like people. People have all kinds of problems. We want them to be like people aspire to be.

And at that point, you've fallen off the cliff. Because, what do people aspire to be? Well, different people aspire to be different and different cultures aspire in different ways. And I think the concept that there will be a perfect mathematical aspiration is just completely wrongheaded. It's just the wrong type of answer.

The question of how we should be is a question that is a reflection back on us. There is no "this is the way we should be" imposed by mathematics.

Humans have ethical beliefs that are a reflection of humanity. One of the things I realised recently is one of the things that's confusing about ethics is if you're used to doing science, you say, "Well, I'm going to separate a piece of the system," and I'm going to say, "I'm going to study this particular subsystem. I'm going to figure out exactly what happens in the subsystem. Everything else is irrelevant."

But in ethics, you can never do that. So you imagine you're doing one of these trolley problem things. You got to decide whether you're going to kill the three giraffes or the eighteen llamas. And which one is it going to be?

Well, then you realise to really answer that question to the best ability of humanity, you're looking at the tentacles of the religious beliefs of the tribe in Africa that deals with giraffes, and this kind of thing that was the consequence of the llama for its wool that went in this supply chain, and all this kind of thing.

In other words, one of the problems with ethics is it doesn't have the separability that we've been used to in science. In other words, it necessarily pulls in everything, and we don't get to say, "There's this micro ethics for this particular thing; we can solve ethics for this thing without the broader picture of ethics outside."

If you say, "I'm going to make this system of laws, and I'm going to make the system of constraints on AIs, and that means I know everything that's going to happen," well, no, you don't. There will always be an unexpected consequence. There will always be this thing that spurts out and isn't what you expected to have happen, because there's this irreducibility, this kind of inexorable computational process that you can't readily predict.

The idea that we're going to have a prescriptive collection of principles for AIs, and we're going to be able to say, "This is enough, that's everything we need to constrain the AIs in the way we want," it's just not going to happen that way. It just can't happen that way.

Something I've been thinking about recently is, so what the heck do we actually do? I was realising this. We have this connection to ChatGPT, for example, and I was thinking now it can write Wolfram Language code, I can actually run that code on my computer. And right there at the moment where I'm going to press the button that says, "Okay, LLM, whatever code you write, it's going to run on my computer," I'm like, "That's probably a bad idea," because, I don't know, it's going to log into all my accounts everywhere, and it's going to send you email, and it's going to tell you this or that thing, and the LLM is in control now.

And I realised that probably it needs some kind of constraints on this. But what constraints should they be? If I say, well, you can't do anything, you can't modify any file, then there's a lot of stuff that would be useful to me that you can't do.

So there is no set of golden principles that humanity agrees on that are what we aspire to. It's like, sorry, that just doesn't exist. That's not the nature of civilisation. It's not the nature of our society.

And so then the question is, so what do you do when you don't have that? And my best current thought is — in fact, I was just chatting with the person I was chatting with before you about this — is developing what are, let's say, a couple of hundred principles you might pick.

One principle might be, I don't know: "An AI must always have an owner." "An AI must always do what its owner tells it to do." "An AI must, whatever."

Now you might say, an AI must always have an owner? Is that a principle we want? Is that a principle we don't want? Some people will pick differently.

But can you at least provide scaffolding for what might be the set of principles that you want? And then it's like be careful what you wish for because you make up these 200 principles or something, and then you see a few years later, people with placards saying, "Don't do number 34" or something, and you realise, "Oh, my gosh, what did one set up?"

But I think one needs some kind of framework for thinking about these things, rather than just people saying, "Oh, we want AIs to be virtuous." Well, what the heck does that mean?

Or, "We have this one particular thing: we want AIs to not do this societally terrible thing right here, but we're blind to all this other stuff." None of that is going to work.

You have to have this formalisation of ethics that is such that you can actually pick; you can literally say, I'm going to be running with number 23, number 25, and not number 24, or something. But you've got to make that kind of framework. 

New Comment
15 comments, sorted by Click to highlight new comments since:
[-]dr_s2216

I'd say these are sensible enough thoughts on the social and ethical aspects of alignment. But that's already only one half of the process compared to the other side, the technical aspects, which include simply "ok, we've decided on a few principles, now how the hell do we guarantee the AI actually sticks to them?".

OP, could you add the link to the podcast:

https://josephnoelwalker.com/148-stephen-wolfram/

Whoops! Sorry about that. Link added. There's lots of interesting stuff in the rest, including some remarks about talent, inventiveness, the academic world, and philanthropy. As you may know, Wolfram was gifted in the very first round of MacArthur Fellowships.

Many thanks.

One of the things I realised recently is one of the things that's confusing about ethics is if you're used to doing science, you say, "Well, I'm going to separate a piece of the system," and I'm going to say, "I'm going to study this particular subsystem. I'm going to figure out exactly what happens in the subsystem. Everything else is irrelevant."

But in ethics, you can never do that.

That seems not accurate. In social science and engineering there are usually countless variables which influence everything, but that doesn't prevent us from estimating expected values for different alternatives. Ethics appears to be very similar. Unpredictability is merely an epistemic problem in both cases.

Yeah. I don't think this actually makes ethics harder to study, but I wonder if he's getting at...

Unlike in experimental or applied science, in ethics you can't ever build a simple ethical scenario, because you can't isolate any part of the world from the judgement or interventionist drives of every single person's value systems. Values, inherently, project themselves out onto the world, nothing really keeps them localized in their concerns.
If someone runs a brutal and unnecessary medical experiment on prisoners in an underground lab, it doesn't matter how many layers of concrete or faraday shielding separate me from it, I still care about that, a bunch of other people care in different ways. You can't isolate anything. The EV considers everything.

I agree with his diagnosis, (related: The Control Problem: Unsolved or Unsolvable?), but then in the solution part, he suggests a framework that he just have condemned for a failure above.

Want to relate Wolfram's big complexity question to three frameworky approaches already in use.

Humans have ideas of rights and property that simplify the question "How do we want people to act?" to "okay well What are we pretty sure we want people not to do?" and simplify that another step to "okay, let's Divide the world into non-intersecting Spheres of control, one per person, say you can do what you want within your sphere, and only do things outside your sphere by mutual agreement with the person in charge of the other sphere.  (And one thing that can be mutually agreed on is redrawing sphere boundaries between the people agreeing.)

These don't just simplify ethics as a curious side-effect; both start as practical chunks of what we want people not to do, then evolved into customs and other forms of hardening.  I guess they evolved to the point where they're common because they were simple enough.

The point I'm making relative to Wolfram is: (inventing ratios) 90% of the problem of ethics is simplified away with 10% of the effort, and it's an obvious first 10% of effort to duplicate.

And although they present simpler goals they don't implement them.

Sometimes ethics isn't the question and game theory or economics is (to the extent those aren't all the same thing).  For example, for some reason there are large corporations that cater to millions of poor customers.

With computers there are attempts at security.  Specifically I want to mention the approach called object-capability security, because it's based on reifying rights and property in fine-grained composable ways and building underlying systems that support and if done right only support rightful actions (in the way they allow to reify).

This paragraph is amateur alignment stuff: The problem of actually understanding how and why humans are good is vague but my guess is it's more tractable than defining ethics in detail with ramifications. Both are barely touched, and we've been getting off easy.  It's not clear that many moral philosophers will jump into high gear based on no really shocking AI alignment disasters (which we survive to react to) so far.  At this point I believe there's something to goodness, that there's something actually and detectably cool about interacting with (other) humans.  It seems to provide a reason to get off one's butt at all.  The value of it could be something that's visible when you have curiosity plus mumble. I.e. complex, but learnable given the right bootstrap.  I don't know how to define whether someone has reconstructed the right bootstrap.

Returning to Wolfram: but at this point it seems possible to me that whatever-good-is exists and bootstrapping it is doable.

You can’t isolate individual ”atoms” in ethics, according to Wolfram. Let’s put that to the test. Tell me if the following ”ethical atoms” are right or wrong:

  1. I will speak in a loud voice

2…on a monday

3…in a public library

4…where I’ve been invited to speak about my new book and I don’t have a microphone.

Now, (1) seems morally permissible, and (2) doesn’t change the evaluation. (3) does make my action seem morally impermissible, but (4) turns it around again. I’m convinced all of this was very simple to everyone.

Ethics is the science about the a priori rules that make these judgments so easy to us, or at least that was Kant’s view which I share. It should be possible to make an AI do this calculation even faster than we do, and all we have to do is to provide the AI with the right a priori rules. When that is done, the rest is just empirical knowledge about libraries and human beings and we will eventually have a moral AI.

all we have to do is to provide the AI with the right a priori rules

An optimistic view. Any idea how to figure out what they are?

I am a Kantian and believe that those a priori rules have already been discovered.

But my point here was merely that you can isolate the part that belongs to pure ethics from evererything empirical, like in my example what a library is; why do people go to libraries; what is a microphone and what is it’s purpose and so on. What makes an action right or wrong at the most fundamental level however is independent of everything empirical and simply an a priori rule.

I guess also my broader point was that Stephen Wolfram is far too pessimistic about the prospects of making a moral AI. A future AI may soon have a greater understanding of the world and the people in it, and so all we have to do is to provide the right a priori rule and we will be fine.

Of course, the technical issue still remains: how do we make the AI stick to that rule, but that is not an ethical problem but an engineering problem.

I am a Kantian and believe that those a priori rules have already been discovered

Does it boil down to the categorical imperative? Where is the best exposition of the rules, and the argument for them? 

Trying to 'solve' ethics by providing a list of features like was done with image recognition algorithms of yore is doomed to failure. Recognizing the right thing to do, just like recognizing a cat, requires learning from millions of different examples encoded in giant inscrutable neural networks.

{compressed, some deletions}

Suppose you have at least one "foundational principle" A = [...words..] -> mapped to token vector say in binary = [ 0110110...] -> sent to internal NN. Encoding  and decoding processes non-transparent in terms of attempting to 'train' on the principle A. If the system's internal weight matrices are already mostly constant, you can't add internal principles (not clear you can even add them when initial random weights are being nonrandomized during training).