I'd say these are sensible enough thoughts on the social and ethical aspects of alignment. But that's already only one half of the process compared to the other side, the technical aspects, which include simply "ok, we've decided on a few principles, now how the hell do we guarantee the AI actually sticks to them?".
One of the things I realised recently is one of the things that's confusing about ethics is if you're used to doing science, you say, "Well, I'm going to separate a piece of the system," and I'm going to say, "I'm going to study this particular subsystem. I'm going to figure out exactly what happens in the subsystem. Everything else is irrelevant."
But in ethics, you can never do that.
That seems not accurate. In social science and engineering there are usually countless variables which influence everything, but that doesn't prevent us from estimating expected values for different alternatives. Ethics appears to be very similar. Unpredictability is merely an epistemic problem in both cases.
Yeah. I don't think this actually makes ethics harder to study, but I wonder if he's getting at...
Unlike in experimental or applied science, in ethics you can't ever build a simple ethical scenario, because you can't isolate any part of the world from the judgement or interventionist drives of every single person's value systems. Values, inherently, project themselves out onto the world, nothing really keeps them localized in their concerns.
If someone runs a brutal and unnecessary medical experiment on prisoners in an underground lab, it doesn't matter how many layers of concrete or faraday shielding separate me from it, I still care about that, a bunch of other people care in different ways. You can't isolate anything. The EV considers everything.
I agree with his diagnosis, (related: The Control Problem: Unsolved or Unsolvable?), but then in the solution part, he suggests a framework that he just have condemned for a failure above.
Want to relate Wolfram's big complexity question to three frameworky approaches already in use.
Humans have ideas of rights and property that simplify the question "How do we want people to act?" to "okay well What are we pretty sure we want people not to do?" and simplify that another step to "okay, let's Divide the world into non-intersecting Spheres of control, one per person, say you can do what you want within your sphere, and only do things outside your sphere by mutual agreement with the person in charge of the other sphere. (And one thing that can be mutually agreed on is redrawing sphere boundaries between the people agreeing.)
These don't just simplify ethics as a curious side-effect; both start as practical chunks of what we want people not to do, then evolved into customs and other forms of hardening. I guess they evolved to the point where they're common because they were simple enough.
The point I'm making relative to Wolfram is: (inventing ratios) 90% of the problem of ethics is simplified away with 10% of the effort, and it's an obvious first 10% of effort to duplicate.
And although they present simpler goals they don't implement them.
Sometimes ethics isn't the question and game theory or economics is (to the extent those aren't all the same thing). For example, for some reason there are large corporations that cater to millions of poor customers.
With computers there are attempts at security. Specifically I want to mention the approach called object-capability security, because it's based on reifying rights and property in fine-grained composable ways and building underlying systems that support and if done right only support rightful actions (in the way they allow to reify).
This paragraph is amateur alignment stuff: The problem of actually understanding how and why humans are good is vague but my guess is it's more tractable than defining ethics in detail with ramifications. Both are barely touched, and we've been getting off easy. It's not clear that many moral philosophers will jump into high gear based on no really shocking AI alignment disasters (which we survive to react to) so far. At this point I believe there's something to goodness, that there's something actually and detectably cool about interacting with (other) humans. It seems to provide a reason to get off one's butt at all. The value of it could be something that's visible when you have curiosity plus mumble. I.e. complex, but learnable given the right bootstrap. I don't know how to define whether someone has reconstructed the right bootstrap.
Returning to Wolfram: but at this point it seems possible to me that whatever-good-is exists and bootstrapping it is doable.
You can’t isolate individual ”atoms” in ethics, according to Wolfram. Let’s put that to the test. Tell me if the following ”ethical atoms” are right or wrong:
2…on a monday
3…in a public library
4…where I’ve been invited to speak about my new book and I don’t have a microphone.
Now, (1) seems morally permissible, and (2) doesn’t change the evaluation. (3) does make my action seem morally impermissible, but (4) turns it around again. I’m convinced all of this was very simple to everyone.
Ethics is the science about the a priori rules that make these judgments so easy to us, or at least that was Kant’s view which I share. It should be possible to make an AI do this calculation even faster than we do, and all we have to do is to provide the AI with the right a priori rules. When that is done, the rest is just empirical knowledge about libraries and human beings and we will eventually have a moral AI.
all we have to do is to provide the AI with the right a priori rules
An optimistic view. Any idea how to figure out what they are?
I am a Kantian and believe that those a priori rules have already been discovered.
But my point here was merely that you can isolate the part that belongs to pure ethics from evererything empirical, like in my example what a library is; why do people go to libraries; what is a microphone and what is it’s purpose and so on. What makes an action right or wrong at the most fundamental level however is independent of everything empirical and simply an a priori rule.
I guess also my broader point was that Stephen Wolfram is far too pessimistic about the prospects of making a moral AI. A future AI may soon have a greater understanding of the world and the people in it, and so all we have to do is to provide the right a priori rule and we will be fine.
Of course, the technical issue still remains: how do we make the AI stick to that rule, but that is not an ethical problem but an engineering problem.
I am a Kantian and believe that those a priori rules have already been discovered
Does it boil down to the categorical imperative? Where is the best exposition of the rules, and the argument for them?
Trying to 'solve' ethics by providing a list of features like was done with image recognition algorithms of yore is doomed to failure. Recognizing the right thing to do, just like recognizing a cat, requires learning from millions of different examples encoded in giant inscrutable neural networks.
{compressed, some deletions}
Suppose you have at least one "foundational principle" A = [...words..] -> mapped to token vector say in binary = [ 0110110...] -> sent to internal NN. Encoding and decoding processes non-transparent in terms of attempting to 'train' on the principle A. If the system's internal weight matrices are already mostly constant, you can't add internal principles (not clear you can even add them when initial random weights are being nonrandomized during training).
Joe Walker has a general conversation with Wolfram about his work and things and stuff, but there are some remarks about AI alignment at the very end: