LESSWRONG
LW

All of Tor Økland Barstad's Comments + Replies

As far as I understand, MIRI did not assume that we're just able to give the AI a utility function directly.

I'm a bit unsure about how to interpret you here.

In my original comment, I used terms such as positive/optimistic assumptions and simplifying assumptions. When doing that, I meant to refer to simplifying assumptions that were made so as to abstract away some parts of the problem.

The Risks from Learned Optimization paper was written mainly by people from MIRI!

Good point (I should have written my comment in such a way that pointing out this didn'... (read more)

3niplav1y

Yeah, I find it difficult to figure out how to look at this. At lot of MIRI discussion focused on their decision theory work, but I think that's just not that important. Tiling agents e.g. was more about constructing or theorizing about agents that may have access to their own values, in a highly idealized setting about logic.

Evaluating the historical value misspecification argument

Tor Økland Barstad1y*10

Thanks for the reply :) I'll try to convey some of my thinking, but I don't expect great success. I'm working on more digestible explainers, but this is a work in progress, and I have nothing good that I can point people to as of now.

(...) part of the explanation here might be "if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment (...)

Yeah, I guess this is where a lot of the differences in our perspective are located.

if the world is solved b

Tor Økland Barstad1y*50

Thanks for the reply :) Feel free to reply further if you want, but I hope you don't feel obliged to do so^[1].

"Fill the cauldron" examples are (...) not examples where it has the wrong beliefs.

I have never ever been confused about that!

It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn't an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to

... (read more)

2Rob Bensinger1y

In retrospect I think we should have been more explicit about the importance of inner alignment; I think that we didn't do that in our introduction to corrigibility because it wasn't necessary for illustrating the problem and where we'd run into roadblocks. Maybe a missing piece here is some explanation of why having a formal understanding of corrigibility might be helpful for actually training corrigibility into a system? (Helpful at all, even if it's not sufficient on its own.) Aside from "concreteness can help make the example easier to think about when you're new to the topic", part of the explanation here might be "if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment or building a system that only outputs English-language sentence". I mean, I think utility functions are an extremely useful and basic abstraction. I think it's a lot harder to think about a lot of AI topics without invoking ideas like 'this AI thinks outcome X is better than outcome Y', or 'this AI's preference come with different weights, which can't purely be reduced to what the AI believes'.

Evaluating the historical value misspecification argument

Tor Økland Barstad1y42

Your reply here says much of what I would expect it to say (and much of it aligns with my impression of things). But why you focused so much on "fill the cauldron" type examples is something I'm a bit confused by (if I remember correctly I was confused by this in 2016 also).

Rob Bensinger1y145

"Fill the cauldron" examples are examples where the cauldron-filler has the wrong utility function, not examples where it has the wrong beliefs. E.g., this is explicit in https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/

The idea of the "fill the cauldron" examples isn't "the AI is bad at NLP and therefore doesn't understand what we mean when we say 'fill', 'cauldron', etc." It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that... (read more)

Making Nanobots isn't a one-shot process, even for an artificial superintelligance

All of Tor Økland Barstad's Comments + Replies

Tweet-length summary-attempts