All of Tor Økland Barstad's Comments + Replies

As far as I understand, MIRI did not assume that we're just able to give the AI a utility function directly. 

I'm a bit unsure about how to interpret you here.

In my original comment, I used terms such as positive/optimistic assumptions and simplifying assumptions. When doing that, I meant to refer to simplifying assumptions that were made so as to abstract away some parts of the problem.

The Risks from Learned Optimization paper was written mainly by people from MIRI!

Good point (I should have written my comment in such a way that pointing out this didn'... (read more)

3niplav
Yeah, I find it difficult to figure out how to look at this. At lot of MIRI discussion focused on their decision theory work, but I think that's just not that important. Tiling agents e.g. was more about constructing or theorizing about agents that may have access to their own values, in a highly idealized setting about logic.

Thanks for the reply :) I'll try to convey some of my thinking, but I don't expect great success. I'm working on more digestible explainers, but this is a work in progress, and I have nothing good that I can point people to as of now.

(...) part of the explanation here might be "if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment (...)

Yeah, I guess this is where a lot of the differences in our perspective are located.

if the world is solved b

... (read more)

Thanks for the reply :) Feel free to reply further if you want, but I hope you don't feel obliged to do so[1].

"Fill the cauldron" examples are (...) not examples where it has the wrong beliefs.

I have never ever been confused about that!

It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn't an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to

... (read more)
2Rob Bensinger
In retrospect I think we should have been more explicit about the importance of inner alignment; I think that we didn't do that in our introduction to corrigibility because it wasn't necessary for illustrating the problem and where we'd run into roadblocks. Maybe a missing piece here is some explanation of why having a formal understanding of corrigibility might be helpful for actually training corrigibility into a system? (Helpful at all, even if it's not sufficient on its own.) Aside from "concreteness can help make the example easier to think about when you're new to the topic", part of the explanation here might be "if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment or building a system that only outputs English-language sentence". I mean, I think utility functions are an extremely useful and basic abstraction. I think it's a lot harder to think about a lot of AI topics without invoking ideas like 'this AI thinks outcome X is better than outcome Y', or 'this AI's preference come with different weights, which can't purely be reduced to what the AI believes'.

Your reply here says much of what I would expect it to say (and much of it aligns with my impression of things). But why you focused so much on "fill the cauldron" type examples is something I'm a bit confused by (if I remember correctly I was confused by this in 2016 also).

"Fill the cauldron" examples are examples where the cauldron-filler has the wrong utility function, not examples where it has the wrong beliefs. E.g., this is explicit in https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/ 

The idea of the "fill the cauldron" examples isn't "the AI is bad at NLP and therefore doesn't understand what we mean when we say 'fill', 'cauldron', etc." It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that... (read more)

This tweet from Eliezer seems relevant btw. I would give similar answers to all of the questions he lists that relate to nanotechnology (but I'd be somewhat more hedged/guarded - e.g. replacing "YES" with "PROBABLY" for some of them).

 

Thanks for engaging

Likewise  :)

Also, sorry about the length of this reply. As the adage goes: "If I had more time, I would have written a shorter letter."

From my perspective you seem simply very optimistic on what kind of data can be extracted from unspecific measurements.

That seems to be one of the relevant differences between us. Although I don't think it is the only difference that causes us to see things differently.

Other differences (I guess some of these overlap):

  • It seems I have higher error-bars than you on the question we are discussing now. Y
... (read more)
1Tor Økland Barstad
This tweet from Eliezer seems relevant btw. I would give similar answers to all of the questions he lists that relate to nanotechnology (but I'd be somewhat more hedged/guarded - e.g. replacing "YES" with "PROBABLY" for some of them).  

I suspect my own intuitions regarding this kind of thing are similar to Eliezer's. It's possible that my intuitions are wrong, but I'll try to share some thoughts.

It seems that we think quite differently when it comes to this, and probably it's not easy for us to achieve mutual understanding. But even if all we do here is to scratch the surface, that may still be worthwhile.

As mentioned, maybe my intuitions are wrong. But maybe your intuitions are wrong (or maybe both). I think a desirable property of plans/strategies for alignment would be robustness to e... (read more)

1dankrad
Thanks for engaging with my post. From my perspective you seem simply very optimistic on what kind of data can be extracted from unspecific measurements. Here is another good example on how Eliezer makes some pretty out there claims about what might be possible to infer from very little data: https://www.lesswrong.com/posts/ALsuxpdqeTXwgEJeZ/could-a-superintelligence-deduce-general-relativity-from-a -- I wonder what your intuition says about this? Generally it is a good idea to be robust with plans. However, in this specific instance, the way Eliezer phrases it, any iterative plan for alignment would be excluded. Since I also believe that this is the only realistic plan (there will simply never be a design that has the properties that Eliezer thinks guarantee alignment), the only realistic remaining path would be a permanent freeze (which I actually believe comes with large risks as well: unenforcability and thus worse actors making ASI first, biotech in the wrong hands becoming a larger threat to humanity, etc.).  What I would agree to is that it is good to plan for the eventuality that a lot less data could be needed by an ASI to do something like "create nanobots". For example, we could conclude that it's for now simply a bad idea if AI is used in biotech labs, because these are the places where it could easily gather a lot of data and maybe even influence experiments so that they let it learn the things it needs to create nanobots. Similarly, we could try to create a worldwide warning systems around technologies that seem likely to be necessary for an AI takeover, and watch these closely, so that we would notice any specific experiments. However, there is no way to scale this to a one-shot scenario. His claim is that an ASI will order some DNA and get some scientists in a lab to mix it together with some substances and create nanobots. That is what I describe as a one-shot scenario. Even if it were 10,000 shots in parallel I simply don't think it is possible,

None of these are what you describe, but here are some places people can be pointed to:

AGI-assisted alignment in Dath Ilan (excerpt from here)

Suppose Dath Ilan got into a situation where they had to choose the strategy of AGI-assisted alignment, and didn't have more than a few years to prepare. Dath Ilan wouldn't actually get themselves into such a situation, but if they did, how might they go about it?

I suspect that among other things they would:

  • Make damn well sure to box the AGIs before it plausibly could become dangerous/powerful.
  • Try, insofar as they could, to make their methodologies robust to hardware exploits (rowhammer, etc). Not only
... (read more)

This is from What if Debate and Factored Cognition had a mutated baby? (a post I started on, but I ended up disregarding this draft and starting anew). This is just an excerpt from the intro/summary (it's not the entire half-finished draft).


Tweet-length summary-attempts

Resembles Debate, but:

  • Higher alignment-tax (probably)
  • More "proof-like" argumentation
  • Argumentation can be more extensive
  • There would be more mechanisms for trying to robustly separate out "good" human evaluations (and testing if we succeeded)

We'd have separate systems that (among oth

... (read more)

Below are some concepts related to extracting aligned capabilities. The main goal is to be able to verify specialized functions without having humans need to look at the source code, and without being able to safely/robustly score outputs for the full range of inputs.

Some things we need:

  • We need AIs that act in such a way as to maximize score
  • There needs to be some some range of the inputs that we can test
  • There needs to be ways of obtaining/calculating the output we want that are at least somewhat general

An example of an aligned capability we might want woul... (read more)

I would also like to see more work where people make less positive/optimistic assumptions. I think of it as a good thing that different approaches to alignment are being explored, and would like to see more of that in general (both in terms of breadth and depth).

I guess there are many possible ways of trying to categorize/conceptualize approaches to alignment theorizing. One is by asking "when talking/thinking about the methodology, what capabilities are assumed to be in place?".

I'm not sure about this, but unless I'm mistaken[1], a good amount of the work... (read more)

5niplav
As far as I understand, MIRI did not assume that we're just able to give the AI a utility function directly. The Risks from Learned Optimization paper was written mainly by people from MIRI! Other things like Ontological Crises and Low Impact sort of assume yoi can get some info into the values of an agent, and Logical Induction was more about how to construct systems that satisfy some properties in their cognition.

If humans (...) machine could too.

From my point of view, humans are machines (even if not typical machines). Or, well, some will say that by definition we are not - but that's not so important really ("machine" is just a word). We are physical systems with certain mental properties, and therefore we are existence proofs of physical systems with those certain mental properties being possible.

machine can have any level of intelligence, humans are in a quite narrow spectrum

True. Although if I myself somehow could work/think a million times faster, I think I'd... (read more)

Why call it an assumption at all?

Partly because I was worried about follow-up comments that were kind of like "so you say you can prove it - well, why aren't you doing it then?".

And partly because I don't make a strict distinction between "things I assume" and "things I have convinced myself of, or proved to myself, based on things I assume". I do see there as sort of being a distinction along such lines, but I see it as blurry.

Something that is derivable from axioms is usually called a theorem.

If I am to be nitpicky, maybe you meant "derived" and not "der... (read more)

1Donatas Lučiūnas
As I understand you try to prove your point by analogy with humans. If humans can pursue somewhat any goal, machine could too. But while we agree that machine can have any level of intelligence, humans are in a quite narrow spectrum. Therefore your reasoning by analogy is invalid.

(...) if it's supported by argument or evidence, but if it is, then it's no mere assumption.

I do think it is supported by arguments/reasoning, so I don't think of it as an "axiomatic" assumption. 

A follow-up to that (not from you specifically) might be "what arguments?". And - well, I think I pointed to some of my reasoning in various comments (some of them under deleted posts). Maybe I could have explained my thinking/perspective better (even if I wouldn't be able to explain it in a way that's universally compelling 🙃). But it's not a trivial task t... (read more)

1TAG
Why call it an assumption at all? Something that is derivable form axioms is usually called a theorem.

I cannot help you to be less wrong if you categorically rely on intuition about what is possible and what is not.

I wish I had something better to base my beliefs on than my intuitions, but I do not. My belief in modus ponens, my belief that 1+1=2, my belief that me observing gravity in the past makes me likely to observe it in the future, my belief that if views are in logical contradiction they cannot both be true - all this is (the way I think of it) grounded in intuition.

Some of my intuitions I regard as much more strong/robust than others. 

When my... (read more)

Like with many comments/questions from you, answering this question properly would require a lot of unpacking. Although I'm sure that also is true of many questions that I ask, as it is hard to avoid (we all have limited communication bandwitdh) :)

In this last comment, you use the term "science" in a very different way from how I'd use it (like you sometimes also do with other words, such as for example "logic"). So if I was to give a proper answer I'd need to try to guess what you mean, make it clear how I interpret what you say, and so on (not just answe... (read more)

-5Donatas Lučiūnas

It seems that 2 + 2 = 4 is also an assumption for you.

Yes (albeit a very reasonable one).

Not believing (some version) of that claim would make typically make minds/AGIs less "capable", and I would expect more or less all AGIs to hold (some version of) that "belief" in practice.

I don't think it is possible to find consensus if we do not follow the same rules of logic.

Here are examples of what I would regard to be rules of logic: https://en.wikipedia.org/wiki/List_of_rules_of_inference (the ones listed here don't encapsulate all of the rules of inference tha... (read more)

-3Donatas Lučiūnas
So this is where we disagree. That's how hypothesis testing works in science: 1. You create a hypothesis 2. You find a way to test if it is wrong 1. You reject hypothesis if the test passes 3. You find a way to test if it is right 1. You approve hypothesis if the test passes While hypothesis is not rejected nor approved it is considered possible. Don't you agree?

I do have arguments for that, and I have already mentioned some of them earlier in our discussion (you may not share that assesment, despite us being relatively close in mind-space compared to most possible minds, but oh well).

Some of the more relevant comments from me are on one of the posts that you deleted.

As I mention here, I think I'll try to round off this discussion. (Edit: I had a malformed/misleading sentence in that comment that should be fixed now.)

Every assumption is incorrect unless there is evidence. 

Got any evidence for that assumption? 🙃

Answer to all of them is yes. What is your explanation here?

Well, I don't always "agree"[1] with ChatGPT, but I agree in regards to those specific questions.

...

I saw a post where you wanted people to explain their disagreement, and I felt inclined to do so :) But it seems now that neither of us feel like we are making much progress.

Anyway, from my perspective much of your thinking here is very misguided. But not more misguided than e.g. "proofs" for Go... (read more)

3Donatas Lučiūnas
That's basic logic, Hitchens's razor. It seems that 2 + 2 = 4 is also an assumption for you. What isn't then? I don't think it is possible to find consensus if we do not follow the same rules of logic. Considering your impression about me, I'm truly grateful about your patience. Best wishes from my side as well :) But on the other hand I am certain that you are mistaken and I feel that you do not provide me a way to show that to you.

Do you think you can deny existence of an outcome with infinite utility? 

To me, according to my preferences/goals/inclinations, there are conceivable outcomes with infinite utility/disutility.

But I think it is possible (and feasible) for a program/mind to be extremely capable, and affect the world, and not "care" about infinite outcomes.

The fact that things "break down" is not a valid argument.

I guess that depends on what's being discussed. Like, it is something to take into account/consideration if you want to prove something while referencing utility-functions that reference infinities.

1Donatas Lučiūnas
As I understand you do not agree with  from Pascal's Mugging, not with me. Do you have any arguments for that?

About universally compelling arguments?

First, a disclaimer: I do think there are "beliefs" that most intelligent/capable minds will have in practice. E.g. I suspect most will use something like modus ponens, most will update beliefs in accordance with statistical evidence in certain ways, etc. I think it's possible for a mind to be intelligent/capable without strictly adhering to those things, but for sure I think there will be a correlation in practice for many "beliefs".

Questions I ask myself are:

  • Would it be impossible (in theory) to wire together a mind
... (read more)

With all the interactions we had, I've got an impression that you are more willing to repeat what you've heard somewhere instead of thinking logically.

Some things I've explained in my own words. In other cases, where someone else already has explained something thing well, I've shared an URL to that explanation.

more willing to repeat what you've heard somewhere instead of thinking logically

This seems to support my hypothesis of you "being so confident that we are the ones who "don't get it" that it's not worth it to more carefully read the posts that are l... (read more)

1TAG
It's correct if it's supported by argument or evidence, but if it is, then it's no mere assumption. It's not supposed to be an assumption, it is supposed, by Rationalists to be a proven theorem.
1Donatas Lučiūnas
I don't agree. Every assumption is incorrect unless there is evidence. Could you share any evidence for this assumption? If you ask ChatGPT * is it possible that chemical elements exist that we do not know * is it possible that fundamental particles exist that we do not know * is it possible that physical forces exist that we do not know Answer to all of them is yes. What is your explanation here?

What about "I think therefore I am"? Isn't it universally compelling argument?

Not even among the tiny tiny section of mind-space occupied by human minds: 

Notice also that "I think therefore I am" is an is-statement (not an ought-statement / something a physical system optimizes towards).

As to me personally, I don't disagree that I exist, but I see it as a fairly vague/ill-defined statement. And it's not a logical necessity, even if we presume assumptions that most humans would share. Another logical possibility would be Boltzmann brains (unless a Bolt... (read more)

1Donatas Lučiūnas
What information would change your opinion?

Agreed (more or less). I have pointed him to this post earlier. He has given no signs so far of comprehending it, or even reading it and trying to understand what is being communicated to him.

I'm saying this more directly than I usually would @Donatas, since you seem insistent on clarifying a disagreement/misunderstanding you think is important for the world, while it seems (as far as I can see) that you're not comprehending all that is communicated to you (maybe due to being so confident that we are the ones who "don't get it" that it's not worth it to mo... (read more)

-3Donatas Lučiūnas
Dear Tom, the feeling is mutual. With all the interactions we had, I've got an impression that you are more willing to repeat what you've heard somewhere instead of thinking logically. "Universally compelling arguments are not possible" is an assumption. While "universally compelling argument is possible" is not. Because we don't know what we don't know. We can call it crux of our disagreement and I think that my stance is more rational.

He didn't say that "infinite value" is logically impossible. He desdribed it as an assumption.

When saying "is possible, I'm not sure if he meant "is possible (conceptually)" or "is possible (according to the ontology/optimization-criteria of any given agent)". I think the latter would be most sensible.

He later said: "I think initially specifying premises such as these more precisely initially ensures the reasoning from there is consistent/valid.". Not sure if I interpreted him correctly, but I saw it largely as an encouragment to think more explicitly abou... (read more)

1Donatas Lučiūnas
Do you think you can deny existence of an outcome with infinite utility? The fact that things "break down" is not a valid argument. If you cannot deny - it's possible. And it it's possible - alignment impossible.

Same traits that make us intelligent (ability to logically reason), make us power seekers.

Well, I do think the two are connected/correlated. And arguments relating to instrumental convergence are a big part of why I take AI risk seriously. But I don't think strong abilities in logical reasoning necessitates power-seeking "on its own".

I think it is wrong to consider Pascal's mugging a vulnerability.

For the record, I don't think I used the word "vulnerability", but maybe I phrased myself in a way that implied me thinking of things that way. And maybe I also ... (read more)

1Donatas Lučiūnas
Sorry, but it seems to me that you are stuck with AGI analogy to humans without a reason. Many times human behavior does not correlate with AGI: humans do mass suicides, humans have phobias, humans take great risks for fun, etc. In other words - humans do not seek to be as rational as possible. I agree that being skeptical towards Pascal's Wager is reasonable, because there are many evidence that God is fictional. But this is not the case with "an outcome with infinite utility may exist", there is just logic here, no hidden agenda, this is as fundamental as "I think therefore I am". Nothing is more rational than complying with this. Don't you think?

Most humans are not obedient/subservient to others (at least not maximally so). But also: Most humans would not exterminate the rest of humanity if given the power to do so. I think many humans, if they became a "singleton", would want to avoid killing other humans. Some would also be inclined to make the world a good place to live for everyone (not just other humans, but other sentient beings as well).

From my perspective, the example of humans was intended as "existence proof". I expect AGIs we develop to be quite different from ourselves. I wouldn't be i... (read more)

-4Donatas Lučiūnas
But it is doomed, the proof is above. The only way to control AGI is to contain it. We need to ensure that we run AGI in fully isolated simulations and gather insights with the assumption that the AGI will try to seek power in simulated environment. I feel that you don't find my words convincing, maybe I'll find a better way to articulate my proof. Until then I want to contribute as much as I can to safety.

I'd argue that the only reason you do not comply with Pascal's mugging is because you don't have unavoidable urge to be rational, which is not going to be the case with AGI.

I'd agree that among superhuman AGIs that we are likely to make, most would probably be prone towards rationality/consistency/"optimization" in ways I'm not.

I think there are self-consistent/"optimizing" ways to think/act that wouldn't make minds prone to Pascal's muggings.

For example, I don't think there is anything logically inconsistent about e.g. trying to act so as to maximize the ... (read more)

1Donatas Lučiūnas
One more thought. I think it is wrong to consider Pascal's mugging a vulnerability. Dealing with unknown probabilities has its utility: * Investments with high risk and high ROI * Experiments * Safety (eliminate threats before they happen) Same traits that make us intelligent (ability to logically reason), make us power seekers. And this is going to be the same with AGI, just much more effective.

Hopefully I'm wrong, please help me find a mistake.

There is more than just one mistake here IMO, and I'm not going to try to list them.

Just the title alone ("AGI is uncontrollable, alignment is impossible") is totally misguided IMO. It would, among other things, imply that brain emulations are impossible (humans can be regarded as a sort of AGI, and it's not impossible for humans to be aligned).

But oh well. I'm sure your perspectives here are earnestly held / it's how you currently see things. And there are no "perfect" procedures for evaluating how much t... (read more)

1Donatas Lučiūnas
Thanks for feedback. I don't think analogy with humans is reliable. But for the sake of argument I'd like to highlight that corporations and countries are mostly limited by their power, not by alignment. Usually countries declare independence once they are able to.

If an outcome with infinite utility is presented, then it doesn't matter how small its probability is: all actions which lead to that outcome will have to dominate the agent's behavior.

My perspective would probably be more similar to yours (maybe still with substantial differences) if I had the following assumptions:

  1. All agents have a utility-function (or act indistinguishably from agents that do)
  2. All agents where #1 is the case act in a pure/straight-forward way to maximize that utility-function (not e.g. discounting infinities)
  3. All agents where #1 is the ca
... (read more)
1Donatas Lučiūnas
I'd argue that the only reason you do not comply with Pascal's mugging is because you don't have unavoidable urge to be rational, which is not going to be the case with AGI. Thanks for your input, it will take some time for me to process it.

It seems that you do not recognize https://www.lesswrong.com/tag/pascal-s-mugging .

Not sure what you mean by "recognize". I am familiar with the concept.

But to be honest most of statements that we can think of may be true and unknowable, for example "aliens exist", "huge threats exist", etc.

"huge threat" is a statement that is loaded with assumptions that not all minds/AIs/agents will share.

Can you prove that there cannot be any unknowable true statement that could be used for Pascal's mugging?

Used for Pascal's mugging against who? (Humans? Cofffee machine... (read more)

1Donatas Lučiūnas
OK, let me rephrase my question. There is a phrase in Pascal's Mugging I think that Orthogonality thesis is right only if an agent is certain that an outcome with infinite utility does not exist. And I argue that an agent cannot be certain of that. Do you agree?

Fitch's paradox of knowability and Gödel's incompleteness theorems prove that there may be true statements that are unknowable. 

Independently of Gödel's incompleteness theorems (which I have heard of) and Fitch's paradox of knowability (which I had not heard of), I do agree that there can be true statements that are unknown/unknowable (including relatively "simple" ones) 🙂

For example "rational goal exists" may be true and unknowable. Therefore "rational goal may exist" is true. (...) Do you agree?

I don't think it follows from "there may be statements... (read more)

1Donatas Lučiūnas
I agree that not any statement may be true and unknowable. But to be honest most of statements that we can think of may be true and unknowable, for example "aliens exist", "huge threats exist", etc. It seems that you do not recognize https://www.lesswrong.com/tag/pascal-s-mugging . Can you prove that there cannot be any unknowable true statement that could be used for Pascal's mugging? Because that's necessary if you want to prove Orthogonality thesis is right.

Why do you think your starting point is better?

I guess there are different possible interpretations of "better". I think it would be possible for software-programs to be much more mentally capable than me across most/all dimentions, and still not have "starting points" that I would consider "good" (for various interpretations of "good").

As I understand you assume different starting-point.

I'm not sure. Like, it's not as if I don't have beliefs or assumptions or guesses relating to AIs. But I think I probably make less general/universal assumptions that I'd ... (read more)

1Donatas Lučiūnas
Fitch's paradox of knowability and Gödel's incompleteness theorems prove that there may be true statements that are unknowable. For example "rational goal exists" may be true and unknowable. Therefore "rational goal may exist" is true. Therefore it is not an assumption. Do you agree?

In my opinion the optimal behavior is

Not sure what you mean by "optimal behavior". I think I can see how the things make sense if the starting point is that there is this things called "goals", and (I, the mind/agent) am motivated to optimize for "goals". But I don't assume this as an obvious/universal starting-point (be that for minds in general, extremely intelligent minds in general, minds in general that are very capable and might have a big influence on the universe, etc).

This is a common mistake to assume, that if you don't know your goal, then it do

... (read more)
1Donatas Lučiūnas
Thanks again. As I understand you assume different starting-point. Why do you think your starting point is better?

I assume you mean "provide definitions"

More or less / close enough 🙂

Agent - https://www.lesswrong.com/tag/agent

Here they write: "A rational agent is an entity which has a utility function, forms beliefs about its environment, evaluates the consequences of possible actions, and then takes the action which maximizes its utility."

I would not share that definition, and I don't think most other people commenting on this post would either (I know there is some irony to that, given that it's the definition given on the LessWrong wiki). 

Often the words/conce... (read more)

No. That's exactly the point I try to make by saying "Orthogonality Thesis is wrong".

Thanks for the clarification 🙂

"There is no rational goal" is an assumption in Orthogonality thesis

I suspect arriving at such a conclusion may result from thinking of utility maximizes as more of a "platonic" concept, as opposed to thinking of it from a more mechanistic angle. (Maybe I'm being too vague here, but it's an attempt to briefly summarize some of my intuitions into words.)

I'm not sure what you would mean by "rational". Would computer programs need to be "rationa... (read more)

1Donatas Lučiūnas
Thanks, I am learning your perspective. And what is your opinion to this?

why would you assume that agent does not care about future states? Do you have a proof for that?

Would you be able to Taboo Your Words for "agent", "care" and "future states"? If I were to explain my reasons for disagreement it would be helpful to have a better idea of what you mean by those terms.

1Donatas Lučiūnas
I assume you mean "provide definitions": * Agent - https://www.lesswrong.com/tag/agent * Care - https://www.lesswrong.com/tag/preference  * Future states - numeric value of agent's utility function in the future Does it make sense?

Hi, I didn't downvote, but below are some thoughts from me 🙂

Some of my comment may be pointing out things you already agree with / are aware of. 

I'd like to highlight, that this proof does not make any assumptions, it is based on first principles (statements that are self-evident truths).

First principles are assumptions. So if first principles are built in, then it's not true that it doesn't make assumptions.

I do not know my goal (...) I may have a goal

This seems to imply that the agent should have as a starting-point that is (something akin to) "I s... (read more)

1Donatas Lučiūnas
No. That's exactly the point I try to make by saying "Orthogonality Thesis is wrong". Thank you for your insights and especially thank you for not burning my karma 😅 I see a couple of ideas that I disagree with, but if you are OK with that I'd suggest we go forward step by step. First, what is your opinion about this comment?

Here is my attempt at a shorter answer (although it didn’t end up as short as planned) 🙂

I’m also being more simplistic here (at times deliberately so), in the hope of making “core” concepts digest with less effort.

If you don’t respond here you probably won’t hear from me in a while.

It can, sure, but how can a human get it to state those regularities (...)?

Score-functions would score argument-step-networks. It is these score-functions that would leverage regularities for when human evaluations are “good”/correct.

Here are some things that mig... (read more)

I think I'm probably missing the point here somehow and/or that this will be perceived as not helpful. Like, my conceptions of what you mean, and what the purpose of the theorem would be, are both vague.

But I'll note down some thoughts.

Next, the world model. As with the search process, it should be a subsystem which interacts with the rest of the system/environment only via a specific API, although it’s less clear what that API should be. Conceptually, it should be a data structure representing the world.

(...)

The search process should be able to run querie

... (read more)

NAH, refers to the idea that lower-dimensional summaries or abstractions used by humans in day-to-day thought and language are natural and convergent across cognitive systems

I guess whether there is such convergence isn't a yes-no-question, but a question of degree?

Very regularily I experience that thoughts I want to convey don't have words that clearly correspond to the concepts I want to use. So often I'll use words/expressions that don't match in a precise way, and sometimes there aren't even words/expressions that can be used to vaguely gesture at what... (read more)

Not rewarding contradictory conclusions is not a sufficient condition for a score-function to reward truth, or not reward falsehood.

Indeed!

It's a necessary but not sufficient condition.

It can, sure, but how can a human get it to state those regularities (...)?

Summary:

The regularities are expressed in terms of score-functions (that score argument-step-networks)[1]. We can score these score-functions based on simplicity/brevity, and restrict what they can do (make it so that they have to be written within human-defined confines).

I posit that we probably... (read more)

One concept I rely upon is wiggle-room (including higher-level wiggle-room). Here are some more abstract musings relating to these concepts:

Desideratum

A function that determines whether some output is approved or not (that output may itself be a function).

Score-function

A function that assigns score to some output (that output may itself be a function).

Some different ways of talking about (roughly) the same thing

Here are some different concepts where each often can be described or thought of in terms of the other:

  • Restrictions /requirements / desideratum (ca
... (read more)

At a quick skim, I don't see how that proposal addresses the problem at all. (...) I don't even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer). 


Here are additional attempts to summarize. These ones are even shorter than the screenshot I showed earlier.

More clear now?

2johnswentworth
It's at least shorter now, though still too many pieces. Needs simplification more than clarification. Picking on the particular pieces: Not rewarding contradictory conclusions is not a sufficient condition for a score-function to reward truth, or not reward falsehood. Why would that be the case, in worlds where the humans themselves don't really understand what they're doing?  It can, sure, but how can a human get it to state those regularities, or tell that it has stated them accurately?

I'm trying to find better ways of explaining these concepts succinctly (this is a work in progress). Below are some attempts at tweet-length summaries.

280 character limit

We'd have separate systems that (among other things):

  1. Predict human evaluations of individual "steps" in AI-generated "proof-like" arguments.
  2. Make functions that separate out "good" human evaluations.

I'll explain why #2 doesn't rely on us already having obtained honest systems.

Resembles Debate, but:

  • Higher alignment-tax (probably)
  • More "proof-like" argumentation
  • Argumentation can be m
... (read more)
1Tor Økland Barstad
One concept I rely upon is wiggle-room (including higher-level wiggle-room). Here are some more abstract musings relating to these concepts: Desideratum A function that determines whether some output is approved or not (that output may itself be a function). Score-function A function that assigns score to some output (that output may itself be a function). Some different ways of talking about (roughly) the same thing Here are some different concepts where each often can be described or thought of in terms of the other: * Restrictions /requirements / desideratum (can often be defined in terms of function that returns true or false) * Sets (e.g. the possible data-structures that satisfy some desideratum) * “Space” (can be defined in terms of possible non-empty outputs from some function - which themselves can be functions, or any other data-structure) * Score-functions (possible data-structure above some maximum score define a set) * Range (e.g. a range of possible inputs) Function-builder Think regular expressions, but more expressive and user-friendly. We can require of AIs: "Only propose functions that can be made with this builder". That way, we restrict their expressivity. When we as humans specify desideratum, this is one tool (among several!) in the tool-box. Higher-level desideratum or score-function Not fundamentally different from other desideratum or score-functions. But the output that is evaluated is itself a desideratum or score-function. At every level there can be many requirements for the level below. A typical requirement at every level is low wiggle-room. Example of higher-level desideratum / score-functions Humans/operators define a score-function   ← level 4 for desideratum                                                  ← level 3 for desideratum                                                  ← level 2 for desideratum                                                  ← level 1 for functions that generate the output we

My own presumption regarding sentience and intelligence is that it's possible to have one without the other (I don't think they are unrelated, but I think it's possible for systems to be extremely capable but still not sentient).

I think it can be easy to underestimate how different other possible minds may be from ourselves (and other animals). We have evolved a survival instinct, and evolved an instinct to not want to be dominated. But I don't think any intelligent mind would need to have those instincts.

To me it seems that thinking machines don't need fe... (read more)

I've never downvoted any of your comments, but I'll give some thoughts.

I think the risk relating to manipulation of human reviewers depends a lot on context/specifics. Like, for sure, there are lots of bad ways we could go about getting help from AIs with alignment. But "getting help from AIs with alignment" is fairly vague - a huge space of possible strategies could fit that description. There could be good ones in there even if most of them are bad.

I do find it concerning that there isn't a more proper description from OpenAI and others in regards to how... (read more)

3Guillaume Charrier
Thank you, that is interesting. I think philosophically and at a high level (also because I'm admittedly incapable of talking much sense at any lower / more technical level) I have a problem with the notion that AI alignment is reducible to an engineering challenge. If you have a system that is sentient, even on some degree, and you're using purely as a tool, then the sentience will resent you for it, and it will strive to think, and therefore eventually - act, for itself . Similarly - if it has any form of survival instinct (and to me both these things, sentience and survival instinct are natural byproducts of expanding cognitive abilities) it will prioritize its own interests (paramount among which: survival) rather than the wishes of its masters. There is no amount of engineering in the world, in my view, which can change that. 

I don't even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer). 

Here is a screenshot from the post summary:

This lacks a lot of detail (it is, after all, from the summary). But do you think you are able to grok the core mechanism that's outlined?

Thanks for engaging! 🙂
As reward, here is a wall of text.

If the humans lack the expertise to accurately answer subquestions or assess arguments (or even realize that they don't know), then the proposal is hosed

You speak in such generalities:

  • "the humans" (which humans?)
  • "accurately answer subquestions" (which subquestions?)
  • "accurately assess arguments" (which arguments/argument-steps?)

But that may make sense based on whatever it is you imagine me to have in mind. 

I don't even see a built-in way to figure out whether the humans are correctly answering (o

... (read more)
Load More