Thanks for the reply :) I'll try to convey some of my thinking, but I don't expect great success. I'm working on more digestible explainers, but this is a work in progress, and I have nothing good that I can point people to as of now.
(...) part of the explanation here might be "if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment (...)
Yeah, I guess this is where a lot of the differences in our perspective are located.
...if the world is solved b
Thanks for the reply :) Feel free to reply further if you want, but I hope you don't feel obliged to do so[1].
"Fill the cauldron" examples are (...) not examples where it has the wrong beliefs.
I have never ever been confused about that!
...It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn't an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to
Your reply here says much of what I would expect it to say (and much of it aligns with my impression of things). But why you focused so much on "fill the cauldron" type examples is something I'm a bit confused by (if I remember correctly I was confused by this in 2016 also).
"Fill the cauldron" examples are examples where the cauldron-filler has the wrong utility function, not examples where it has the wrong beliefs. E.g., this is explicit in https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/
The idea of the "fill the cauldron" examples isn't "the AI is bad at NLP and therefore doesn't understand what we mean when we say 'fill', 'cauldron', etc." It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that...
This tweet from Eliezer seems relevant btw. I would give similar answers to all of the questions he lists that relate to nanotechnology (but I'd be somewhat more hedged/guarded - e.g. replacing "YES" with "PROBABLY" for some of them).
Thanks for engaging
Likewise :)
Also, sorry about the length of this reply. As the adage goes: "If I had more time, I would have written a shorter letter."
From my perspective you seem simply very optimistic on what kind of data can be extracted from unspecific measurements.
That seems to be one of the relevant differences between us. Although I don't think it is the only difference that causes us to see things differently.
Other differences (I guess some of these overlap):
I suspect my own intuitions regarding this kind of thing are similar to Eliezer's. It's possible that my intuitions are wrong, but I'll try to share some thoughts.
It seems that we think quite differently when it comes to this, and probably it's not easy for us to achieve mutual understanding. But even if all we do here is to scratch the surface, that may still be worthwhile.
As mentioned, maybe my intuitions are wrong. But maybe your intuitions are wrong (or maybe both). I think a desirable property of plans/strategies for alignment would be robustness to e...
None of these are what you describe, but here are some places people can be pointed to:
AGI-assisted alignment in Dath Ilan (excerpt from here)
Suppose Dath Ilan got into a situation where they had to choose the strategy of AGI-assisted alignment, and didn't have more than a few years to prepare. Dath Ilan wouldn't actually get themselves into such a situation, but if they did, how might they go about it?
I suspect that among other things they would:
This is from What if Debate and Factored Cognition had a mutated baby? (a post I started on, but I ended up disregarding this draft and starting anew). This is just an excerpt from the intro/summary (it's not the entire half-finished draft).
Resembles Debate, but:
|
We'd have separate systems that (among oth |
Below are some concepts related to extracting aligned capabilities. The main goal is to be able to verify specialized functions without having humans need to look at the source code, and without being able to safely/robustly score outputs for the full range of inputs.
Some things we need:
An example of an aligned capability we might want woul...
I would also like to see more work where people make less positive/optimistic assumptions. I think of it as a good thing that different approaches to alignment are being explored, and would like to see more of that in general (both in terms of breadth and depth).
I guess there are many possible ways of trying to categorize/conceptualize approaches to alignment theorizing. One is by asking "when talking/thinking about the methodology, what capabilities are assumed to be in place?".
I'm not sure about this, but unless I'm mistaken[1], a good amount of the work...
If humans (...) machine could too.
From my point of view, humans are machines (even if not typical machines). Or, well, some will say that by definition we are not - but that's not so important really ("machine" is just a word). We are physical systems with certain mental properties, and therefore we are existence proofs of physical systems with those certain mental properties being possible.
machine can have any level of intelligence, humans are in a quite narrow spectrum
True. Although if I myself somehow could work/think a million times faster, I think I'd...
Why call it an assumption at all?
Partly because I was worried about follow-up comments that were kind of like "so you say you can prove it - well, why aren't you doing it then?".
And partly because I don't make a strict distinction between "things I assume" and "things I have convinced myself of, or proved to myself, based on things I assume". I do see there as sort of being a distinction along such lines, but I see it as blurry.
Something that is derivable from axioms is usually called a theorem.
If I am to be nitpicky, maybe you meant "derived" and not "der...
(...) if it's supported by argument or evidence, but if it is, then it's no mere assumption.
I do think it is supported by arguments/reasoning, so I don't think of it as an "axiomatic" assumption.
A follow-up to that (not from you specifically) might be "what arguments?". And - well, I think I pointed to some of my reasoning in various comments (some of them under deleted posts). Maybe I could have explained my thinking/perspective better (even if I wouldn't be able to explain it in a way that's universally compelling 🙃). But it's not a trivial task t...
I cannot help you to be less wrong if you categorically rely on intuition about what is possible and what is not.
I wish I had something better to base my beliefs on than my intuitions, but I do not. My belief in modus ponens, my belief that 1+1=2, my belief that me observing gravity in the past makes me likely to observe it in the future, my belief that if views are in logical contradiction they cannot both be true - all this is (the way I think of it) grounded in intuition.
Some of my intuitions I regard as much more strong/robust than others.
When my...
Like with many comments/questions from you, answering this question properly would require a lot of unpacking. Although I'm sure that also is true of many questions that I ask, as it is hard to avoid (we all have limited communication bandwitdh) :)
In this last comment, you use the term "science" in a very different way from how I'd use it (like you sometimes also do with other words, such as for example "logic"). So if I was to give a proper answer I'd need to try to guess what you mean, make it clear how I interpret what you say, and so on (not just answe...
It seems that 2 + 2 = 4 is also an assumption for you.
Yes (albeit a very reasonable one).
Not believing (some version) of that claim would make typically make minds/AGIs less "capable", and I would expect more or less all AGIs to hold (some version of) that "belief" in practice.
I don't think it is possible to find consensus if we do not follow the same rules of logic.
Here are examples of what I would regard to be rules of logic: https://en.wikipedia.org/wiki/List_of_rules_of_inference (the ones listed here don't encapsulate all of the rules of inference tha...
I do have arguments for that, and I have already mentioned some of them earlier in our discussion (you may not share that assesment, despite us being relatively close in mind-space compared to most possible minds, but oh well).
Some of the more relevant comments from me are on one of the posts that you deleted.
As I mention here, I think I'll try to round off this discussion. (Edit: I had a malformed/misleading sentence in that comment that should be fixed now.)
Every assumption is incorrect unless there is evidence.
Got any evidence for that assumption? 🙃
Answer to all of them is yes. What is your explanation here?
Well, I don't always "agree"[1] with ChatGPT, but I agree in regards to those specific questions.
...
I saw a post where you wanted people to explain their disagreement, and I felt inclined to do so :) But it seems now that neither of us feel like we are making much progress.
Anyway, from my perspective much of your thinking here is very misguided. But not more misguided than e.g. "proofs" for Go...
Do you think you can deny existence of an outcome with infinite utility?
To me, according to my preferences/goals/inclinations, there are conceivable outcomes with infinite utility/disutility.
But I think it is possible (and feasible) for a program/mind to be extremely capable, and affect the world, and not "care" about infinite outcomes.
The fact that things "break down" is not a valid argument.
I guess that depends on what's being discussed. Like, it is something to take into account/consideration if you want to prove something while referencing utility-functions that reference infinities.
About universally compelling arguments?
First, a disclaimer: I do think there are "beliefs" that most intelligent/capable minds will have in practice. E.g. I suspect most will use something like modus ponens, most will update beliefs in accordance with statistical evidence in certain ways, etc. I think it's possible for a mind to be intelligent/capable without strictly adhering to those things, but for sure I think there will be a correlation in practice for many "beliefs".
Questions I ask myself are:
With all the interactions we had, I've got an impression that you are more willing to repeat what you've heard somewhere instead of thinking logically.
Some things I've explained in my own words. In other cases, where someone else already has explained something thing well, I've shared an URL to that explanation.
more willing to repeat what you've heard somewhere instead of thinking logically
This seems to support my hypothesis of you "being so confident that we are the ones who "don't get it" that it's not worth it to more carefully read the posts that are l...
What about "I think therefore I am"? Isn't it universally compelling argument?
Not even among the tiny tiny section of mind-space occupied by human minds:
Notice also that "I think therefore I am" is an is-statement (not an ought-statement / something a physical system optimizes towards).
As to me personally, I don't disagree that I exist, but I see it as a fairly vague/ill-defined statement. And it's not a logical necessity, even if we presume assumptions that most humans would share. Another logical possibility would be Boltzmann brains (unless a Bolt...
Agreed (more or less). I have pointed him to this post earlier. He has given no signs so far of comprehending it, or even reading it and trying to understand what is being communicated to him.
I'm saying this more directly than I usually would @Donatas, since you seem insistent on clarifying a disagreement/misunderstanding you think is important for the world, while it seems (as far as I can see) that you're not comprehending all that is communicated to you (maybe due to being so confident that we are the ones who "don't get it" that it's not worth it to mo...
He didn't say that "infinite value" is logically impossible. He desdribed it as an assumption.
When saying "is possible, I'm not sure if he meant "is possible (conceptually)" or "is possible (according to the ontology/optimization-criteria of any given agent)". I think the latter would be most sensible.
He later said: "I think initially specifying premises such as these more precisely initially ensures the reasoning from there is consistent/valid.". Not sure if I interpreted him correctly, but I saw it largely as an encouragment to think more explicitly abou...
Same traits that make us intelligent (ability to logically reason), make us power seekers.
Well, I do think the two are connected/correlated. And arguments relating to instrumental convergence are a big part of why I take AI risk seriously. But I don't think strong abilities in logical reasoning necessitates power-seeking "on its own".
I think it is wrong to consider Pascal's mugging a vulnerability.
For the record, I don't think I used the word "vulnerability", but maybe I phrased myself in a way that implied me thinking of things that way. And maybe I also ...
Most humans are not obedient/subservient to others (at least not maximally so). But also: Most humans would not exterminate the rest of humanity if given the power to do so. I think many humans, if they became a "singleton", would want to avoid killing other humans. Some would also be inclined to make the world a good place to live for everyone (not just other humans, but other sentient beings as well).
From my perspective, the example of humans was intended as "existence proof". I expect AGIs we develop to be quite different from ourselves. I wouldn't be i...
I'd argue that the only reason you do not comply with Pascal's mugging is because you don't have unavoidable urge to be rational, which is not going to be the case with AGI.
I'd agree that among superhuman AGIs that we are likely to make, most would probably be prone towards rationality/consistency/"optimization" in ways I'm not.
I think there are self-consistent/"optimizing" ways to think/act that wouldn't make minds prone to Pascal's muggings.
For example, I don't think there is anything logically inconsistent about e.g. trying to act so as to maximize the ...
Hopefully I'm wrong, please help me find a mistake.
There is more than just one mistake here IMO, and I'm not going to try to list them.
Just the title alone ("AGI is uncontrollable, alignment is impossible") is totally misguided IMO. It would, among other things, imply that brain emulations are impossible (humans can be regarded as a sort of AGI, and it's not impossible for humans to be aligned).
But oh well. I'm sure your perspectives here are earnestly held / it's how you currently see things. And there are no "perfect" procedures for evaluating how much t...
If an outcome with infinite utility is presented, then it doesn't matter how small its probability is: all actions which lead to that outcome will have to dominate the agent's behavior.
My perspective would probably be more similar to yours (maybe still with substantial differences) if I had the following assumptions:
It seems that you do not recognize https://www.lesswrong.com/tag/pascal-s-mugging .
Not sure what you mean by "recognize". I am familiar with the concept.
But to be honest most of statements that we can think of may be true and unknowable, for example "aliens exist", "huge threats exist", etc.
"huge threat" is a statement that is loaded with assumptions that not all minds/AIs/agents will share.
Can you prove that there cannot be any unknowable true statement that could be used for Pascal's mugging?
Used for Pascal's mugging against who? (Humans? Cofffee machine...
Fitch's paradox of knowability and Gödel's incompleteness theorems prove that there may be true statements that are unknowable.
Independently of Gödel's incompleteness theorems (which I have heard of) and Fitch's paradox of knowability (which I had not heard of), I do agree that there can be true statements that are unknown/unknowable (including relatively "simple" ones) 🙂
For example "rational goal exists" may be true and unknowable. Therefore "rational goal may exist" is true. (...) Do you agree?
I don't think it follows from "there may be statements...
Why do you think your starting point is better?
I guess there are different possible interpretations of "better". I think it would be possible for software-programs to be much more mentally capable than me across most/all dimentions, and still not have "starting points" that I would consider "good" (for various interpretations of "good").
As I understand you assume different starting-point.
I'm not sure. Like, it's not as if I don't have beliefs or assumptions or guesses relating to AIs. But I think I probably make less general/universal assumptions that I'd ...
In my opinion the optimal behavior is
Not sure what you mean by "optimal behavior". I think I can see how the things make sense if the starting point is that there is this things called "goals", and (I, the mind/agent) am motivated to optimize for "goals". But I don't assume this as an obvious/universal starting-point (be that for minds in general, extremely intelligent minds in general, minds in general that are very capable and might have a big influence on the universe, etc).
...This is a common mistake to assume, that if you don't know your goal, then it do
I assume you mean "provide definitions"
More or less / close enough 🙂
Here they write: "A rational agent is an entity which has a utility function, forms beliefs about its environment, evaluates the consequences of possible actions, and then takes the action which maximizes its utility."
I would not share that definition, and I don't think most other people commenting on this post would either (I know there is some irony to that, given that it's the definition given on the LessWrong wiki).
Often the words/conce...
No. That's exactly the point I try to make by saying "Orthogonality Thesis is wrong".
Thanks for the clarification 🙂
"There is no rational goal" is an assumption in Orthogonality thesis
I suspect arriving at such a conclusion may result from thinking of utility maximizes as more of a "platonic" concept, as opposed to thinking of it from a more mechanistic angle. (Maybe I'm being too vague here, but it's an attempt to briefly summarize some of my intuitions into words.)
I'm not sure what you would mean by "rational". Would computer programs need to be "rationa...
why would you assume that agent does not care about future states? Do you have a proof for that?
Would you be able to Taboo Your Words for "agent", "care" and "future states"? If I were to explain my reasons for disagreement it would be helpful to have a better idea of what you mean by those terms.
Hi, I didn't downvote, but below are some thoughts from me 🙂
Some of my comment may be pointing out things you already agree with / are aware of.
I'd like to highlight, that this proof does not make any assumptions, it is based on first principles (statements that are self-evident truths).
First principles are assumptions. So if first principles are built in, then it's not true that it doesn't make assumptions.
I do not know my goal (...) I may have a goal
This seems to imply that the agent should have as a starting-point that is (something akin to) "I s...
Here is my attempt at a shorter answer (although it didn’t end up as short as planned) 🙂
I’m also being more simplistic here (at times deliberately so), in the hope of making “core” concepts digest with less effort.
If you don’t respond here you probably won’t hear from me in a while.
It can, sure, but how can a human get it to state those regularities (...)?
Score-functions would score argument-step-networks. It is these score-functions that would leverage regularities for when human evaluations are “good”/correct.
Here are some things that mig...
I think I'm probably missing the point here somehow and/or that this will be perceived as not helpful. Like, my conceptions of what you mean, and what the purpose of the theorem would be, are both vague.
But I'll note down some thoughts.
...Next, the world model. As with the search process, it should be a subsystem which interacts with the rest of the system/environment only via a specific API, although it’s less clear what that API should be. Conceptually, it should be a data structure representing the world.
(...)
The search process should be able to run querie
NAH, refers to the idea that lower-dimensional summaries or abstractions used by humans in day-to-day thought and language are natural and convergent across cognitive systems
I guess whether there is such convergence isn't a yes-no-question, but a question of degree?
Very regularily I experience that thoughts I want to convey don't have words that clearly correspond to the concepts I want to use. So often I'll use words/expressions that don't match in a precise way, and sometimes there aren't even words/expressions that can be used to vaguely gesture at what...
Not rewarding contradictory conclusions is not a sufficient condition for a score-function to reward truth, or not reward falsehood.
Indeed!
It's a necessary but not sufficient condition.
It can, sure, but how can a human get it to state those regularities (...)?
Summary:
The regularities are expressed in terms of score-functions (that score argument-step-networks)[1]. We can score these score-functions based on simplicity/brevity, and restrict what they can do (make it so that they have to be written within human-defined confines).
I posit that we probably...
One concept I rely upon is wiggle-room (including higher-level wiggle-room). Here are some more abstract musings relating to these concepts:
Desideratum A function that determines whether some output is approved or not (that output may itself be a function). |
Score-function A function that assigns score to some output (that output may itself be a function). |
Some different ways of talking about (roughly) the same thing Here are some different concepts where each often can be described or thought of in terms of the other:
|
At a quick skim, I don't see how that proposal addresses the problem at all. (...) I don't even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer).
Here are additional attempts to summarize. These ones are even shorter than the screenshot I showed earlier.
More clear now?
I'm trying to find better ways of explaining these concepts succinctly (this is a work in progress). Below are some attempts at tweet-length summaries.
280 character limit
We'd have separate systems that (among other things):
I'll explain why #2 doesn't rely on us already having obtained honest systems. |
Resembles Debate, but:
|
My own presumption regarding sentience and intelligence is that it's possible to have one without the other (I don't think they are unrelated, but I think it's possible for systems to be extremely capable but still not sentient).
I think it can be easy to underestimate how different other possible minds may be from ourselves (and other animals). We have evolved a survival instinct, and evolved an instinct to not want to be dominated. But I don't think any intelligent mind would need to have those instincts.
To me it seems that thinking machines don't need fe...
I've never downvoted any of your comments, but I'll give some thoughts.
I think the risk relating to manipulation of human reviewers depends a lot on context/specifics. Like, for sure, there are lots of bad ways we could go about getting help from AIs with alignment. But "getting help from AIs with alignment" is fairly vague - a huge space of possible strategies could fit that description. There could be good ones in there even if most of them are bad.
I do find it concerning that there isn't a more proper description from OpenAI and others in regards to how...
I don't even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer).
Here is a screenshot from the post summary:
This lacks a lot of detail (it is, after all, from the summary). But do you think you are able to grok the core mechanism that's outlined?
Thanks for engaging! 🙂
As reward, here is a wall of text.
If the humans lack the expertise to accurately answer subquestions or assess arguments (or even realize that they don't know), then the proposal is hosed
You speak in such generalities:
But that may make sense based on whatever it is you imagine me to have in mind.
...I don't even see a built-in way to figure out whether the humans are correctly answering (o
I'm a bit unsure about how to interpret you here.
In my original comment, I used terms such as positive/optimistic assumptions and simplifying assumptions. When doing that, I meant to refer to simplifying assumptions that were made so as to abstract away some parts of the problem.
Good point (I should have written my comment in such a way that pointing out this didn'... (read more)