All of drocta's Comments + Replies

drocta
10

If you are interested in convincing people who so far think "It is impossible for the existence of an artificial superintelligence to produce desirable outcomes" otherwise, you should have a meaning of "an aritifical superintelligence" in mind that is like what they mean by it.

If one suspects that it is impossible for an artificial superintelligence to produce desirable outcomes, then when one considers "among possible futures, the one(s) that have as good or better outcomes than any other possible future", one would suppose that these perhaps are not ones... (read more)

drocta
10

(Sorry for the late response, I hadn't checked my LW inbox much since my previous comments.)
If it were the case that such a function exists but cannot possibly be implemented (any implementation would be implementation as a state), and no other function satisfying the same constraints could possibly be implemented, that seems like it would be a case of it being impossible to have the aligned ASI. (Again, not that I think this is the case, just considering the validity of argument.)

The function that is being demonstrated to exist is the lookup table that produces the appropriate actions, yes? The one that is supposed to be implementable by a finite depth circuit?

drocta
10

It seems to make sense that if hiring an additional employee provides marginal shareholder value, that the company will hire additional employees. So, when the company stops hiring employees, it seems reasonable that this is because the marginal benefit of hiring an additional employee is not positive. However, I don't see why this should suggest that the company is likely to hire an employee that provides a marginal value of 0 or negative.

"Number of employees" is not a continuous variable. When hiring an additional employee, how this changes what the marg... (read more)

drocta
10

Not if the point of the argument is to establish that a superintelligence is compatible with achieving the best possible outcome.

Here is a parody of the issue, which is somewhat unfair and leaves out almost all of your argument, but which I hope makes clear the issue I have in mind:

"Proof that a superintelligence can lead to the best possible outcome: Suppose by some method we achieved the best possible outcome. Then, there's no properties we would want a superintelligence to have beyond that, so let's call however we achieved the best possible outcome, 'a... (read more)

2Roko
The problem with this is that people use the word "superintelligence" without a precise definition. Clearly they mean some computational process. But nobody who uses the term colloquially defines it. So, I will make the assertion that if a computational process achieves the best possible outcome for you, it is a superintelligence. I don't think anyone would disagree with that. If you do, please state what other properties you think a "superintelligence" must have other than being a computational process achieves the best possible outcome.
drocta
10

Yes, I knew the cardinalities in question were finite. The point applies regardless though. For any set X, there is no injection from 2^X to X. In the finite case, this is 2^n > n for all natural numbers n.

If there are N possible states, then the number of functions from possible states to {0,1} is 2^N , which is more than N, so there is some function from the set of possible states to {0,1} which is not implemented by any state.

2Roko
I never said it had to be implemented by a state. That is not the claim: the claim is merely that such a function exists.
drocta
90

If your argument is, "if it is possible for humans to produce some (verbal or mechanical) output, then it is possible for a program/machine to produce that output", then, that's true I suppose?

I don't see why you specified "finite depth boolean circuit".

While it does seem like the number of states for a given region of space is bounded, I'm not sure how relevant this is. Not all possible functions from states to {0,1} (or to some larger discrete set) are implementable as some possible state, for cardinality reasons.

I guess maybe that's why you mentioned th... (read more)

2Roko
Until I wrote this proof, it was a live possibility that aligned superintelligence is in fact logically impossible.
2Roko
All cardinalities here are finite. The set of generically realizable states is a finite set because they each have a finite and bounded information content description (a list of instructions to realize that state, which is not greater in bits than the number of neurons in all the human brains on Earth).
2Roko
Isn't it enough that it achieves the best possible outcome? What other criteria do you want a "superintelligence" to have?
drocta
10

Yes. I believe that is consistent with what I said.

"not((necessarily, for each thing) : has [x] -> those [x] are such that P_1([x]))"
is equivalent to, " (it is possible that something) has [x], but those [x] are not such that P_1([x])"

not((necessarily, for each thing) : has [x] such that P_2([x]) -> those [x] are such that P_1([x]))
is equivalent to "(it is possible that something) has [x], such that P_2([x]), but those [x] are not sure that P_1([x])" .

The latter implies the former, as (A and B and C) implies (A and C), and so the latter is stronger, not weaker, than the former.

Right?

drocta
10

Doesn't "(has preferences, and those preferences are transitive) does not imply (completeness)" imply (has preferences) does not imply (completeness)" ? Surely if "having preferences" implied completeness, then "having transitive preferences" would also imply completeness?

1martinkunev
Usually "has preferences" is used to convey that there is some relation (between states?) which is consistent with the actions of the agent. Completeness and transitivity are usually considered additional properties that this relation could have.
drocta
10

"Political category" seems, a bit strong? Like, sure, the literal meaning of "processed" is not what people are trying to get at. But, clearly, "those processing steps that are done today in the food production process which were not done N years ago" is a thing we can talk about. (by "processing step" I do not include things like "cleaning the equipment", just steps which are intended to modify the ingredients in some particular way. So, things like, hydrogenation. This also shall not be construed as indicating that I think all steps that were done N years ago were better than steps done today.)

drocta
70

For example, it is not clear to me if once I consider a program that outputs 0101 I will simply ignore other programs that output that same thing plus one bit (e.g. 01010).

No, the thing about prefixes is about what strings encode a program, not about their outputs.
The purpose of this is mostly just to define a prior over possible programs, in a way that conveniently ensures that the total probability assigned over all programs is at most 1. Seeing as it still works for different choices of language, it probably doesn't need to exactly use this kind of defi... (read more)

3mukashi
Thank you for the comprehensive answer and for correcting the points where I wasn't clear. Also, thank you for pointing out that the Kolmogorov complexity of a program is the length of the program that writes that program The complexity of the algorithms was totally arbitrary and for the sake of the example. I still have some doubts, but everything is more clear now (see my answer to Charlie Steiner also)
drocta
10

Thanks! The specific thing I was thinking about most recently was indeed specifically about context length, and I appreciate the answer tailored to that, as it basically fully addresses my concerns in this specific case.

However, I also did mean to ask the question more generally. I kinda hoped that the answers might also be helpful to others who had similar questions (as well as if I had another idea meeting the same criteria in the future), but maybe thinking other people with the same question would find the question+answers here, was not super realistic, idk.

Answer by drocta
41

Here is my understanding:
we assume a programming language where a program is a finite sequence of bits, and such that no program is a prefix of another program. So, for example, if 01010010 is a program, then 0101 is not a program.
Then, the (not-normalized) prior probability for a program is 
Why that probability?
If you take any infinite sequence of bits, then, because no program is a prefix of any other program, at most one program will be a prefix of that sequence of bits.
If you randomly (with uniform distribution) select an infi... (read more)

2mukashi
The part I understood is, that you weigh the programs based on the length in bits, the longer the program the less weight it has. This makes total sense. I am not sure that I understand the prefix thing and I think that's relevant. For example, it is not clear to me if once I consider a program that outputs 0101 I will simply ignore other programs that output that same thing plus one bit (e.g. 01010). I also find still fuzzy (and know at least I can put my finger on it) is the part where Solomonoff induction is extended to deal with randomness. Let me see if I can make my question more specific: Let's imagine for a second that we live in a universe where only the next programs could be written: * A) A program that produces deterministically a given sequence of five digits (there are 2^5 of this programs * B) A program that produces deterministically a given sequence of 6 digits (there are 2^6 of them) * C) A program that produces 5 random coin flips with p=0.5 The programs in A have 5 bits of Kolmogorov complexity each. The programs in B have 6 bits. The program C has 4 We observe the sequence O = HTHHT I measure the likelihood for each possible model. I discard the models with L = 0 A) There is a model here with likelihood 1 B) There are 2 models here, each of them with likelihood 1 too C) This model has likelihood 2^-5 Then, things get murky: the priors for each models will be 2^-5 for model A, 2^-6 for model B and 2^-4 for model C, according to their Kolmogorov complexity? 
drocta
20

Well, I was kinda thinking of  as being, say, a distribution of human behaviors in a certain context (as filtered through a particular user interface), though, I guess that way of doing it would only make sense within limited contexts, not general contexts where whether the agent is physically a human or something else, would matter. And in this sort of situation, well, the action of "modify yourself to no-longer be a quantilizer" would not be in the human distribution, because the actions to do that are not applicable to humans (as humans are, ... (read more)

drocta
10

For the "Crappy Optimizer Theorem", I don't understand why condition 4, that if  , then  , isn't just a tautology[1]. Surely if  , then no-matter what  is being used,
as  , then letting  , then  , and so  .

I guess if the 4 conditions are seen as conditions on a function  (where they are written for  ), then it no-longer is automatic, and it is just when specifying... (read more)

drocta
20

I thought CDT was considered not reflectively-consistent because it fails Newcomb's problem?
(Well, not if you define reflective stability as meaning preservation of anti-Goodhart features, but, CDT doesn't have an anti-Goodhart feature (compared to some base thing) to preserve, so I assume you meant something a little broader?) 
Like, isn't it true that a CDT agent who anticipates being in Newcomb-like scenarios would, given the opportunity to do so, modify itself to be not a CDT agent? (Well, assuming that the Newcomb-like scenarios are of the form "a... (read more)

2Jeremy Gillen
Good point on CDT, I forgot about this. I was using a more specific version of reflective stability. > - wait.. that doesn't seem right..? Yeah this is also my reaction. Assuming that bound seems wrong. I think there is a problem with thinking of ν as a known-to-be-acceptably-safe agent, because how can you get this information in the first place? Without running that agent in the world? To construct a useful estimate of the expected value of the "safe"-agent, you'd have to run it lots of times, necessarily sampling from it's most dangerous behaviours. Unless there is some other non-empirical way of knowing an agent is safe? Yeah I was thinking of having large support of the base distribution. If you just rule-in behaviours, this seems like it'd restrict capabilities too much.
drocta
10

Whoops, yes, that should have said  , thanks for the catch! I'll edit to make that fix.

Also, yes, what things between  and   should be sent to, is a difficulty..
A thought I had which, on inspection doesn't work, is that (things between   and )  could be sent to  , but that doesn't work, because  might be terminal, but (thing between  and ) isn't terminal. It seems like the only thing that would always work would be for them to be sent to somethin... (read more)

drocta
*10

A thought on the "but what if multiple steps in the actual-algorithm correspond to a single step in an abstracted form of the algorithm?" thing :
This reminds me a bit of, in the topic of "Abstract Rewriting Systems", the thing that the   vs  distinction handles. (the asterisk just indicating taking the transitive reflexive closure)

Suppose we have two abstract rewriting systems  and .
(To make it match more closely what you are describing, we can suppose that every node has at most one outgoing arrow, to make... (read more)

2Erik Jenner
Should it be f(a)→Bf(a′) at the end instead? Otherwise not sure what b is. I think this could be a reasonable definition but haven't thought about it deeply. One potentially bad thing is that f would have to be able to also map any of the intermediate steps between a an a' to f(a). I could imagine you can't do that for some computations and abstractions (of course you could always rewrite the computation and abstraction to make it work, but ideally we'd have a definition that just works). What I've been imagining instead is that the abstraction can specify a function that determines which are the "high-level steps", i.e. when f should be applied. I think that's very flexible and should support everything. But also, in practice the more important question may just be how to optimize over this choice of high-level steps efficiently, even just in the simple setting of circuits.
drocta
20

In the line that ends with "even if God would not allow complete extinction.", my impulse is to include " (or other forms of permanent doom)" before the period, but I suspect that this is due to my tendency to include excessive details/notes/etc. and probably best not to actually include in that sentence.

(Like, for example, if there were no more adult humans, only billions of babies grown in artificial wombs (in a way staggered in time) and then kept in a state of chemically induced euphoria until the age of 1, and then killed, that technically wouldn't be... (read more)

drocta
90

I want to personally confirm a lot of what you've said here. As a Christian, I'm not entirely freaked out about AI risk because I don't believe that God will allow it to be completely the end of the world (unless it is part of the planned end before the world is remade? But that seems unlikely to me.), but that's no reason that it can't still go very very badly (seeing as, well, the Holocaust happened).

In addition, the thing that seems to me most likely to be the way that God doesn't allow AI doom, is for people working on AI safety to succeed. One shouldn... (read more)

4Yaakov T
@drocta @Cookiecarver We started writing up an answer to this question for Stampy. If you have any suggestions to make it better I would really appreciate it. Are there important factors we are leaving out? Something that sounds off? We would be happy for any feedback you have either here or on the document itself https://docs.google.com/document/d/1tbubYvI0CJ1M8ude-tEouI4mzEI5NOVrGvFlMboRUaw/edit#
drocta
10

I don't understand why this comment has negative "agreement karma". What do people mean by disagreeing with it? Do they mean to answer the question with "no"?

drocta
30

First, I want to summarize what I understand to be what your example is an example of:
"A triple consisting of
1) A predicate P
2) the task of generating any single input x for which P(x) is true
3) the task of, given any x (and given only x, not given any extra witness information), evaluating whether P(x) is true
"

For such triples, it is clear, as your example shows, that the second task (the 3rd entry) can be much harder than the first task (the 2nd entry).

_______

On the other hand, if instead one had the task of producing an exhaustive list of all x such tha... (read more)

drocta
1-2

As you know, there's a straightforward way to, given any boolean circuit, to turn it into a version which is a tree, by just taking all the parts which have two wires coming out from a gate, and making duplicates of everything that leads into that gate.
I imagine that it would also be feasible to compute the size of this expanded-out version without having to actually expand out the whole thing?

Searching through normal boolean circuits, but using a cost which is based on the size if it were split into trees, sounds to me like it would give you the memoizati... (read more)

drocta
10

It seems like the 5th sentence has its ending cut off? "it tries to parcel credit and blame for a decision up to the input neurons, even when credit and blame" , seems like it should continue [do/are x] for some x.

drocta
20

When you say "which yields a solution of the form ", are you saying that  yields that, or are you saying that  yields that? Because, for the former, that seems wrong? Specifically, the former should yield only things of the form  .

But, if the latter, then, I would think that the solutions would be more solutions than that?

Like, what about  ? (where, say,  and 
... (read more)

2[anonymous]
I meant the former (which you're right only has the solution with c1). I only added the c2 term to make it work for the inequality. As a result, it's only a subset of the solutions for the inequality. The (quite complicated!) expression you provided also works. 
drocta
80

As another "why not just" which I'm sure there's a reason for:

in the original circuits thread, they made a number of parameterized families of synthetic images which certain nodes in the network responded strongly to in a way that varied smoothly with the orientation parameter, and where these nodes detected e.g. boundaries between high-frequency and low-frequency regions at different orientations.

If given another such network of generally the same kind of architecture, if you gave that network the same images, if it also had analogous nodes, I'd expect th... (read more)

5johnswentworth
You've correctly identified most of the problems already. One missing piece: it's not necessarily node-activations which are the right thing to look at. Even in existing work, there's other ways interpretable information is embedded, like e.g. directions in activation space of a bunch of neurons, or rank-one updates to matrices.
drocta
20

I was surprised by how the fine-tuning was done for the verbalized confidence.

My initial expectation was that it would make the loss be based on like, some scoring rule based on the probability expressed and the right answer.

Though, come to think of it, I guess seeing as it would be assigning logits values to different expressions of probabilities, it would have to... what, take the weighted average of the scores it would get if it gave the different probabilities? And, I suppose that if many training steps were done on the same question/answer pairs, then... (read more)

3Owain_Evans
The indirect logit is trained with cross-entropy based on the groundtruth correct answer. You can't do this for verbalized probability without using RL, and so we instead do supervised learning using the empirical accuracy for different question types as the labels.
drocta
10

For   such that  is a mesa=optimizer let  be the space it optimizes over, and  be its utility function .

I know you said "which we need not notate", but I am going to say that for  and  , that  , and  is the space of actions (or possibly,  and  is the space of actions available in the situation  )
(Though maybe you just meant that we need note notate separately from s, the map from X to A which s defines. In which ... (read more)

drocta
10

Is this something that the infra-bayesianism idea could address? So, would an infra-bayesian version of AIXI be able to handle worlds that include halting oracles, even though they aren't exactly in its hypothesis class?

drocta
10

Do I understand correctly that in general the elements of A, B, C,  are achievable probability distributions over the set of n possible outcomes? (But that in the examples given with the deterministic environments, these are all standard basis vectors / one-hot vectors / deterministic distributions ?)

And, in the case where these outcomes are deterministic, and A and B are disjoint, and A is much larger than B, then given a utility function on the possible outcomes in A or B, a random permutation of this utility function will, with high probability, ha... (read more)

3TurnTrout
Yes. Nice catch. In the stochastic case, you do need a permutation-enforced similarity, as you say (see definition 6.1: similarity of visit distribution sets in the paper). They won't apply for all A, B, because that would prove way too much.
drocta
62

My understanding:

One could create a program which hard-codes the point about which it oscillates (as well as some amount which it always eventually goes that far in either direction), and have it buy once when below, and then wait until the price is above to sell, and then wait until price is below to buy, etc.  

The programs receive as input the prices which the market maker is offering.

It doesn't need to predict ahead of time how long until the next peak or trough, it only needs to correctly assume that it does oscillate sufficiently, and respond when it does.

drocta
Ω030

The part about Chimera functions was surprising, and I look forward to seeing where that will go, and to more of this in general.

In section 2.1 , Proposition 2 should presumably say that  is a partial order on  rather than on  .

2Scott Garrabrant
Fixed, Thanks.
drocta
*40

In the section about Non-Dogmatism , I believe something was switched around. It says that if the logical inductor assigns prices converging to $1 to a proposition that cannot be proven, that the trader can buy shares in that proposition at prices of $ and thereby gain infinite potential upside. I believe this should say that if the logical inductor assigns prices converging to $0 to a proposition that can't be dis-proven, instead of prices converging to $1 for a proposition that can't be proven .
(I think that if the price was converging to $1 for ... (read more)

3Mark Xu
Thanks! Should be fixed now.
drocta
Ω230

You said that you thought that this could be done in a categorical way. I attempted something which appears to describe the same thing when applied to the category FinSet , but I'm not sure it's the sort of thing you meant by when you suggested that the combinatorial part could potentially be done in a categorical way instead, and I'm not sure that it is fully categorical.

Let S be an object.
For i from 1 to k, let  be an object, (which is not anything isomorphic to the product of itself with itself, or at least is not the terminal object) .
Let&n... (read more)

6Scott Garrabrant
I have not thought much about applying to things other than finite sets. (I looked at infinite sets enough to know there is nontrivial work to be done.) I do think it is good that you are thinking about it, but I don't have any promises that it will work out. What I meant when I think that this can be done in a categorical way is that I think I can define a nice symmetric monodical category of finite factored sets such that things like orthogonality can be given nice categorical definitions. (I see why this was a confusing thing to say.)
drocta
10

I've now computed the volumes within the [-a,a]^3 cube for and, or, and the constant 1 function. I was surprised by the results.
(I hadn't considered that the ratios between the volumes will not depend on the size of the cube)
If we select x,y,z uniformly at random within this cube, the probability of getting the and gate is 1/48, the probability of getting the or gate is 2/48, and the probability of getting the constant 1 function is 13/48 (more than 1/4).
This I found quite surprising, because of the constant 1 function requiring 4 half planes to express th... (read more)

drocta
10

For the volumes, I suppose that because scaling all of these parameters by the same positive constant doesn't change the function computed, it would make sense to compute the volumes of the corresponding regions of the cube, and this would handle the issues with these regions having unbounded size.
(this would still work with more parameters, it would just be a higher dimensional sphere)
Er, would that give the same thing as the limit if we took the parameters within a cube?
Anyway, at least in this case, if we use the "projected onto the sphere" case, we cou... (read more)

1drocta
I've now computed the volumes within the [-a,a]^3 cube for and, or, and the constant 1 function. I was surprised by the results. (I hadn't considered that the ratios between the volumes will not depend on the size of the cube) If we select x,y,z uniformly at random within this cube, the probability of getting the and gate is 1/48, the probability of getting the or gate is 2/48, and the probability of getting the constant 1 function is 13/48 (more than 1/4). This I found quite surprising, because of the constant 1 function requiring 4 half planes to express the conditions for it. So, now I'm guessing that the ones that required fewer half spaces to specify, are the ones where the individual constraints are already implying other constraints, and so actually will tend to have a smaller volume. On the other hand, I still haven't computed any of them for if projecting onto the sphere, and so this measure kind of gives extra weight to the things in the directions near the corners of the cube, compared to the measure that would be if using the sphere.
2Alex Flint
Yes it does seem challenging to compute realistic complexity measures for such small functions. Perhaps we could just look at the mappings ordered by their volume in parameter space, and check whether the mappings at the top of that ordering "seem" less complex than the mappings at the bottom.
drocta
70

nitpick : the appendix says   possible configurations of the whole grid, while it should say  possible configurations. (Similarly for what it says about the number of possible configurations in the region that can be specified.)

4Alex Flint
Thank you. Fixed.
drocta
Ω2100

This comment I'm writing is mostly because this prompted me to attempt to see how feasible it would be to computationally enumerate the conditions for the weights of small networks like the 2 input 2 hidden layer 1 output in order to implement each of the possible functions. So, I looked at the second smallest case by hand, and enumerated conditions on the weights for a 2 input 1 output no hidden layer perceptron to implement each of the 2 input gates, and wanted to talk about it. This did not result in any insights, so if that doesn't sound interesting, m... (read more)

3Chris Mingard
Check out https://arxiv.org/pdf/1909.11522.pdf where we do some similar analysis of perceptrons but in higher dimensions. Theorem 4.1 shows that there is an anti-entropy bias - in other words, functions with either mostly 0s or mostly 1s are exponentially more likely to show up than expected under a uniform prior - which holds for perceptrons of any dimension. This proves a (fairly trivial) bias towards simple functions, although it doesn't say anything about why a function like 010101010101... appears more frequently than other functions in the maximum-entropy class.
4Alex Flint
Very very cool. Thank you for this drocta. What would it take to map out the sizes of the volumes corresponding to each of these mappings? Also, could you perhaps compute the exact Kolmogorov complexity of these mappings in some particular description language, since they are so small? It would be super interesting to me to assemble a table of volumes and Kolmogorov complexities for each of these small mappings. It may then be possible to write some code that does the same for 3-input and 4-input mappings.
drocta
20

The link in the rss feed entry for this at https://agentfoundations.org/rss goes to https://www.alignmentforum.org/events/vvPYYTscRXFBvdkXe/ai-safety-beginners-meetup which is a broken link (though, easily fixed by replacing "events" with "posts" in the url) .
[edit: it appears that it is no longer in the rss feed? It showed up in my rss feed reader.]
I think this has also happened with other "event" type posts in the rss feed before, but I may be remembering wrong.
I suspect this is some bug in how the rss feed is generated, but possibly it is a known bug wh... (read more)

2Linda Linsefors
I think this happened because I unselected "Alignment Forum" for this event. To my best understanding, evens are not supposed to be Alignment Forum content, and it is a but that this is even possible. Therefore, I decided that the cooperative thing to do would be not to use this bug. Though I'm not sure what is better, since I think events should be allowed on the Alignment Forum. > I assume that when the event is updated that the additional information will include how to join the meetup? Yes. We'll probably be in Zoom, but I have not decided.  > I am interested in attending. Great, see you there.
drocta
20

The agent/thinker are limited in the time or computational resources available to them, while the predictor is unlimited.

My understanding is that this is generally situation which is meant. Well, not necessarily unlimited, just with enough resources to predict the behavior of the agent.

I don't see why you call this situation uninteresting.

drocta
30

That something can be modeled using some Turing machine, doesn't imply that it can be any Turing machine.


If I have some simple physical system, such that I can predict how it will behave, well, it can be modeled by a Turing machine, but me being able to predict it doesn't imply that I've solved the halting problem.

A realistic conception of agents in an environment doesn't involve all agents having unlimited compute at every time-step. An agent cannot prevent the universe from continuing simply by getting stuck in a loop and never producing its output for its next action.

drocta
20

Ah, thank you, I see where I misunderstood now. And upon re-reading, I see that it was because I was much too careless in reading the post, to the point that I should apologize. Sorry.
I was thinking that the agents were no longer being trained, already being optimal players, and so I didn't think the judge would need to take into account how their choice would influence future answers. This reading clearly doesn't match what you wrote, at least past the very first part.

If the debaters are still being trained, or the judge can be convinced that the debaters... (read more)

2Joe Collman
Oh no need for apologies: I'm certain the post was expressed imperfectly - I was understanding more as I wrote (I hope!). Often the most confusing parts are the most confused. Since I'm mainly concerned with behaviour-during-training, I don't think the post-training picture is too important to the point I'm making. However, it is interesting to consider what you'd expect to happen after training in the event that the debaters' only convincing "ignore-the-question" arguments are training-signal based. I think in that case I'd actually expect debaters to stop ignoring the question (assuming they know the training has stopped). I assume that a general, super-human question answerer must be able to do complex reasoning and generalise to new distributions. Removal of the training signal is a significant distributional shift, but one that I'd expect a general question-answerer to handle smoothly (in particular, we're assuming it can answer questions about [optimal debating tactics once training has stopped]). [ETA: I can imagine related issues with high-value-information bribery in a single debate: "Give me a win in this branch of the tree, and I'll give you high-value information in another branch", or the like... though it's a strange bargaining situation given that in most setups the debaters have identical information to offer. This could occur during or after training, but only in setups where the judge can give reward before the end of the debate.... Actually I'm not sure on that: if the judge always has the option to override earlier decisions with larger later rewards, then mid-debate rewards don't commit the judge in any meaningful way, so aren't really bargaining chips. So I don't think this style of bribery would work in setups I've seen.]
drocta
20

I am unsure as to what the judge's incentive is to select the result that was more useful, given that they still have access to both answers? Is it just because the judge will want to be such that the debaters would expect them to select the useful answer so that the debaters will provide useful answers, and therefore will choose the useful answers?

If that's the reason, I don't think you would need a committed deontologist to get them to choose a correct answer over a useful answer, you could instead just pick someone who doesn't think very hard about cert... (read more)

2Joe Collman
My expectation is that they'd select the more useful result on the basis that it sends a signal to produce useful results in the future - and that a debater would specifically persuade them to do this (potentially over many steps). I see the situation as analogous to this: The question-creators, judge and debaters are in the same building. The building is on fire, in imminent danger of falling off a cliff, at high risk of enraged elephant stampede... The question-creators, judge and debaters are ignorant of or simply ignoring most such threats. The question creators have just asked the question "What time should we have lunch?". Alice answers "There's a fire!!...", persuades the judge that this is true, and that there are many other major threats. Bob answers "One o'clock would be best...". There's no need for complex/exotic decision-theoretic reasoning on the part of the judge to conclude: "The policy which led the debater to inform me about the fire is most likely to point out other threats in future. The actual question is so unimportant relative to this that answering it is crazy. I want to send a training signal encouraging the communication of urgent, life-saving information, and discouraging the wasting of time on trivial questions while the building burns." Or more simply the judge can just think: "The building's on fire!? Why are you still talking to me about lunch?? I'm picking the sane answer." Of course the judge doesn't need to come up with this reasoning alone - just to be persuaded of it by a debater. I'm claiming that the kind of judge who'll favour "One o'clock would be best..." while the building burns is a very rare human (potentially non-existent?), and not one whose values we'd want having a large impact. More fundamentally, to be confident the QIA fails and that you genuinely have a reliable question-answerer, you must be confident that there (usually) exists no compelling argument in favour of a non-answer. I happen to think the one I've
drocta
20

This reminds me of the "Converse Lawvere Problem" at https://www.alignmentforum.org/posts/5bd75cc58225bf06703753b9/the-ubiquitous-converse-lawvere-problem a little bit, except that the different functions in the codomain have domain which also has other parts to it aside from the main space  . 

As in, it looks like here, we have a space  of values , which includes things such as "likes to eat meat" or "values industriousness" or whatever, where this part can just be handled as some generic nice space   , as one part of ... (read more)

2VojtaKovarik
Yeah, I just meant this simple thing that you can mathematically model as $$f : V \times V \to \mathbb R$$. I suppose it makes sense to consider special cases of this that would have better mathematical properties. But I don't have high-confidence intuitions on which special cases are the right ones to consider. I mostly meant this as a tool that would allow people with different opinions to move their disagreements from "your model doesn't make sense" to "both of our models make sense in theory; the disagreement is an empirical one". (E.g., the value-drift situation from Figure 6 is definitely possible, but that doesn't necessarily mean that this is what is happening to us.)
drocta
30

Thanks! (The way you phrased the conclusion is also much clearer/cleaner than how I phrased it)

drocta
Ω130

I am trying to check that I am understanding this correctly by applying it, though probably not in a very meaningful way:

Am I right in reasoning that, for  , that  iff ( (C can ensure S), and (every element of S is a result of a combination of a possible configuration of the environment of C with a possible configuration of the agent for C, such that the agent configuration is one that ensures S regardless of the environment configuration)) ?

So, if S = {a,b,c,d} , then

would have  , but, say

... (read more)

5Scott Garrabrant
Yep. There is a single morphism from 1S to ⊥ for every world in S, so 1S◃C means all of these morphism factor through C.  A morphism from C to ⊥ is basically a column of C and a morphism from 1S to C is basically an row in C, all of whose entries are in S, and these compose to the morphism corresponding to the entry where this column meets this row. Thus 1S◃C if and only if when you delete all rows not entirely in S, the resulting matrix has image S. I think this equivalent to what you said. I just wrote it out myself because that was the easiest way for me to verify what you said.
drocta
30

There are a few places where I believe you mean to write  a but instead have  instead. For example, in the line above the "Applicability" heading.

I like this.

2johnswentworth
Ah, thanks. I think I got them all now.
drocta
50

As an example, I think in the game "both players win if they choose the same option, and lose if they pick different options" has "the two players pick different options, and lose" as one of the feasible outcomes, and it is not on the Pareto frontier, because if they picked the same thing, they would both win, and that would be a Pareto improvement.

2TurnTrout
Right, I understand how this correctly labels certain cases, but that doesn't seem to address my question?
drocta
Ω020

What came to mind for me before reading the spoiler-ed options, was a variation on #2, with the difference being that, instead of trying to extract P's hypothesis about B, we instead modify T to get a T' which has P replaced with a P' which is a paperclip minimizer instead of maximizer, and then run both, and only use the output when the two agree, or if they give probabilities, use the average, or whatever.

Perhaps this could have an advantage over #2 if it is easier to negate what P is optimizing for than to extract P's model of B. (ed... (read more)

3Donald Hobson
Thanks for a thoughtful comment. Assuming that P and P' are perfectly antialigned, they won't cooperate. However they need to be really antialigned for this to work. If there is some obscure borderline that P thinks is a paperclip, and P' thinks isn't, they can work together to tile the universe with it. I don't think it would bed that easy to change evolution into a reproductive fitness minimiser, or to negate a humans values. If P and P' are antialigned, then in the scenario where you only listen to them if they agree, then for any particular prediction, at least one of them will consider disagreeing better than that. The game theory is a little complicated, but they aren't being incentivised to report their predictions. Actually, A has to be able to manage, not only correct and competent adversaries, but deluded and half mad ones too. I think P would find it hard to be inscrutable. It is impossible to obfuscate arbitrary code. I agree with your final point. Though for any particular string X, the fastest turing machine to produce it is the one that is basically print(X) . This is why we use short TM's not just fast ones.