LESSWRONG
LW

All of keith_wynroe's Comments + Replies

Have been doing some experiments on obfuscated reasoning lately and think this is definitely a thing. For example, I ran a set up where the model was filing reimbursement claims and was rewarded for getting more money but penalised for explicitly planning to lie about how much it spent. It ends up consistently claiming more money than it should while having totally innocent CoTs, but just because it learns a really dumb circuit that attends from the <answer> tag back to the numbers in the prompt and inflates them. You can noise or even completely delete the CoT and the behaviour persists

Have been unsuccesful in getting a set-up where the behaviour both needs planning and survives the monitor

A Simple Toy Coherence Theorem

keith_wynroe8mo*10

I feel like this could branch out into a lot of small disagreements here but in the interest of keeping it streamlined:

One of the consequences of this, however, is that this prefix-based encoding method is only optimal for functions whose prefix-free encodings (i.e. encodings that cannot be partitioned into substrings such that one of the substrings encodes another UTM) in UTM1 and UTM2 differ in length by more than len(P). And, since len(P) is a measure of UTM2's complexity relative to UTM1, it follows directly that a UTM2 whose "coding scheme" is such th

... (read more)

A Simple Toy Coherence Theorem

keith_wynroe8mo10

The constant bound isn't not that relevant just because of the in principal unbounded size, it also doesn't constrain the induced probabilities in the second coding scheme much at all. It's an upper bound on the maximum length, so you can still have the weightings in codings scheme B differ differ in relative length by a ton, leading to wildly different priors

And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to "natural" versus "unnatural" optimization cri

... (read more)

4dxu8mo

Your phrasing here is vague and somewhat convoluted, so I have difficulty telling if what you say is simply misleading, or false. Regardless: If you have UTM1 and UTM2, there is a constant-length prefix P such that UTM1 with P prepended to some further bitstring as input will compute whatever UTM2 computes with only that bitstring as input; we can say of P that it "encodes" UTM2 relative to UTM1. This being the case, each function indexed by UTM1 differs from its counterpart for UTM2 by a maximum of len(P), because whenever it's the case that a given function would otherwise be encoded in UTM1 by a bitstring longer than len(P + [the shortest bitstring encoding the function in UTM2]), the prefixed version of that function simply is the shortest bitstring encoding it in UTM1. One of the consequences of this, however, is that this prefix-based encoding method is only optimal for functions whose prefix-free encodings (i.e. encodings that cannot be partitioned into substrings such that one of the substrings encodes another UTM) in UTM1 and UTM2 differ in length by more than len(P). And, since len(P) is a measure of UTM2's complexity relative to UTM1, it follows directly that, for a UTM2 whose "coding scheme" is such that a function whose prefix-free encoding in UTM2 differs in length from its prefix-free encoding in UTM1 by some large constant (say, ~2^10^80), len(P) itself must be on the order of 2^10^80—in other words, UTM2 must have an astronomical complexity relative to UTM1. For any physically realizable universal computational system, that system can be analogized to UTM1 in the above analysis. If you have some behavioral policy that is e.g. deontological in nature, that behavioral policy can in principle be recast as an optimization criterion over universe histories; however, this criterion will in all likelihood have a prefix-free description in UTM1 of length ~2^10^80. And, crucially, there will be no UTM2 in whose encoding scheme the criterion in question ha

The Sun is big, but superintelligences will not spare Earth a little sunlight

keith_wynroe9mo*2017

If you parse this post as "attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors" rather than "trying to argue with the bleeding-edge arguments on x-risk", this post seems good

This seems reasonable in isolation, but it gets frustrating when the former is all Eliezer seems to do these days, with seemingly no attempt at the latter. When all you do is retread these dunks on "midwits" and show apathy/contempt for engaging with newer arguments, it makes it look like you don't actually have an i... (read more)

What's the Deal with Logical Uncertainty?

Answer by keith_wynroeSep 17, 2024104

I can reason as follows: There is 0.5 chance that it is Heads. Let P represent the actual, unknown, state of the outcome of the toss (Heads or Tails); and let Q represent the other state. If Q, then anything follows. For example, Q implies that I will win $1 billion. Therefore the value of this bet is at least $500,000,000, which is 0.5 * $1,000,000, and I should be willing to pay that much to take the bet.

This doesn't go through, what you have are two separate propositions "H -> (T -> [insert absurdity here]" and "T -> (H -> [insert absu... (read more)

1Ape in the coat10mo

Thank you for addressing specifically the example I raised! So what changes if H and T are theorems? Let O mean "googolth digit of pi is odd" and E mean "googolth digit of pi is even". I have two separate propositions: O -> ( E -> Absurdity ) E -> (O -> Absurdity) Now its possible to prove either E or O. How does it allow me to derive a contradition?

Lucius Bushnaq's Shortform

keith_wynroe10mo10

What are your thoughts on KL-div after the unembed softmax as a metric?

3Lucius Bushnaq10mo

On its own, this'd be another metric that doesn't track the right scale as models become more powerful. The same KL-div in GPT-2 and GPT-4 probably corresponds to the destruction of far more of the internal structure in the latter than the former. Destroy 95% of GPT-2's circuits, and the resulting output distribution may look quite different. Destroy 95% of GPT-4's circuits, and the resulting output distribution may not be all that different, since 5% of the circuits in GPT-4 might still be enough to get a lot of the most common token prediction cases roughly right.

2Neel Nanda10mo

I don't see important differences between that and ce loss delta in the context Lucius is describing

Rabin's Paradox

keith_wynroe11mo153

These results hold only if you assume risk aversion is entirely explained by a concave utility function, if you don't assume that then the surprising constraints on your preferences don't apply

IIRC that's the whole point of the paper - not that utility functions are in fact constrained in this way (they're not), but if you assume risk aversion can only come from diminishing maginal value of money (as many economists do), then you end up in weird places so maybe you should rethink that

A Simple Toy Coherence Theorem

keith_wynroe1y10

I think what I'm getting at is more general than specifically talking about resources, I'm more getting at the degree of freedom in the problem description that lets you frame anything as technically optimizing something at a distance i.e. in 'Utility Maximization = Description Length Minimization' you can take any system, find its long-term and long-distance effects on some other region of space-time, and find a coding-scheme where those particular states have the shortest descriptions. The description length of the universe will by construction get minim... (read more)

3dxu8mo

All possible encoding schemes / universal priors differ from each other by at most a finite prefix. You might think this doesn't achieve much, since the length of the prefix can be in principle unbounded; but in practice, the length of the prefix (or rather, the prior itself) is constrained by a system's physical implementation. There are some encoding schemes which neither you nor any other physical entity will ever be able to implement, and so for the purposes of description length minimization these are off the table. And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to "natural" versus "unnatural" optimization criteria.

A Simple Toy Coherence Theorem

keith_wynroe1y30

Thanks, I feel like I understand your perspective a bit better now.

Re: your "old" frame: I agree that the fact we're training an AI to be useful from our perspective will certainly constrain its preferences a lot, such that it'll look like it has preferences over resources we think in terms of/won't just be representable as a maximally random utility function. I think there's a huge step from that though to "it's a optimizer with respect to those resources" i.e there are a lot of partial orderings you can put over states where it broadly has preference ord... (read more)

2johnswentworth1y

This part I think is false. The theorem in this post does not need any notion of resources, and neither does Utility Maximization = Description Length Minimization. We do need a notion of spacetime (in order to talk about stuff far away in space/time), but that's a much weaker ontological assumption.

A Simple Toy Coherence Theorem

keith_wynroe1y50

Thanks, I think that's a good distinction - I guess I have like 3 issues if we roll with that though

I don't think a system acting according to preferences over future states entails it is EV-maximising w.r.t. some property/resource of those future states. If it's not doing the latter it seems like it's not necessarily scary, and if it is then I think we're back at the issue that we're making an unjustified leap, this time from "it's a utility maximizer + it has preferences over future-states" (i.e. having preferences over properties of future states is com

... (read more)

4Steven Byrnes1y

In terms of the OP toy model, I think the OP omitted another condition under which the coherence theorem is trivial / doesn’t apply, which is that you always start the MDP in the same place and the MDP graph is a directed tree or directed forest. (i.e., there are no cycles even if you ignore the arrow-heads … I hope I’m getting the graph theory terminology right). In those cases, for any possible end-state, there’s at most one way to get from the start to the end-state; and conversely, for any possible path through the MDP, that’s the path that would result from wanting to get to that end-state. Therefore, you can rationalize any path through the MDP as the optimal way to get to whatever end-state it actually gets to. Right? (cc @johnswentworth @David Lorell ) OK, so what about the real world? The laws of physics are unitary, so it is technically true that if I have some non-distant-future-related preferences (e.g. “I prefer to never tell a lie”, “I prefer to never use my pinky finger”, etc.), this preference can be cast as some inscrutably complicated preference about the state of the world on January 1 2050, assuming omniscient knowledge of the state of the world right now and infinite computational power. For example, “a preference to never use my pinky finger starting right now” might be equivalent to something kinda like “On January 1 2050, IF {air molecule 9834705982347598 has speed between 34.2894583000000 and 34.2894583000001 AND air molecule 8934637823747621 has … [etc. for a googolplex more lines of text]” This is kind of an irrelevant technicality, I think. The real world MDP in fact is full of (undirected) cycles—i.e. different ways to get to the same endpoint—…as far as anyone can measure it. For example, let’s say that I care about the state of a history ledger on January 1 2050. Then it’s possible for me to do whatever for 25 years … and then hack into the ledger and change it! However, if the history ledger is completely unbreachable (haha), then

A Simple Toy Coherence Theorem

keith_wynroe1y82

The actual result here looks right to me, but kinda surfaces a lot of my confusion about how people in this space use coherence theorems/reinforces my sense they get misused

You say:

This ties to a common criticism: that any system can be well-modeled as a utility maximizer, by simply choosing the utility function which rewards whatever the system in fact does. As far as I can tell, that criticism usually reflects ignorance of what coherence says

My sense of how this conversation goes is as follows:

"Utility maximisers are scary, and here are some theorems tha... (read more)

5johnswentworth1y

I would guess that response is memetically largely downstream of my own old take. It's not wrong, and it's pretty easy to argue that future systems will in fact behave efficiently with respect to the resources we care about: we'll design/train the system to behave efficiently with respect to those resources precisely because we care about those resources and resource-usage is very legible/measurable. But over the past year or so I've moved away from that frame, and part of the point of this post is to emphasize the frame I usually use now instead. In that new frame, here's what I would say instead: "Well sure, you can model anything as a utility maximizer technically, but usually any utility function compatible with the system's behavior is very myopic - it mostly just cares about some details of the world "close to" (in time/space) the system itself, and doesn't involve much optimization pressure against most of the world. If a system is to apply much optimization pressure to parts of the world far away from itself - like e.g. make & execute long-term plans - then the system must be a(n approximate) utility maximizer in a much less trivial sense. It must behave like it's maximizing a utility function specifically over stuff far away." (... actually that's not a thing I'd say, because right from the start I would have said that I'm using utility maximization mainly because it makes it easy to illustrate various problems. Those problems usually remain even when we don't assume utility maximization, they're just a lot less legible without a mathematical framework. But, y'know, for purposes of this discussion...) In my head, an important complement to this post is Utility Maximization = Description Length Minimization, which basically argues that "optimization" in the usual Flint/Yudkowsky sense is synonymous with optimizing some utility function over the part of the world being optimized. However, that post doesn't involve an optimizer; it just talks about stuff "b

3[anonymous]1y

In addition to what Steve Byrnes has said, you should also read the answers and associated commentary on my recent question about coherence theorems and agentic behavior.

5Steven Byrnes1y

I would say "systems that act according to preferences about the state of the world in the distant future are scary", and then that can hopefully lead to a productive and substantive discussion about whether people are likely to build such systems. (See e.g. here where I argue that someone is being too pessimistic about that, & section 1 here where I argue that someone else is being too optimistic.)

An OV-Coherent Toy Model of Attention Head Superposition

keith_wynroe1y10

Hey, sorry for the (very) belated response - thanks for the comment! Your description of the problem set-up/model look right to me. FWIW this post was ~my first attempt at digging into something superpositon-related, so I think you're right that it was being pretty sloppy/confused with the concept of "superposition". I've since come around more to your perspective of polysemanticity/distributed representation/interference being insufficient for "true" superposition.

Re: your point about there existing simpler solutions - you're totally right that for d-head... (read more)

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe1y10

Sorry for the delay - thanks for this! Yeah I agree, in general the OV circuit seems like it'll be much easier given the fact that it doesn't have the bilinearity or the softmax issue. I think the idea you sketch here sounds like a really promising one and pretty in line with some of the things we're trying atm

I think the tough part will be the next step which is somehow "stitching together" the QK and OV decompositions that give you an end-to-end understanding of what the whole attention layer is doing. Although I think the extent to which we should be th... (read more)

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe1y10

Thanks!

The auxiliary losses were something we settled on quite early, and we made some improvements to the methodology since then for the current results so I don't have great apples-to-apples comparisons for you. The losses didn't seem super important though in the sense that runs would still converge, just take longer and end with slightly worse reconstruction error. I think it's very likely that with a better training set-up/better hyperparam tuning you could drop these entirely and be fine.

Re: comparison to SAE's, you mean what do the dictionaries/feat... (read more)

Toward A Mathematical Framework for Computation in Superposition

keith_wynroe1y20

This looks really cool! Haven't digested it all yet but I'm especially interested in the QK superposition as I'm working on something similar. I'm wondering what your thoughts are on the number of bigrams being represented by a QK circuit not being bounded by interference but by its interaction with the OV circuit. IIUC it looks like a head can store a surprising number of d_resid bigrams, but since the OV circuit is only a function of the key, then having the same key feature be in a clique with a large number of different query features means the OV-circuit will be unable to differentially copy information based on which bigram is present. I don't think this has been explored outside of toy models from Anthropic though

OpenAI Superalignment: Weak-to-strong generalization

keith_wynroe2y20

I know they flag it in the paper, but seeing the performance curves for the strong model on zero- and few-shot attempts really makes me think the data leakage issue is doing a lot of the work here. If you get the majority(?) of the PGR from e.g. 5-shot prompting it seems like a natural takeaway is the strong model doesn't actually need to be fine-tuned on the task, and the weak supervisor is just eliciting the knowledge that's already there

AI #41: Bring in the Other Gemini

keith_wynroe2y*98

Sorry you found it so stressful! I’m not objecting to you deciding it’s not worth your time to engage, what I’m getting at is a perceived double standard in when this kind of criticism is applied. You say

I do not think that the thing I am observing from Pope/Belrose is typical of LW/AF/rationalist/MIRI/etc behaviors to anything like the same degree that they consistently do it

But this seems wrong to me. The best analogue of your post from Quintin’s perspective was his own post laying out disagreements with Eliezer. Eliezer’s response to this was to say... (read more)

AI #41: Bring in the Other Gemini

keith_wynroe2y206

And all of this is asserted as, essentially, obvious and undeniable, extreme confidence is displayed, all the arguments offered against this are invalid and dumb, and those that disagree are at best deeply confused and constantly told they did not understand or fairly represent what was said.

This feels unnecessarily snarky, but is also pretty much exactly the experience a lot of people have trying to engage with Yudkowsky et al. It feels weird to bring up “they’re very confident and say that their critics just don’t get it” as a put-down here.

It seems d... (read more)

0rotatingpaguro2y

If you mean literally two, it's more, although I won't take the time to dig up the tweets. I remember seeing them discuss at non-trivial length at least once on twitter. (If "a couple" encompassed that... Well once someone asked me "a couple of spaghetti" and when I gave him 2 spaghetti he got quite upset. Uhm. Don't get upset at me, please?) I've thought a bit about this because I too on first sight perceived a lack of serious engagement. I've not yet come to a confident conclusion; on reflection I'm not so sure anymore there was an unfair lack of engagement. First I tried to understand Pope's & co arguments at the object level. Within the allotted time, I failed. I expected to fare better, so I think there's some mixture of (Pope's framework not being simplifiable) & (Pope's current communication situation low), where the comparatives refer to the state of Yudkowsky's & co framework when I first encountered it. So I turned to proxies; in cases where I thought I understood the exchange, what could I say about it? Did it seem fair? From this I got the impression that sometimes Pope makes blunders at understanding simple things Yudkowsky means (not cruxes or anything really important, just trivial misunderstandings), which throw a shadow over his reading comprehension, such that then one is less inclined to spend the time to take him seriously when he makes complicated arguments that are not clear at once. On the other hand, Yudkowsky seems to not take the time to understand when Pope's prose is a bit approximative or not totally rigorous, which is difficult to avoid when compressing technical arguments. So my current conceit is: a mixture of (Pope is not good at communicating) & (does not invest in communication). This does not bear significatively on whether he's right, but it's a major time investment to understand him, so inevitably someone with many options on who to talk to is gonna deprioritize him. To look at a more specific point, Vaniver replied at l

Zvi2y4429

From my perspective here's what happened: I spent hours trying to parse his arguments. I then wrote an effort post, responding to something that seemed very wrong to me, that took me many hours, that was longer than the OP, and attempted to explore the questions and my model in detail.

He wrote a detailed reply, which I thanked him for, ignoring the tone issues in question here and focusing on thee details and disagreements. I spent hours processing it and replied in detail to each of his explanations in the reply, including asking many detailed quest... (read more)

6ryan_greenblatt2y

A bunch of the more pessimistic people have in practice spent a decent amount of time trying to argue with (e.g.) Paul Christiano and other people who are more optimistic. So, it's not as though the total time spent engaging with counter-arguments is small. Additionally, I think there are basically two different questions here: 1. Should people who are very pessimistic be interested in spending a bunch of time engaging with counterarguments? This could be either to argue to bystanders or get a better model of reality for themselves and/or the counterparty. 2. Who should people who are very pessimistic aim to engage with and what counterarguments should they discuss? I'm pretty sympathetic to the take that pessimistic people should spend more time engaging (1), but not that sure that immediately engaging with specifically "AI optimists" is the best approach (2). (FWIW, I think both AI optimists and Yudkowsky and Nate often make important errors wrt. various arguments at least when these arguments are made publicly in writing.) ETA: there is relevant context in this post from Nate: Hashing out long-standing disagreements seems low-value to me

1Noosphere892y

As a point of data, Yudkowsky also responded in a very terse manner and basically didn't respond to Pope's post at all, so this is not a one-off event: https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#YYR4hEFRmA7cb5csy

UDT shows that decision theory is more puzzling than ever

keith_wynroe2y10

I understand it’s a proposition like any other, I don’t see why an agent would reflect on it/use it in their deliberation to decide what to do. The fact that they’re a CDT agent is a fact about how they will act in the decision, not a fact that they need to use in their deliberation

Analogous to preferences, whether or not an agent prefers A or B is a proposition like any other, but I don’t think it’s a natural way to model them as first consult the credences they have assigned to “I prefer A to B” etc. Rather, they will just choose A ex hypothesis because that’s what having the preference means.

3Max H2y

They don't have to, and for the purposes of the thought experiment you could specify that they simply don't. But humans are often pretty uncertain about their own preferences, and about what kind of decision theory they can or should use. Many of these humans are pretty strongly inclined to deliberate, reflect, and argue about these beliefs, and take into account their own uncertainty in them when making decisions. So I'm saying that if you stipulate that no such reflection or deliberation occurs, you might be narrowing the applicability of the thought experiment to exclude human-like minds, which may be a rather large and important class of all possible minds.

UDT shows that decision theory is more puzzling than ever

keith_wynroe2y32

Why would they be uncertain about whether they’re a CDT agent? Being a CDT agent surely just means by definition they evaluate decisions based on causal outcomes. It feels confused to say that they have to be uncertain about/reflect on which decision theory they have and then apply it, rather than their being a CDT agent being an ex hypothesis fact about how they behave

0Max H2y

What kind of decision theory the agent will use and how it will behave in specific circumstances is a proposition about the agent and about the world which can be true or false, just like any other proposition. Within any particular mind (the agent's own mind, Omega's, ours) propositions of any kind come with differing degrees of uncertainty attached, logical or otherwise. You can of course suppose for the purposes of the thought experiment that the agent itself has an overwhelmingly strong prior belief about its own behavior, either because it can introspect and knows its own mind well, or because it has been told how it will behave by a trustworthy and powerful source such as Omega. You could also stipulate that the entirety of the agent is a formally specified computer program, perhaps one that is incapable of reflection or forming beliefs about itself at all. However, any such suppositions would be additional constraints on the setup, which makes it weirder and less likely to be broadly applicable in real life (or anywhere else in the multiverse). Also, even if the agent starts with 99.999%+ credence that it is a CDT agent and will behave accordingly, perhaps there's some chain of thoughts the agent could think which provide strong enough evidence to overcome this prior (and thus change the agent's behavior). Perhaps this chain of thoughts is very unlikely to actually occur to the agent, or for the purposes of the thought experiment, we only consider agents which flatly aren't capable of such introspection or belief-updating. But that may (or may not!) mean restricting the applicability of the thought experiment to a much smaller and rather impoverished class of minds.

A Hill of Validity in Defense of Meaning

keith_wynroe2y50

Why not? Is it common for NDAs/non-disparagement agreements to also have a clause stating the parties aren’t allowed to tell anyone about it? I’ve never heard of this outside of super-injunctions which seems a pretty separate thing

8Elizabeth2y

my boilerplate severance agreement at a job included an NDA that couldn't be acknowledged (I negotiated to change this).

Dagon2y2914

Absolutely common. Most non-disparagement agreements are paired with non-disclosure agreements (or clauses in the non-disparagement wording) that prohibit talking about the agreement, as much as talking about the forbidden topics.

It's pretty obvious to lawyers that "I would like to say this, but I have a legal agreement that I won't" is equivalent, in many cases, to saying it outright.

A Hill of Validity in Defense of Meaning

keith_wynroe2y1-2

They can presumably confirm whether or not there is a nondisparagement agreement and whether that is preventing them from commenting though right

7jimrandomh2y

You can confirm this if you're aware that it's a possibility, and interpret carefully-phrased refusals to comment in a way that's informed by reasonable priors. You should not assume that anyone is able to directly tell you that an agreement exists.

An artificially structured argument for expecting AGI ruin

keith_wynroe2y10

I think (1b) doesn't go through. The "starting data" we have from (1a) is that the AGI has some preferences over lotteries that it competently acts on - acyclicality seems likely but we don't get completeness or transitivity for free, so we can't assume its preferences will be representable as maximising some utility function. (I suppose we also have the constraint that its preferences look "locally" good to us given training). But if this is all we have it doesn't follow that the agent will have some coherent goal it'd be want optimisers optimising toward... (read more)

The basic reasons I expect AGI ruin

keith_wynroe2y10

Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.

I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:

I don't see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is "Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that ... (read more)

There are no coherence theorems

keith_wynroe2y51

Want to bump this because it seems important? How do you see the agent in the post as being dominated?

There are no coherence theorems

keith_wynroe2y64

How is the toy example agent sketched in the post dominated?

There are no coherence theorems

keith_wynroe2y10

Yeah I agree that even if they fall short of normative constraints there’s some empirical content around what happens in adversarial environments. I think I have doubts that this stuff translates to thinking about AGIs too much though, in the sense that there’s an obvious story of how an adversarial environment selected for (partial) coherence in us, but I don’t see the same kinds of selection pressures being a force on AGIs. Unless you assume that they’ll want to modify themselves in anticipation of adversarial environments which kinda begs the question

3eapi2y

Hmm, I was going to reply with something like "money-pumps don't just say something about adversarial environments, they also say something about avoiding leaking resources" (e.g. if you have circular preferences between proximity to apples, bananas, and carrots, then if you encounter all three of them in a single room you might get trapped walking between them forever) but that's also begging your original question - we can always just update to enjoy leaking resources, transmuting a "leak" into an "expenditure". Another frame here is that if you make/encounter an agent, and that agent self-modifies into/starts off as something which is happy to leak pretty fundamental resources like time and energy and material-under-control, then you're not as worried about it? It's certainly not competing as strongly for the same resources as you whenever it's "under the influence" of its circular preferences.

There are no coherence theorems

keith_wynroe2y10

Kind of tangential but I'd be interested in your take on how strongly money-pumping etc is actually an argument against full-on cyclical preferences? One way to think about why getting money-pumped is bad is because you have an additional preference to not pay money to go nowhere. But it feels like all this tells us is that "something has to go", and if an agent is rationally permitted to modify its own preferences to avoid these situations then it seems a priori acceptable for it to instead just say something like "well actually I weight my cyclical prefe... (read more)

1eapi2y

(I'm not EJT, but for what it's worth:) I find the money-pumping arguments compelling not as normative arguments about what preferences are "allowed", but as engineering/security/survival arguments about what properties of preferences are necessary for them to be stable against an adversarial environment (which is distinct from what properties are sufficient for them to be stable, and possibly distinct from questions of self-modification).

There are no coherence theorems

keith_wynroe2y21

This seems totally different to the point OP is making which is that you can in theory have things that definitely are agents, definitely do have preferences, and are incoherent (hence not EV-maximisers) whilst not "predictably shooting themselves in the foot" as you claim must follow from this

I agree the framing of "there are no coherence theorems" is a bit needlessly strong/overly provocative in a sense, but I'm unclear what your actual objection is here - are you claiming these hypothetical agents are in fact still vulnerable to money-pumping? That they are in fact not possible?

There are no coherence theorems

keith_wynroe2y76

Great post. I think a lot of the discussion around the role of coherence arguments and what we should expect a super-intelligent agent to behave like is really sloppy and I think this distinction between "coherence theorems as a self-contained mathematical result" and "coherence arguments as a normative claim about what an agent must be like on pain of shooting themselves in the foot" is an important one

The example of how an incomplete agent avoids getting Dutch-booked also seems to look very naturally like how irl agents behave imo. One way of thinking ab... (read more)

There are no coherence theorems

keith_wynroe2y1111

Ngl kinda confused how these points imply the post seems wrong, the bulk of this seems to be (1) a semantic quibble + (2) a disagreement on who has the burden of proof when it comes to arguing about the plausibility of coherence + (3) maybe just misunderstanding the point that's being made?

(1) I agree the title is a bit needlessly provocative and in one sense of course VNM/Savage etc count as coherence theorems. But the point is that there is another sense that people use "coherence theorem/argument" in this field which corresponds to something like "If yo... (read more)

You're Not One "You" - How Decision Theories Are Talking Past Each Other

keith_wynroe2y20

Ah I hadn't realised Caspar wrote that, thanks for the link! I agree that seems to be getting at the same idea, and it's kind of separable from the multi-agent point

You're Not One "You" - How Decision Theories Are Talking Past Each Other

keith_wynroe2y10

I'm probably misunderstanding you or I've worded things in a confusing way that I haven't noticed - I don't think anywhere it's implied what you do on Tails? The "iff" here is just saying you would be paid on Heads iff you would pay on Tails - the flip will happen regardless and the predictor hasn't made any prediction about the coin itself, just you're conditional behaviour

Edit: Maybe the "iff you will pay $1 on Tails" makes it sound like the predictor is predicting both the coin and your response, I'll edit to make more clear