The constant bound isn't not that relevant just because of the in principal unbounded size, it also doesn't constrain the induced probabilities in the second coding scheme much at all. It's an upper bound on the maximum length, so you can still have the weightings in codings scheme B differ differ in relative length by a ton, leading to wildly different priors
...And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to "natural" versus "unnatural" optimization cri
If you parse this post as "attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors" rather than "trying to argue with the bleeding-edge arguments on x-risk", this post seems good
This seems reasonable in isolation, but it gets frustrating when the former is all Eliezer seems to do these days, with seemingly no attempt at the latter. When all you do is retread these dunks on "midwits" and show apathy/contempt for engaging with newer arguments, it makes it look like you don't actually have an i...
I can reason as follows: There is 0.5 chance that it is Heads. Let P represent the actual, unknown, state of the outcome of the toss (Heads or Tails); and let Q represent the other state. If Q, then anything follows. For example, Q implies that I will win $1 billion. Therefore the value of this bet is at least $500,000,000, which is 0.5 * $1,000,000, and I should be willing to pay that much to take the bet.
This doesn't go through, what you have are two separate propositions "H -> (T -> [insert absurdity here]" and "T -> (H -> [insert absu...
What are your thoughts on KL-div after the unembed softmax as a metric?
These results hold only if you assume risk aversion is entirely explained by a concave utility function, if you don't assume that then the surprising constraints on your preferences don't apply
IIRC that's the whole point of the paper - not that utility functions are in fact constrained in this way (they're not), but if you assume risk aversion can only come from diminishing maginal value of money (as many economists do), then you end up in weird places so maybe you should rethink that
I think what I'm getting at is more general than specifically talking about resources, I'm more getting at the degree of freedom in the problem description that lets you frame anything as technically optimizing something at a distance i.e. in 'Utility Maximization = Description Length Minimization' you can take any system, find its long-term and long-distance effects on some other region of space-time, and find a coding-scheme where those particular states have the shortest descriptions. The description length of the universe will by construction get minim...
Thanks, I feel like I understand your perspective a bit better now.
Re: your "old" frame: I agree that the fact we're training an AI to be useful from our perspective will certainly constrain its preferences a lot, such that it'll look like it has preferences over resources we think in terms of/won't just be representable as a maximally random utility function. I think there's a huge step from that though to "it's a optimizer with respect to those resources" i.e there are a lot of partial orderings you can put over states where it broadly has preference ord...
Thanks, I think that's a good distinction - I guess I have like 3 issues if we roll with that though
The actual result here looks right to me, but kinda surfaces a lot of my confusion about how people in this space use coherence theorems/reinforces my sense they get misused
You say:
This ties to a common criticism: that any system can be well-modeled as a utility maximizer, by simply choosing the utility function which rewards whatever the system in fact does. As far as I can tell, that criticism usually reflects ignorance of what coherence says
My sense of how this conversation goes is as follows:
"Utility maximisers are scary, and here are some theorems tha...
Hey, sorry for the (very) belated response - thanks for the comment! Your description of the problem set-up/model look right to me. FWIW this post was ~my first attempt at digging into something superpositon-related, so I think you're right that it was being pretty sloppy/confused with the concept of "superposition". I've since come around more to your perspective of polysemanticity/distributed representation/interference being insufficient for "true" superposition.
Re: your point about there existing simpler solutions - you're totally right that for d-head...
Sorry for the delay - thanks for this! Yeah I agree, in general the OV circuit seems like it'll be much easier given the fact that it doesn't have the bilinearity or the softmax issue. I think the idea you sketch here sounds like a really promising one and pretty in line with some of the things we're trying atm
I think the tough part will be the next step which is somehow "stitching together" the QK and OV decompositions that give you an end-to-end understanding of what the whole attention layer is doing. Although I think the extent to which we should be th...
Thanks!
The auxiliary losses were something we settled on quite early, and we made some improvements to the methodology since then for the current results so I don't have great apples-to-apples comparisons for you. The losses didn't seem super important though in the sense that runs would still converge, just take longer and end with slightly worse reconstruction error. I think it's very likely that with a better training set-up/better hyperparam tuning you could drop these entirely and be fine.
Re: comparison to SAE's, you mean what do the dictionaries/feat...
This looks really cool! Haven't digested it all yet but I'm especially interested in the QK superposition as I'm working on something similar. I'm wondering what your thoughts are on the number of bigrams being represented by a QK circuit not being bounded by interference but by its interaction with the OV circuit. IIUC it looks like a head can store a surprising number of d_resid bigrams, but since the OV circuit is only a function of the key, then having the same key feature be in a clique with a large number of different query features means the OV-circuit will be unable to differentially copy information based on which bigram is present. I don't think this has been explored outside of toy models from Anthropic though
I know they flag it in the paper, but seeing the performance curves for the strong model on zero- and few-shot attempts really makes me think the data leakage issue is doing a lot of the work here. If you get the majority(?) of the PGR from e.g. 5-shot prompting it seems like a natural takeaway is the strong model doesn't actually need to be fine-tuned on the task, and the weak supervisor is just eliciting the knowledge that's already there
Sorry you found it so stressful! I’m not objecting to you deciding it’s not worth your time to engage, what I’m getting at is a perceived double standard in when this kind of criticism is applied. You say
I do not think that the thing I am observing from Pope/Belrose is typical of LW/AF/rationalist/MIRI/etc behaviors to anything like the same degree that they consistently do it
But this seems wrong to me. The best analogue of your post from Quintin’s perspective was his own post laying out disagreements with Eliezer. Eliezer’s response to this was to say...
And all of this is asserted as, essentially, obvious and undeniable, extreme confidence is displayed, all the arguments offered against this are invalid and dumb, and those that disagree are at best deeply confused and constantly told they did not understand or fairly represent what was said.
This feels unnecessarily snarky, but is also pretty much exactly the experience a lot of people have trying to engage with Yudkowsky et al. It feels weird to bring up “they’re very confident and say that their critics just don’t get it” as a put-down here.
It seems d...
From my perspective here's what happened: I spent hours trying to parse his arguments. I then wrote an effort post, responding to something that seemed very wrong to me, that took me many hours, that was longer than the OP, and attempted to explore the questions and my model in detail.
He wrote a detailed reply, which I thanked him for, ignoring the tone issues in question here and focusing on thee details and disagreements. I spent hours processing it and replied in detail to each of his explanations in the reply, including asking many detailed quest...
I understand it’s a proposition like any other, I don’t see why an agent would reflect on it/use it in their deliberation to decide what to do. The fact that they’re a CDT agent is a fact about how they will act in the decision, not a fact that they need to use in their deliberation
Analogous to preferences, whether or not an agent prefers A or B is a proposition like any other, but I don’t think it’s a natural way to model them as first consult the credences they have assigned to “I prefer A to B” etc. Rather, they will just choose A ex hypothesis because that’s what having the preference means.
Why would they be uncertain about whether they’re a CDT agent? Being a CDT agent surely just means by definition they evaluate decisions based on causal outcomes. It feels confused to say that they have to be uncertain about/reflect on which decision theory they have and then apply it, rather than their being a CDT agent being an ex hypothesis fact about how they behave
Why not? Is it common for NDAs/non-disparagement agreements to also have a clause stating the parties aren’t allowed to tell anyone about it? I’ve never heard of this outside of super-injunctions which seems a pretty separate thing
Absolutely common. Most non-disparagement agreements are paired with non-disclosure agreements (or clauses in the non-disparagement wording) that prohibit talking about the agreement, as much as talking about the forbidden topics.
It's pretty obvious to lawyers that "I would like to say this, but I have a legal agreement that I won't" is equivalent, in many cases, to saying it outright.
They can presumably confirm whether or not there is a nondisparagement agreement and whether that is preventing them from commenting though right
I think (1b) doesn't go through. The "starting data" we have from (1a) is that the AGI has some preferences over lotteries that it competently acts on - acyclicality seems likely but we don't get completeness or transitivity for free, so we can't assume its preferences will be representable as maximising some utility function. (I suppose we also have the constraint that its preferences look "locally" good to us given training). But if this is all we have it doesn't follow that the agent will have some coherent goal it'd be want optimisers optimising toward...
Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.
I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:
I don't see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is "Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that ...
Want to bump this because it seems important? How do you see the agent in the post as being dominated?
How is the toy example agent sketched in the post dominated?
Yeah I agree that even if they fall short of normative constraints there’s some empirical content around what happens in adversarial environments. I think I have doubts that this stuff translates to thinking about AGIs too much though, in the sense that there’s an obvious story of how an adversarial environment selected for (partial) coherence in us, but I don’t see the same kinds of selection pressures being a force on AGIs. Unless you assume that they’ll want to modify themselves in anticipation of adversarial environments which kinda begs the question
Kind of tangential but I'd be interested in your take on how strongly money-pumping etc is actually an argument against full-on cyclical preferences? One way to think about why getting money-pumped is bad is because you have an additional preference to not pay money to go nowhere. But it feels like all this tells us is that "something has to go", and if an agent is rationally permitted to modify its own preferences to avoid these situations then it seems a priori acceptable for it to instead just say something like "well actually I weight my cyclical prefe...
This seems totally different to the point OP is making which is that you can in theory have things that definitely are agents, definitely do have preferences, and are incoherent (hence not EV-maximisers) whilst not "predictably shooting themselves in the foot" as you claim must follow from this
I agree the framing of "there are no coherence theorems" is a bit needlessly strong/overly provocative in a sense, but I'm unclear what your actual objection is here - are you claiming these hypothetical agents are in fact still vulnerable to money-pumping? That they are in fact not possible?
Great post. I think a lot of the discussion around the role of coherence arguments and what we should expect a super-intelligent agent to behave like is really sloppy and I think this distinction between "coherence theorems as a self-contained mathematical result" and "coherence arguments as a normative claim about what an agent must be like on pain of shooting themselves in the foot" is an important one
The example of how an incomplete agent avoids getting Dutch-booked also seems to look very naturally like how irl agents behave imo. One way of thinking ab...
Ngl kinda confused how these points imply the post seems wrong, the bulk of this seems to be (1) a semantic quibble + (2) a disagreement on who has the burden of proof when it comes to arguing about the plausibility of coherence + (3) maybe just misunderstanding the point that's being made?
(1) I agree the title is a bit needlessly provocative and in one sense of course VNM/Savage etc count as coherence theorems. But the point is that there is another sense that people use "coherence theorem/argument" in this field which corresponds to something like "If yo...
Ah I hadn't realised Caspar wrote that, thanks for the link! I agree that seems to be getting at the same idea, and it's kind of separable from the multi-agent point
I'm probably misunderstanding you or I've worded things in a confusing way that I haven't noticed - I don't think anywhere it's implied what you do on Tails? The "iff" here is just saying you would be paid on Heads iff you would pay on Tails - the flip will happen regardless and the predictor hasn't made any prediction about the coin itself, just you're conditional behaviour
Edit: Maybe the "iff you will pay $1 on Tails" makes it sound like the predictor is predicting both the coin and your response, I'll edit to make more clear
I feel like this could branch out into a lot of small disagreements here but in the interest of keeping it streamlined:
... (read more)