I will try to more directly express the positive intuition for why all of this seems possible to me, that is why I think such a loss function over heuristic arguments that makes all the correct tradeoffs should exist.

Consider the process of SGD as a process of Bayesian model selection. We start with some prior of possible weights of a model in some GPT architecture, then we update based on a series of data and in the end we get some model. We might similarly then have a bunch of objections to how such a model selection process could ever learn the data, e.g. that we don't have enough parameters to memorize every fact like "apples fall down" "pears fall down" etc., so how will the model know when to try to compress these facts into an underlying theory? And for other things like Barack Obama, how will the model learn to memorize that fact, but not the facts about fruits falling down? How is it possible to have a loss function that treats "Obama follows Barack" as an axiom, but "apples fall down" as a fact derived from some more general beliefs about gravity?

The answer is, of course, that we don't really need to deal with any of that and we can just make loss go down, and if we've set up the learning problem correctly then SGD will magically do all these tradeoffs for us.

In the heuristic argument frame, the hope is thus less "we will find a loss function that somehow does all these tradeoffs in a way that magically works" but rather that we can find some loss function over heuristic arguments that does the same thing as what SGD is in some sense already doing to find a model that somehow compresses common crawl so well. That is, our loss function only needs to learn treat "Obama follows Barack" as axiomatic in so far as SGD learns to treat "Obama follows Barack" as axiomatic.

And the hope is that if we do this correctly, the we can identify deceptive alignment because deceptive alignment is defined to be your model intentionally deceiving, and thus "model acts deceptively" is not, from the perspective of model/SGD, an axiomatic fact, so as long as our loss function over heuristic arguments is properly "parallel" to SGD, then it will not learn to treat "model acts deceptive" as axiomatic (because it will only treat things as axiomatic if model/SGD treat them as axiomatic).

Another way of saying this is that SGD + architecture implicitly assign some "probability" (probably not really in a way that is a distribution in any sense) to any fact F being "axiomatic" and uses data to learn which facts are axiomatic vs not, and so the heuristic argument machinery must assign the same "probability" that facts are axiomatic and do the same kind of learning.

On Eating the Sun

Mark Xu6mo63

I think I expect Earth in this case to just say no and not sell the sun? But I was confused at like 2 points in your paragraph so I don't think I understand what you're saying that well. I also think we're probably on mostly the same page, and am not that interested in hashing out further potential disagreements.

Also, mostly unrelated, maybe a hot take, but if you're able to get outcompeted because you don't upload, then the future you're in is not very good.

On Eating the Sun

Mark Xu6mo20

Cool. I misinterpreted your previous comment and think we're basically on the same page.

On Eating the Sun

Mark Xu6mo62

I think the majority of humans probably won't want to be uploads, leave the solar system permanently, etc. Maybe this is where we disagree? I don't really think there's going to be a thing that most people care about more.

On Eating the Sun

Mark Xu6mo41

I don't think that's a very good analogy, but I will say that is is basically true for the Amish. And I do think that we should respect their preferences. (I seperately think cars are not that good, and that people would infact prefer to bicycle around or ride house drawn carriges or whatever if civilization was conducive to that, although that's kinda besides the point.)

I'm not arguing that we should be conservative about changing the sun. I'm just claiming that people like the sun and won't want to see it eaten/fundamentally transformed, and that we should respect this preference. This is reason why it's different from candles -> lightbulbs, because people very obviously wanted lightbulbs when offered. But I don't think the marginal increase in well-being from eating the sun will be nearly enough to make balance against the desire that the sun remain the same, so I don't think most people will on net want the sun to be eaten. To be clear, this is an empirical claim about what people want that might very well be false.

On Eating the Sun

Mark Xu6mo81

I am claiming that people when informed will want the sun to continuing being the sun. I also think that most people when informed will not really care that much about creating new people, will continue to believe in the act-omission distinction, etc. And that this is a coherent view that will add up to a large set of people wanting things in the solar system to remain conservatively the same. I seperately claim that if this is true, then other people should just respect this preference, and use the other stars that people don't care about for energy.

On Eating the Sun

Mark Xu6mo11

But most people on Earth don't want "an artificial system to light the Earth in such a way as to mimic the sun", they want the actual sun to go on existing.

Benito's Shortform Feed

Mark Xu6mo40

This is in part the reasoning used by Judge Kaplan:

Kaplan himself said on Thursday that he decided on his sentence in part to make sure that Bankman-Fried cannot harm other people going forward. “There is a risk that this man will be in a position to do something very bad in the future,” he said. “In part, my sentence will be for the purpose of disabling him, to the extent that can appropriately be done, for a significant period of time.”

from https://time.com/6961068/sam-bankman-fried-prison-sentence/

ejenner's Shortform

Mark Xu6mo40

It's kind of strange that, from my perspective, these mistakes are very similar to the mistakes I think I made, and also see a lot of other people making. Perhaps one "must" spend too long doing abstract slippery stuff to really understand the nature of why it doesn't really work that well?

Mark Xu's Shortform

Mark Xu7mo41

I know what the word means, I just think in typical cases people should be saying a lot more about why something is undignified, because I don’t think people’s senses of dignity typically overlap that much, especially if the reader doesn’t typically read LW. In these cases I think permitting the use of the word “undignified” prevents specificity.