Zvi — LessWrong

I wouldn't obviously even put AMD on the list given that they're up on rather big single stock news, but yes, good note, there is that.

Realistic Reward Hacking Induces Different and Deeper Misalignment

Zvi21d60

Would a reasonable way to summarize this be that if you train on pretend reward hacking you get emergent misalignment that takes the form of pretending (playacting) misbehaving and being evil, whereas if you here train on realistic reward hacking examples it starts realistically (and in some ways strategically) misbehaving and doing other forms of essentially reward hacking instead?

Bending The Curve

Zvi21d20

Yes.

Bending The Curve

Zvi21d20

Knowing that, hopefully you wouldn't?

AI #137: An OpenAI App For That

Zvi21d82

Oh, of course, how silly of me!

AI #136: A Song and Dance

Zvi1mo20

I was not aware of this at the time.

More Reactions to If Anyone Builds It, Everyone Dies

Zvi1mo20

My guess is that on the margin more time should be spent improving the core messaging versus saturating the dialogue tree, on many AI questions, if you combine effort across everyone.

More Reactions to If Anyone Builds It, Everyone Dies

Zvi1mo50

Lot of 'welcome to my world' vibes reading your self reports here, especially the '50 different people have 75 different objections for a mix of good, bad and deeply stupid reasons, and require 100 different responses, some of which are very long, and it takes a back-and-forth to figure out which one, and you can't possibly just list everything' and so on, and that's without getting into actually interesting branches and the places where you might be wrong or learn something, etc.

So to take your example, which seems like a good one:

Humans don't generalize their values out of distribution. I affirm this not as strictly fully true, but on the level of 'this is far closer to true and generative of a superior world model then its negation' and 'if you meditate on this sentence you may become [more] enlightened.'

I too have noticed that people seem to think that they do so generalize in ways they very much don't, and this leads to a lot of rather false conclusions.

I also notice that I'm not convinced we are thinking about the sentence that similarly in ways that could end up being pretty load bearing. Stuff gets complicated.

I think that when you say the statement is 'trivially' true you are wrong about that, or at least holding people to unrealistic standards of epistemics? And that a version of this mistake is part of the problem. At least from me (I presume from others too) you get a very different reaction from saying each of:

Humans don't generalize their values out of distribution. (let this be [X]).
Statement treating [X] as in-context common knowledge.
It is trivially true that [X] (said explicitly), or 'obviously' [X], or similar.
I believe that [X] or am very confident that [X]. (without explaining why you believe this)
I believe that [X] or am very confident that [X], but it is difficult for me to explain/justify.

And so on. I am very deliberate, or try to be, on which one I say in any given spot, even at the cost of a bunch of additional words.

Another note is I think in spots like this you basically do have to say this even if the subject already knows, to establish common knowledge and that you are basing your argument on this, even if only to orient them that this is where you are reasoning from. So it was a helpful statement to say and a good use of a sentence.

I see that you get disagreement votes when you say this on LW., but the comments don't end up with negative karma or anything. I can see how that can be read as 'punishment' but I think that's the system working as intended and I don't know what a better one would be?

In general, I think if you have a bunch of load-bearing statements where you are very confident they are true but people typically think the statement is false and you can't make an explicit case for them (either because you don't have that kind of time/space, or because you don't know how), then the most helpful thing to do is to tell the other person the thing is load bearing, and gesture towards it and why you believe it, but be clear you can't justify it. You can also look for arguments that reach the same conclusion without it - often true things are highly overdetermined so you can get a bunch of your evidence 'thrown out of court' and still be fine, even if that sucks.

More Reactions to If Anyone Builds It, Everyone Dies

Zvi1mo30

On Janus comparisons: I do model you as pretty distinct from them in underlying beliefs although I don't pretend to have a great model of either belief set. Reaction expectations are similarly correlated but distinct. I imagine they'd say that they answer good faith questions too, and often that's true (e.g. when I do ask Janus a question I have a ~100% helpful answer rate, but that's with me having a v high bar for asking).

More Reactions to If Anyone Builds It, Everyone Dies

Zvi1mo111

If that's your reaction to my reaction, then it was a miss in at least some ways, which is on me.

I did not feel angry (more like frustrated?) when I wrote it nor did I intend to express anger, but I did read your review itself as expressing anger and hostility in various forms - you're doing your best to fight through that and play fair with the ideas as you see them, which is appreciated - and have generally read your statements about Yudkowsky and related issues consistently as being something in the vicinity of angry, also as part of a consistent campaign, and perhaps some of this was reflected in my response. It's also true that I have a cached memory of you often responding as if things said are more hostile than I felt they were or were intended, although I do not recall examples at this point.

And I hereby report that, despite at points in the past putting in considerable effort trying to parse your statements, and at some point found it too difficult, frustrating and aversive in some combination and mostly stopped attempting to do so when my initial attempt on a given statement bounced (which sometimes it doesn't).

(Part of what is 'esoteric' is perhaps that the perfect-enemy-of-good thing means a lot of load-bearing stuff is probably unsaid by you, and you may not realize that you haven't said it?)

But also, frankly, when people write much dumber reviews with much dumber things in them, I mostly can't even bring myself to be mad, because I mean what else can one expect from such sources - there's only one such review that actually did make me angry, because it was someone where I expected better. It's something I've worked a lot on, and I think made progress on - I don't actually e.g. get mad at David Sacks anymore as a person, although I still sometimes get mad that I have to once again write about David Sacks.

To the extent I was actually having a reaction to you here it was a sign that I respect you enough to care, that I sense opportunity in some form, and that you're saying actual things that matter rather than just spouting gibberish or standard nonsense.

Similarly, with the one exception, if those people had complained about my reaction to their reaction in the ways I'd expect them to do so, I would have ignored them.

Versus your summary of your review, I would say I read it more as:

We are currently in an alignment winter. (This is bad). This is asserted as 'obvious' and then causes are cited, all in what I read as a hostile manner, and an assertion of 'facts not in evidence' that I indeed disagree with, including various forms of derision that read in-context as status attacks and accusations of bad epistemic action, and the claim that the value loading problem has been solved, which is all offered in a fashion that implies you think this is all clearly true if not rather obvious, and this is all loaded up front despite it not being especially relevant to the book, and echoing things you talk about a lot. This sets the whole thing up as an adversarial exercise. You can notice that in my reaction, I treated these details as central, in a way you don't seem to think are, or at least I think the central thing boils down to this thing?
Alignment is not solved yet but people widely believe it is. (This is bad). It's weird because you say 'we solved [X] and people think [X] solves alignment but it doesn't' where I don't think it's true we solved [X].
I was expecting to hate the book but it actually retreats on most of the rhetoric I blame for contributing to the alignment winter. (This is good) Yes.
The style of the book is bad, but I won't dwell on it and in fact spend a paragraph on the issue and then move on. 'Truly appalling' editorial choices, weird and often condescending, etc. Yes it's condensed but you come on very strong here (which is fine, you clearly believe it, but I wouldn't minimize its role). Also your summary skips over the 'contempt for LLMs' paragraph.
I actually disagree with the overall thesis, but think it's virtuous to focus on the points of agreement when someone points out an important issue so I don't dwell on that either and instead.
"Emphatically agree" (literal words) that AI labs are not serious about the alignment problem.
State a short version of what the alignment problem actually is. (Important because it's usually conflated with or confused with simpler problems that sound a lot easier to solve.)
I signal boost Eliezer's other and better writing because I think my audience is disproportionately made up of people who might be able to contribute to the alignment problem if they're not deeply confused about it and I think Eliezer's earlier work is under-read.
I reiterate that I think the book is kinda bad, since I need a concluding paragraph.

I read 'ok' in this context as better than 'kinda bad' fwiw.

As for 'I should just ask you,' I notice this instinctively feels aversive as likely opening up a very painful and time consuming and highly frustrating interaction or set of interactions and I notice I have the strong urge not to do it. I forget the details of the interactions with you in particular or close others that caused this instinct, and it could be a mistake. I could be persuaded to try again.

I do know that when I see the interactions of the entire Janus-style crowd on almost anything, I have the same feeling I had with early LW, where I expect to get lectured to and yelled at and essentially downvoted a lot, including in 'get a load of this idiot' style ways, if I engage directly in most ways and it puts me off interacting. Essentially it doesn't feel like a safe space for views outside a certain window. This makes me sad because I have a lot of curiosity there, and it is entirely possible this is deeply stupid and if either side braved mild social awkwardness we'd all get big gains from trade and sharing info. I don't know.

I realize it is frustrating to report things in my head where I can't recall many of the sources of the things, but I am guessing that you would want me to do that given that this is the situation.

I dunno, man, this is definitely a 'write the long letter' situation and I'm calling it here.

(If you want to engage further, my reading of LW comments even on my own posts is highly unreliable, but I would get a PM or Twitter DM or email etc pretty reliably).

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments