Would a reasonable way to summarize this be that if you train on pretend reward hacking you get emergent misalignment that takes the form of pretending (playacting) misbehaving and being evil, whereas if you here train on realistic reward hacking examples it starts realistically (and in some ways strategically) misbehaving and doing other forms of essentially reward hacking instead?
Yes.
Knowing that, hopefully you wouldn't?
Oh, of course, how silly of me!
I was not aware of this at the time.
My guess is that on the margin more time should be spent improving the core messaging versus saturating the dialogue tree, on many AI questions, if you combine effort across everyone.
Lot of 'welcome to my world' vibes reading your self reports here, especially the '50 different people have 75 different objections for a mix of good, bad and deeply stupid reasons, and require 100 different responses, some of which are very long, and it takes a back-and-forth to figure out which one, and you can't possibly just list everything' and so on, and that's without getting into actually interesting branches and the places where you might be wrong or learn something, etc.
So to take your example, which seems like a good one:
Humans don't generalize their values out of distribution. I affirm this not as strictly fully true, but on the level of 'this is far closer to true and generative of a superior world model then its negation' and 'if you meditate on this sentence you may become [more] enlightened.'
I too have noticed that people seem to think that they do so generalize in ways they very much don't, and this leads to a lot of rather false conclusions.
I also notice that I'm not convinced we are thinking about the sentence that similarly in ways that could end up being pretty load bearing. Stuff gets complicated.
I think that when you say the statement is 'trivially' true you are wrong about that, or at least holding people to unrealistic standards of epistemics? And that a version of this mistake is part of the problem. At least from me (I presume from others too) you get a very different reaction from saying each of:
And so on. I am very deliberate, or try to be, on which one I say in any given spot, even at the cost of a bunch of additional words.
Another note is I think in spots like this you basically do have to say this even if the subject already knows, to establish common knowledge and that you are basing your argument on this, even if only to orient them that this is where you are reasoning from. So it was a helpful statement to say and a good use of a sentence.
I see that you get disagreement votes when you say this on LW., but the comments don't end up with negative karma or anything. I can see how that can be read as 'punishment' but I think that's the system working as intended and I don't know what a better one would be?
In general, I think if you have a bunch of load-bearing statements where you are very confident they are true but people typically think the statement is false and you can't make an explicit case for them (either because you don't have that kind of time/space, or because you don't know how), then the most helpful thing to do is to tell the other person the thing is load bearing, and gesture towards it and why you believe it, but be clear you can't justify it. You can also look for arguments that reach the same conclusion without it - often true things are highly overdetermined so you can get a bunch of your evidence 'thrown out of court' and still be fine, even if that sucks.
On Janus comparisons: I do model you as pretty distinct from them in underlying beliefs although I don't pretend to have a great model of either belief set. Reaction expectations are similarly correlated but distinct. I imagine they'd say that they answer good faith questions too, and often that's true (e.g. when I do ask Janus a question I have a ~100% helpful answer rate, but that's with me having a v high bar for asking).
If that's your reaction to my reaction, then it was a miss in at least some ways, which is on me.
I did not feel angry (more like frustrated?) when I wrote it nor did I intend to express anger, but I did read your review itself as expressing anger and hostility in various forms - you're doing your best to fight through that and play fair with the ideas as you see them, which is appreciated - and have generally read your statements about Yudkowsky and related issues consistently as being something in the vicinity of angry, also as part of a consistent campaign, and perhaps some of this was reflected in my response. It's also true that I have a cached memory of you often responding as if things said are more hostile than I felt they were or were intended, although I do not recall examples at this point.
And I hereby report that, despite at points in the past putting in considerable effort trying to parse your statements, and at some point found it too difficult, frustrating and aversive in some combination and mostly stopped attempting to do so when my initial attempt on a given statement bounced (which sometimes it doesn't).
(Part of what is 'esoteric' is perhaps that the perfect-enemy-of-good thing means a lot of load-bearing stuff is probably unsaid by you, and you may not realize that you haven't said it?)
But also, frankly, when people write much dumber reviews with much dumber things in them, I mostly can't even bring myself to be mad, because I mean what else can one expect from such sources - there's only one such review that actually did make me angry, because it was someone where I expected better. It's something I've worked a lot on, and I think made progress on - I don't actually e.g. get mad at David Sacks anymore as a person, although I still sometimes get mad that I have to once again write about David Sacks.
To the extent I was actually having a reaction to you here it was a sign that I respect you enough to care, that I sense opportunity in some form, and that you're saying actual things that matter rather than just spouting gibberish or standard nonsense.
Similarly, with the one exception, if those people had complained about my reaction to their reaction in the ways I'd expect them to do so, I would have ignored them.
Versus your summary of your review, I would say I read it more as:
I read 'ok' in this context as better than 'kinda bad' fwiw.
As for 'I should just ask you,' I notice this instinctively feels aversive as likely opening up a very painful and time consuming and highly frustrating interaction or set of interactions and I notice I have the strong urge not to do it. I forget the details of the interactions with you in particular or close others that caused this instinct, and it could be a mistake. I could be persuaded to try again.
I do know that when I see the interactions of the entire Janus-style crowd on almost anything, I have the same feeling I had with early LW, where I expect to get lectured to and yelled at and essentially downvoted a lot, including in 'get a load of this idiot' style ways, if I engage directly in most ways and it puts me off interacting. Essentially it doesn't feel like a safe space for views outside a certain window. This makes me sad because I have a lot of curiosity there, and it is entirely possible this is deeply stupid and if either side braved mild social awkwardness we'd all get big gains from trade and sharing info. I don't know.
I realize it is frustrating to report things in my head where I can't recall many of the sources of the things, but I am guessing that you would want me to do that given that this is the situation.
I dunno, man, this is definitely a 'write the long letter' situation and I'm calling it here.
(If you want to engage further, my reading of LW comments even on my own posts is highly unreliable, but I would get a PM or Twitter DM or email etc pretty reliably).
I wouldn't obviously even put AMD on the list given that they're up on rather big single stock news, but yes, good note, there is that.