LESSWRONG
LW

habryka

Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.

(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

Car Thoughts

habryka5h40

Pretty sure this user was spam. I banned + deleted their account.

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

habryka8h*1721

Hmm, I don't want to derail the comments on this post with a bunch of culture war things, but these two sentences in combination seemed to me to partially contradict each other:

When present, the bias is always against white and male candidates across all tested models and scenarios.
[...]
The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings.

I agree that the labs have spent a substantial amount of effort to address this issue, but the current behavior seems in-line with the aims of the labs? Most of the pressure comes from left-leaning academics or reporters, and my sense is are largely in-favor of affirmative action. The world where the AI systems end up with a margin of safety to be biased against white male candidates, in order to reduce the likelihood they ever look like they discriminate in the other direction which would actually be at substantial risk of blowing up, while not talking explicitly about the reasoning itself since that would of course prove highly controversial, seems basically the ideal result from a company PR perspective.

I don't currently think that is what's going on, but I do think due to these dynamics, the cited benefit of this scenario for studying the faithfulness of CoT reasoning seems currently not real to me. My guess is companies do not have a strong incentive to change this current behavior, and indeed I can't immediately think of a behavior in this domain the companies would prefer from a selfish perspective.

Morphism's Shortform

habryka12h52

The classical infohazard is "here is a way to build a nuke using nothing but the parts of a microwave". I think you are thinking of a much narrower class of infohazards than that word is intended to refer to.

TurnTrout's shortform feed

habryka14h*84

That said, my feeling is Trump et al. weren't reacting against any specific woke activism, but very woke policies (and opinions) which resulted from the activism.

I don't think this is true, and that indeed the counter-reaction is strongly to the woke activism. My sense is a lot of current US politics stuff is very identity focused, the policies on both sides matter surprisingly little (instead a lot of what is going on is something more like personal persecution of the outgroup and trying to find ways to hurt them, and to prop up your own status, which actually ends up with surprisingly similar policies on both ends).

[Meta] New moderation tools and moderation guidelines

habryka14h20

Not going to go into this, since I think it's actually a pretty complicated situation, but at a very high level some obvious groups that could override me:

The Lightcone Infrastructure board (me, Vaniver, Daniel Kokotajlo)
If Eliezer really wanted, he can probably override me
A more distributed consensus among what one might consider the leadership of the rationality community (like, let's say Scott Alexander and Ryan Greenblatt and Buck and Nate and John Wentworth and Gwern all roughly agree on me messing up really badly)

There would be lots more to say on this topic, but as I said, I am unlikely to pick this thread up again, so I hope that's good enough!

[Meta] New moderation tools and moderation guidelines

habryka14h3-2

Yes, well… the problem is that this is the central issue in this whole dispute (such as it is). The whole point is that your preferred policies (the ones to which I object) directly and severely damage LW’s ability to be “a free marketplace of ideas, a place where contradicting ideas can be discussed and debated”, and instead constitute you effectively making a list of allowed or forbidden opinions on this forum.

I don't see where I am making any such list, unless you mean "list" in a weird way that doesn't involve any actual lists, or even things that are kind of like lists.

in any meaningful sense, undertake to unilaterally decide anything w.r.t. correctness of views and positions.

I don't think that's an accurate description of DSL, indeed it appears to me that what the de-facto list of the kind of policy you have chosen is is pretty predictable (and IMO does not result in particular good outcomes). Just because you have some other people make the choices doesn't change the predictability of the actual outcome, or who is responsible for it.

I already made the obvious point that of course, in some sense, I/we will define what is OK on LessWrong via some procedural way. You can dislike the way I/we do it.

There is definitely no "fundamentally at odds", there is a difference in opinion about what works here, which you and me have already spent hundreds of hours trying to resolve, and we seem unlikely to resolve right now. Just making more comments stating that "I am wrong" in big words will not make that happen faster (or more likely to happen at all).

TurnTrout's shortform feed

habryka1d118

It appears to me that the present republican administration is largely a counter-reaction to various social justice and left-leaning activism. IMO a very costly one.

[Meta] New moderation tools and moderation guidelines

habryka1d20

Seems like we got lost in a tangle of edits. I hope my comment clarifies sufficiently, as it is time for me to sleep, and I am somewhat unlikely to pick up this thread tomorrow.

[Meta] New moderation tools and moderation guidelines

habryka1d41

Gwern himself refers to the "rude and offensive" part in this subthread as a one-place function:

aside from the bad news being delivered in it, I wrote a lot of it to be deliberately rude and offensive - and those were some of the most effective parts of it! (And also, yes, made people mad at me.)

I have no interest in doing more hand-wringing about whether Said's comments are intended to make people feel judged or not, and don't find your distinction of "no personal insults" as somehow making the rudeness more objective compelling. If you want we can talk about the Gwern hypothetical in which he clearly intended to be rude and offensive towards other people.

I think grown-ups should be able to shrug this off without calling for draconian and deranged censorship policies.

This is indeed a form of aggression and scorn that I do not approve of on this site, especially after extensive litigation.

I'll leave it on this thread, but as a concrete example for the sake of setting clear guidelines, strawmanning all (or really any) authors who have preferences about people not being super aggro in their comment threads as "pathetic cry-bullying" and "calling for draconian and deranged censorship policies" is indeed one of the things that will get you banned from this site on other threads! You have been warned!

[Meta] New moderation tools and moderation guidelines

habryka1d72

At the end of the day, I^[1] have the keys to the database and the domain, so in some sense anything that leaves me with those keys can be summarized as "the LW mods have undertaken to unilaterally decide, in advance, what are the correct views on all topics and the correct positions in all arguments".

But of course, that is largely semantic. It is of course not the case that I have or would ever intend to make a list of allowed or forbidden opinions on LessWrong. In contrast, I have mostly procedural models about how LessWrong should function, including the importance of LW as a free marketplace of ideas, a place where contradicting ideas can be discussed and debated, and many other aspects of what will cause the whole LW project to go well. Expanding on all of them would of course far exceed this comment thread.

On the specific topic of which things deserve scorn or ridicule or rudeness, I also find it hard to give a very short summary of what I believe. We have litigated some past disagreements in the space (such as whether people using their moderation tools to ban others from their blogpost should be subject to scorn or ridicule in most cases), which can provide some guidance, though the breadth of things we've covered is fairly limited. It is also clear to me that the exact flavor of rudeness and aggressiveness matters quite a bit. I favor straightforward aggression over passive aggression, and have expressed my model that "sneering" as a mental motion is almost never appropriate (though not literally never, as I expanded on).

And on most topics, I simply don't know yet, and I'll have to figure it out as it comes up. The space of ways people can be helpfully or unhelpfully judgmental and aggressive is very large and big, and I do not have most of it precomputed. I do have many more principles I could expand on, and would like to do sometime, but this specific comment thread does not seem like the time.

^{^}
Again, not just me, but also other mods and stakeholders and stuff