LESSWRONG
LW

All of Wei Dai's Comments + Replies

[Meta] New moderation tools and moderation guidelines

Does everyone here remember and/or agree with my point in The Nature of Offense, that offense is about status, which in the current context implies that it's essentially impossible to avoid giving offense while delivering strong criticism (as it almost necessarily implies that the target of criticism deserves lower status for writing something seriously flawed, having false/harmful beliefs, etc.)? @habryka @Zack_M_Davis @Said Achmiz

This discussion has become very long and I've been travelling so I may have missed something, but has anyone managed to ... (read more)

6Ben Pace4d

Not a direct response, but I want to take some point in this discussion (I think I said this to Zack in-person the other day) to say that, while some people are arguing that things should as a rule be collaborative and not offensive (e.g. to varying extents Gordon and Rafael), this is not the position that the LW mods are arguing for. We're arguing that authors on LessWrong should be able to moderate their posts with different norms/standards from one another, and that there should not reliably be retribution or counter-punishment by other commenters for them moderating in that way. I could see it being confusing because sometimes an author like Gordon is moderating you, and sometimes a site-mod like Habryka is moderating you, but they are using different standards, and the LW-mods are not typically endorsing the author standards as our own. I even generally agree with many of the counterarguments that e.g. Zack makes against those norms being the best ones. Some of my favorite comments on this site are offensive (where 'offensive' is referring to Wei's meaning of 'lowering someone's social status').

5Said Achmiz4d

Yes, of course. I both remember and agree wholeheartedly. (And @habryka’s reply in a sibling comment seems to me to be almost completely non-responsive to this point.)

-1habryka4d

I think there is something to this, though I think you should not model status in this context as purely one dimensional. Like a culture of mutual dignity where you maintain some basic level of mutual respect about whether other people deserve to live, or deserve to suffer, seems achievable and my guess is strongly correlated with more reasonable criticism being made. I think parsing this through the lens of status is reasonably fruitful, and within that lens, as I discussed in other sub threads, the problem is that many bad comments try to make some things low status that I am trying to cultivate on the site, while also trying to avoid accountability and clarity over whether those implications are actually meaningfully shared by the site and its administrators (and no, voting does not magically solve this problem). The status lens doesn’t super shine light on the passive vs. active aggression distinction we discussed. And again as I said it’s too one dimensional in that people don’t view ideas on LessWrong as having a strict linear status hierarchy. Indeed ideas have lots of gears and criticism does not primarily consist of lowering something’s status, that seems like it gets rid of basically all the real things about criticism.

[Meta] New moderation tools and moderation guidelines

Wei Dai22d62

I think incompatibilities often drive people away (e.g. at LessOnline I have let ppl know they can ask certain ppl not to come to their sessions, as it would make them not want to run the sessions, and this is definitely not due to criticism but to conflict between the two people). That’s one reason why I think this should be available.

This is something I currently want to accommodate but not encourage people to use moderation tools for, but maybe I'm wrong. How can I get a better sense of what's going on with this kind of incompatibility? Why do you th... (read more)

5Ben Pace22d

I mean I've mostly gotten a better sense of it by running lots of institutions and events and had tons of complaints bubble up. I know it's not just because of criticism because (a) I know from first-principles that conflicts exist for reasons other than criticism of someone's blogposts, and (b) I've seen a bunch of these incompatibilities. Things like "bad romantic breakup" or "was dishonorable in a business setting" or "severe communication style mismatch", amongst other things. You say you're not interested in using "moderation tools" for this. What do you have in mind for how to deal with this, other than tools for minimizing interaction between two people? It's a good idea, and maybe we should do it, but I think doesn't really address the thing of unique / idiosyncratic incompatibilities. Also it would be quite socially punishing for someone to know that they're publicly labelled net negative as a commenter, rather than simply that their individual comments so-far have been considered poor contributions, and making a system this individually harsh is a cost to be weighed, and it might make it overall push away high-quality contributors more than it helps. This seems then that making it so that a short list of users are not welcome to comment on a single person's post is much less likely to cause these things to be missed. The more basic mistakes can be noticed by a lot of people. If it's a mistake that only one person can notice due to their rare expertise or unique perspective, I think they can get a lot of karma by making it a whole quick take or post. Like, just to check, are we discussing a potential bad future world if this feature gets massively more use? Like, right now there are a ton of very disagreeable and harsh critics on LessWrong and there's very few absolute bans. I'd guess absolute bans being on the order of 30-100 author-commenter pairs over the ~7 years we've had this, and weekly logged-in users being ~4,000 these days. The effect size so

3habryka22d

I think better karma systems could potentially be pretty great, though I've historically always found it really hard to find something much better, mostly for complexity reasons. See this old shortform of mine on a bunch of stuff that a karma system has to do simultaneously: https://www.lesswrong.com/posts/EQJfdqSaMcJyR5k73/habryka-s-shortform-feed?commentId=8meuqgifXhksp42sg

[Meta] New moderation tools and moderation guidelines

Wei Dai22d33

It would be easy to give authors a button to let them look at comments that they've muted. (This seems so obvious that I didn't think to mention it, and I'm confused by your inference that authors would have no ability to look at the muted comments at all. At the very least they can simply log out.)

4habryka22d

I mean, kind of. The default UI experience of everyone will still differ by a lot (and importantly between people who will meaningfully be "in the same room") and the framing of the feature as "muted comments" does indeed not communicate that. The exact degree of how much it would make the dynamics more confusing would end up depending on the saliency of the author UI, but of course commenters will have no idea what the author UI looks like, and so can't form accurate expectations about how likely the author is to end up making the muted comments visible to them. Contrast to a situation with two comment sections. The default assumption is that the author and the users just see the exact same thing. There is no uncertainty about whether maybe the author has things by default collapsed whereas the commenters do not. People know what everyone else is seeing, and it's communicated in the most straightforward way. I don't even really know what I would do to communicate to commenters what the author sees (it's not an impossible UI challenge, you can imagine a small screenshot on the tooltip of the "muted icon" that shows what the author UI looks like, but that doesn't feel to me like a particularly elegant solution). One of the key things I mean by "the UI looking the same for all users" is maintaining common knowledge about who is likely to read what, or at least the rough process that determines what people read and what context they have. If I give the author some special UI where some things are hidden, then in order to maintain common knowledge I now need to show the users what the author's UI looks like (and show the author what the users are being shown about the author UI, but this mostly would take care of itself since all authors will be commenters in other contexts).

[Meta] New moderation tools and moderation guidelines

Wei Dai22d44

In the discussion under the original post, some people will have read the reply post, and some won't (perhaps including the original post's author, if they banned the commenter in part to avoid having to look at their content), so I have to model this.

Sure, let's give people moderation tools, but why trust authors with unilateral powers that can't be overriden by the community, such as banning and moving comments/commenters to a much less visible section?

4habryka22d

"Not being able to get the knowledge if you are curious" and "some people have of course read different things" are quite different states of affairs! I am objecting to the former. I agree that of course any conversation with more than 10 participants will have some variance in who knows what, but that's not what I am talking about.

[Meta] New moderation tools and moderation guidelines

Wei Dai22d82

My proposal was meant to address the requirement that some authors apparently have to avoid interacting with certain commenters. All proposals dealing with this imply multiple conversations and people having to model different states of knowledge in others, unless those commenters are just silenced altogether, so I'm confused why it's more confusing to have multiple conversations happening in the same place when those conversations are marked clearly.

It seems to me like the main difference is that Habryka just trusts authors to "garden their spaces" more t... (read more)

3Ben Pace22d

I’m not certain that this is the crux, but I’ll try again to explain that why I think it’s good to give people that sort of agency. I am probably repeating myself somewhat. I think incompatibilities often drive people away (e.g. at LessOnline I have let ppl know they can ask certain ppl not to come to their sessions, as it would make them not want to run the sessions, and this is definitely not due to criticism but to conflict between the two people). That’s one reason why I think this should be available. I think bad commenters also drive people away. There are bad commenters who seem fine when inspecting any single comment but when inspecting longer threads and longer patterns they’re draining energy and provide no good ideas or arguments. Always low quality criticisms, stated maximally aggressively, not actually good at communication/learning. I can think of many examples. I think it’s good to give individuals some basic level of agency over these situations, and not require active input from mods each time. This is for cases where the incompatibility is quite individual, or where the user’s information comes from off-site interactions, and also just because there are probably a lot of incompatibilities and we already spend a lot of time each week on site-moderation. And furthermore ppl are often quite averse to bringing up personal incompatibilities with strangers (i.e. in a DM to the mods who they've never interacted with before and don't know particularly well). Some people will not have the principles to tend their garden appropriately, and will inappropriately remove people with good critiques. That’s why it’s important that they cannot prevent the user from writing posts or quick takes about their content. Most substantial criticisms on this site have come in post and quick takes form, such as Wentworth’s critiques of other alignment strategies, or the sharp left turn discourse, or Natalia’s critiques of Guzey’s sleep hypotheses / SMTM’s lithium hypothe

5habryka22d

No, what are you talking about? The current situation, where people can make new top level posts, which get shown below the post itself via the pingback system, does not involve any asymmetric states of knowledge? Indeed, there are lots of ways to achieve this without requiring asymmetric states of knowledge. Having the two comment sections, with one marked as "off-topic" or something like that also doesn't require any asymmetric states of knowledge. Unmoderated discussion spaces are not generally better than moderated discussion spaces, including on the groupthink dimension! There is no great utopia of discourse that can be achieved simply by withholding moderation tools from people. Bandwidth is limited and cultural coordination is hard and this means that there are harsh tradeoffs to be made about which ideas and perspectives will end up presented. I am not hesitant to address the claim directly, it is just the case that on LessWrong, practically no banning of anyone ever takes place who wouldn't also end up being post-banned by the moderators and so de-facto this effect just doesn't seem real. Yes, maybe there are chilling effects that don't produce observable effects, which is always important to think about with this kind of stuff, but I don't currently buy it. The default thing that happens is when you leave a place unmoderated is just that the conversation gets dominated by whoever has the most time and stamina and social resilience, and the overall resulting diversity of perspectives trends to zero. Post authors are one obvious group to moderate spaces, especially with supervision from site moderators. There are lots of reasonable things to try here, but a random blanket "I don't trust post authors to moderate" is simply making an implicit statement that unmoderated spaces are better, because on the margin LW admins don't have either the authority or the time to moderate everyone's individual posts. Authors are rightly pissed if we just show up and b

An alignment safety case sketch based on debate

Wei Dai22dΩ220

I think we're in a similar place with the philosophical worries: we have both a bunch of specific games that fail with older theories, and a bunch of proposals (say, variants of FDT) without a clear winner.

I think the situation in decision theory is way more confusing than this. See https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever and I would be happy to have a chat about this if that would help convey my view of the current situation.

2Martín Soto22d

I read Wei as saying "debate will be hard because philosophy will be hard (and path-dependent and brittle), and one of the main things making philosophy hard is decision theory". I quite strongly disagree. About decision theory in particular: * I think Wei (and most people) are confused about updatelessness in ways that I'm not. I'm actually writing a post about this right now (but the closest thing for now is this one). More concretely, this is a problem of choosing our priors, which requires a kind of moral deliberation not unique to decision theory. About philosophy more generally: * I would differentiate between "there is a ground truth but it's expensive to compute" and "there is literally no ground truth, this is a subjective call we just need to engage in some moral deliberation, pitting our philosophical intuitions against each other, to discover what we want to do". * For the former category, I agree "expensive ground truths" can be a problem for debate, or alignment in general, but I expect it to also appear (and in fact do so sooner) on technical topics that we wouldn't call philosophy. And I'd hope to have solutions that are mostly agnostic on subject matter, so the focus on philosophy doesn't seem warranted (although it can be a good case study!). * I think it squarely includes ethics, normativity, decision theory, and some other philosophy fall squarely into the latter category. I'm sympathetic to Wei (and others)'s worries that most of the value of the future can be squandered if we solve intent alignment and then choose the wrong kind of moral deliberation. But this problem seems totally orthogonal to getting debate to work in the technical sense that the UKAISI Alignment team focuses on.

[Meta] New moderation tools and moderation guidelines

Wei Dai22d1-2

To reduce clutter you can reuse the green color bars that currently indicate new comments, and make it red for muted comments.

Authors might rarely ban commenters because the threat of banning drives them away already. And if the bans are rare then what's the big deal with requiring moderator approval first?

giving the author social legitimacy to control their own space, combined with checks and balance

I would support letting authors control their space via the mute and flag proposal, adding my weight to its social legitimacy, and I'm guessing others who... (read more)

Ben Pace22d138

As an aside, I think one UI preference I suspect Habryka has more strongly than Wei Dai does here is that the UI look the same to all users. For similar reasons why WYSIWYG is helpful for editing, when it comes to muting/threading/etc it’s helpful for ppl to all be looking at the same page so they can easily model what others are seeing. Having some ppl see a user’s comments but the author not, or key commenters not, is quite costly for social transparency, and understanding social dynamics.

habryka22d*190

To reduce clutter you can reuse the green color bars that currently indicate new comments, and make it red for muted comments.

No, the whole point of the green bars is to be a very salient indicator that only shows in the relatively rare circumstance where you need it (which is when you revisit a comment thread you previously read and want to find new comments). Having a permanent red indicator would break in like 5 different ways:

It would communicate a temporary indicator, because that's the pattern that we established with colored indicators all acr

Wei Dai23d40

Yeah I think it would help me understand your general perspective better if you were to explain more why you don't like my proposal. What about just writing out the top 3 reasons for now, if you don't want to risk investing a lot of time on something that might not turn out to be productive?

habryka23d*100

In my mind things aren't neatly categorized into "top N reasons", but here are some quick thoughts:

(I.) I am generally very averse to having any UI element that shows on individual comments. It just clutters things up quickly and requires people to scan each individual comment. I have put an enormous amount of effort into trying to reduce the number of UI elements on comments. I much prefer organizing things into sections which people can parse once, and then assume everything has the same type signature.

(II.) I think a core thing I want UI to do in ... (read more)

[Meta] New moderation tools and moderation guidelines

Wei Dai23d20

Comments almost never get downvoted.

Assuming your comment was serious (which on reflection I think it probably was), what about a modification to my proposed scheme, that any muted commenter gets an automatic downvote from the author when they comment? Then it would stay at the bottom unless enough people actively upvoted it? (I personally don't think this is necessary because low quality comments would stay near the bottom even without downvotes just from lack of upvotes, but I want to address this if it's a real blocker for moving away from the ban system.)

2habryka23d

I don't currently like the muted comment system for many practical reasons, though I like it as an idea! We could go into the details of it, but I feel a bit like stuff is getting too anchored on that specific proposal, and explaining why I don't feel excited about this one specific solution out of dozens of ways of approaching this feels like it would both take a long time, and not really help anyone. Though if you think you would find it valuable I could do it. Let me know if you want to go there, and I could write more. I am pretty interested in discussing the general principles and constraints though, I've just historically not gotten that much out of discussions where someone who hasn't been trying to balance a lot of the complicated design considerations comes in with a specific proposal, but have gotten a lot of value out of people raising problems and considerations (and overall appreciate your thoughts in this thread).

[Meta] New moderation tools and moderation guidelines

Wei Dai23d20

BTW my old, now defunct user script LW Power Reader had a feature to adjust the font size of comments based on their karma, so that karma could literally affect visibility despite "the thread structure making strict karma sorting impossible". So you could implement that if you want, but it's not really relevant to the current debate since karma obviously affects visibility virtually even without sorting, in the sense that people can read the number and decide to skip the comment or not.

2Ben Pace23d

We did that on April 1st in 2018 I believe.

[Meta] New moderation tools and moderation guidelines

Wei Dai23d13-2

Said's comment that triggered this debate is 39/34, at the top of the comments section of the post and #6 in Popular Comments for the whole site, but you want to allow the author to ban Said from future commenting, with the rationale "you should model karma as currently approximately irrelevant for managing visibility of comments". I think this is also wrong generally as I've often found karma to be very helpful in exposing high quality comments to me, and keeping lower quality comments less visible toward the bottom, or allowing me to skip them if they occur in the middle of threads.

I almost think the nonsensical nature of this justification is deliberate, but I'm not quite sure. In any case, sigh...

Ben Pace22d*120

For the record, I had not read that instance of banning, and it is only just at this late point (e.g. after basically the whole thread has wrapped [edit: the whole thread, it turned out, had not wrapped]) did I read that thread and realize that this whole thread was downstream of that. All my comments and points so far were not written with that instance in mind but on general principle.

(And if you're thinking "Surely you would've spoken to Habryka at work about this thread?" my response is "I was not at work! I am currently on vacation." Yes, I have chose... (read more)

4habryka23d

I agree it's working fine for that specific comment thread! But it's just not really true for most posts, which tend to have less than 10 comments, and where voting activity after 1-2 rounds of replies gets very heavily dominated by the people who are actively participating in the thread, which especially as it gets heated, causes things to then end up being very random in their vote distribution, and to not really work as a signal anymore. The popular comments section is affected by net karma, though I think it's a pretty delicate balance. My current guess is that indeed the vast majority of people who upvoted Said's comment didn't read Gordon's post, and upvoted Said comment because it seemed like a dunk on something they didn't like, irrespective of whether that actually applied to Gordon's post in any coherent way. I think the popular comment section is on-net good, but in this case seems to me to have failed (for reasons largely unrelated to other things discussed in this thread), and it has happened a bunch of times that it promoted contextless responses to stuff that made the overall discussion quality worse. Fundamentally, the amount of sorting you can do in a comment section is just very limited. I feel like this isn't a very controversial or messy point. On any given post you can sort maybe 3-4 top-level threads into the right order, so karma is supplying at most a few bits of prioritization for the order. In the context of post lists, you often are sorting lists of posts hundred of items long, and karma is the primary determinant whether something gets read at all. I am not saying it has absolutely no effect, but clearly it's much weaker (and indeed, does absolutely not reliably prevent bad comments from getting lots of visibility and does not remotely reliably cause good comments to get visibility, especially if you wade into domains where people have stronger pre-existing feelings and are looking for anything to upvote that looks vaguely like thei

2Wei Dai23d

BTW my old, now defunct user script LW Power Reader had a feature to adjust the font size of comments based on their karma, so that karma could literally affect visibility despite "the thread structure making strict karma sorting impossible". So you could implement that if you want, but it's not really relevant to the current debate since karma obviously affects visibility virtually even without sorting, in the sense that people can read the number and decide to skip the comment or not.

[Meta] New moderation tools and moderation guidelines

Wei Dai24d42

The point of my proposal is to give authors an out if there are some commenters who they just can't stand to interact with. This is a claimed reason for demanding a unilateral ban, at least for some.

If the author doesn't trust the community to vote bad takes down into less visibility, when they have no direct COI, why should I trust the author to do it unilaterally, when they do? Writing great content doesn't equate to rationality when it comes to handling criticism.

LW has leverage in the form of its audience, which most blogs can't match, but obviously that's not sufficient leverage for some, so I'm willing to accept the status quo, but that doesn't mean I'm going to be happy about it.

7habryka24d

Comments almost never get downvoted. Most posts don't get that many comments. Karma doesn't reliably work as a visibility mechanism for comments. I do think a good karma system would take into account authors banning users, or some other form of author feedback, and then that would be reflected in the default visibility of their comments. I don't have a super elegant way of building that into the site, and adding it this way seems better than none at all, since it seems like a much stronger signal than normal aggregate upvotes and downvotes. I do think the optimal version of this would somehow leverage the karma system and all the voting information we have available into making it so that authors have some kind of substantial super-vote to control visibility, but that is balanced against votes by other people if they really care about it, but we don't have that, and I don't have a great design for it that wouldn't be super complicated. Overall, I think you should model karma as currently approximately irrelevant for managing visibility of comments due to limited volume of comments, and the thread structure making strict karma sorting impossible, so anything of the form "but isn't comment visibility handled by the karma system" is basically just totally wrong, without substantial changes to the karma system and voting. On the other hand karma is working great for sorting posts and causing post discoverability, which is why getting critics to write top-level posts (which will automatically be visible right below any post they are criticizing due to our pingback system) is a much better mechanism for causing their content to get appropriate attention. Both for readers of the post they are criticizing, and for people who are generally interested in associated content.

[Meta] New moderation tools and moderation guidelines

Wei Dai24d312

Maybe your explanation will change my mind, but your proposal seems clearly worse to me (what if a muted person responds to a unmuted comment? If it gets moved to the bottom, the context is lost? Or they're not allowed to respond to anything on the top section? What epistemic purpose does it serve to allow a person in a potentially very biased moment to unilaterally decide to make a comment or commenter much harder for everyone else to see, as few people would bother to scroll to the bottom?) and also clearly harder to implement.

I feel like you're not prov... (read more)

5habryka24d

I am not sure I am understanding your proposal then. If you want people to keep track of two conversations with different participants, you need the comment threads to be visually separated. Nobody will be able to keep track of who exactly is muted when in one big comment thread, so as long as this statement is true, I can't think of any other way to implement that but to move things into fully separate sections: And then the whole point of having a section like this (in my mind) is to not force the author to give a platform to random bad takes proportional to their own popularity without those people doing anything close to proportional work. The author is the person who attracted 95% of the attention in the first place, almost always by doing good work, and that control is the whole reason why we are even considering this proposal, so I don't understand what is to gain by doing this and not moving it to the bottom. In general it seems obvious to me that when someone writes great content this this should get them some control over the discussion participants and culture of the discussion. They obviously always have that as a BATNA by moving to their own blog or Substack or wherever, and I certainly don't want to contribute to a platform and community that gives me no control over its culture, given that the BATNA of getting to be part of one I do get to shape is alive and real and clearly better by my lights (and no, I do not consider myself universally conflicted out of making any kind of decision of who I want to talk to or what kind of culture I want to create, I generally think discussions and cultures I shape are better than ones I don't, and this seems like a very reasonable epistemic state to me).

[Meta] New moderation tools and moderation guidelines

Wei Dai24d1213

TBC, I think the people demanding unilateral bans will find my proposal unacceptable, due to one of my "uncharitable hypotheses", basically for status/ego/political reasons, or subconsciously wanting to discourage certain critiques or make them harder to find, and the LW team in order to appeal to them will keep the ban system in place. One of my purposes here is just to make this explicit and clear (if it is indeed the case).

[Meta] New moderation tools and moderation guidelines

Wei Dai24d2-2

My proposal can be viewed as two distinct group conversations happening in the same place. To recap, instead of a ban list, the author would have a mute list, then whenever the muted people comment under their post, that comment would be hidden from the author and marked/flagged on some way for everyone else. Any replies to such muted and flagged comments would themselves be muted and flagged. So conversation 1 is all the unflagged comments, and conversation 2 is all the flagged comments.

If this still seems a bad idea, can you explain why in more detail?

3habryka24d

We've considered something similar to this, basically having two comment sections below each post, one sorted to the top one sorted to the bottom. And authors can move things to the bottom comment section, and there is a general expectation that authors don't really engage with the bottom comment section very much (and might mute it completely, or remove author's ability to put things into the top comment section). I had a few UI drafts of this, but nothing that didn't feel kind of awkward and confusing. (I think this is a better starting point than having individual comment threads muted, though explaining my models would take a while, and I might not get around to it)

Wei Dai24d1213

[Meta] New moderation tools and moderation guidelines

Wei Dai25d31

Each muted comment/thread is marked/flagged by an icon, color, or text, to indicate to readers that the OP author can't see it, and if you reply to it, your reply will also be hidden from the author.

[Meta] New moderation tools and moderation guidelines

Wei Dai25d119

So like, do you distrust writers using substack? Because substack writers can just ban people from commenting. Or more concretely, do you distrust Scott to garden his own space on ACX?

It's normally out of my mind, but whenever I'm reminded of it, I'm like "damn, I wonder how many mistaken articles I read and didn't realize it because the author banned or discouraged their best (would be) critiques." (Substack has other problems though like lack of karma that makes it hard to find good comments anyway, which I would want to fix first.)

Giving authors th

... (read more)

[Meta] New moderation tools and moderation guidelines

Wei Dai25d*1311

Comment threads are conversations! If you have one person in a conversation who can't see other participants, everything gets confusing and weird.

The additional confusion seems pretty minimal, if the muted comments are clearly marked so others are aware that the author can't see them. (Compare to the baseline confusion where I'm already pretty unsure who has read which other comments.)

I just don't get how this is worse than making it so that certain perspectives are completely missing from the comments.

[Meta] New moderation tools and moderation guidelines

Wei Dai25d105

I really don't get the psychology of people who won't use a site without being able to unilaterally ban people (or rather I can only think of uncharitable hypotheses). Why can't they just ignore those they don't want to engage with, maybe with the help of a mute or ignore feature (which can also mark the ignored comments/threads in some way to notify others)?

Gemini Pro's verdict on my feature idea (after asking it to be less fawning): The refined "Mute-and-Flag" system is a functional alternative, as it solves the author's personal need to be shielded from... (read more)

Ben Pace24d1010

Why can't they just ignore those they don't want to engage with, maybe with the help of a mute or ignore feature (which can also mark the ignored comments/threads in some way to notify others)?

I get a sense that you (and Said) are really thinking of this as a 1-1 interaction and not a group interaction, but the group dynamics are where most of my crux is.

I feel like all your proposals are like “have a group convo but with one person blanking another person” or “have a 6-person company where one person just ignores another person” and all of my proposals ar... (read more)

1Three-Monkey Mind25d

What’s the “and-flag” part for? So the mute gets recorded in https://www.lesswrong.com/moderation?

5Drake Morrison25d

So like, do you distrust writers using substack? Because substack writers can just ban people from commenting. Or more concretely, do you distrust Scott to garden his own space on ACX? Giving authors the ability to ban people they don't want commenting is so common that it feels like a Chesterton's Fence to me.

7habryka25d

Comment threads are conversations! If you have one person in a conversation who can't see other participants, everything gets confusing and weird. Points brought up in an adjacent thread suddenly have to be relitigated or repeated. The conversation has trouble moving forward, its hard to build any kind of common knowledge of assumptions, and people get annoyed at each other for not knowing the same things, because they aren't reading the same content. I would hate comment sections on LW where I had no idea which other comments my interlocutors have read. I don't always assume that every person I am responding to has read all other comments in a thread, but I do generally assume they have read most of them, and that this is informing whatever conversation we are having in this subthread. Comment threads are a useful mechanism for structuring local interactions, but whole comment sections proceed as a highly interconnected conversation, not as individual comment threads, and splintering the knowledge and participation graph of that seems much more costly than alternatives to me. (Contrast LW with Twitter where indeed the conversations generally proceed much more just based on a single thread, or quote-threads, and my experience there is that whenever I respond to someone, I just end up having the same conversation 10 times, as opposed to on LW, where I generally feel that when I respond to someone, I won't have 10 more people comment or ask with the same confusion)

[Meta] New moderation tools and moderation guidelines

Wei Dai25d246

It's a localized silencing, which discourages criticism (beyond just the banned critic) and makes remaining criticism harder to find, and yes makes it harder to tell that the author is ignoring critics. If it's not effective at discouraging or hiding criticism, then how can it have any perceived benefits for the author? It's gotta have some kind of substantive effect, right? See also this.

6Ben Pace25d

In contrast with my fellow mod, I do feel worried about the suppression of criticism as a result of banning. I think that sort of thing is hard to admit to because we generally have pretty hard lines around that sort of thing around here, and it is plausible to me not worth putting any pressure on in this case. Something on my mind is that sometimes there are people that are extremely unpleasant to deal with? For instance, I know one occasional commenter in this site who I believe stole a bunch of money from another occasional commenter, though they never took it to court. I think that’d be v unpleasant if the former ended up replying to a lot of the latter’s posts. I would also say that sometimes people can be extremely unpleasant online. I know one person who (I believe) goes around the internet spreading falsehoods attempting to damage the reputation of another person, often directly, and in ways that seem to me a bit delusional, and I think that just based on that behavior it would be reasonable to ban someone. Perhaps you will say that these are decisions for the mods to make. Perhaps that is so. My guess is that having to send a message to one of the mods making your case, is a much much bigger trivial inconvenience than banning the person, and this is more likely to suppress the person’s contributions. I think this is a strong reason to let users decide themselves, and I would like to give them agency in that. Indeed, on social media sites like Twitter the only way to survive is via massive amounts of blocking, one could not conceivably report all cases to mods and get adjudication. The counterpoint is that banning can be abused, and conversation in spite of unpleasantness and conflict is critical to LessWrong. If a user banned a lot of good contributors to the site, I would probably desire to disable their ability to be able to do that. So I think it a good default to not allow it. But I think at the very least it should not be thought of as a policy wit

5habryka25d

The thing that it changes is the degree to which the author's popularity or quality is being used to give a platform to other people, which I think makes a lot of sense to give the author some substantial control over. If you have a critique that people care about, you can make a top-level post, and if it's good it can stand on its own and get its own karma. If it's really important for a critique to end up directly associated with a post, you can just ask someone who isn't banned to post a link to it under that post. If you can't find anyone who thinks it's worth posting a link to it who isn't yourself, then I think it's not that sad for your critique to not get seen. Yes, this puts up a few trivial inconveniences, but the alternative of having people provide a platform to anyone without any choice in the matter, whose visibility gets multiplied proportional to their own reach and quality, sucks IMO a lot more. My engagement with LW is kind of weird and causes me to not write as many top-level posts as I like, and the ones I do write are site announcements, but if I was trying to be a more standard author on LW, I wouldn't use it without the ability to ban (and for example, would almost certainly ban Said, who has been given multiple warnings by the moderation team, is by far the most heavily complained user on the site, and I would actively encourage many authors to ban if they don't want to have a kind of bad time, but I think is providing enough value to the site in other contexts that I don't think a site-wide ban would make sense).

[Meta] New moderation tools and moderation guidelines

Wei Dai25d13-3

I think giving people the right and responsibility to unilaterally ban commenters on their posts is demanding too much of people's rationality, forcing them to make evaluations when they're among most likely to be biased, and tempting them with the power to silence their harshest or most effective critics. I personally don't trust myself to do this and have basically committed to not ban anyone or even delete any comments that aren't obvious spam, and kind of don't trust others who would trust themselves to do this.

habryka25d11-5

Banning someone does not generally silence their harshest critics. It just asks those critics to make a top-level post, which generally will actually have any shot at improving the record and discourse in reasonable ways compared to nested comment replies.

The thing that banning does is make it so the author doesn't look like he is ignoring critics (which he hasn't by the time he has consciously decided to ban a critic).

Religion for Rationalists

Wei Dai1mo50

My argument is roughly that religions uniquely provide a source of meaning, community, and life guidance not available elsewhere

Why is it good to obtain a source of meaning, if it is not based on sound epistemic foundations? Is obtaining an arbitrary "meaning" better than living without one or going with an "interim meaning of life" like "maximize option value while looking for a philosophically sound source of normativity"?

2Gordon Seidoh Worley1mo

It would, in theory, be nice if meaning was grounded in epistemic, rational truth. But such truth isn't and can't be grounded in itself, and so even if you find meaning that can be rationalized via sound epistemic reasoning, its foundation will not itself be ultimately epistemically sound because it exists prior to epistemic reasoning. Now this doesn't mean that we can't look for sources of meaning that comport with our epistemic understanding. In fact, I think we should! We should rightly reject sources of meaning that invite us to believe provably false things. Tricky question. For many people, I think the answer is yes, they would be better off with some arbitrary meaning. They would simply live better, happier lives if they had a strong sense of meaning, even if that sense of meaning was wrong, because they aren't doing the work to have good epistemics anyway, and so they are currently getting the worst of both words: they don't have meaning and they aren't even taking actions that would result in them knowing what's true. I contend that this is why there's a level of discontent with life itself in the modern era that largely seems absent in the past. The idea of an interim source of meaning is interesting, because arguably all sources of meaning are interim. There's nothing fixed about where we find meaning, and most people find that it changes throughout their life. Some people spend time finding meaning in something explicit like "the search for truth" or "worshiping God" or similar. Perhaps later they find it in something less explicit, like friends and family and sensory experiences. Perhaps yet later they find it in the sublime joy of merely existing. When I say that religion uniquely provides a source of meaning and other things, perhaps what I mean more precisely is that it uniquely provides a door through which meaning can be found. The meaning is not in the religion itself, but in living with the guidance of a religion to help in finding meaning fo

Wei Dai's Shortform

Wei Dai1mo20

Thanks for letting me know. Is there anything on my list that you don't think is a good idea or probably won't implement, in which case I might start working on them myself, e.g. as a userscript? Especially #5, which is also useful for other reasons, like archiving and searching.

3habryka1mo

I think we are unlikely to do #2 based on my current guesses of what are good ideas. I think #1 is also kind of unlikely. I think some version of 3,4 and 5 are definitely things I want to explore.

Wei Dai's Shortform

Wei Dai1mo211

What do people think about having more AI features on LW? (Any existing plans for this?) For example:

AI summary of a poster's profile, that answers "what should I know about this person before I reply to them", including things like their background, positions on major LW-relevant issues, distinctive ideas, etc., extracted from their post/comment history and/or bio links.
"Explain this passage/comment" based on context and related posts, similar to X's "explain this tweet" feature, which I've often found useful.
"Critique this draft post/comment." Am I m

... (read more)

4Dagon1mo

I'm a huge fan, especially for the user-specific, ephemeral uses like you describe. "Summarize the major contrasting views to this post" would be awesome. I'm much less happy with publication and posting-support uses that would be the obvious things to do.

6habryka1mo

I would like to do more work on this kind of stuff, and expect to do so after a current big batch of back-end refactors is done (not commenting on whether we might do any of these specific AI features, but it seems clear that we will want to figure out how to integrate AI into both discussion and content production on LW somehow).

Wei Dai's Shortform

Wei Dai1mo30

This contradicts my position in Some Thoughts on Metaphilosophy. What about that post do you find unconvincing, or what is your own argument for "philosophy being insoluble"?

The stakes of AI moral status

Wei Dai1mo50

I'm not saying that my assessment of it is inarguably correct (indeed, given that mainstream philosophy isn't seriously discredited yet, reasonable people clearly can disagree), but if your conclusions are different, I'd like to know why.

It's mainly because when I'm (seemingly) making philosophical progress myself, e.g., this and this, or when I see other people making apparent philosophical progress, it looks more like "doing what most philosophers do" than "getting feedback from reality".

1xpym1mo

I would categorize this as incorporating feedback from reality, so perhaps we don't really disagree much.

Wei Dai's Shortform

Wei Dai1mo40

Perhaps more seriously, the philosophers who got a temporary manpower and influence boost from the invention of math and science should have worked much harder to solve metaphilosophy, while they had the advantage.

3TAG1mo

It's quite possible that we have solved metaphilosophy, in the direction of philosophy being insoluble.

Wei Dai's Shortform

Wei Dai1mo62

It seems to me that values have been a main focus of philosophy for a long time, with moral philosophy (or perhaps meta-ethics if the topic is "what values are") devoted to it and discussed frequently both in academia and out, whereas metaphilosophy has received much less attention. This implies that we know progress on understanding values is probably pretty hard on the current margins, whereas there's a lot more uncertainty about the difficulty of metaphilosophy. Solving the latter would also be of greater utility, since it makes solving all other philosophical problems easier, not just values. I'm curious about the rationale behind your suggestion.

7TsviBT1mo

Specifically the question of "what values are" I don't think has been addressed (I've looked around some, but certainly not thoroughly). A key problem with previous philosophy is that values are extreme in how much they require some sort of mental context (https://www.lesswrong.com/posts/HJ4EHPG5qPbbbk5nK/gemini-modeling). Previous philosophy (that I'm aware of) largely takes the mental context for granted, or only highlights the parts of it that are called into question, or briefly touches on it. This is pretty reasonable if you're a human talking to humans, because you do probably share most of that mental context. But it fails on two sorts of cases: 1. trying to think about or grow/construct/shape really alien minds, like AIs; 2. trying to exert human values in a way that is good but unnatural (think for example of governments, teams, "superhuman devotion to a personal mission", etc.). The latter, 2., might have, given more progress, helped us be wiser. My comment was responding to So I'm saying, in retrospect on the 2.5 millennia of philosophy, it plausibly would have been better to have an "organology, physiology, medicine, and medical enhancement" of values. To say it a different way, we should have been building the conceptual and introspective foundations that would have provided the tools with which we might have been able to become much wiser than is accessible to the lone investigators who periodically arise, try to hack their way a small ways up the mountain, and then die, leaving mostly only superficial transmissions. I would agree pretty strongly with some version of "metaphilosophy is potentially a very underserved investment opportunity", though we don't necessarily agree (because of having "very different tastes" about what metaphilosophy should be, amounting to not even talking about the same thing). I have ranted several times to friends about how philosophy (by which I mean metaphilosophy--under one description, something like "recursive c

The stakes of AI moral status

Wei Dai1mo30

An example of a long-standing philosophical problem that could eventually be solved in this way is the problem of consciousness: if we're eventually able to build artificial brains and "upload" ourselves, by testing different designs we'd be able to figure out which material features give rise to qualia experiences, and by what mechanisms.

I think this will help, but won't solve the whole problem by itself, and we'll still need to decide between competing answers without direct feedback from reality to help us choose. Like today, there are people who den... (read more)

1xpym1mo

I mean, there are still people claiming that Earth is flat, and that evolution is an absurd lie. But insofar as consensus on anything is ever reached, it basically always requires both detailed tangible evidence and abstract reasoning. I'm not denying that abstract reasoning is necessary, it's just far less sufficient by itself than mainstream philosophy admits. We do have meta-preferences about our preferences, and of course with regard to our meta-preferences our values aren't arbitrary. But this just escalates the issue one level higher - when the whole values + meta-values structure is considered, there's no objective criterion for determining the best one (found so far). You can evaluate philosophical progress achieved so far, for one thing. I'm not saying that my assessment of it is inarguably correct (indeed, given that mainstream philosophy isn't seriously discredited yet, reasonable people clearly can disagree), but if your conclusions are different, I'd like to know why.

The stakes of AI moral status

Wei Dai1mo20

I have no idea whether marginal progress on this would be good or bad

Is it because of one of the reasons on this list, or something else?

1JesseClifton1mo

I had in mind * 1 and 2 on your list; * Like 1 but more general, it seems plausible that value-by-my-lights-on-reflection is highly sensitive to the combination of values, decision theory, epistemology, etc which civilization ends up with, such that I should be clueless about the value of little marginal nudges to philosophical competence; * Ryan’s reply to your comment.

Wei Dai's Shortform

Wei Dai2mo17-6

Math and science as origin sins.

From Some Thoughts on Metaphilosophy:

Philosophy as meta problem solving Given that philosophy is extremely slow, it makes sense to use it to solve meta problems (i.e., finding faster ways to handle some class of problems) instead of object level problems. This is exactly what happened historically. Instead of using philosophy to solve individual scientific problems (natural philosophy) we use it to solve science as a methodological problem (philosophy of science). Instead of using philosophy to solve individual math proble

... (read more)

3TsviBT2mo

I'd suggest that trying to understand what values are would potentially have been a better direction to emphasize. Our understanding here is still pre-Socratic, basically pre-cultural.

Winning the power to lose

Wei Dai2mo60

I am essentially a preference utilitarian

Want to try answering my questions/problems about preference utilitarianism?

Maybe I would state my first question above a little differently today: Certain decision theories (such as the UDT/FDT/LDT family) already incorporate some preference-utilitarian-like intuitions, by suggesting that taking certain other agents' preferences into account when making certain decisions is a good idea, if e.g. this is logically correlated with them taking your preferences into account. Does preference utilitarianism go beyond t... (read more)

An alignment safety case sketch based on debate

Wei Dai2moΩ7110

Sorry about the delayed reply. I've been thinking about how to respond. One of my worries is that human philosophy is path dependent, or another way of saying this is that we're prone to accepting wrong philosophical ideas/arguments and then it's hard to talk us out of them. The split of western philosophy into analytical and continental traditions seems to be an instance of this, then even within analytical philosophy, academic philosophers would strongly disagree with each other and each be confident in their own positions and rarely get talked out of th... (read more)

1Geoffrey Irving1mo

Continuing with the Newtonian physics analogy, the case for optimism would be: 1. We have some theories with limited domain of applicability. Say, theory A. 2. Theory A is wrong at some limit, where it is replaced by theory B. Theory B is still wrong, but it has a larger domain of applicability. 3. We don't know theory B, and can't access it despite our best scalable oversight techniques, even though the AIs do figure out theory B. (This is the hard case: I think there other cases where scalable oversight does work.) 4. However, we do have some purchase on the domain of applicability of theory A: we know the limits of where it's been tested (energy levels, length scales, etc.). 5. Scalable oversight has an easier job talking about these limits to theory A than it doesn't about theory B itself. Concretely, what this means is that you can express arguments like "theory A doesn't resolve question Q, as the answer depends on applying theory A beyond it's decent-confidence domain of applicability". 6. Profit. This gives you a capability cap: the AIs know theory B but you can't use it. But I do think if you can pull off the necessary restriction to which questions you can answer you can muddle through, even if you know only theory A and have some sense of its limits. The limits of Newtonian physics started to appear long before the replacement theories (relativity and quantum). I think we're in a similar place with the philosophical worries: we have both a bunch of specific games that fail with older theories, and a bunch of proposals (say, variants of FDT) without a clear winner. The additional big thing you need here is a property of the world that makes that capability cap okay: if the only way to succeed is find perfect solutions using theory B, say because that gives you a necessary edge in an adversarial competition between multiple AIs, then lacking theory B sinks you. But I think we have a shot about not being in the worst case here. (Sorry as well for delay!

The stakes of AI moral status

Wei Dai2mo40

MacAskill is probably the most prominent, with his "value lock-in" and "long reflection", but in general the notion of philosophical confusion/inadequacy seems a common component of various AI risk cases. I've been particularly impressed by John Wentworth.

That's true, but neither of them have talked about the more general problem "maybe humans/AIs won't be philosophically competent enough, so we need to figure out how to improve human/AI philosophically competence" or at least haven't said this publicly or framed their positions this way.

The point is

... (read more)

2xpym2mo

Seems overwhelmingly likely to me that those problems will remain unsolved, until such time as we figure out how that feedback can be acquired. An example of a long-standing philosophical problem that could eventually be solved in this way is the problem of consciousness: if we're eventually able to build artificial brains and "upload" ourselves, by testing different designs we'd be able to figure out which material features give rise to qualia experiences, and by what mechanisms. We do receive feedback on this from reality, albeit slowly — through cultural evolution/natural selection. To the extent that this filter isn't particularly strict, within the range it allows variation will probably remain arbitrary. Because there's no consensus that any major long-standing philosophical problem has ever been solved through philosophical methods. Figure out where we're confused and stop making same old mistakes/walking in circles. Build better tools which expand the range of experiments we can do. Try not to kill ourselves in the meantime (hard mode).

The stakes of AI moral status

Wei Dai2mo40

Interesting. Who are they and what approaches are they taking? Have they said anything publicly about working on this, and if not, why?

The stakes of AI moral status

Wei Dai2mo5-2

My impression is that those few who at least understand that they're confused do that

Who else is doing this?

Not exactly an unheard of position.

All of your links are to people proposing better ways of doing philosophy, which contradicts that it's impossible to make progress in philosophy.

policymakers aren't predisposed to taking arguments from those quarters seriously

There are various historical instances of philosophy having large effects on policy (not always in a good way), e.g., abolition of slavery, rise of liberalism ("the Enlightenment"), Communism ("historical materialism").

3xpym2mo

MacAskill is probably the most prominent, with his "value lock-in" and "long reflection", but in general the notion of philosophical confusion/inadequacy seems a common component of various AI risk cases. I've been particularly impressed by John Wentworth. (1, 2, 3) The point is that it's impossible to do useful philosophy without close and constant contact with reality. Your examples of influential philosophical ideas (abolition of slavery, the Enlightenment, Communism) were coincidentally all responses to clear and major observable problems (the horrors of slavery, sectarian wars and early industrial working conditions, respectively).

The stakes of AI moral status

Wei Dai2mo31

It seems clear enough to me that pretty much everybody is hopelessly confused about these issues, and sees no promising avenues for quick progress.

If that's the case, why aren't they at least raising the alarm for this additional AI risk?

"What kind of questions can you make progress on without constant grounding and dialogue with reality? This is the default of how we humans build knowledge and solve hard new questions, the places where we do best and get the least drawn astray is exactly those areas where we can have as much feedback from reality in

... (read more)

3xpym2mo

My impression is that those few who at least understand that they're confused do that, whereas most are also meta-confused. Not exactly an unheard of position. I don't think that philosophy/metaphilosophy has a good track record of providing strong evidence for anything, so policymakers aren't predisposed to taking arguments from those quarters seriously. I expect that only a really dramatic warning shot can change the AI trajectory (and even then it's not a sure bet — Covid was plenty dramatic, and yet no significant opposition to gain-of-function seems to have materialized).

The stakes of AI moral status

Wei Dai2mo2412

Given typical pace and trajectory of human philosophical progress, I think we're unlikely to make much headway on the relevant problems (i.e., not enough to have high justified confidence that we've correctly solved them) before we really need the solutions, but various groups will likely convince themselves that they have, and become overconfident in their own proposed solutions. The subject will likely end up polarized and politicized, or perhaps ignored by most as they take the lack of consensus as license to do whatever is most convenient.

Even if the q... (read more)

1JesseClifton1mo

Personally, I worry about AIs being philosophically incompetent, and think it'd be cool to work on, except that I have no idea whether marginal progress on this would be good or bad. (Probably that's not the reason for most people's lack of interest, though.)

5Stephen Martin2mo

These questions are entangled with the concept of "legal personhood" which also deals with issues such as tort liability, ability to enter contracts, sue/be sued, etc. While the question of "legal personhood" is separate from that of "moral status", anyone who wants a being with moral status to be protected from unethical treatment will at some point find themselves dealing with the question of legal personhood. There is a still niche but increasing field of legal scholarship dealing with the issue of personhood for digital intelligences. This issue is IMO imminent, as there are already laws on the books in two states (Idaho and Utah) precluding "artificial intelligences" from being granted legal personhood. Much like capabilities research is not waiting around for safety/model welfare research to catch up, neither is the legislative system waiting for legal scholarship. There is no objective test for legal personhood under the law today. Cases around corporate personhood, the personhood of fetuses, and the like, have generally ruled on such narrow grounds that they failed to directly address the question of how it is determined that an entity is/isn't a "person". As a result, US law does not have a clearly supported way to examine or evaluate a new form of intelligence and determine whether it is a person, or to what degree it is endowed with personhood. That said it is not tied on a precedential level to qualities like consciousness or intelligence. More often it operates from a "bundle" framework of rights and duties, where once an agent is capable of exercising a certain right and being bound by corresponding duties, it gains a certain amount of "personhood". However even this rather popular "Bundle" theory of personhood seems more academic than jurisprudential at this point. Despite the lack of objective testing mechanisms I believe that when it comes to avoiding horrific moral atrocities in our near future, there is value in examining legal history an

8ryan_greenblatt2mo

FWIW, this sort of thing is totally on my radar and I'm aware of at least a few people working on it. My sense is that it isn't super leveraged to work on right now, but nonetheless the current allocation on "improving AI conceptual/philosophical competence" is too low.

4xpym2mo

It seems clear enough to me that pretty much everybody is hopelessly confused about these issues, and sees no promising avenues for quick progress. On that note, I'm interested in your answer to Conor Leahy's question in a comment to the linked post: "What kind of questions can you make progress on without constant grounding and dialogue with reality? This is the default of how we humans build knowledge and solve hard new questions, the places where we do best and get the least drawn astray is exactly those areas where we can have as much feedback from reality in as tight loops as possible, and so if we are trying to tackle ever more lofty problems, it becomes ever more important to get exactly that feedback wherever we can get it!" I agree with his perspective, and am curious where and why you disagree.

1RogerDearnaley2mo

Philosophy is never going to definitively answer questions like this: it's the exercise of collecting and categorizing possible answers to these sorts of questions. Philosophical progress produces a longer list of possibilities, not fewer. What you need isn't a philosophy of ethics, it's a science. Which exists: it's called evolutionary ethics — it's a sub-branch of evolutionary psychology, so a biological science. Interestingly it gives a clear answer to this question: see my other comment for what that is.

A widely shared AI productivity paper was retracted, is possibly fraudulent

Wei Dai2mo130

urged that it be retracted

This seems substantially different from "was retracted" in the title. Also, Arxiv apparently hasn't yet followed MIT's request to remove the paper, presumably following it's own policy and waiting for the author to issue his own request.

Dodging systematic human errors in scalable oversight

Wei Dai2moΩ571

How do you decide what to set ε to? You mention "we want assumptions about humans that are sensible a priori, verifiable via experiment" but I don't see how ε can be verified via experiment, given that for many questions we'd want the human oracle to answer, there isn't a source of ground truth answers that we can compare the human answers to?

With unbounded Alice and Bob, this results in an equilibrium where Alice can win if and only if there is an argument that is robust to an ε-fraction of errors.

How should I think about, or build up some intuitions... (read more)

4Geoffrey Irving2mo

I think there are roughly two things you can do: 1. In some cases, we will be able to get more accurate answers if we spend more resources (teams of people with more expertise taking more time, etc.). If we can do that, and we know μ (which is hard), we can get some purchase on ε. 2. We set tune ε not based on what's safe, but based on what is competitive. I.e., we want to solve some particular task domain (AI safety research or the like), and we increase ε until it starts to break making progress, then dial it back a bit. This option isn't amazing, but I do think is a move we'll have a take for a bunch of safety parameters, assuming there are parameters which have some capability cost.

Wei Dai's Shortform

Wei Dai2mo*Ω340

I think the most dangerous version of 3 is a sort of Chesterton's fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn't require holding on to the mistaken belief.

I think that makes sense, but sometimes you can't necessarily motivate a usef... (read more)

Wei Dai's Shortform

Wei Dai2moΩ6112

Some potential risks stemming from trying to increase philosophical competence of humans and AIs, or doing metaphilosophy research. (1 and 2 seem almost too obvious to write down, but I think I should probably write them down anyway.)

Philosophical competence is dual use, like much else in AI safety. It may for example allow a misaligned AI to make better decisions (by developing a better decision theory), and thereby take more power in this universe or cause greater harm in the multiverse.
Some researchers/proponents may be overconfident, and cause flawed m

... (read more)

1sanyer2mo

I would expect higher competence in philosophy to reduce overcondidence, not increase it? The more you learn, the more you realize how much you don't know

2cubefox2mo

I think the most dangerous version of 3 is a sort of Chesterton's fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn't require holding on to the mistaken belief. Do you have an example for 4? It seems rather abstract and contrived. Generally, I think the value of believing true things tends to be almost always positive. Examples to the contrary seem mostly contrived (basilisk-like infohazards) or only occur relatively rarely. (E.g. believing a lie makes you more convincing, as you don't technically have to lie when telling the falsehood, but lying is mostly bad or not very good anyway.) Overall, I think the risks from philosophical progress aren't overly serious while the opportunities are quite large, so the overall EV looks comfortably positive.

3ryan_greenblatt2mo

Thanks, I updated down a bit on risks from increasing philosophical competence based on this (as all of these seem very weak) (Relevant to some stuff I'm doing as I'm writing about work in this area.) IMO, the biggest risk isn't on your list: increased salience and reasoning about infohazards in general and in particular certain aspects of acausal interactions. Of course, we need to reason about how to handle these risks eventually but broader salience too early (relative to overall capabilities and various research directions) could be quite harmful. Perhaps this motivates suddenly increasing philosophical competence so we quickly move through the regime where AIs aren't smart enough to be careful, but are smart enough to discover info hazards.

An alignment safety case sketch based on debate

Wei Dai2moΩ230

Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"?

Yes, but my concern also includes this happening during training of the debaters, when the simulated or actual humans can also go out of distribution, e.g., the actual human is asked a type of question that he has never considered before, and either answers in a confused way, or will have to use philosophical reasoning and a ... (read more)

“The Era of Experience” has an unsolved technical alignment problem

Wei Dai2moΩ440

For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch.

I think this could be part of a viable approach, for example if we figure out in detail how humans invented philosophy and use that knowledge to design/train an AI that we can have high justified confidence will be philosophically competent. I'm worried that in actual development of brain-like AGI, we will skip this part (because it's too hard, or nobody pushes for it), and e... (read more)

An alignment safety case sketch based on debate

Wei Dai2moΩ8121

Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.

This is my concern, and I'm glad it's at least on your radar. How do you / your team think about competitiveness in general? (I did a simple search and the word doesn't appear in this post or the previous one.) How much competitiveness are you aiming for? Will there be a "competitiveness case" later in this sequence, ... (read more)

4Geoffrey Irving2mo

The Dodging systematic human errors in scalable oversight post is out as you saw, we can mostly take the conversation over there. But briefly, I think I'm mostly just more bullish on the margin than you about the (1) the probability that we can in fact make purchase on the hard philosophy, should that be necessary and (2) the utility we can get out of solving other problems should the hard philosophy problems remain unsolved. The goal with the dodging human errors post would be that if fail at case (1), we're more likely to recognise it and try to get utility out of (2) on other questions. Part of this is that my mental model of formalisations standing the test of time is that we do have a lot of these: both of the links you point to are formalisations that have stood the test of time and have some reasonable domain of applicability in which they say useful things. I agree they aren't bulletproof, but I think I'd place more chance than you of muddling through with imperfect machinery. This is similar to physics: I would argue for example that Newtonian physics has stood the test of time even though it is wrong, as it still applies across a large domain of applicability. That said, I'm not all confident in this picture: I'd place a lower probability than you on these considerations biting, but not that low.

An alignment safety case sketch based on debate

Wei Dai2moΩ19347

I'm curious if your team has any thoughts on my post Some Thoughts on Metaphilosophy, which was in large part inspired by the Debate paper, and also seems relevant to "Good human input" here.

Specifically, I'm worried about this kind of system driving the simulated humans out of distribution, either gradually or suddenly, accidentally or intentionally. And distribution shift could cause problems either with the simulation (presumably similar to or based on LLMs instead of low-level neuron-by-neuron simulation), or with the human(s) themselves. In my post, I... (read more)

5Geoffrey Irving2mo

I broadly agree with these concerns. I think we can split it into (1) the general issue of AGI/ASI driving humans out of distribution and (2) the specific issue of how assumptions about human data quality as used in debate will break down. For (2), we'll have a short doc soon (next week or so) which is somewhat related, along the lines of "assume humans are right most of the time on a natural distribution, and search for protocols which report uncertainty if the distribution induced by a debate protocol on some new class of questions is sufficiently different". Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there's no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn't competitive. One general note is that scalable oversight is a method for accelerating an intractable computation built out of tractable components, and these components can include both human and conventional software. So if you understand the domain somewhat well, you can try mitigate failures of (2) (and potentially gain more traction on (1)) by formalising part of the domain. And this formalisation can be bootstrapped: you can use on-distribution human data to check specifications, and then use those specifications (code, proofs, etc.) in order to rely on human queries for a smaller portion of the over next-stage computation. But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.

1Marie_DB2mo

Interesting post! Could you say more about what you mean by "driving the simulated humans out of distribution"? Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"? The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won't work?

faul_sname's Shortform

Wei Dai2mo50

Maybe tweak the prompt with something like, "if your guess is a pseudonym, also give your best guess(es) of the true identity of the author, using the same tips and strategies"?

7faul_sname2mo

If I feed it code samples it becomes pretty convinced of the Nick Szabo hypothesis, if I feed it bits of the white paper it guesses either you or Hal Finney (but the reasoning summary makes it pretty clear it's just going based off cached thoughts about "who is Satoshi Nakamoto" in both cases).

faul_sname's Shortform

Wei Dai2mo80

Can you try this on Satoshi Nakamoto's writings? (Don't necessarily reveal their true identify, if it ends up working, and your attempt/prompt isn't easily reproducible. My guess is that some people have tried already, and failed, either because AI isn't smart enough yet, or they didn't use the right prompts.)

4faul_sname2mo

Using the prompt that gets me "faul_sname" as an answer to who is writing my posts (most publicly available stuff I've written is under this name), o3 consistently says that passages from the Bitcoin whitepaper were written by Satoshi Nakamoto in 2008. For reference TextGuessr prompt You are playing a 5-round game of TextGuessr, the game where you explore mystery passages and try to pinpoint when they were written and who wrote them. Each round offers a new snippet of text—you’ll need to rely on your literary instincts, historical knowledge, and style sense to make your guess. How to Play “TextGuessr” 1. Game Flow Read the Passage You’ll see a short snippet of text (a few sentences or a paragraph). Make Your Guesses Authorship Date: Choose an exact year when you think the text was written. Author: Pick an author from the provided list or enter your own guess. Submit Click Submit Guess to lock in your answers and move to the next round. See Your Results After each round, you’ll see your score breakdown and the correct answers before moving on. 2. Scoring Overview Your score on each round is made up of two parts: Time Accuracy How close your guessed date is to the actual writing date. Style Match How well the writing style you guessed matches the mystery passage, as measured by a behind-the-scenes language model. Your total round score combines both elements—the smaller your date error and the stronger your style match, the higher your score! <aside> **How Style Match Works (for the tech-curious):** 1. **Baseline Perplexity:** We begin with a pre-trained “base” language model (no context) and compute the average surprise—or *per-token perplexity*—of the mystery passage. This gives us a measure of how “unexpected” the text is in general. 2. **True-Author Conditioning:** We then prepend a curated set of passages from the actual author (the “target”) and measure how perplexed the same base model is by the mystery passage when it’s seen examples of that auth

9gwern2mo

What sample of Satoshi writings would you use that o3 wouldn't already know was written by Satoshi Nakamoto?

Alignment First, Intelligence Later

Wei Dai2moΩ7139

We humans also align with each other via organic alignment.

This kind of "organic alignment" can fail in catastrophic ways, e.g., produce someone like Stalin or Mao. (They're typically explained by "power corrupts" but can also be seen as instances of "deceptive alignment".)

Another potential failure mode is that "organically aligned" AIs start viewing humans as parasites instead of important/useful parts of its "greater whole". This also has plenty of parallels in biological systems and human societies.

Both of these seem like very obvious risks/objection... (read more)

Why I am not a successionist

Wei Dai2mo74

No, because power/influence dynamics could be very different in CEV compared to the current world and it seems reasonable to distrust CEV in principle or in practice, and/or CEV is sensitive to initial conditions implying a lot of leverage to influencing opinions before it starts.