Researching donation opportunities. Previously: ailabwatch.org.
You suspect someone in your community is a bad actor. Kinds of reasons not to move against them:
This is all from-first-principles. I'm interested in takes and reading recommendations. Also creative affordances / social technology.
The convenient case would be that you can privately get the powerful stakeholders on board and then they all oust the bad actor and it's a fait accompli, so there's no protracted conflict and minimal cost to you and your community. If you can't get the powerful stakeholders on board, I guess you give up and just privately share information with people you trust — unless you're sufficiently powerful/independent/ostracized that it's cheap for you to make enemies, or it's sufficiently important, or you can share information anonymously. If you're scared about telling the powerful stakeholders, I guess it's the same situation.
This is assuming that the main audience for your beliefs is some powerful stakeholders. Sometimes it's more like many community members.
I agree but don't feel very strongly. On Anthropic security, I feel even more sad about this.
If [the insider exception] does actually apply to a large fraction of technical employees, then I'm also somewhat skeptical that Anthropic can actually be "highly protected" from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical!
You may be interested in ailabwatch.org/resources/corporate-documents, which links to a folder where I have uploaded ~all past versions of the CoI. (I don't recommend reading it, although afaik the only lawyers who've read the Anthropic CoI are Anthropic lawyers and advisors, so it might be cool if one independent lawyer read it from a skeptical/robustness perspective. And I haven’t even done a good job diffing the current version from a past version; I wasn’t aware of the thing Drake highlighted.)
I guess so! Is there reason to favor logit?
Yep, e.g. donations sooner are better for getting endorsements. Especially for Bores and somewhat for Wiener, I think.
There's often a logistic curve for success probabilities, you know? The distances are measured in multiplicative odds, not additive percentage points. You can't take a project like this and assume that by putting in some more hard work, you can increase the absolute chance of success by 10%. More like, the odds of this project's failure versus success start out as 1,000,000:1, and if we're very polite and navigate around Mr. Topaz's sense that he is higher-status than us and manage to explain a few tips to him without ever sounding like we think we know something he doesn't, we can quintuple his chances of success and send the odds to 200,000:1. Which is to say that in the world of percentage points, the odds go from 0.0% to 0.0%. That's one way to look at the “law of continued failure”.
If you had the kind of project where the fundamentals implied, say, a 15% chance of success, you’d then be on the right part of the logistic curve, and in that case it could make a lot of sense to hunt for ways to bump that up to a 30% or 80% chance.
I observe that https://www.lesswrong.com/posts/BqwXYFtpetFxqkxip/mikhail-samin-s-shortform?commentId=dtmeRXPYkqfDGpaBj isn't frontpage-y but remains on the homepage even after many mods have seen it. This suggests that the mods were just patching the hack. (But I don't know what other shortforms they've hidden, besides the political ones, if any.)
fwiw I agree with most but not all details, and I agree that Anthropic's commitments and policy advocacy have a bad track record, but I think that Anthropic capabilities is nevertheless net positive, because Anthropic has way more capacity and propensity to do safety stuff than other frontier AI companies.
I wonder what you believe about Anthropic's likelihood of noticing risks from misalignment relative to other companies, or of someday spending >25% of internal compute on (automated) safety work.
Concept: inconvenience and flinching away.
I've been working for 3.5 years. Until two months ago, I did independent-ish research where I thought about stuff and tried to publicly write true things. For the last two months, I've been researching donation opportunities. This is different in several ways. Relevant here: I'm working with a team, and there's a circle of people around me with some beliefs and preferences related to my work.
I have some new concepts related to these changes. (Not claiming novelty.)
First is "flinching away": when I don't think about something because it's awkward/scary/stressful. I haven't noticed my object-level beliefs skewed by flinching away, but I've noticed that certain actions should be priorities but come to mind less than they should. In particular: doing something about a disagreement with the consensus or status quo or something an ally is doing. I maybe fixed this by writing the things I'm flinching away from in a google doc when I notice them so I don't forget (and now having the muscle of noticing similar things). It would still be easier if it wasn't the case that (it's salient to me that) certain conclusions and actions are more popular than others.
Second is convenience. Convenience is orthogonal to truth. Examples of convenient things are: considerations in favor of conclusions my circle agrees with; considerations that make the answer look more clear/overdetermined. I might flinch away from inconvenient upshots for my actions. (I think I'm not biased by convenience—i.e., not doing motivated reasoning—when forming object-level beliefs within the scope of my day-to-day work; possibly I'm unusually good at this.) I've noticed myself saying stuff like "conveniently, X" and "Y, that's inconvenient" a lot recently. Noticing feelings about considerations has felt helpful but I don't know/recall why. Possibly this is mostly when thinking about stuff with others: saying "that's convenient" is a flag to check whether it's suspiciously convenient; saying "that's inconvenient" is a flag to make sure to take it seriously.