Yep
Is it OK for LW admins to look at DM metadata for spam prevention reasons?
Sometimes new users show up and spam a bunch of other users in DMs (in particular high-profile users). We can't limit DM usage to only users with activity on the site, because many valuable DMs get sent by people who don't want to post publicly. We have some basic rate limits for DMs, but of course those can't capture many forms of harassment or spam.
Right now, admins can only see how many DMs users have sent, and not who users have messaged, without making a whole manual...
Could make this a report-based system? If the user reported a potential spam, then in the submission process ask for reasons, and ask for consent to look over the messages (between the reporter and the alleged spammer); if multiple people reported the same person it will be obvious this account is spamming with DM?
edit: just saw previous comment on this too
Suppose there is 10 years of monumental progress in mechanistic interpretability. We can roughly - not exactly - explain any output that comes out of a neural network. We can do experiments where we put our AIs in interesting situations and make a very good guess at their hidden reasons for doing the things they do.
Doesn't this sound a bit like where we currently are with models that operate with a hidden chain of thought? If you don't think that an AGI built with the current fingers-crossed-its-faithful paradigm would be safe, what percentage an outcome would mech interp have to hit to beat that?
Seems like 99+ to me.
I very much agree. Do we really think we're going to track a human-level AGI let alone a superintelligence's every thought, and do it in ways it can't dodge if it decides to?
I strongly support mechinterp as a lie detector, and it would be nice to have more as long as we don't use that and control methods to replace actual alignment work and careful thinking. The amount of effort going into interp relative to the theory of impact seems a bit strange to me.
One thing I notice in AI safety career and strategy discussions is that there is a lot of epistemic helplessness in regard to AGI timelines. People often talk about "your timelines" instead of "timelines" when giving advice, even if they disagree strongly with the timelines. I think this habit causes people to ignore disagreements in unhelpful ways.
Here's one such conversation:
Bob: Should I do X if my timelines are 10 years?
Alice (who has 4 year timelines): I think X makes sense if your timelines are l...
In general, it is difficult to give advice if whether the advice is good depends on background facts that giver and recipient disagree about. I think the most honest approach is to explicitly state what your advice depends on when you think the recipient is likely to disagree. E.g. "I think living at high altitude is bad for human health, so in my opinion you shouldn't retire in Santa Fe."
If I think AGI will arrive around 2055, and you think it will arrive in 2028, what is achieved by you saying "given timelines, I don't think your mechinterp project will ...
Given the OpenAI o3 results making it clear that you can pour more compute to solve problems, I'd like to announce that I will be mentoring at SPAR for an automated interpretability research project using AIs with inference-time compute.
I truly believe that the AI safety community is dropping the ball on this angle of technical AI safety and that this work will be a strong precursor of what's to come.
Note that this work is a small part in a larger organization on automated AI safety I’m currently attempting to build.
Here’s the link: https://airtable.com/ap...
links 11/20/2024: https://roamresearch.com/#/app/srcpublic/page/12-20-2024
Physical object.
I might (20%) make a run of buttons that say how long since you pressed them. eg so I can push the button in the morning when I have put in my anti-baldness hair stuff and then not have to wonder whether I did.
Would you be interested in buying such a thing?
Perhaps they have a dry wipe section so you can write what the button is for.
If you would, can you upvote the attached comment.
I use daily checklists, in spreadsheet form, for this.
I’d like to hire cognitive assistants and tutors more often. This could (potentially) be you, or people you know. Please let me know if you’re interested or have recommendations.
By “cognitive assistant” I mean a range of things, but the core thing is “sit next to me, and notice when I seem like I’m not doing the optimal thing, and check in with me.” I’m interested in advanced versions who have particular skills (like coding, or Applied Quantitivity, or good writing, or research taste) who can also be tutoring me as we go.
I’d like a large rolodex of such pe...
I decided to conduct an experiment at neurips this year: I randomly surveyed people walking around in the conference hall to ask whether they had heard of AGI
I found that out of 38 respondents, only 24 could tell me what AGI stands for (63%)
we live in a bubble
I'd be surprised if this were the case. next neurips I can survey some non native English speakers to see how many ML terms they know in English vs in their native language. I'm confident in my ability to administer this experiment on Chinese, French, and German speakers, which won't be an unbiased sample of non-native speakers, but hopefully still provides some signal.
Gemini 2.0 Flash Thinking is claimed to 'transparently show its thought process' (in contrast to o1, which only shows a summary): https://x.com/denny_zhou/status/1869815229078745152. This might be at least a bit helpful in terms of studying how faithful (e.g. vs. steganographic, etc.) the Chains of Thought are.
Other recent models that show (at least purportedly) the full CoT:
[this is a draft. I strongly welcome comments]
A blockade of Taiwan seems significantly more likely than a full-scale invasion. The US's non-intervention in Ukraine suggests similar restraint might occur with Taiwan.
Nevertheless, Metaculus predicts a 65% chance of US military response to a Chinese invasion and separately gives 20-50% for some kind of Chinese military intervention by 2035. Let us imagine that the worst comes to pass and China and the United States are engaged in a hot war?
China's...
Damn! Dark forest vibes, very cool stuff!
Reference for the sub collision: https://en.wikipedia.org/wiki/HMS_Vanguard_and_Le_Triomphant_submarine_collision
And here's another one!
https://en.wikipedia.org/wiki/Submarine_incident_off_Kildin_Island
Might as well start equipping them with fenders at this point.
And 2050 basically means post-AGI at this point. ;)
I *despise* the category of "social construction." "Race is a social construct." "Gender is a social construct."
It confusingly conflates these four:
- Socially Constituted: Something that only exists because we say it does conceptually.
- Socially Caused: A thing in the world that only exists because we chose to make it exist.
- Socially Chunked: Something that exists as a spectrum in the real world that we—with some arbitrariness—chunk into discrete categories.
- Socially Charged: Something that exists in the real world that we decide to assign social importa...
This is convincing me to buy a sleep mask and blackout curtains. One man's modus ponens is another man's modus tollens as they say.
Ok, I really don't get why my post announcing my book got downvoted (ignored is one thing, downvoted quite another)...
Update: when I made this post the original post was on 5 Karma and out of the frontpage. Now it's 15 Karma, which is about what I expected it to get, given that it's not a core topic of LW and doesn't have a lot of information (though I now added some more information at the end), so I'm happy. Just a bit of a bummer that I had to make this post to get the original post out of the pit of obscurity it was pushed into by noise.
I would need more data to make an opinion on this.
At first sight, it seems to me like having a rule "if your total karma is less than 100, you are not allowed to downvote an article or a comment if doing so would push it under zero" would be good.
But I have no idea how often that happens in real life. Do we actually have many readers with karma below 100 who bother to vote?
By the way, I didn't vote on your article, but... you announced that you were writing a book i.e. it is not even finished, you didn't provide a free chapter or something... so what exact...
How to Poison the Water?
I think we've all heard the saying about the fish and the water (the joke goes, and old fish asks young fish about the water, and the the young fish ask "what's water?).
I'm curious the key failure modes or methods that tend to "poison the water", or destroy/alter an organization/scene's culture or norms negatively. Are there major patterns that communities tend to fall into as they self destruct?
Would love for anyone to share resources or general reflections on this- I'm currently part of a (unrelated) community where I see th...
In my experience, I only remember one example of a successful "coup". It was a private company that started small, and then became wildly successful. Two key employees were savvy enough to realize that this is not necessary a good news for them. The founders, those will definitely become rich. But a rich company will hire more employees, which means that a relative importance of each one of them will decrease. And the position of the founders towards those two will probably become something like: "okay guys, you spent a decade working hard to make all of t...
I just saw a post from AI Digest on a Self-Awareness benchmark and I just thought, "holy fuck, I'm so happy someone is on top of this".
I noticed a deep gratitude for the alignment community for taking this problem so seriously. I personally see many good futures but that’s to some extent built on the trust I have in this community. I'm generally incredibly impressed by the rigorous standards of thinking, and the amount of work that's been produced.
When I was a teenager I wanted to join a community of people who worked their ass off in order to make sure hu...
Yes, problems, yes, people are being really stupid, yes, inner alignment and all of it's cousins are really hard to solve. We're generally a bit fucked, I agree. The brickwall is so high we can't see the edge and we have to bash out each brick one at a time and it is hard, really hard.
I get it people, and yet we've got a shot, don't we? The probability distribution of all potential futures is being dragged towards better futures because of the work you put in and I'm very grateful for that.
Like, I don't know how much credit to give LW and the alignment com...
Parable of the Purple Creatures
Some Medieval townsfolk thought witches were poisoning their wells. Witches, of course, are people—often ugly—who are in league with Satan, can do magic, and have an affinity for broomsticks. These villagers wanted to catch the witches so they could punish them. Everyone in the town felt that witches should be punished. So they set up a commission to investigate the matter.
Well, it turned out little purple alien creatures were poisoning their wells.
They *weren’t* human.
They *couldn’t* do magic.
They *weren’t*...
Same. It's especially true because if a knight saw a T-Rex he wouldn't hesitate to call it a dragon. He wouldn't be like, "What this is is ambiguous."
Most murder mysteries on TV tend to have a small number of suspects, and the trick is to find which one did it. I get the feeling that real life murders the police either have absolutely no idea who did it, or know exactly who did it and just need to prove that it was them to the satisfaction of the court of law.
That explains why forensic tests (e.g. fingerprints) are used despite being pretty suspect. They convince the jury that the guilty guy did it, which is all that matters.
See https://issues.org/mnookin-fingerprints-evidence/ for more on fingerprints.