Dan H

newsletter.safe.ai

newsletter.mlsafety.org

Sequences

Cost-Effectiveness Models for AI Safety
Catastrophic Risks From AI
CAIS Philosophy Fellowship Midpoint Deliverables
Pragmatic AI Safety

Wikitag Contributions

Comments

Sorted by

capability thresholds be vague or extremely high

xAI's thresholds are entirely concrete and not extremely high.

evaluation be unspecified or low-quality

They are specified and as high-quality as you can get. (If there are better datasets let me know.)

I'm not saying it's perfect, but I wouldn't but them all in the same bucket. Meta's is very different from DeepMind's or xAI's.

though I don't think xAI took an official position one way or the other

I assumed most of everybody assumed xAI supported it since Elon did. I didn't bother pushing for an additional xAI endorsement given that Elon endorsed it.

It's probably worth them mentioning for completeness that Nat Friedman funded an earlier version of the dataset too. (I was advising at that time and provided the main recommendation that it needs to be research-level because they were focusing on Olympiad level.)

Also can confirm they aren't giving access to the mathematicians' questions to AI companies other than OpenAI like xAI.

and have clearly been read a non-trivial amount by Elon Musk

Nit: He heard this idea in conversation with an employee AFAICT.

Relevant: Natural Selection Favors AIs over Humans

universal optimization algorithm

Evolution is not an optimization algorithm (this is a common misconception discussed in Okasha, Agents and Goals in Evolution).

We have been working for months on this issue and have made substantial progress on it: Tamper-Resistant Safeguards for Open-Weight LLMs

General article about it: https://www.wired.com/story/center-for-ai-safety-open-source-llm-safeguards/

It's worth noting that activations are one thing you can modify, but many of the most performant methods (e.g., LoRRA) modify the weights. (Representations = {weights, activations}, hence "representation" engineering.)

"Bay Area EA alignment community"/"Bay Area EA community"? (Most EAs in the Bay Area are focused on alignment compared to other causes.)

The AI safety community is structurally power-seeking.

I don't think the set of people interested in AI safety is a even a "community" given how diverse it is (Bengio, Brynjolfsson, Song, etc.), so I think it's be more accurate to say "Bay Area AI alignment community is structurally power-seeking."

Load More