Zac Hatfield-Dodds

Technical staff at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; ongoing PhD in CS (learning / testing / verification), open sourcerer, more at zhd.dev

Wiki Contributions

Comments

It's a sparse autoencoder because part of the loss function is an L1 penalty encouraging sparsity in the hidden layer. Otherwise, it would indeed learn a simple identity map!

you CAN predict that there will be evidence with equal probability of each direction.

More precisely the expected value of upwards and downwards updates should be the same; it's nonetheless possible to be very confident that you'll update in a particular direction - offset by a much larger and proportionately less likely update in the other.

For example, I have some chance of winning. lottery this year, not much lower than if I actually bought a ticket. I'm very confident that each day I'll give somewhat lower odds (as there's less time remaining), but being credibly informed that I've won would radically change the odds such that the expectation balances out.

I agree that there's no substitute for thinking about this for yourself, but I think that morally or socially counting "spending thousands of dollars on yourself, an AI researcher" as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it's-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it's very easy for donating to people or organizations in your social circle to have substantial negative expected value.

I'm glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.

Trivially true to the extent that you are about equally likely to observe a thing throughout that timespan; and the Lindy Effect is at least regularly talked of.

But there are classes of observations for which this is systematically wrong: for example, most people who see a ship part-way through a voyage will do so while it's either departing or arriving in port. Investment schemes are just such a class, because markets are usually up to the task of consuming alpha and tend to be better when the idea is widely known - even Buffett's returns have oscillated around the index over the last few years!

Answer by Zac Hatfield-DoddsMar 13, 2024122

Safety properties aren't the kind of properties you can prove; they're statements about the world, not about mathematical objects. I very strongly encourage anyone reading this comment to go read Leveson's Engineering a Safer World (free pdf from author) through to the end of chapter three - it's the best introduction to systems safety that I know of and a standard reference for anyone working with life-critical systems. how.complexsystems.fail is the short-and-quotable catechism.


I'm not really sure what you mean by "AI toolchain", nor what threat model would have a race-condition present an existential risk. More generally, formal verification is a research topic - there's some neat demonstration systems and they're used in certain niches with relatively small amounts of code and compute, simple hardware, and where high development times are acceptable. None of those are true of AI systems, or even libraries such as Pytorch.

For flavor, some of the most exciting developments in formal methods: I expect the Lean FRO to improve usability, and 'autoformalization' tricks like Proofster (pdf) might also help - but it's still niche, and "proven correct" software can still have bugs from under-specified components, incorrect axioms, or outright hardware issues (e.g. Spectre, Rowhammer, cosmic rays, etc.). The seL4 microkernel is great, but you still have to supply an operating system and application layer, and then ensure the composition is still safe. To test an entire application stack, I'd instead turn to Antithesis, which is amazing so long as you can run everything in an x86 hypervisor (with no GPUs).

(as always, opinions my own)

I think he's actually quite confused here - I imagine saying

Hang on - you say that (a) we can think, and (b) we are the instantiations of any number of computer programs. Wouldn't instantiating one of those programs be a sufficient condition of understanding? Surely if two things are isomorphic even in their implementation, either both can think, or neither.

(the Turing test suggests 'indistinguishable in input/output behaviour', which I think is much too weak)

See e.g. https://mschloegel.me/paper/schloegel2024sokfuzzevals.pdf

Fuzzing is a generally pretty healthy subfield, but even there most peer-reviewed papers in top venues are still are completely useless! Importantly, "a 'working' github repo" is really not enough to ensure that your results are reproducible, let alone ensure external validity.

people’s subjective probability of successful restoration to life in the future, conditional on there not being a global catastrophe destroying civilization before then. This is also known as p(success).

This definition seems relevantly modified by the conditional!

You also seem to be assuming that "probability of revival" could be a monocausal explanation for cryonics interest, but I find that implausible ex ante. Monocausality approximately doesn't exist, and "is being revived good in expectation / good with what probability" are also common concerns. (CF)

Very little, because most CS experiments are not in fact replicable (and that's usually only one of several serious methodological problems).

CS does seem somewhat ahead of other fields I've worked in, but I'd attribute that to the mostly-separate open source community rather than academia per se.

Load More