I want a name for the following principle:
the world-spec gap hurts you more than the spec-component gap
I wrote it out much like this a couple years ago and Zac recently said the same thing.
I'd love to be able to just say "the <one to three syllables> principle", yaknow?
I'm working on making sure we get high quality critical systems software out of early AGI. Hardened infrastructure buys us a lot in the slightly crazy story of "self-exfiltrated model attacks the power grid", but buys us even more in less crazy stories about all the software modules adjacent to AGI having vulnerabilities rapidly patched at crunchtime.
<standup_comedian>
What's the deal with evals </standup_comedian>
epistemic status: tell me I'm wrong.
Funders seem particularly enchanted with evals, which seems to be defined as "benchmark but probably for scaffolded systems and scoring that is harder than scoring most of what we call benchmarks".
I can conjure a theory of change. It's like, 1. if measurement is bad then we're working with vibes, so we'd like to make measurement good. 2. if measurement is good then we can demonstrate to audiences (especially policymakers) that warning shots are...
If I'm allowed to psychoanalyze funders rather than discussing anything at the object level, I'd speculate that funders like evals because:
I'm impressed with models that accomplish tasks in zero or one shot with minimal prompting skill. I'm not sure what galaxy brained scaffolds and galaxy brained prompts demonstrate. There's so much optimization in the measurement space.
I shipped a benchmark recently, but it's secretly a synthetic data play so regardless of how hard people try in order to score on it, we get synthetic data out of it which leads to finetune jobs which leads to domain specific models that can do such tasks hopefully with minimal prompting effort and no scaffolding.
$PERSON at $LAB once showed me an internal document saying that there are bad benchmarks - dangerous capability benchmarks - that are used negatively, so unlike positive benchmarks where the model isn't shipped to prod if it performs under a certain amount, these benchmarks could block a model from going to prod that performs over a certain amount. I asked, "you create this benchmark like it's a bad thing, and it's a bad thing at your shop, but how do you know it won't be used in a sign-flipped way at another shop?" and he said "well we just call it EvilBe...
This is exactly why the bio team for WMDP decided to deliberately include distractors involving relatively less harmful stuff. We didn't want to publicly publish a benchmark which gave a laser-focused "how to be super dangerous" score. We aimed for a fuzzier decision boundary. This brought criticism from experts at the labs who said that the benchmark included too much harmless stuff. I still think the trade-off was worthwhile.
I'm surprised to hear you say that, since you write
Upfront, I want to clarify: I don’t believe or wish to claim that GSAI is a full or general panacea to AI risk.
I kinda think anything which is not a panacea is swiss cheese, that those are the only two options.
In a matter of what sort of portofolio can lay down slices of swiss cheese at what rate and with what uncorrelation. And I think in this way GSAI is antifragile to next year's language models, which is why I can agree mostly with Zac's talk and still work on GSAI (I don't think he talks about my ...
I gave a lightning talk with my particular characterization, and included "swiss cheese" i.e. that gsai sources some layers of swiss cheese without trying to be one magic bullet. But if people agree with this, then really guaranteed-safe ai is a misnomer, cuz guarantee doesn't evoke swiss cheese at all
For anecdata: id be really jazzed about 3 or 4, 5 might be a little crazy but somewhat open to that or more.
yeah last week was grim for a lot of people with r1's implications for proliferation and the stargate fanfare after inauguration. Had a palpable sensation of it pivoting from midgame to endgame, but I would doubt that sensation is reliable or calibrated.
Tmux allows you to set up multiple panes in your terminal that keep running in the background. Therefore, if you disconnect from a remote machine, scripts running in tmux will not be killed. We tend to run experiments across many tmux panes (especially overnight).
Does no one use suffix & disown
which sends a command to a background process that doesn't depend on the ssh process, or prefix nohup
which does the same thing? You have to make sure any logging that goes to stdout goes to a log file instead (and in this way tmux or screen are better)
Your r...
the story i roughly understand is that this was within Epoch's mandate in the first place because they wanted to forecast on benchmarks but didn't think existing benchmarks were compelling or good enough so had to take matters into their own hands. Is that roughly consensus, or true? Why is frontiermath a safety project? i haven't seen adequate discussion on this.
Can't relate. Don't particularly care for her content (tho audibly laughed at a couple examples that you hated), but I have no aversion to it. I do have aversion to the way you appealed to datedness as if that matters. I generally can't relate to people who find cringiness in the way you describe significantly problematic, really.
People like authenticity, humility, and irony now, both in the content and in its presentation.
I could literally care less, omg--- but im unusually averse to irony. Authenticity is great, humility is great most of the time, why is irony even in the mix?
Tho I'm weakly with you that engagement farming leaves a bad taste in my mouth.
Update: new funding call from ARIA calls out the Safeguarded/Gatekeeper stack in a video game directly
Creating (largely) self-contained prototypes/minimal-viable-products of a Safeguarded AI workflow, similar to this example but pushing for incrementally more advanced environments (e.g. Atari games).
I tried a little myself too. Hope I didn't misremembering.
Very anecdotally, I've talked to some extremely smart people who I would guess are very good at making progress on hard problems, but just didn't think too hard about what solutions help.
A few of the dopest people i know, who id love to have on the team, fall roughly into the category of "engaged and little with lesswrong, grok the core pset better than most 'highly involved' people, but are working on something irrelevant and not even trying cuz they think it seems too hard". They have some thoughtful p(doom), but assume they're powerless.
Richard ngo tweeted recently that it was a mistake to design the agi safety fundamentals curriculum to be broadly accessible, that if he could do it over again thered be punishing problem sets that alienate most people
The upvotes and agree votes on this comment updated my perception of the rough consensus about mats and streetlighting. I previously would have expected less people to evaluate mats that way
As someone who, isolated and unfunded, went on months-long excursions into the hard version of the pset multiple times and burned out each time, I felt extremely validated when you verbally told me a fragment of this post around a fire pit at illiad. The incentives section of this post is very grim, but very true. I know naive patches to the funding ecosystem would also be bad (easy for grifters, etc), but I feel very much like I and we were failed by funders. I could've been stronger etc, I could've been in berkeley during my attempts instead of philly, b...
Ok. I'll wear a solid black shirt instead of my bright blue shirt.
When it comes to your internal track record, it is often said that finding what you wrote at time t-k beats trying to remember what you thought at t-k. However, the activation energy to keep such a journal is kinda a hurdle (which is why products like https://fatebook.io are so good!).
I find that a nice midpoint between the full and correct internal track record practices (rigorous journaling) and completely winging it (leaving yourself open to mistakes and self delusion) is talking to friends, because I think my memory of...
I was at an ARIA meeting with a bunch of category theorists working on safeguarded AI and many of them didn't know what the work had to do with AI.
epistemic status: short version of post because I never got around to doing the proper effort post I wanted to make.
A sketch I'm thinking of: asking people to consume information (a question, in this case) is asking them to do you a favor, so you should do your best to ease this burden, however, also don't be paralyzed so budget some leeway to be less than maximally considerate in this way when you really need to.
Rumors are that 2025 lighthaven is jam packed. If this is the case, and you need money, rudimentary economics suggests only the obvious: raise prices. I know many clients are mission aligned, and there's a reasonable ideological reason to run the joint at or below cost, but I think it's aligned with that spirit if profits from the campus fund the website.
I also want to say in print what I said in person a year ago: you can ask me to do chores on campus to save money, it'd be within my hufflepuff budget. There are good reasons to not go totally "by and for ...
ThingOfThings said that Story of Louis Pasteur is a very EA movie, but I think it also counts for rationality. Huge fan.
Event for the paper club: https://calendar.app.google/2a11YNXUFwzHbT3TA
blurb about the paper in last month's newsletter:
...... If you’re wondering why you just read all that, here’s the juice: often in GSAI position papers there’ll be some reference to expectations that capture “harm” or “safety”. Preexpectations and postexpectations with respect to particular pairs of programs could be a great way to cash this out, cuz we could look at programs as interventions and simulate RCTs (labeling one program
my dude, top level post- this does not read like a shortform
by virtue of their technical chops, also care about their career capital.
I didn't understand this-- "their technical chops impose opportunity cost as they're able to build very safe successful careers if they toe the line" would make sense, or they care about career capital independent of their technical chops would make sense. But here, the relation between technical chops and caring about career capital doesn't come through clear.
I was thinking the same thing this morning! My main thought was, "this is a trap. ain't no way I'm pressing a big red button especially not so near to petrov day"
Alas, belief is easier than disbelief; we believe instinctively, but disbelief requires a conscious effort.
Yes, but this is one thing that I have felt being mutated as I read the sequences and continued to hang out with you lot (roughly 8 years ago, with some off and on)
By all means. Happy for that
See Limitations on Formal Verification for AI Safety over on LessWrong. I have a lot of agreements, and my disagreements are more a matter of what deserves emphasis than the fundamentals. Overall, I think the Tegmark/Omohundro paper failed to convey a swisscheesey worldview, and sounded too much like “why not just capture alignment properties in ‘specs’ and prove the software ‘correct’?” (i.e. the vibe I was responding to in my very pithy post). However, I think my main reason I’m not using Dickson’s post as a reason to j...
in case i forgot last month, here's a link to july
One proof of concept for the GSAI stack would be a well-understood mechanical engineering domain automated to the next level and certified to boot. How about locks? Needs a model of basic physics, terms in some logic for all the parts and how they compose, and some test harnesses that simulate an adversary. Can you design and manufacture a provably unpickable lock?
Zac Hatfield-Dodds (of hypothesis/pytest and Anthropic, was offered and declined au...
I do think it's important to reach appropriately high safety assurances before developing or deploying future AI systems which would be capable of causing a catastrophe. However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification - whatever actually works in practice.
for what it's worth, I do see GSAI as largely a swiss cheesey worldview. Though I can see how you might read some of the authors involved to be implying otherwise! I should knock out a post on this.
not just get it to sycophantically agree with you
i struggle with this, and need to attend a prompting bootcamp
sure -- i agree that's why i said "something adjacent to" because it had enough overlap in properties. I think my comment completely stands with a different word choice, I'm just not sure what word choice would do a better job.
I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I'd prefer to avoid lock-in to my or anyone's values at t=2024, but such a lock-in might be "good enough" if I'm threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I'd accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different ...
Thinking about a top-level post on FOMO and research taste
He had become so caught up in building sentences that he had almost forgotten the barbaric days when thinking was like a splash of color landing on a page.
a -valued quantifier is any function , so when is bool quantifiers are the functions that take predicates as input and return bool as output (same for prop). the standard max
and min
functions on arrays count as real-valued quantifiers for some index set .
I thought I had seen as the max of the Prop-valued quantifiers, and exists as the min somewhere, which has a nice mindfeel since forall has this "big" feeling (if you determined for that (of which is just syntax sugar since the variable name is irrelevant) by exhaustive ...
I'm excited for language model interpretability to teach us about the difference between compilers and simulations of compilers. In the sense that chatgpt and I can both predict what a compiler of a suitably popular programming language will do on some input, what's going on there---- surely we're not reimplementing the compiler on our substrate, even in the limit of perfect prediction? Will be an opportunity for a programming language theorist in another year or two of interp progress
firefox is giving me a currently broken redirect on https://atlascomputing.org/safeguarded-ai-summary.pdf
@Tyra Burgess and I wrote down a royalty-aware payout function yesterday:
For a type B, let L(B) be the "left closure under implication" or the admissible antecedents. I.e., the set of all the antecedents A in the public ledger such that A→B. p:Type→Money is the price that a proposition was listed for (admitting summing over duplicates). Suppose player 1,...,k have previously proven B1,...,Bk and L(A) is none other than the set of all Bi from 1 to k.
We would like to fix an ϵ (could be fairly big, like 15) and say that the royalty-aware payout given eps... (read more)