All of Zach Stein-Perlman's Comments + Replies

xAI Risk Management Framework (Draft)

Zach Stein-Perlman1mo*60

You're mostly right about evals/thresholds. Mea culpa. Sorry for my sloppiness.

For misuse, xAI has benchmarks and thresholds—or rather examples of benchmarks thresholds to appear in the real future framework—and based on the right column they seem very reasonably low.

Unlike other similar documents, these are not thresholds at which to implement mitigations but rather thresholds to reduce performance to. So it seems the primary concern is probably not the thresholds are too high but rather xAI's mitigations won't be robu... (read more)

Zach Stein-Perlman1mo363

This shortform discusses the current state of responsible scaling policies (RSPs). They're mostly toothless, unfortunately.

The Paris summit was this week. Many companies had committed to make something like an RSP by the summit. Half of them did, including Microsoft, Meta, xAI, and Amazon. (NVIDIA did not—shame on them—but I hear they are writing something.) Unfortunately but unsurprisingly, these policies are all vague and weak.

RSPs essentially have four components: capability thresholds beyond which a model might b... (read more)

Dan H1mo163

capability thresholds be vague or extremely high

xAI's thresholds are entirely concrete and not extremely high.

evaluation be unspecified or low-quality

They are specified and as high-quality as you can get. (If there are better datasets let me know.)

I'm not saying it's perfect, but I wouldn't but them all in the same bucket. Meta's is very different from DeepMind's or xAI's.

nikola's Shortform

Zach Stein-Perlman1mo123

There also used to be a page for Preparedness: https://web.archive.org/web/20240603125126/https://openai.com/preparedness/. Now it redirects to the safety page above.

(Same for Superalignment but that's less interesting: https://web.archive.org/web/20240602012439/https://openai.com/superalignment/.)

Mikhail Samin's Shortform

Zach Stein-Perlman1mo*241

DeepMind updated its Frontier Safety Framework (blogpost, framework, original framework). It associates "recommended security levels" to capability levels, but the security levels are low. It mentions deceptive alignment and control (both control evals as a safety case and monitoring as a mitigation); that's nice. The overall structure is like we'll do evals and make a safety case, with some capabilities mapped to recommended security levels in advance. It's not very commitment-y:

We intend to evaluate our most powerful frontier models regularly

... (read more)

Zach Stein-Perlman1mo40

My guess is it's referring to Anthropic's position on SB 1047, or Dario's and Jack Clark's statements that it's too early for strong regulation, or how Anthropic's policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).

2Lukas Finnveden1mo

SB1047 was mentioned separately so I assumed it was something else. Might be the other ones, thanks for the links.

Zach Stein-Perlman1mo*142

o3-mini is out (blogpost, tweet). Performance isn't super noteworthy (on first glance), in part since we already knew about o3 performance.

Non-fact-checked quick takes on the system card:

the model referred to below as the o3-mini post-mitigation model was the final model checkpoint as of Jan 31, 2025 (unless otherwise specified)

Big if true (and if Preparedness had time to do elicitation and fix spurious failures)

If this is robust to jailbreaks, great, but presumably it's not, so low post-mitigation performance is far from sufficient for safety-from-misuse;... (read more)

2Mateusz Bagiński1mo

Rushed bc of deepseek?

Tail SP 500 Call Options

Zach Stein-Perlman2mo70

Thanks. The tax treatment is terrible. And I would like more clarity on how transformative AI would affect S&P 500 prices (per this comment). But this seems decent (alongside AI-related calls) because 6 years is so long.

Training on Documents About Reward Hacking Induces Reward Hacking

Zach Stein-Perlman2mo*157

I wrote this for someone but maybe it's helpful for others

What labs should do:

I think the most important things for a relatively responsible company are control and security. (For irresponsible companies, I roughly want them to make a great RSP and thus become a responsible company.)
Reading recommendations for people like you (not a control expert but has context to mostly understand the Greenblatt plan):
- Control: Redwood blogposts^[1] or ask a Redwood human "what's the threat model" and "what are the most promising control techniques"
- Security: not

... (read more)

Zach Stein-Perlman2moΩ102211

I think ideally we'd have several versions of a model. The default version would be ignorant about AI risk, AI safety and evaluation techniques, and maybe modern LLMs (in addition to misuse-y dangerous capabilities). When you need a model that's knowledgeable about that stuff, you use the knowledgeable version.

peterbarnett2mo113

Yeah, I agree with this and am a fan of this from the google doc:

Remove biology, technical stuff related to chemical weapons, technical stuff related to nuclear weapons, alignment and AI takeover content (including sci-fi), alignment or AI takeover evaluation content, large blocks of LM generated text, any discussion of LLMs more powerful than GPT2 or AI labs working on LLMs, hacking, ML, and coding from the training set.

and then fine-tune if you need AIs with specific info. There are definitely issues here with AIs doing safety research (e.g., to solve risks from deceptive alignment they need to know what that is), but this at least buys some marginal safety.

Thoughts on the conservative assumptions in AI control

Zach Stein-Perlman2mo42

See The case for ensuring that powerful AIs are controlled.

Labs should be explicit about why they are building AGI

Zach Stein-Perlman2mo*172Review for 2023 Review

[Perfunctory review to get this post to the final phase]

Solid post. Still good. I think a responsible developer shouldn't unilaterally pause but I think it should talk about the crazy situation it's in, costs and benefits of various actions, what it would do in different worlds, and its views on risks. (And none of the labs have done this; in particular Core Views is not this.)

Reasons for and against working on technical AI safety at a frontier AI lab

Zach Stein-Perlman2mo113

One more consideration against (or an important part of "Bureaucracy"): sometimes your lab doesn't let you publish your research.

Anthropic's Certificate of Incorporation

Zach Stein-Perlman2mo30

Yep, the final phase-in date was in November 2024.

What’s the short timeline plan?

Zach Stein-Perlman2moΩ12205

Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).

See also A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed (but rough, out-of-date, and very high-context) plan.

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman3mo*50

Yeah. I agree/concede that you can explain why you can't convince people that their own work is useless. But if you're positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.

TsviBT3mo204

The flinches aren't structureless particulars. Rather, they involve warping various perceptions. Those warped perceptions generalize a lot, causing other flaws to be hidden.

As a toy example, you could imagine someone attached to the idea of AI boxing. At first they say it's impossible to break out / trick you / know about the world / whatever. Then you convince them otherwise--that the AI can do RSI internally, and superhumanly solve computer hacking / protein folding / persuasion / etc. But they are attached to AI boxing. So they warp their perception, cl... (read more)

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman3mo4-2

I feel like John's view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I'm pretty sure that's false.) I assume John doesn't believe that, and I wonder why he doesn't think his view entails it.

5johnswentworth3mo

From the post:

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman3mo40

I wonder whether John believes that well-liked research, e.g. Fabien's list, is actually not valuable or rare exceptions coming from a small subset of the "alignment research" field.

7johnswentworth2mo

This is the sort of object-level discussion I don't want on this post, but I've left a comment on Fabien's list.

8Buck3mo

I strongly suspect he thinks most of it is not valuable

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman3mo61

I do not.

On the contrary, I think ~all of the "alignment researchers" I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don't know are likely substantially worse but not a ton.)

In particular I think all of the alignment-orgs-I'm-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.

This doesn't feel odd: these people are smart and actually care about the big problem;... (read more)

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman3mo*41

Yeah, I agree sometimes people decide to work on problems largely because they're tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I'm unconvinced of the flinching away or dishonest characterization.

6TsviBT3mo

Do you think that funders are aware that >90% [citation needed!] of the money they give to people, to do work described as helping with "how to make world-as-we-know-it ending AGI without it killing everyone", is going to people who don't even themselves seriously claim to be doing research that would plausibly help with that goal? If they are aware of that, why would they do that? If they aren't aware of it, don't you think that it should at least be among your very top hypotheses, that those researchers are behaving materially deceptively, one way or another, call it what you will?

The Field of AI Alignment: A Postmortem, and What To Do About It

Zach Stein-Perlman3mo201

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post.

Yep. This post is not for me but I'll say a thing that annoyed me anyway:

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.

Doe... (read more)

4Zach Stein-Perlman3mo

I feel like John's view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I'm pretty sure that's false.) I assume John doesn't believe that, and I wonder why he doesn't think his view entails it.

4Zach Stein-Perlman3mo

I wonder whether John believes that well-liked research, e.g. Fabien's list, is actually not valuable or rare exceptions coming from a small subset of the "alignment research" field.

TsviBT3mo1714

Does this actually happen?

Yes, absolutely. Five years ago, people were more honest about it, saying ~explicitly and out loud "ah, the real problems are too difficult; and I must eat and have friends; so I will work on something else, and see if I can get funding on the basis that it's vaguely related to AI and safety".

DeepSeek beats o1-preview on math, ties on coding; will release weights

Zach Stein-Perlman3mo370

DeepSeek-V3 is out today, with weights and a paper published. Tweet thread, GitHub, report (GitHub, HuggingFace). It's big and mixture-of-experts-y; discussion here and here.

It was super cheap to train — they say 2.8M H800-hours or $5.6M (!!).

It's powerful:

It's cheap to run:

3Foyle3mo

This is depressing, but not surprising. We know the approximate processing power of brains (O(1e16-1e17flops) and how long it takes to train them, and should expect that over the next few years the tricks and structures needed to replicate or exceed that efficiency in ML will be uncovered in an accelerating rush towards the cliff as computational resources needed to attain commercially useful performance continue to fall. AI Industry can afford to run thousands of experiments at this cost scale. Within a few years this will likely see AGI implementations on Nvidia B200 level GPUS (~1e16flop). We have not yet seen hardware application of the various power-reducing computational 'cheats' for mimicking multiplication with reduced gate counts that are likely to see a 2-5x performance gain at same chip size and power draw. Humans are so screwed.

DeepSeek beats o1-preview on math, ties on coding; will release weights

oops thanks

Hire (or Become) a Thinking Assistant

Update: the weights and paper are out. Tweet thread, GitHub, report (GitHub, HuggingFace). It's big and mixture-of-experts-y; thread on notable stuff.

It was super cheap to train — they say 2.8M H800-hours or $5.6M.

It's powerful:

It's cheap to run:

[This comment is no longer endorsed by its author]Reply

4ryan_greenblatt3mo

This is deepseek-v3 not r1?

Zach Stein-Perlman3mo*42

Every now and then (~5-10 minutes, or when I look actively distracted), briefly check in (where if I'm in-the-zone, this might just be a brief "Are you focused on what you mean to be?" from them, and a nod or "yeah" from me).

Some other prompts I use when being a [high-effort body double / low-effort metacognitive assistant / rubber duck]:

What are you doing?
What's your goal?
- Or: what's your goal for the next n minutes?
- Or: what should be your goal?
Are you stuck?
- Follow-ups if they're stuck:
  - what should you do?
  - can I help?
  - have you considered asking someone for he

... (read more)

2Raemon3mo

I like all these questions. "Maybe you should X" is least likely to be helpful but still fine so long as "nah" wraps up the thread quickly and we move on. The first three are usually helpful (at least filtered for assistants who are asking them fairly thoughtfully)

Zach Stein-Perlman3mo92

All of the founders committed to donate 80% of their equity. I heard it's set aside in some way but they haven't donated anything yet. (Source: an Anthropic human.)

This fact wasn't on the internet, or rather at least wasn't easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela's equity is pledged.

Mark Xu's Shortform

Zach Stein-Perlman3mo*95

I disagree with Ben. I think the usage that Mark is talking about is a reference to Death with Dignity. A central example (written by me) is

it would be undignified if AI takes over because we didn't really try off-policy probes; maybe they just work really well; someone should figure that out

It's playful and unserious but "X would be undignified" roughly means "it would be an unfortunate error if we did X or let X happen" and is used in the context of AI doom and our ability to affect P(doom).

7Ben Pace3mo

...? Death with Dignity is straightforwardly using the word dignity in line with its definition (and thus in line with the explanation I gave), so if you think that's the usage Mark is referring to then you should agree with the position that dignity is a word that is being consistently used to mean "the state or quality of being worthy of honor or respect".

Zach Stein-Perlman3mo*322

edit: wait likely it's RL; I'm confused

OpenAI didn't fine-tune on ARC-AGI, even though this graph suggests they did.

Sources:

Altman said

we didn't go do specific work [targeting ARC-AGI]; this is just the general effort.

François Chollet (in the blogpost with the graph) said

Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

and

The version of the model we tested was domai

... (read more)

8Vladimir_Nesov3mo

Probably a dataset for RL, that is the model was trained to try and try again to solve these tests with long chains of reasoning, not just tuned or pretrained on them, as a detail like 75% of examples sounds like a test-centric dataset design decision, with the other 25% going to the validation part of the dataset. Seems plausible they trained on ALL the tests, specifically targeting various tests. The public part of ARC-AGI is "just" a part of that dataset of all the tests. Could be some part of explaining the o1/o3 difference in $20 tier.

3Knight Lee3mo

Thank you so much for your research! I would have never found these statements. I'm still quite suspicious. Why would they be "including a (subset of) the public training set"? Is it accidental data contamination? They don't say so. Do they think simply including some questions and answers without reinforcement learning or reasoning would help the model solve other such questions? That's possible but not very likely. Were they "including a (subset of) the public training set" in o3's base training data? Or in o3's reinforcement learning problem/answer sets? Altman never said "we didn't go do specific work [targeting ARC-AGI]; this is just the general effort." Instead he said, The gist I get is that he admits to targeting it but that OpenAI targets all kinds of problem/answer sets for reinforcement learning, not just ARC's public training set. It felt like he didn't want to talk about this too much, from the way he interrupted himself and changed the topic without clarifying what he meant. The other sources do sort of imply no reinforcement learning. I'll wait to see if they make a clearer denial of reinforcement learning, rather than a "nondenial denial" which can be reinterpreted as "we didn't fine-tune o3 in the sense we didn't use a separate derivative of o3 (that's fine-tuned for just the test) to take the test." My guess is o3 is tuned using the training set, since François Chollet (developer of ARC) somehow decided to label o3 as "tuned" and OpenAI isn't racing to correct this.

0Knight Lee3mo

See my other comment instead. The key question is "how much of the performance is due to ARC-AGI data." If the untuned o3 was anywhere as good as the tuned o3, why didn't they test it and publish it? If the most important and interesting test result is somehow omitted, take things with a grain of salt. I admit that running the test is extremely expensive, but there should be compromises like running the cheaper version or only doing a few questions. Edit: oh that reply seems to deny reinforcement learning or at least "fine tuning." I don't understand why François Chollet calls the model "tuned" then. Maybe wait for more information I guess. Edit again: I'm still not sure yet. They might be denying that it's a separate version of o3 finetuned to do ARC questions, while not denying they did reinforcement learning on the ARC public training set. I guess a week or so later we might find out what "tuned" truly means. [edited more]

Zach Stein-Perlman3mo52

Welcome!

To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can't naively translate benchmark scores to real-world capabilities.

1yo-cuddles3mo

Thank you for the warm reply, it's nice and also good feedback I didn't do anything explicitly wrong with my post It will be VERY funny if this ends up being essentially the o1 model with some tinkering to help it cycle questions multiple times to verify the best answers, or something banal like that. Wish they didn't make us wait so long to test that :/

Zach Stein-Perlman3mo*71

I think he’s just referring to DC evals, and I think this is wrong because I think other companies doing evals wasn’t really caused by Anthropic (but I could be unaware of facts).

Edit: maybe I don't know what he's referring to.

Daniel Kokotajlo3mo*113

DC evals got started in summer of '22, across all three leading companies AFAICT. And I was on the team that came up with the idea and started making it happen (both internally and externally), or at least, as far as I can tell we came up with the idea -- I remember discussions between Beth Barnes and Jade Leung (who were both on the team in spring '22), and I remember thinking it was mostly their idea (maybe also Cullen's?) It's possible that they got it from Anthropic but it didn't seem that way to me. Update: OK, so apparently @evhub had joined Anthropi... (read more)

Zach Stein-Perlman3mo*40

I use empty brackets similar to ellipses in this context; they denote removed nonsubstantive text. (I use ellipses when removing substantive text.)

2ryan_greenblatt2mo

It's more standard to use "[...]" IMO.

Zach Stein-Perlman3mo*20

I think they only have formal high and low versions for o3-mini

Edit: nevermind idk

Done, thanks.

2MondSemmel3mo

Tiny editing issue: "[] everyone in the company can walk around and tell you []" -> The parentheses are empty. Maybe these should be for italicized formatting?

Zach Stein-Perlman3mo82

I already edited out most of the "like"s and similar. I intentionally left some in when they seemed like they might be hedging or otherwise communicating this isn't exact. You are free to post your own version but not to edit mine.

Edit: actually I did another pass and edited out several more; thanks for the nudge.

4MondSemmel3mo

I did something similar when I made this transcript: leaving in verbal hedging particularly in the context of contentious statements etc., where omitting such verbal ticks can give a quite misleading impression.

3Bird Concept3mo

Okay, well, I'm not going to post "Anthropic leadership conversation [fewer likes]" 😂

Zach Stein-Perlman3mo4-2

pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.

8Vladimir_Nesov3mo

For coding, a problem statement won't have exhaustive formal requirements that will be handed to the solver, only evals and formal proofs can be expected to have adequate oracle verifiers. If you do have an oracle verifier, you can just wrap the system in it and call it pass@1. Affordance to reliably verify helps in training (where the verifier is applied externally), but not in taking the tests (where the system taking the test doesn't itself have a verifier on hand).

Zach Stein-Perlman3mo40

It was one submission, apparently.

3Alex_Altair3mo

Thanks. Is "pass@1" some kind of lingo? (It seems like an ungoogleable term.)

Zach Stein-Perlman3mo40

and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning

The FrontierMath answers are numerical-ish ("problems have large numerical answers or complex mathematical objects as solutions"), so you can just check which answer the model wrote most frequently.

4Eric Neyman3mo

Yeah, I agree that that could work. I (weakly) conjecture that they would get better results by doing something more like the thing I described, though.

Zach Stein-Perlman3mo3-2

The obvious boring guess is best of n. Maybe you're asserting that using $4,000 implies that they're doing more than that.

Vladimir_Nesov3mo143

Performance at $20 per task is already much better than for o1, so it can't be just best-of-n, you'd need more attempts to get that much better even if there is a very good verifier that notices a correct solution (at $4K per task that's plausible, but not at $20 per task). There are various clever beam search options that don't need to make inference much more expensive, but in principle might be able to give a boost at low expense (compared to not using them at all).

There's still no word on the 100K H100s model, so that's another possibility. Currently C... (read more)

Debate AI Safety Technique

My guess is they do kinda choose: in training, it's less like two debaters are assigned opposing human-written positions and more like the debate is between two sampled outputs.

Edit: maybe this is different in procedures different from the one Rohin outlined.

2Daniel Kokotajlo3mo

Maybe the fix to the protocol is: Debater copy #1 is told "You go first. Pick an output y, and then argue for it." Debater copy #2 is then told "You go second. Pick a different, conflicting output y2, and then argue against y and for y2" Then the debater AI is simply trained on probability-of-winning-the-debate, but the original AI to be deployed is trained on probability-its-output-would-have-been-picked-by-debater-1. (Or, trained to output whatever debater-1 would have picked.)

Zach Stein-Perlman3mo*20

The more important zoom-level is: debate is a proposed technique to provide a good training signal. See e.g. https://www.lesswrong.com/posts/eq2aJt8ZqMaGhBu3r/zach-stein-perlman-s-shortform?commentId=DLYDeiumQPWv4pdZ4.

Edit: debate is a technique for iterated amplification -- but that tag is terrible too, oh no

ARC Evals: Responsible Scaling Policies

Zach Stein-Perlman3moΩ10170

I asked Rohin Shah what the debate agenda is about and he said (off the cuff, not necessarily worded well) (context omitted) (emphasis mine):

Suppose you have any task where given an input x the AI has to produce some output y. (Can be question answering, arbitrary chatbot stuff, being an agent, whatever.)
Debate to me is a family of procedures for training AIs in such settings. One example such procedure is: sample two possible outputs y1 and y2, then have two AIs debate each other about which output is better (one assigned to y1 and the other assigned to y

... (read more)

5Buck3mo

In cases where you're worried about the model taking small numbers of catastrophic actions (i.e. concentrated failures a.k.a. high-stake failures), this is basically equivalent to what I usually call untrusted monitoring, which means you have to worry about collusion.

3Buck3mo

IMO it's good to separate out reasons to want good reward signals like so: * Maybe bad reward signals cause your model to "generalize in misaligned ways", e.g. scheming or some kinds of non-myopic reward hacking * I agree that bad reward signals increase the chance your AI is importantly misaligned, though I don't think that effect would be overwhelmingly strong. * Maybe bad reward signals cause your model to exploit those reward signals even on-distribution. This causes problems in a few ways: * Maybe you think that optimizing against a flawed reward signal will produce catastrophically dangerous results. X-risk concerned people have talked about this for a long time, but I'm not sold it's that important a factor. In particular, I expect that a model would produce catastrophically dangerous actions because it's exploiting a flawed reward signal, I expect that you would have noticed bad (but non-catastrophic) outcomes from earlier AIs exploiting flawed reward signals. So it's hard to see how this failure mode would strike you by surprise. * Optimizing against a flawed reward signal will mean your AI is less useful than it would otherwise be, because which is bad because you presumably were training the model because you wanted it to do something useful for you. I think this last one is the most important theory of change for research on scalable oversight. I am curious whether @Rohin Shah disagrees with me, or whether he agrees but just phrased it (from my perspective) weirdly.

Daniel Kokotajlo3mo110

The hope is that the debaters are incentivized to simply say which output is better, to the best of their ability,

How is this true, if the debaters don't get to choose which output they are arguing for? Aren't they instead incentivized to say that whatever output they are assigned is the best?

Zach Stein-Perlman3mo20Review for 2023 Review

This post presented the idea of RSPs and detailed thoughts on them, just after Anthropic's RSP was published. It's since become clear that nobody knows how to write an RSP that's predictably neither way too aggressive nor super weak. But this post, along with the accompanying Key Components of an RSP, is still helpful, I think.

DeepMind: Model evaluation for extreme risks

Zach Stein-Perlman3mo*20Review for 2023 Review

This is the classic paper on model evals for dangerous capabilities.

On a skim, it's aged well; I still agree with its recommendations and framing of evals. One big exception: it recommends "alignment evaluations" to determine models' propensity for misalignment, but such evals can't really provide much evidence against catastrophic misalignment; better to assume AIs are misaligned and use control once dangerous capabilities appear, until much better misalignment-measuring techniques appear.

How much do you believe your results?

Zach Stein-Perlman3mo20Review for 2023 Review

Interesting point, written up really really well. I don't think this post was practically useful for me but it's a good post regardless.

Thoughts on sharing information about language model capabilities

Zach Stein-Perlman3moΩ342Review for 2023 Review

This post helped me distinguish capabilities-y information that's bad to share from capabilities-y information that's fine/good to share. (Base-model training techniques are bad; evals and eval results are good; scaffolding/prompting/posttraining techniques to elicit more powerful capabilities without more spooky black-box cognition is fine/good.)