LESSWRONG
LW

All of aysja's Comments + Replies

Anthropic, and taking "technical philosophy" more seriously

My high-level skepticism of their approach is A) I don't buy that it's possible yet to know how dangerous models are, nor that it is likely to become possible in time to make reasonable decisions, and B) I don't buy that Anthropic would actually pause, except under a pretty narrow set of conditions which seem unlikely to occur.

As to the first point: Anthropic's strategy seems to involve Anthropic somehow knowing when to pause, yet as far as I can tell, they don't actually know how they'll know that. Their scaling policy does not list the tests they'll run,... (read more)

johnswentworth's Shortform

aysja3d429

One thing that I do after social interactions, especially those which pertain to my work, is to go over all the updates my background processing is likely to make and to question them more explicitly.

This is helpful because I often notice that the updates I’m making aren’t related to reasons much at all. It’s more like “ah they kind of grimaced when I said that, so maybe I'm bad?” or like “they seemed just generally down on this approach, but wait are any of those reasons even new to me? Haven’t I already considered those and decided to do it anyway?” or “... (read more)

Anthropic releases Claude 3.7 Sonnet with extended thinking mode

aysja17d84

There are a ton of objective thresholds in here. E.g., for bioweapon acquisition evals "we pre-registered an 80% average score in the uplifted group as indicating ASL-3 capabilities" and for bioweapon knowledge evals "we consider the threshold reached if a well-elicited model (proxying for an "uplifted novice") matches or exceeds expert performance on more than 80% of questions (27/33)" (which seems good!).

I am confused, though, why these are not listed in the RSP, which is extremely vague. E.g., the "detailed capability threshold" in the RSP is "the... (read more)

Mikhail Samin's Shortform

aysja1mo140

I don't think that most people, upon learning that Anthropic's justification was "other companies were already putting everyone's lives at risk, so our relative contribution to the omnicide was low" would then want to abstain from rioting. Common ethical intuitions are often more deontological than that, more like "it's not okay to risk extinction, period." That Anthropic aims to reduce the risk of omnicide on the margin is not, I suspect, the point people would focus on if they truly grokked the stakes; I think they'd overwhelmingly focus on the threat to their lives that all AGI companies (including Anthropic) are imposing.

7Knight Lee1mo

Regarding common ethical intuitions, I think people in the post singularity world (or afterlife, for the sake of argument) will be far more forgiving of Anthropic. They will understand, even if Anthropic (and people like me) turned out wrong, and actually were a net negative for humanity. Many ordinary people (maybe most) would have done the same thing in their shoes. Ordinary people do not follow the utilitarianism that the awkward people here follow. Ordinary people also do not follow deontology or anything that's the opposite of utilitarianism. Ordinary people just follow their direct moral feelings. If Anthropic was honestly trying to make the future better, they won't feel that outraged at their "consequentialism." They may be outraged an perceived incompetence, but Anthropic definitely won't be the only one accused of incompetence.

The Failed Strategy of Artificial Intelligence Doomers

aysja1mo70

Technical progress also has the advantage of being the sort of thing which could make a superintelligence safe, whereas I expect very little of this to come from institutional competency alone.

The Failed Strategy of Artificial Intelligence Doomers

aysja1mo120

I have grown pessimistic about our ability to solve the open technical problems even given 100 years of work on them.

Why?

Ben Pace1mo161

I'm not particularly resolute on this question. But I get this sense when I look at (a) the best agent foundations work that's happened over ~10 years of work on the matter, and (b) the work output of scaling up the number of people working on 'alignment' by ~100x.

For the first, trying to get a better understand of the basic concepts like logical induction and corrigibility and low-impact and ontological updates, while I feel like there's been progress (in timeless decision theory taking a clear step forward in figuring out how think about decision-makers ... (read more)

What Is The Alignment Problem?

aysja2mo*2811

I agree with a bunch of this post in spirit—that there are underlying patterns to alignment deserving of true name—although I disagree about… not the patterns you’re gesturing at, exactly, but more how you’re gesturing at them. Like, I agree there’s something important and real about “a chunk of the environment which looks like it’s been optimized for something,” and “a system which robustly makes the world look optimized for a certain objective, across many different contexts.” But I don't think the true names of alignment will be behaviorist (“as if” des... (read more)

Lucius Bushnaq2mo*103

Many accounts of cognition are impossible (eg AIXI, VNM rationality, or anything utilizing utility functions, many AIT concepts), since they include the impossible step of considering all possible worlds. I think people normally consider this to be something like a “God’s eye view” of intelligence—ultimately correct, but incomputable—which can be projected down to us bounded creatures via approximation, but I think this is the wrong sort of in-principle to real-world bridge. Like, it seems to me that intelligence is fundamentally about ~“finding and exploi

... (read more)

2Noosphere892mo

I do think a large source of impossibility results come from trying to consider all possible worlds, but the core feature of all of the impossible proposals in our reality is a combination of ignoring computational difficulty entirely, combined with problems on embedded agency, and that the boundary between agent and environment is fundamental to most descriptions of intelligence/agency, ala Cartesian boundaries, but physically universal cellular automatons invalidate this abstraction, meaning the boundary is arbitrary and has no meaning at a low level, and our universe is plausibly physically universal. More here: https://www.lesswrong.com/posts/dHNKtQ3vTBxTfTPxu/what-is-the-alignment-problem#3GvsEtCaoYGrPjR2M (Caveat that the utility function framing actually can work, assuming we restrain the function classes significantly enough, and you could argue the GPT series has a utility function of prediction, but I won't get into that). The problem with the bridge is that even if the god's eye view of intelligence theories was totally philosophically correct, there is no way to get anything like that, and thus you cannot easily approximate without giving very vacuous bounds, and thus you need a specialized theory of intelligence in specific universes that might be inelegant philosophically/mathematically, but that can actually work to build and align AGI/ASI, especially if it comes soon.

What’s the short timeline plan?

aysja2moΩ6102

By faithful and human-legible, I mean that the model’s reasoning is done in a way that is directly understandable by humans and accurately reflects the reasons for the model’s actions.

Curious why you say this, in particular the bit about “accurately reflects the reasons for the model’s actions.” How do you know that? (My impression is that this sort of judgment is often based on ~common sense/folk reasoning, ie, that we judge this in a similar way to how we might judge whether another person took sensical actions—was the reasoning internally consiste... (read more)

Marius Hobbhahn2moΩ6150

I think this is a very important question and the answer should NOT be based on common-sense reasoning. My guess is that we could get evidence about the hidden reasoning capabilities of LLMs in a variety of ways both from theoretical considerations, e.g. a refined version of the two-hop curse or extensive black box experiments, e.g. comparing performance on evals with and without CoT, or with modified CoT that changes the logic (and thus tests whether the models internal reasoning aligns with the revealed reasoning).

These are all pretty basic thought... (read more)

Matthew Barnett's Shortform

aysja2mo2018

I purposefully use these terms vaguely since my concepts about them are in fact vague. E.g., when I say “alignment” I am referring to something roughly like “the AI wants what we want.” But what is “wanting,” and what does it mean for something far more powerful to conceptualize that wanting in a similar way, and what might wanting mean as a collective, and so on? All of these questions are very core to what it means for an AI system to be “aligned,” yet I don’t have satisfying or precise answers for any of them. So it seems more natural to me, at this sta... (read more)

The Field of AI Alignment: A Postmortem, and What To Do About It

aysja3mo16183

I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess. I do think it takes an unusual skillset, though, which is where most of the trouble lives. I.e., I think the pre-paradigmatic skillset requires unusually strong epistemics (because you often need to track for yourself what makes sense), ~creativity (the ability to synthesize new concepts, to generate genuinely novel hypotheses/ideas), good ability to traverse levels of abstraction (connecting details t... (read more)

johnswentworth2mo144

I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess.

To be clear, I wasn't talking about physics postdocs mainly because of raw g. Raw g is a necessary element, and physics postdocs are pretty heavily loaded on it, but I was talking about physics postdocs mostly because of the large volume of applied math tools they have.

The usual way that someone sees footholds on the hard parts of alignment is to have a broad enough technical background that they can se... (read more)

Raemon2mo182

This is the sort of thing I find appealing to believe, but I feel at least somewhat skeptical of. I notice a strong emotional pull to want this to be true (as well as an interesting counterbalancing emotional pull for it to not be true).

I don't think I've seen output from the people aspiring in this direction without being visibly quite smart to make me think "okay yeah it seems like it's on track in some sense."

I'd be interested in hearing more explicit cruxes from you about it.

I do think it's plausible than the "smart enough, creative enough, stron... (read more)

Seth Herd3mo145

I think this is right. A couple of follow-on points:

There's a funding problem if this is an important route to progress. If good work is illegible for years, it's hard to decide who to fund, and hard to argue for people to fund it. I don't have a proposed solution, but I wanted to note this large problem.

Einstein did his pre-paradigmatic work largely alone. Better collaboration might've sped it up.

LessWrong allows people to share their thoughts prior to having publishable journal articles and get at least a few people to engage.

This makes the difficult pre... (read more)

4Chris_Leong3mo

Agreed. Simply focusing on physics post-docs feels too narrow to me. Then again, just as John has a particular idea of what good alignment research looks like, I have my own idea: I would lean towards recruiting folk with both a technical and a philosophical background. It's possible that my own idea is just as narrow.

Why Not Just Outsource Alignment Research To An AI?

aysja3mo67

I took John to be arguing that we won’t get a good solution out of this paradigm (so long as the humans doing it aren’t expert at alignment), rather than we couldn’t recognize a good solution if it were proposed.

Separately, I think that recognizing good solutions is potentially pretty fraught, especially the more capable the system we’re outsourcing it to is. Like anything about a proposed solution that we don’t know how to measure or we don’t understand could be exploited, and it’s really hard to tell those failures exist almost definitionally. E.g., a pl... (read more)

What are the good rationality films?

Answer by aysjaNov 21, 2024185

Jan suggested a similar one (Baraka), but I was going to say Koyaanisqatsi. It’s one of my favorite films; I still feel deeply affected by it. I bring it up here, though, because it does an excellent job of inviting viewers to do original seeing. It’s basically a 90 minute documentary about the world, but it doesn’t feel like it has any agenda. It’s just shot after shot of what this planet is like (the Grand Canyon, a commute to work, a factory farm). It doesn’t shy away from anything, doesn’t feel like it’s grasping at some goal. Just an honest, gentle lo... (read more)

2Jan_Kulveit4mo

I hesitated between Koyaanisqatsi and Baraka! Both are some of my favorites, but in my view Koyaanisqatsi actually has notably more of an agenda and a more pessimistic outlook.

8Mateusz Bagiński4mo

Astronaut.io

Anthropic: Three Sketches of ASL-4 Safety Case Components

aysja4mo245

The interpretability safety case section seems to make the rather strong claim that the activation of known sabotage-related features is necessary for sabotage to occur (emphasis mine):

[C1.1] The model will be unable to accomplish organizational sabotage (T1) or to undermine oversight (T3) without triggering our feature-based monitoring (M1.1) and a detailed analysis of the session.

Afaict, the way that they establish this link (that a feature activation is necessary for a behavior to occur) is through ablation studies. Basically, if deactiva... (read more)

6Alex Mallen4mo

I think that if they: - Find a feature that is necessary for defection in all of the cases of defection in a diverse and realistic set of model organisms - Locate that feature with very little optimization on the defection cases I'd be fairly confident (>~80%) that it would catch any particular deployment-time defection attempt from an ASL-4 schemer. This would cut risk from the deployment by >5x when paired with compelling incrimination and a shutdown. This is a substantial risk reduction, and I might even buy this safety case if there weren't better options for this threat model (e.g. untrusted monitoring). My main concerns are: - It seems unlikely we would get a feature that meets the above two criteria - The feature might have too many false positives - The possibility that the model organisms were sabotaging your eval by coordinating to route their defection attempts through a particular feature (this seems hard at ASL-4 though (but this claim would need an eval)).

2Zach Stein-Perlman4mo

See [V1.1]

johnswentworth's Shortform

aysja4mo268

I think I probably agree, although I feel somewhat wary about it. My main hesitations are:

The lack of epistemic modifiers seems off to me, relative to the strength of the arguments they’re making. Such that while I agree with many claims, my imagined reader who is coming into this with zero context is like “why should I believe this?” E.g., “Without intervention, humanity will be summarily outcompeted and relegated to irrelevancy,” which like, yes, but also—on what grounds should I necessarily conclude this? They gave some argument along the lines of “inte

... (read more)

Dario Amodei — Machines of Loving Grace

aysja5mo913

"It's plausible that things could go much faster than this, but as a prediction about what will actually happen, humanity as a whole probably doesn't want things to get incredibly crazy so fast, and so we're likely to see something tamer." I basically agree with that.

I feel confused about how this squares with Dario’s view that AI is "inevitable," and "driven by powerful market forces." Like, if humanity starts producing a technology which makes practically all aspects of life better, the idea is that this will just… stop? I’m sure some people wi... (read more)

Anthropic's updated Responsible Scaling Policy

aysja5mo132

Thanks, I think you’re right on both points—that the old RSP also didn’t require pre-specified evals, and that the section about Capability Reports just describes the process for non-threshold-triggering eval results—so I’ve retracted those parts of my comment; my apologies for the error. I’m on vacation right now so was trying to read quickly, but I should have checked more closely before commenting.

That said, it does seem to me like the “if/then” relationships in this RSP have been substantially weakened. The previous RSP contained sufficiently much wigg... (read more)

Anthropic's updated Responsible Scaling Policy

aysja5mo5211

In the previous RSP, I had the sense that Anthropic was attempting to draw red lines—points at which, if models passed certain evaluations, Anthropic committed to pause and develop new safeguards. That is, if ~~evaluations triggered,~~ ~~then~~ they would implement safety measures. The “if” was already sketchy in the first RSP, as Anthropic was allowed to “determine whether the evaluation was overly conservative,” i.e., they were allowed to retroactively declare red lines green. Indeed, with such caveats it was difficult for me to see the RSP a... (read more)

RobertM5mo1715

Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.

This doesn't seem right to me, though it's possible that I'm misreading either the old or new policy (or both).

Re: predefined evaluations, the old policy nei... (read more)

DanielFilan's Shortform Feed

aysja5mo228

Basically I just agree with what James said. But I think the steelman is something like: you should expect shorter (or no) pauses with an RSP if all goes well, because the precautions are matched to the risks. Like, the labs aim to develop safety measures which keep pace with the dangers introduced by scaling, and if they succeed at that, then they never have to pause. But even if they fail, they're also expecting that building frontier models will help them solve alignment faster. I.e., either way the overall pause time would probably be shorter?

It does s... (read more)

MATS Alumni Impact Analysis

aysja5mo63

Does the category “working/interning on AI alignment/control” include safety roles at labs? I’d be curious to see that statistic separately, i.e., the percentage of MATS scholars who went on to work in any role at labs.

2Ryan Kidd5mo

Scholars working on safety teams at scaling labs generally selected "working/interning on AI alignment/control"; some of these also selected "working/interning on AI capabilities", as noted. We are independently researching where each alumnus ended up working, as the data is incomplete from this survey (but usually publicly available), and will share separately.

Skills from a year of Purposeful Rationality Practice

aysja6mo304

Similarly in Baba is You: when people don't have a crisp understanding of the puzzle, they tend to grasp and straws and motivatedly-reason their way into accepting sketchy sounding premises. But, the true solution to a level often feels very crisp and clear and inevitable.

A few of the scientists I’ve read about have realized their big ideas in moments of insight (e.g., Darwin for natural selection, Einstein for special relativity). My current guess about what’s going on is something like: as you attempt to understand a concept you don’t already ... (read more)

TurnTrout's shortform feed

aysja6mo30

I'm somewhat confused about when these evaluations are preformed (i.e., how much safety training the model has undergone). OpenAI's paper says: "Red teamers had access to various snapshots of the model at different stages of training and mitigation maturity starting in early August through mid-September 2024," so it seems like this evaluation was probably performed several times. Were these results obtained only prior to safety training or after? The latter seems more concerning to me, so I'm curious.

5Marius Hobbhahn6mo

Unless otherwise stated, all evaluations were performed on the final model we had access to (which I presume is o1-preview). For example, we preface one result with "an earlier version with less safety training".

Pay Risk Evaluators in Cash, Not Equity

aysja6mo134

You actually want evaluators to have as much skin in the game as other employees so that when they take actions that might shut the company down or notably reduce the value of equity, this is a costly signal.
Further, it's good if evaluators are just considered normal employees and aren't separated out in any specific way. Then, other employees at the company will consider these evaluators to be part of their tribe and will feel some loyalty. (Also reducing the chance that risk evaluators feel like they are part of rival "AI safety" tribe.) This probably ha

... (read more)

8ryan_greenblatt6mo

I agree with all of this, but don't expect to live in an ideal world with a non-broken risk management framework and we're making decisions on that margin. I also think predefined actions is somewhat tricky to get right even in a pretty ideal set up, but I agree you can get reasonably close. Note that I don't necessary endorse the arguments I quoted (this is just the strongest objection I'm aware of) and as a bottom line, I think you should pay risk evaluators in cash.

The Checklist: What Succeeding at AI Safety Will Involve

aysja6mo489

Thanks for writing this! I think it’s important for AI labs to write and share their strategic thoughts; I appreciate you doing so. I have many disagreements, but I think it’s great that the document is clear enough to disagree with.

You start the post by stating that “Our ability to do our safety work depends in large part on our access to frontier technology,” but you don’t say why. Like, there’s a sense in which much of this plan is predicated on Anthropic needing to stay at the frontier, but this document doesn’t explain why this is the right ... (read more)

Zach Stein-Perlman's Shortform

aysja7mo4421

I disagree. It would be one thing if Anthropic were advocating for AI to go slower, trying to get op-eds in the New York Times about how disastrous of a situation this was, or actually gaming out and detailing their hopes for how their influence will buy saving the world points if everything does become quite grim, and so on. But they aren’t doing that, and as far as I can tell they basically take all of the same actions as the other labs except with a slight bent towards safety.

Like, I don’t feel at all confident that Anthropic’s credit has exceeded their... (read more)

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

aysja7mo7285

Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn’t include commitments. This is because the science is in early stages and best practices will need to evolve.

I think it’s really alarming that this safety framework contains no commitments, and I’m frustrated that concern about this is brushed aside. If DeepMind is aiming to build AGI soon, I think it’s reasonable to expect they have a mitigation plan more concrete than a list of vague IOU’s. For example, consider this description of a plan from the FSF:

When a mo

... (read more)

Zach Stein-Perlman's Shortform

aysja7mo3523

I'm sympathetic to how this process might be exhausting, but at an institutional level I think Anthropic (and all labs) owe humanity a much clearer image of how they would approach a potentially serious and dangerous situation with their models. Especially so, given that the RSP is fairly silent on this point, leaving the response to evaluations up to the discretion of Anthropic. In other words, the reason I want to hear more from employees is in part because I don't know what the decision process inside of Anthropic will look like if an evaluation indicat... (read more)

[anonymous]7mo1519

I will also note what I feel is a somewhat concerning trend. It's happened many times now that I've critiqued something about Anthropic (its RSP, advocating to eliminate pre-harm from SB 1047, the silent reneging on the commitment to not push the frontier), and someone has said something to the effect of: "this wouldn't seem so bad if you knew what was happening behind the scenes."

I just wanted to +1 that I am also concerned about this trend, and I view it as one of the things that I think has pushed me (as well as many others in the community) to lose a l... (read more)

Raemon's Shortform

aysja7mo106

Largely agree with everything here.

But, I've heard some people be concerned "aren't basically all SSP-like plans basically fake? is this going to cement some random bureaucratic bullshit rather than actual good plans?." And yeah, that does seem plausible.

I do think that all SSP-like plans are basically fake, and I’m opposed to them becoming the bedrock of AI regulation. But I worry that people take the premise “the government will inevitably botch this” and conclude something like “so it’s best to let the labs figure out what to do before cemen... (read more)

Richard_Ngo7mo117

Afaict, the current world we’re in is basically the worst case scenario

the status quo is not, imo, a remotely acceptable alternative either

Both of these quotes display types of thinking which are typically dangerous and counterproductive, because they rule out the possibility that your actions can make things worse.

The current world is very far from the worst-case scenario (even if you have very high P(doom), it's far away in log-odds) and I don't think it would be that hard to accidentally make things considerably worse.

2Raemon7mo

I think on alternative here that isn't just "trust AI companies" is "wait until we have a good Danger Eval, and then get another bit of legislation that specifically focuses on that, rather than hoping that the bureaucratic/political process shakes out with a good set of SSP industry standards." I don't know that that's the right call, but I don't think it's a crazy position from a safety perspective.

lemonhope's Shortform

aysja7mo62

Or independent thinkers try to find new frames because the ones on offer are insufficient? I think this is roughly what people mean when they say that AI is "pre-paradigmatic," i.e., we don't have the frames for filling to be very productive yet. Given that, I'm more sympathetic to framing posts on the margin than I am to filling ones, although I hope (and expect) that filling-type work will become more useful as we gain a better understanding of AI.

2niplav7mo

This response is specific to AI/AI alignment, right? I wasn't "sub-tweeting" the state of AI alignment, and was more thinking of other endeavours (quantified self, paradise engineering, forecasting research). In general, the bias towards framing can be swamped by other considerations.

Linch's Shortform

aysja8mo6024

I think this letter is quite bad. If Anthropic were building frontier models for safety purposes, then they should be welcoming regulation. Because building AGI right now is reckless; it is only deemed responsible in light of its inevitability. Dario recently said “I think if [the effects of scaling] did stop, in some ways that would be good for the world. It would restrain everyone at the same time. But it’s not something we get to choose… It’s a fact of nature… We just get to find out which world we live in, and then deal with it as best we can.” Bu... (read more)

2Dr. David Mathers7mo

One way to understand this is that Dario was simply lying when he said he thinks AGI is close and carries non-negligible X-risk, and that he actually thinks we don't need regulation yet because it is either far away or the risk is negligible. There have always been people who have claimed that labs simply hype X-risk concerns as a weird kind of marketing strategy. I am somewhat dubious of this claim, but Anthropic's behaviour here would be well-explained by it being true.

5Rebecca8mo

I’ve found use of the term catastrophe/catastrophic in discussions of SB 1047 makes it harder for me to think about the issue. The scale of the harms captured by SB 1047 has a much much lower floor than what EAs/AIS people usually term catastrophic risk, like $0.5bn+ vs $100bn+. My view on the necessity of pre-harm enforcement, to take the lens of the Anthropic letter, is very different in each case. Similarly, while the Anthropic letter talks about the the bill as focused on catastrophic risk, it also talks about “skeptics of catastrophic risk” - surely this is about eg not buying that AI will be used to start a major pandemic, rather than whether eg there’ll be an increase in the number of hospital systems subject to ransomware attacks bc of AI.

Zach Stein-Perlman's Shortform

aysja8mo60

I agree that scoring "medium" seems like it would imply crossing into the medium zone, although I think what they actually mean is "at most medium." The full quote (from above) says:

In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place for the relevant postmitigation risk level to be back at most to “medium” level.

I.e., I think what they... (read more)

4Zach Stein-Perlman8mo

Surely if any categories are above the "high" threshold then they're in "high zone" and if all are below the "high" threshold then they're in "medium zone." And regardless the reading you describe here seems inconsistent with [edited] ---------------------------------------- Added later: I think someone else had a similar reading and it turned out they were reading "crosses a medium risk threshold" as "crosses a high risk threshold" and that's just [not reasonable / too charitable].

Zach Stein-Perlman's Shortform

aysja8mo70

Maybe I'm missing the relevant bits, but afaict their preparedness doc says that they won't deploy a model if it passes the "medium" threshold, eg:

Only models with a post-mitigation score of "medium" or below can be deployed. In other words, if we reach (or are forecasted to reach) at least “high” pre-mitigation risk in any of the considered categories, we will not continue with deployment of that model (by the time we hit “high” pre-mitigation risk) until there are reasonably mitigations in place.

The threshold for further developing is set to "high," though. I.e., they can further develop so long as models don't hit the "critical" threshold.

8Zach Stein-Perlman8mo

I think you're confusing medium-threshold with medium-zone (the zone from medium-threshold to just-below-high-threshold). Maybe OpenAI made this mistake too — it's the most plausible honest explanation. (They should really do better.) (I doubt they intentionally lied, because it's low-upside and so easy to catch, but the mistake is weird.) Based on the PF, they can deploy a model just below the "high" threshold without mitigations. Based on the tweet and blogpost: This just seems clearly inconsistent with the PF (should say crosses out of medium zone by crossing a "high" threshold). This doesn't make sense: if you cross a "medium" threshold you enter medium-zone. Per the PF, the mitigations just need to bring you out of high-zone and down to medium-zone. (Sidenote: the tweet and blogpost incorrectly suggest that the "medium" thresholds matter for anything; based on the PF, only the "high" and "critical" thresholds matter (like, there are three ways to treat models: below high or between high and critical or above critical).) [edited repeatedly]

Optimistic Assumptions, Longterm Planning, and "Cope"

aysja8mo42

But you could just as easily frame this as "abstract reasoning about unfamiliar domains is hard therefore you should distrust doom arguments".

But doesn't this argument hold with the opposite conclusion, too? E.g. "abstract reasoning about unfamiliar domains is hard therefore you should distrust arguments about good-by-default worlds".

0Richard_Ngo8mo

Yepp, all of these arguments can weigh in many different directions, depending on your background beliefs. That's my point.

80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)

aysja8mo128

But whether an organization can easily respond is pretty orthogonal to whether they’ve done something wrong. Like, if 80k is indeed doing something that merits a boycott, then saying so seems appropriate. There might be some debate about whether this is warranted given the facts, or even whether the facts are right, but it seems misguided to me to make the strength of an objection proportional to someone’s capacity to respond rather than to the badness of the thing they did.

Habryka's Shortform Feed

aysja8mo4833

This comment appears to respond to habryka, but doesn’t actually address what I took to be his two main points—that Anthropic was using NDAs to cover non-disparagement agreements, and that they were applying significant financial incentive to pressure employees into signing them.

We historically included standard non-disparagement agreements by default in severance agreements

Were these agreements subject to NDA? And were all departing employees asked to sign them, or just some? If the latter, what determined who was asked to sign?

Habryka's Shortform Feed

aysja8mo610

Agreed. I'd be especially interested to hear this from people who have left Anthropic.

Loving a world you don’t trust

aysja9mo53

I do kind of share the sense that people mostly just want frisbee and tea, but I am still confused about it. Wasn't religion a huge deal for people for most of history? I could see a world where they were mostly just going through the motions, but the amount of awe I feel going into European churches feels like some evidence against this. And it's hard for me to imagine that people were kind of just mindlessly sitting there through e.g., Gregorian chanting in candlelight, but maybe I am typical minding too hard. It really seems like these rituals, the arch... (read more)

1SpectrumDT9mo

This sounds to me like selection bias. Most people did not build churches. And I suspect you do not feel awestruck in every church. I suspect that you remember the new most awesome ones, built by exceptional people who felt exceptionally religious. It may have been built for that purpose. This does not mean that most people felt the existential intensity. It is conceivable that many people felt "wow, the church sure is rich and powerful; I'd better obey" whereas many others felt nothing and stayed quiet about it.

1Mateusz Bagiński9mo

(Vague shower thought, not standing strongly behind it) Maybe it is the case that most people as individuals "just want frisbee and tea" but once religion (or rather the very broad class of ~"social practices" some subset/projection of which we round up to "religion") evolved and lowered the activation energy of people's hive switch, they became more inclined to appreciate the beauty of Cathedrals and Gregorian chants, etc. In other words, people's ability to want/appreciate/[see value/beauty in X] depends largely on the social structure they are embedded in, the framework they adopt to make sense of the world etc. (The selection pressures that led to religion didn't entirely reduce to "somebody wanting something", so at least that part is not question-begging [I think].)

Richard Ngo's Shortform

aysja9mo129

Right now it seems like the entire community is jumping to conclusions based on a couple of "impressions" people got from talking to Dario, plus an offhand line in a blog post. With that little evidence, if you have formed strong expectations, that's on you.

Like Robert, the impressions I had were based on what I heard from people working at Anthropic. I cited various bits of evidence because those were the ones available, not because they were the most representative. The most representative were those from Anthropic employees who concurred that this... (read more)

2Richard_Ngo9mo

This makes sense, and does update me. Though I note "implication", "insinuation" and "impression" are still pretty weak compared to "actually made a commitment", and still consistent with the main driver being wishful thinking on the part of the AI safety community (including some members of the AI safety community who work at Anthropic). I think there are two implicit things going on here that I'm wary of. The first one is an action-inaction distinction. Pushing them to justify their actions is, in effect, a way of slowing down all their actions. But presumably Anthropic thinks that them not doing things is also something which could lead to humanity going extinct. Therefore there's an exactly analogous argument they might make, which is something like "when you try to stop us from doing things you owe it to the world to adhere to a bar that's much higher than 'normal discourse'". And in fact criticism of Anthropic has not met this bar - e.g. I think taking a line from a blog post out of context and making a critical song about it is in fact unusually bad discourse. What's the disanalogy between you and Anthropic telling each other to have higher standards? That's the second thing that I'm wary about: you're claiming to speak on behalf of humanity as a whole. But in fact, you are not; there's no meaningful sense in which humanity is in fact demanding a certain type of explanation from Anthropic. Almost nobody wants an explanation of this particular policy; in fact, the largest group of engaged stakeholders here are probably Anthropic customers, who mostly just want them to ship more models. I don't really have a strong overall take. I certainly think it's reasonable to try to figure out what went wrong with communication here, and perhaps people poking around and asking questions would in fact lead to evidence of clear commitments being made. I am mostly against the reflexive attacks based on weak evidence, which seems like what's happening here. In general my m

Non-Disparagement Canaries for OpenAI

aysja9mo78

I have not heard from anyone who wasn’t released, and I think it is reasonably likely I would have heard from them anonymously on Signal. Also, not releasing a bunch of people after saying they would seems like an enormously unpopular, hard to keep secret, and not very advantageous move for OpenAI, which is already taking a lot of flak for this.

I’m not necessarily imagining that OpenAI failed to release a bunch of people, although that still seems possible to me. I’m more concerned that they haven’t released many key people, and while I agree that yo... (read more)

Non-Disparagement Canaries for OpenAI

aysja9mo127

I imagine many of the people going into leadership positions were prepared to ignore the contract, or maybe even forgot about the nondisparagement clause

I could imagine it being the case that people are prepared to ignore the contract. But unless they publicly state that, it wouldn’t ameliorate my concerns—otherwise how is anyone supposed to trust they will?

The clause is also open to more avenues of legal attack if it's enforced against someone who takes another position which requires disparagement (e.g. if it's argued to be a restriction on engaging in b

... (read more)

5William_S9mo

Evidence could look like 1. Someone was in a position where they had to make a judgement about OpenAI and was in a position of trust 2. They said something bland and inoffensive about OpenAI 3. Later, independently you find that they likely would have known about something bad that they likely weren't saying because of the nondisparagement agreement (instead of ordinary confidentially agreements). This requires some model of "this specific statement was influenced by the agreement" instead of just "you never said anything bad about OpenAI because you never gave opinions on OpenAI". I think one should require this kind of positive evidence before calling it a "serious breach of trust", but people can make their own judgement about where that bar should be.

MIRI 2024 Communications Strategy

aysja9mo6046

I agree this is usually the case, but I think it’s not always true, and I don’t think it’s necessarily true here. E.g., people as early as Da Vinci guessed that we’d be able to fly long before we had planes (or even any flying apparatus which worked). Because birds can fly, and so we should be able to as well (at least, this was Da Vinci and the Wright brothers' reasoning). That end point was not dependent on details (early flying designs had wings like a bird, a design which we did not keep :p), but was closer to a laws of physics claim (if birds can do i... (read more)

Matthew Barnett9mo*148

There's a pretty big difference between statements like "superintelligence is physically possible", "superintelligence could be dangerous" and statements like "doom is >80% likely in the 21st century unless we globally pause". I agree with (and am not objecting to) the former claims, but I don't agree with the latter claim.

I also agree that it's sometimes true that endpoints are easier to predict than intermediate points. I haven't seen Eliezer give a reasonable defense of this thesis as it applies to his doom model. If all he means here is that superin... (read more)

OpenAI: Fallout

aysja10mo5626

Bloomberg confirms that OpenAI has promised not to cancel vested equity under any circumstances, and to release all employees from one-directional non-disparagement agreements.

They don't actually say "all" and I haven't seen anyone confirm that all employees received this email. It seems possible (and perhaps likely) to me that many high profile safety people did not receive this email, especially since it would presumably be in Sam's interest to do so, and since I haven't seen them claiming otherwise. And we wouldn't know: those who are still under the co... (read more)

1PeterH9mo

CNBC reports: A handful of former employees have publicly confirmed that they received the email.

Maybe Anthropic's Long-Term Benefit Trust is powerless

aysja10mo139

Why do you think this? The power that I'm primarily concerned about is the power to pause, and I'm quite skeptical that companies like Amazon and Google would be willing to invest billions of dollars in a company which may decide to do something that renders their investment worthless. I.e, I think a serious pause, one on the order of months or years, is essentially equivalent to opting out of the race to AGI. On this question, my strong prior is that investors like Google and Amazon have more power than employees or the trust, else they wouldn't invest.

2Dr. David Mathers10mo

People will sometimes invest if they think the expected return is high, even if they also think there is a non-trivial chance that the investment will go to zero. During the FTX collapse many people claimed that this is a common attitude amongst venture capitalists, although maybe Google and Amazon are more risk averse?

4ryan_greenblatt10mo

They might just (probably correctly) think it is unlikely that the employees will decide to do this.

An explanation of evil in an organized world

aysja10mo20

"So God can’t make the atoms be arranged one way and the humans be arranged another contradictory way."

But couldn't he have made a different sort of thing than humans, which were less prone to evil? Like, it seems to me that he didn't need to make us evolve through the process of natural selection, such that species were always in competition, status was a big deal, fighting over mates commonplace, etc. I do expect that there's quite a bit of convergence in the space of possible minds—even if one is selecting them from the set of "all possible atomic confi... (read more)

1Mateusz Bagiński10mo

Or maybe the Ultimate Good in the eyes of God is the epic sequence of: dead matter -> RNA world -> protocells -> ... -> hairless apes throwing rocks at each other and chasing gazelles -> weirdoes trying to accomplish the impossible task of raising the sanity waterline and carrying the world through the Big Filter of AI Doom -> deep utopia/galaxy lit with consciousness/The Goddess of Everything Else finale.

The Intentional Stance, LLMs Edition

aysja10mo126

Secondly, following Dennett, the point of modeling cognitive systems according to the intentional stance is that we evaluate them on a behavioral basis and that is all there is to evaluate.

I am confused on this point. Several people have stated that Dennett believes something like this, e.g., Quintin and Nora argue that Dennett is a goal "reductionist," by which I think they mean something like "goal is the word we use to refer to certain patterns of behavior, but it's not more fundamental than that."

But I don't think Dennett believes this. He's ... (read more)

The first future and the best future

aysja11mo117

I don't know what Katja thinks, but for me at least: I think AI might pose much more lock-in than other technologies. I.e., I expect that we'll have much less of a chance (and perhaps much less time) to redirect course, adapt, learn from trial and error, etc. than we typically do with a new technology. Given this, I think going slower and aiming to get it right on the first try is much more important than it normally is.

Paul Christiano named as US AI Safety Institute Head of AI Safety

aysja11mo42

I agree there other problems the EA biosecurity community focuses on, but surely lab escapes are one of those problems, and part of the reason we need biosecurity measures? In any case, this disagreement seems beside the main point that I took Adam to be making, namely that the track record for defining appropriate units of risk for poorly understood, high attack surface domains is quite bad (as with BSL). This still seems true to me.

2Davidmanheim11mo

BSL isn't the thing that defines "appropriate units of risk", that's pathogen risk-group levels, and I agree that those are are problem because they focus on pathogen lists rather than actual risks. I actually think BSL are good at what they do, and the problem is regulation and oversight, which is patchy, as well as transparency, of which there is far too little. But those are issues with oversight, not with the types of biosecurity measure that are available.

Daniel Dennett has died (1942-2024)

aysja11mo177

Dennett meant a lot to me, in part because he’s shaped my thinking so much, and in part because I think we share a kindred spirit—this ardent curiosity about minds and how they might come to exist in a world like ours. I also think he is an unusually skilled thinker and writer in many respects, as well as being an exceptionally delightful human. I miss him.

In particular, I found his deep and persistent curiosity beautiful and inspiring, especially since it’s aimed at all the (imo) important questions. He has a clarity of thought which manages to be b... (read more)

Express interest in an "FHI of the West"

aysja11mo4511

Aw man, this is so exciting! There’s something really important to me about rationalist virtues having a home in the world. I’m not sure if what I’m imagining is what you’re proposing, exactly, but I think most anything in this vicinity would feel like a huge world upgrade to me.

Apparently I have a lot of thoughts about this. Here are some of them, not sure how applicable they are to this project in particular. I think you can consider this to be my hopes for what such a thing might be like, which I suspect shares some overlap.

It has felt to me for a few y... (read more)

Express interest in an "FHI of the West"

aysja11mo30

Huh, I feel confused. I suppose we just have different impressions. Like, I would say that Oliver is exceedingly good at cutting through the bullshit. E.g., I consider his reasoning around shutting down the Lightcone offices to be of this type, in that it felt like a very straightforward document of important considerations, some of which I imagine were socially and/or politically costly to make. One way to say that is that I think Oliver is very high integrity, and I think this helps with bullshit detection: it's easier to see how things don't cut to the ... (read more)

2owencb11mo

I don't really disagree with anything you're saying here, and am left with confusion about what your confusion is about (like it seemed like you were offering it as examples of disagreement?).

Transformers Represent Belief State Geometry in their Residual Stream

aysja11mo140

This is very cool! I’m excited to see where it goes :)

A couple questions (mostly me grappling with what the implications of this work might be):

Given a dataset of sequences of tokens, how do you find the HMM that could have generated it, and can this be done automatically? Also, is the mapping from dataset to HMM unique?
This question is possibly more confused on my end, sorry if so. I’m trying to get at something like “how interpretable will these simplexes be with much larger models?” Like, if I’m imagining that each state is a single token, and the HMM i

... (read more)

Adam Shai11mo110

Thanks!

one way to construct an HMM is by finding all past histories of tokens that condition the future tokens with the same probablity distribution, and make that equivalence class a hidden state in your HMM. Then the conditional distributions determine the arrows coming out of your state and which state you go to next. This is called the "epsilon machine" in Comp Mech, and it is unique. It is one presentation of the data generating process, but in general there are an infinite number of HMM presntations that would generate the same data. The epsilon mach

... (read more)