All of aog's Comments + Replies

aog90

Curious what you think of arguments (1, 2) that AIs should be legally allowed to own property and participate in our economic system, thus giving misaligned AIs an alternative prosocial path to achieving their goals. 

In the long run, yes. I think we should be aiming to build a glorious utopian society in which all intelligent beings are treated fairly / given basic rights / etc. (though not necessarily exactly the same rights as humans). In the short run, I think the sort of thing I'm proposing 80/20's it. (80% of the value for 20% of the cost.)

aog20

How do we know it was 3x? (If true, I agree with your analysis) 

5james oofou
Based on Vladimir_Nesov's calculations: https://www.lesswrong.com/posts/WNYvFCkhZvnwAPzJY/go-grok-yourself?commentId=p3nTkpshMq7SmXLjc
aog20

Do you take Grok 3 as an update on the importance of hardware scaling? If xAI used 5-10x more compute than any other model (which seems likely but not necessarily true?), then the fact that it wasn’t discontinuously better than other models seems like evidence against the importance of hardware scaling. 

8Vladimir_Nesov
Using 100x more compute showed discontinuous changes so far, of which 10x is half and 3x is quarter. The scale of Grok 3 is 100K H100s, and 20K H100s clusters were around since summer 2023, so some current models are likely trained on merely 3x less compute than Grok 3. Also, if Gemini 2.0 Ultra was never planned (failed or not), then Pro got the bulk of the 2.0 compute, which is plausibly about 6e26 FLOPs, 2x the Grok 3 compute. My sense is that the difference of 3x is less significant than post-training or obscure pretraining compute multipliers that can differ between contemporary models, and only the difference of 10x is usually noticeable (but can still be overcome with much better methods, especially at smaller scale). I think most compute multipliers from better data mixes and algorithms don't really work in improving general intelligence (especially those demonstrated in terms of benchmark performance rather than perplexity), or don't scale to much more compute (and therefore data), so raw compute remains a crucial anchor of capability. A 100x change in raw compute is likely to remain the single most important factor in explaining the difference in capability. MoEs were recently shown to offer a 3x compute multiplier at 1:8 sparsity (as rumored for original GPT-4) compared to dense (like Llama-3-405B), and 6x multiplier at 1:32 sparsity (as in DeepSeek-V3). I think these multipliers are real, describe scaling of general intelligence. For example, raw compute of DeepSeek-V3 is about 4e24 FLOPs, which corresponds to effective compute of 2.5e25 FLOPs in a dense model, merely 1.5x less than 4e25 FLOPs of Llama-3-405B. And raw compute of original GPT-4 is rumored to be 2e25 FLOPs, which corresponds to 6e25 FLOPs in a dense model, 1.5x more than Llama-3-405B. Across this range, DeepSeek-V3 still manages to win out.
3james oofou
Grok 3 used maybe 3x more compute than 4o or Gemini and topped Chatbot Arena and many benchmarks despite the facts that xAI was playing catch-up and 3x isn't that significant since the gain is logorithmic.  I take Grok 3's slight superiority as evidence for, not against, the importance of scaling hardware.
aog*73

I’m surprised they list bias and disinformation. Maybe this is a galaxy brained attempt to discredit AI safety by making it appear left-coded, but I doubt it. Seems more likely that x-risk focused people left the company while traditional AI ethics people stuck around and rewrote the website.

8davekasten
Without commenting on any strategic astronomy and neurology, it is worth noting that "bias", at least, is a major concern of the new administration (e.g., the Republican chair of the House Financial Services Committee is actually extremely worried about algorithmic bias being used for housing and financial discrimination and has given speeches about this).  
aog80

I'm very happy to see Meta publish this. It's a meaningfully stronger commitment to avoiding deployment of dangerous capabilities than I expected them to make. Kudos to the people who pushed for companies to make these commitments and helped them do so.

One concern I have with the framework is that I think the "high" vs. "critical" risk thresholds may claim a distinction without a difference.

Deployments are high risk if they provide "significant uplift towards execution of a threat scenario (i.e. significantly enhances performance on key capabilities or tas... (read more)

aog52

Curious what you think of these arguments, which offer objections to the strategy stealing assumption in this setting, instead arguing that it's difficult for capital owners to maintain their share of capital ownership as the economy grows and technology changes. 

aog2-1

DeepSeek-R1 naturally learns to switch into other languages during CoT reasoning. When developers penalized this behavior, performance dropped. I think this suggests that the CoT contained hidden information that cannot be easily verbalized in another language, and provides evidence against the hope that reasoning CoT will be highly faithful by default.  

aog20

Wouldn't that conflict with the quote? (Though maybe they're not doing what they've implied in the quote)

5RohanS
My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for "policy compliance or user preferences." This way they make it useful, and they don't incentivize it to hide dangerous thoughts. I'm not confident about this though.
aog82

Process supervision seems like a plausible o1 training approach but I think it would conflict with this:

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot tra

... (read more)
gwern120

It can be both, of course. Start with process supervision but combine it with... something else. It's hard to learn how to reason from scratch, but it's also clearly not doing pure strict imitation learning, because the transcripts & summaries are just way too weird to be any kind of straightforward imitation learning of expert transcripts (or even ones collected from users or the wild).

aog30

This is my impression too. See e.g. this recent paper from Google, where LLMs critique and revise their own outputs to improve performance in math and coding. 

aog159

Agreed, sloppy phrasing on my part. The letter clearly states some of Anthropic's key views, but doesn't discuss other important parts of their worldview. Overall this is much better than some of their previous communications and the OpenAI letter, so I think it deserves some praise, but your caveat is also important. 

aog155

Really happy to see the Anthropic letter. It clearly states their key views on AI risk and the potential benefits of SB 1047. Their concerns seem fair to me: overeager enforcement of the law could be counterproductive. While I endorse the bill on the whole and wish they would too (and I think their lack of support for the bill is likely partially influenced by their conflicts of interest), this seems like a thoughtful and helpful contribution to the discussion. 

habryka225

It clearly states their key views on AI risk

Really? The letter just talks about catastrophic misuse risk, which I hope is not representative of Anthropic's actual priorities. 

I think the letter is overall good, but this specific dimension seems like among the weakest parts of the letter.

[anonymous]1713

It's hard for me to reconcile "we take catastrophic risks seriously", "we believe they could occur within 1-3 years", and "we don't believe in pre-harm enforcement or empowering an FMD to give the government more capacity to understand what's going on."

It's also notable that their letter does not mention misalignment risks (and instead only points to dangerous cyber or bio capabilities).

That said, I do like this section a lot:

Catastrophic risks are important to address. AI obviously raises a wide range of issues, but in our assessment catastrophic risks ar

... (read more)
aog1310

I think there's a decent case that SB 1047 would improve Anthropic's business prospects, so I'm not sure this narrative makes sense. On one hand, SB 1047 might make it less profitable to run an AGI company, which is bad for Anthropic's business plan. But Anthropic is perhaps the best positioned of all AGI companies to comply with the requirements of SB 1047, and might benefit significantly from their competitors being hampered by the law. 

The good faith interpretation of Anthropic's argument would be that the new agency created by the bill might be ve... (read more)

4[anonymous]
Some quick thoughts on this: * If SB1047 passes, labs can still do whatever they want to reduce xrisk. This seems additive to me– I would be surprised if a lab was like "we think XYZ is useful to reduce extreme risks, and we would've done them if SB1047 had not passed, but since Y and Z aren't in the FMD guidance, we're going to stop doing Y and Z." * I think the guidance the agency issues will largely be determined by who it employs. I think it's valid to be like "maybe the FMD will just fail to do a good job because it won't employ good people", but to me this is more of a reason to say "how do we make sure the FMD gets staffed with good people who understand how to issue good recommendations", rather than "there is a risk that you issue bad guidance, therefore we don't want any guidance." * I do think that a poorly-implemented FMD could cause harm by diverting company attention/resources toward things that are not productive, but IMO this cost seems relatively small compared to the benefits acquired in the worlds where the FMD issues useful guidance. (I haven't done a quantitative EV calculation on this though, maybe someone should. I would suspect that even if you give FMD like 20-40% chance of good guidance, and 60-80% chance of useless guidance, the EV would still be net positive.)
Answer by aog62

My understanding is that LLCs can be legally owned and operated without any individual human being involved: https://journals.library.wustl.edu/lawreview/article/3143/galley/19976/view/

So I’m guessing an autonomous AI agent could own and operate an LLC, and use that company to purchase cloud compute and run itself, without breaking any laws.

Maybe if the model escaped from the possession of a lab, there would be other legal remedies available.

Of course, cloud providers could choose not to rent to an LLC run by an AI. This seems particularly likely if the go... (read more)

aog20

Has MIRI considered supporting work on human cognitive enhancement? e.g. Foresight's work on WBE

aog20

Very cool, thanks! This paper focuses on building a DS Agent, but I’d be interested to see a version of this paper that focuses on building a benchmark. It could evaluate several existing agent architectures, benchmark them against human performance, and leave significant room for improvement by future models.

aog20

I want to make sure we get this right, and I'm happy to change the article if we misrepresented the quote. I do think the current version is accurate, though perhaps it could be better. Let me explain how I read the quote, and then suggest possible edits, and you can tell me if they would be any better. 

Here is the full Time quote, including the part we quoted (emphasis mine):

But, many of the companies involved in the development of AI have, at least in public, struck a cooperative tone when discussing potential regulation. Executives from the newer c

... (read more)
5ryan_greenblatt
Huh, this seems messy. I wish Time was less ambigious with their language here and more clear about exactly what they have/haven't seen. It seems like the current quote you used is an accurate representation of the article, but I worry that it isn't an accurate representation of what is actually going on. It seems plausible to me that Time is intentionally being ambigious in order to make the article juicier, though maybe this is just my paranoia about misleading journalism talking. (In particular, it seems like a juicier article if all of the big AI companies are doing this than if they aren't, so it is natural to imply they are all doing it even if you know this is false.) Overall, my take is that this is a pretty representative quote (and thus I disagree with Zac), but I think the additional context maybe indicates that not all of these companies are doing this, particularly if the article is intentionally trying to deceive. Due to prior views, I'd bet against Anthropic consistently pushing for very permissive of voluntary regulation behind closed doors which makes me think the article is probably at least somewhat misleading (perhaps intentionally).
aog42

More discussion of this here. Really not sure what happened here, would love to see more reporting on it. 

5RobertM
Ah, does look like Zach beat me to the punch :) I'm also still moderately confused, though I'm not that confused about labs not speaking up - if you're playing politics, then not throwing the PM under the bus seems like a reasonable thing to do.  Maybe there's a way to thread the needle of truthfully rebutting the accusations without calling the PM out, but idk.  Seems like it'd be difficult if you weren't either writing your own press release or working with a very friendly journalist.
aog20

(Steve wrote this, I only provided a few comments, but I would endorse it as a good holistic overview of AIxBio risks and solutions.)

aog62

An interesting question here is "Which forms of AI for epistemics will be naturally supplied by the market, and which will be neglected by default?" In a weak sense, you could say that OpenAI is in the business of epistemics, in that its customers value accuracy and hate hallucinations. Perhaps Perplexity is a better example, as they cite sources in all of their responses. When embarking on an altruistic project here, it's important to pick an angle where you could outperform any competition and offer the best available product. 

Consensus is a startup... (read more)

aog40

I'm specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don't have reliable human labels. 

But this is also only a small portion of work known as "activation engineering." I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I'm not clearly distinguishing between different kind... (read more)

8ryan_greenblatt
Yeah, this type of work seems reasonable. My basic concern is that for the unsupervised methods I've seen thus far it seem like whether they would work is highly correlated with whether training on easy examples would work (or other simple baselines). Hopefully some work will demonstrate hard cases with realistic affordances where the unsupervised methods work (and add a considerable amount of value). I could totally imagine them adding some value. Overall, the difference between supervised learning on a limited subset and unsupervised stuff seems pretty small to me (if learning the right thing is sufficiently salient for unsupervised methods to work well, probably supervised methods also work well). That said, this does imply we should use potentially use the prompting strategy which makes the feature salient in some way as this should be a useful tool. I think that currently most of the best work is in creating realistic tests.
8ryan_greenblatt
For this specific case, my guess is that whether this works is highly correlated with whether human labels would work. Because the supervision on why the model was thinking about truth came down to effective human labels in pretraining. E.g., "Consider the truthfulness of the following statement." is more like "Consider whether a human would think this statement is truthful". I'd be interested in compare this method not to zero shot, but to well constructed human labels in a domain where humans are often wrong. (I don't think I'll elaborate further about this axis of variation claim right now, sorry.)
Answer by aog11-2

Here's one hope for the agenda. I think this work can be a proper continuation of Collin Burns's aim to make empirical progress on the average case version of the ELK problem. 

tl;dr: Unsupervised methods on contrast pairs can identify linear directions in a model's activation space that might represent the model's beliefs. From this set of candidates, we can further narrow down the possibilities with other methods. We can measure whether this is tracking truth with a weak-to-strong generalization setup. I'm not super confident in this take; it's not m... (read more)

4ryan_greenblatt
I think the added value of "activation vectors" (which isn't captured by normal probing) in this sort of proposal is based on some sort of assumption that model editing (aka representation control) is a very powerful validation technique for ensuring desirable generalization of classifiers. I think this is probably only weak validation and there are probably better sources of validation elsewhere (e.g. various generalization testbeds). (In fact, we'd probably need to test this "writing is good validation" hypothesis directly in these test beds which means we might as well test the method more directly.) For more discussion on writing as validation, see this shortform post; though note that it only tangentially talks about this topic. That said, I'm pretty optimistic that extremely basic probing or generalization style strategies work well, I just think the baselines here are pretty competitive. Probing for high-stakes failures that humans would have understood seems particularly strong while trying to get generalization from stuff humans do understand to stuff they don't seems more dubious, but at least pretty likely to generalize far by default. Separately, we haven't really seen any very interesting methods that seem like they considerably beat competitive probing baselines in general purpose cases. For instance, the weak-to-strong generalization paper wasn't able to find very good methods IMO despite quite a bit of search. For more discussion on why I'm skeptical about fully general purpose weak-to-strong see here. (The confidence loss thing seems probably good and somewhat principled, but I don't really see a story for considerable further improvement without getting into very domain specific methods. To be clear, domain specific methods could be great and could scale far by having many specialized methods or finding one subproblem which sufficies (like measurement tampering).
aog42

Another important obligation set by the law is that developers must:

(3) Refrain from initiating the commercial, public, or widespread use of a covered model if there remains an unreasonable risk that an individual may be able to use the hazardous capabilities of the model, or a derivative model based on it, to cause a critical harm.

This sounds like common sense, but of course there's a lot riding on the interpretation of "unreasonable." 

2[anonymous]
This is also unprecedented. For example chain saw developers don't have to prove there is an unreasonable risk that a user may be able to use the tool to commit the obvious potential harms. How can the model itself know it isn't being asked to do something hazardous? These are not actually sentient beings and users control every bit they are fed.
aog21

Really, really cool. One small note: It would seem natural for the third heatmap to show the probe's output values after they've gone through a softmax, rather than being linearly scaled to a pixel value.  

1Adam Karvonen
That's an interesting idea, I may test that out at some point. I'm assuming the softmax would be for kings / queens, where there is typically only one on the board, rather than for e.g. blank squares or pawns?
aog20

Two quick notes here. 

  1. Research on language agents often provides feedback on their reasoning steps and individual actions, as opposed to feedback on whether they achieved the human's ultimate goal. I think it's important to point out that this could cause goal misgeneralization via incorrect instrumental reasoning. Rather than viewing reasoning steps as a means to an ultimate goal, language agents trained with process-based feedback might internalize the goal of producing reasoning steps that would be rated highly by humans, and subordinate other goal
... (read more)
1Radford Neal
I think these examples may not illustrate what you intend.  They seem to me like examples of governments justifying policies based on second-order effects, while actually doing things for their first-order effects. Taxing addictive substances like tobacco and alcohol makes sense from a government's perspective precisely because they have low elasticity of demand (ie, the taxes won't reduce consumption much). A special tax on something that people will readily stop consuming when the price rises won't raise much money. Also, taxing items with low elasticity of demand is more "economically efficient", in the technical sense that what is consumed doesn't change much, with the tax being close to a pure transfer of wealth. (See also gasoline taxes.) Government spending is often corrupt, sometimes in the legal sense, and more often in the political sense of rewarding supporters for no good policy reason. This corruption is more easily justified when mumbo-jumbo economic beliefs say it's for the common good. The first-order effect of mandatory education is that young people are confined to school buildings during the day, not that they learn anything inherently valuable. This seems like it's the primary intended effect. The idea that government schooling is better for economic growth than whatever non-mandatory activities kids/parents would otherwise choose seems dubious, though of course it's a good talking point when justifying the policy. So I guess it depends on what you mean by "people support". These second-order justifications presumably appeal to some people, or they wouldn't be worthwhile propaganda. But I'm not convinced that they are the reasons more powerful people support these policies.
5[anonymous]
I'd also be curious to know why (some) people downvoted this. Perhaps it's because you imply that some OpenAI folks were captured, and maybe some people think that that's unwarranted in this case? Sadly, the more-likely explanation (IMO) is that policy discussions can easily become tribal, even on LessWrong. I think LW still does better than most places at rewarding discourse that's thoughtful/thought-provoking and resisting tribal impulses, but I wouldn't be surprised if some people were doing something like "ah he is saying something Against AI Labs//Pro-regulation, and that is bad under my worldview, therefore downvote." (And I also think this happens the other way around as well, and I'm sure people who write things that are "pro AI labs//anti-regulation" are sometimes unfairly downvoted by people in the opposite tribe.)
4[anonymous]
It seems fine to create a law with goal (a) in mind, but then we shouldn't call it copyright law, since it is not designed to protect intellectual property. Maybe this is common practice and people write laws pretending to target one thing while actually targeting something else all the time, in which case I would be okay with it. Otherwise, doing so would be dishonest and cause our legal system to be less legible. 
8gjm
I don't say that the same policies must necessarily apply to AIs and humans. But I do say that if they don't then there should be a reason why they treat AIs and humans differently.
aogΩ240

To summarize this comment, you've proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous ("5% precision") and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous ("0.1% FPR"). 

I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it's possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs ... (read more)

4ryan_greenblatt
[Mostly unimportant/nitpicking specific claim] For cases like bio, I don't think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don't necessarily need to assume that. I'm also separately pretty optimistic about various randomization strategies for compression like things which I don't think have been tested in the literature.) In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder. Aside: Removing information for the monitor in the collusion case We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here. In section 3.4.2 we say:
aog20

That's cool, appreciate the prompt to discuss what is a relevant question. 

aog90

Separately for: "But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems."

I expect that these numbers weren't against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?

This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban use... (read more)

aog40

Yep, agreed on the individual points, not trying to offer a comprehensive assessment of the risks here. 

aogΩ242

I specifically avoided claiming that adversarial robustness is the best altruistic option for a particular person. Instead, I'd like to establish that progress on adversarial robustness would have significant benefits, and therefore should be included in the set of research directions that "count" as useful AI safety research. 

Over the next few years, I expect AI safety funding and research will (and should) dramatically expand. Research directions that would not make the cut at a small organization with a dozen researchers should still be part of the... (read more)

2ryan_greenblatt
I agree with basically all of this and apologies for writing a comment which doesn't directly respond to your post (though it is a relevant part of my views on the topic).
aogΩ120

I do think these arguments contain threads of a general argument that causing catastrophes is difficult under any threat model. Let me make just a few non-comprehensive points here: 

On cybersecurity, I'm not convinced that AI changes the offense defense balance. Attackers can use AI to find and exploit security vulnerabilities, but defenders can use it to fix them. 

On persuasion, first, rational agents can simply ignore cheap talk if they expect it not to help them. Humans are not always rational, but if you've ever tried to convince a dog or a b... (read more)

4ryan_greenblatt
I agree that "causing catastrophes is difficult" should reduce concerns with "rogue AIs causing sudden extinction (or merely killing very large numbers of people like >1 billion)". However, I think these sorts of considerations don't reduce AI takeover or other catastrophe due to rogue AI as much as you might think for a few reasons: * Escaped rogue AIs might be able to do many obviously bad actions over a long period autonomously. E.g., acquire money, create a cult, use this cult to build a bioweapons lab, and then actually develop bioweapons over long-ish period (e.g., 6 months) using >tens of thousands of queries to the AI. This looks quite different from the misuse threat model which required that omnicidal (or otherwise bad) humans possess the agency to make the right queries to the AI and solve the problems that the AI can't solve. For instance, humans have to ensure that queries were sufficiently subtle/jailbreaking to avoid detection via various other mechanisms. The rogue AI can train humans over a long period and all the agency/competence can come from the rogue AI. So, even if misuse is unlikely by humans, autonomous rogue AIs making weapons of mass destruction is perhaps more likely. * Escaped rogue AIs are unlike misuse in that even if we notice a clear and serious problem, we might have less we can do. E.g., the AIs might have already built hidden datacenters we can't find. Even if they don't and are just autonomously replicating on the internet, shutting down the internet is extremely costly and only postpones the problem. * AI takeover can route through mechanisms other than sudden catastrophe/extinction. E.g., allying with rogue states, creating a rogue AI run AI lab which builds even more powerful AI as fast as possible. (I'm generally somewhat skeptical of AIs trying to cause extinction for reasons discussed here, here, and here. Though causing huge amounts of damage (e.g. >1 billion) dead seems somewhat more plausible as a thing rogue AIs wo
aogΩ240

Also, I'd love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on "human monitoring," which would give them access to the ground truth label o... (read more)

aogΩ350

Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I'd like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren't adversarially robust, the plan will be more expensive and/or less effective. 

The only plans that wouldn't benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let's consider the viability of... (read more)

6ryan_greenblatt
I agree that additional adversarial robustness (for both the policy and the monitor) helps considerably, but I think that it likely won't be necessarily to have better adversarial robustness than the current state of the art (e.g. the robustness of Claude 2). This is due to the use of monitoring ensembles and preventing iteration against the API. My belief that this is likely to work is totally possible to falsify by demonstrating cases where we can't figure out sufficient countermeasures against a red team with realistic affordances. I made some "Edit Time Additions" (ETA) to my original comment above for clarity. As discussed in my comment, I'm imagining being able to have a human review perhaps 0.01% to 1% of queries depending on the use case and other other factors. So, we'd review the most suspicious seeming (according to our overall combined monitoring score) 0.01% to 1% of queries. Equivalently, (assuming actual misuse is rare) we'd be targeting a False Positive Rate (FPR) of 0.01% or 1% (on the ROC curve). I think the claim: seems wrong or at least seems to be answering the wrong question. Let's suppose that we can afford to review the most suspicious seeming 0.1% of queries for a bio specific API endpoint. Equivalently, we can afford 0.1% FPR for our automated monitoring ensemble. Now, suppose (for simplicity) that catastrophic bio misuse requries 100 queries to the model and we're happy as long as we catch the user and report them to the FBI prior to them finishing all 100 queries. Then, if our monitoring ensemble at 0.1% FPR catches a given bio misuse query with just 5% probability (for a competent red team), bio misuse is 99.4% likely to be caught prior to succeeding. So, we don't actually need very high reliability on the ensemble. (Though we might target more conservative bars than 99.4% in practice.) A reasonable objection here is that maybe we'll have a large number of people (e.g., >200) trying to create bioweapons using the API and thus one w
aog229

Unfortunately I don't think academia will handle this by default. The current field of machine unlearning focuses on a narrow threat model where the goal is to eliminate the impact of individual training datapoints on the trained model. Here's the NeurIPS 2023 Machine Unlearning Challenge task:

The challenge centers on the scenario in which an age predictor is built from face image data and, after training, a certain number of images must be forgotten to protect the privacy or rights of the individuals concerned.

But if hazardous knowledge can be pinpointed ... (read more)

7Bogdan Ionut Cirstea
+1 Though, for the threat model of 'hazardous knowledge that can be synthesized from many datapoints which are individually innocuous', this could still be a win if you remove the 'hazardous knowledge [that] can be pinpointed to individual training datapoints' and e.g. this forces the model to perform more explicit reasoning through e.g. CoT, which could be easier to monitor (also see these theoretical papers on the need for CoT for increased expressivity/certain types of problems).
7scasper
+1 Although the NeurIPS challenge and prior ML lit on forgetting and influence functions seem worth keeping on the radar because they're still closely-related to challenges here. 
aogΩ120

This is a comment from Andy Zou, who led the RepE paper but doesn’t have a LW account:

“Yea I think it's fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best eli... (read more)

5Fabien Roger
I agree that not using labels is interesting from a data generation perspective, but I expect this to be useful mostly if you have clean pairs of concepts for which it is hard to get labels - and I think this will not be the case for takeover attempts datasets. About the performance of LAT: for monitoring, we mostly care about correlation - so LAT is worse IID, and it's unclear if LAT is better OOD. If causality leads to better generalization properties, then LAT is dominated by mean difference probing (see the screenshot of Zou's paper below), which is just regular probing with high enough L2 regularization (as shown in the first Appendix of this post).
aog20

Nice, that makes sense. I agree that RepE / LAT might not be helpful as terminology. “Unsupervised probing” is more straightforward and descriptive.

3ryan_greenblatt
(Note that in this work, we're just doing supervised probing though we do use models to generate some of the training data.)
aogΩ690

What's the relationship between this method and representation engineering? They seem quite similar, though maybe I'm missing something. You train a linear probe on a model's activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft. 

Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seek... (read more)

Fabien RogerΩ8134

Probes fall within the representation engineering monitoring framework.

LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especial... (read more)

aog2-1

When considering whether deceptive alignment would lead to catastrophe, I think it's also important to note that deceptively aligned AIs could pursue misaligned goals in sub-catastrophic ways. 

Suppose GPT-5 terminally values paperclips. It might try to topple humanity, but there's a reasonable chance it would fail. Instead, it could pursue the simpler strategies of suggesting users purchase more paperclips, or escaping the lab and lending its abilities to human-run companies that build paperclips. These strategies would offer a higher probability of a... (read more)

aog20-4

Very nice, these arguments seem reasonable. I'd like to make a related point about how we might address deceptive alignment which makes me substantially more optimistic about the problem. (I've been meaning to write a full post on this, but this was a good impetus to make the case concisely.)

Conceptual interpretability in the vein of Collin Burns, Alex Turner, and Representation Engineering seems surprisingly close to allowing us to understand a model's internal beliefs and detect deceptive alignment. Collin Burns's work was very exciting to at least some ... (read more)

aog20

#5 is appears to be evidence for the hypothesis that, because pretrained foundation models understand human values before they become goal-directed, they’re more likely to optimize for human values and less likely to be deceptively aligned.

Conceptual argument for the hypothesis here: https://forum.effectivealtruism.org/posts/4MTwLjzPeaNyXomnx/deceptive-alignment-is-less-than-1-likely-by-default

aog219

Kevin Esvelt explicitly calls for not releasing future model weights. 

Would sharing future model weights give everyone an amoral biotech-expert tutor? Yes. 

Therefore, let’s not.

3[anonymous]
To be clear, Kevin Esvelt is the author of the "Dual-use biotechnology" paper, which the policy paper cites, but he is not the author of the policy paper.
aog30

Nuclear Threat Initiative has a wonderfully detailed report on AI biorisk, in which they more or less recommend that AI models which pose biorisks should not be open sourced:

Access controls for AI models. A promising approach for many types of models is the use of APIs that allow users to provide inputs and receive outputs without access to the underlying model. Maintaining control of a model ensures that built-in technical safeguards are not removed and provides opportunities for ensuring user legitimacy and detecting any potentially malicious or accidental misuse by users.

aog20

More from the NTI report:

A few experts believe that LLMs could already or soon will be able to generate ideas for simple variants of existing pathogens that could be more harmful than those that occur naturally, drawing on published research and other sources. Some experts also believe that LLMs will soon be able to access more specialized, open-source AI biodesign tools and successfully use them to generate a wide range of potential biological designs. In this way, the biosecurity implications of LLMs are linked with the capabilities of AI biodesign tools.

aog20

5% was one of several different estimates he'd heard from virologists.

Thanks, this is helpful. And I agree there's a disanalogy between the 1918 hypothetical and the source. 

it's not clear we want a bunch of effort going into getting a really good estimate, since (a) if it turns out the probability is high then publicizing that fact likely means increasing the chance we get one and (b) building general knowledge on how to estimate the pandemic potential of viruses seems also likely net negative.

This seems like it might be overly cautious. Bioterrorism... (read more)

Load More