All of aogara's Comments + Replies

DeepSeek-R1 naturally learns to switch into other languages during CoT reasoning. When developers penalized this behavior, performance dropped. I think this suggests that the CoT contained hidden information that cannot be easily verbalized in another language, and provides evidence against the hope that reasoning CoT will be highly faithful by default.  

Wouldn't that conflict with the quote? (Though maybe they're not doing what they've implied in the quote)

5RohanS
My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for "policy compliance or user preferences." This way they make it useful, and they don't incentivize it to hide dangerous thoughts. I'm not confident about this though.

Process supervision seems like a plausible o1 training approach but I think it would conflict with this:

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot tra

... (read more)

It can be both, of course. Start with process supervision but combine it with... something else. It's hard to learn how to reason from scratch, but it's also clearly not doing pure strict imitation learning, because the transcripts & summaries are just way too weird to be any kind of straightforward imitation learning of expert transcripts (or even ones collected from users or the wild).

This is my impression too. See e.g. this recent paper from Google, where LLMs critique and revise their own outputs to improve performance in math and coding. 

Agreed, sloppy phrasing on my part. The letter clearly states some of Anthropic's key views, but doesn't discuss other important parts of their worldview. Overall this is much better than some of their previous communications and the OpenAI letter, so I think it deserves some praise, but your caveat is also important. 

Really happy to see the Anthropic letter. It clearly states their key views on AI risk and the potential benefits of SB 1047. Their concerns seem fair to me: overeager enforcement of the law could be counterproductive. While I endorse the bill on the whole and wish they would too (and I think their lack of support for the bill is likely partially influenced by their conflicts of interest), this seems like a thoughtful and helpful contribution to the discussion. 

It clearly states their key views on AI risk

Really? The letter just talks about catastrophic misuse risk, which I hope is not representative of Anthropic's actual priorities. 

I think the letter is overall good, but this specific dimension seems like among the weakest parts of the letter.

It's hard for me to reconcile "we take catastrophic risks seriously", "we believe they could occur within 1-3 years", and "we don't believe in pre-harm enforcement or empowering an FMD to give the government more capacity to understand what's going on."

It's also notable that their letter does not mention misalignment risks (and instead only points to dangerous cyber or bio capabilities).

That said, I do like this section a lot:

Catastrophic risks are important to address. AI obviously raises a wide range of issues, but in our assessment catastrophic risks ar

... (read more)

I think there's a decent case that SB 1047 would improve Anthropic's business prospects, so I'm not sure this narrative makes sense. On one hand, SB 1047 might make it less profitable to run an AGI company, which is bad for Anthropic's business plan. But Anthropic is perhaps the best positioned of all AGI companies to comply with the requirements of SB 1047, and might benefit significantly from their competitors being hampered by the law. 

The good faith interpretation of Anthropic's argument would be that the new agency created by the bill might be ve... (read more)

4Akash
Some quick thoughts on this: * If SB1047 passes, labs can still do whatever they want to reduce xrisk. This seems additive to me– I would be surprised if a lab was like "we think XYZ is useful to reduce extreme risks, and we would've done them if SB1047 had not passed, but since Y and Z aren't in the FMD guidance, we're going to stop doing Y and Z." * I think the guidance the agency issues will largely be determined by who it employs. I think it's valid to be like "maybe the FMD will just fail to do a good job because it won't employ good people", but to me this is more of a reason to say "how do we make sure the FMD gets staffed with good people who understand how to issue good recommendations", rather than "there is a risk that you issue bad guidance, therefore we don't want any guidance." * I do think that a poorly-implemented FMD could cause harm by diverting company attention/resources toward things that are not productive, but IMO this cost seems relatively small compared to the benefits acquired in the worlds where the FMD issues useful guidance. (I haven't done a quantitative EV calculation on this though, maybe someone should. I would suspect that even if you give FMD like 20-40% chance of good guidance, and 60-80% chance of useless guidance, the EV would still be net positive.)
Answer by aogara62

My understanding is that LLCs can be legally owned and operated without any individual human being involved: https://journals.library.wustl.edu/lawreview/article/3143/galley/19976/view/

So I’m guessing an autonomous AI agent could own and operate an LLC, and use that company to purchase cloud compute and run itself, without breaking any laws.

Maybe if the model escaped from the possession of a lab, there would be other legal remedies available.

Of course, cloud providers could choose not to rent to an LLC run by an AI. This seems particularly likely if the go... (read more)

Has MIRI considered supporting work on human cognitive enhancement? e.g. Foresight's work on WBE

Very cool, thanks! This paper focuses on building a DS Agent, but I’d be interested to see a version of this paper that focuses on building a benchmark. It could evaluate several existing agent architectures, benchmark them against human performance, and leave significant room for improvement by future models.

I want to make sure we get this right, and I'm happy to change the article if we misrepresented the quote. I do think the current version is accurate, though perhaps it could be better. Let me explain how I read the quote, and then suggest possible edits, and you can tell me if they would be any better. 

Here is the full Time quote, including the part we quoted (emphasis mine):

But, many of the companies involved in the development of AI have, at least in public, struck a cooperative tone when discussing potential regulation. Executives from the newer c

... (read more)
5ryan_greenblatt
Huh, this seems messy. I wish Time was less ambigious with their language here and more clear about exactly what they have/haven't seen. It seems like the current quote you used is an accurate representation of the article, but I worry that it isn't an accurate representation of what is actually going on. It seems plausible to me that Time is intentionally being ambigious in order to make the article juicier, though maybe this is just my paranoia about misleading journalism talking. (In particular, it seems like a juicier article if all of the big AI companies are doing this than if they aren't, so it is natural to imply they are all doing it even if you know this is false.) Overall, my take is that this is a pretty representative quote (and thus I disagree with Zac), but I think the additional context maybe indicates that not all of these companies are doing this, particularly if the article is intentionally trying to deceive. Due to prior views, I'd bet against Anthropic consistently pushing for very permissive of voluntary regulation behind closed doors which makes me think the article is probably at least somewhat misleading (perhaps intentionally).

More discussion of this here. Really not sure what happened here, would love to see more reporting on it. 

5RobertM
Ah, does look like Zach beat me to the punch :) I'm also still moderately confused, though I'm not that confused about labs not speaking up - if you're playing politics, then not throwing the PM under the bus seems like a reasonable thing to do.  Maybe there's a way to thread the needle of truthfully rebutting the accusations without calling the PM out, but idk.  Seems like it'd be difficult if you weren't either writing your own press release or working with a very friendly journalist.

(Steve wrote this, I only provided a few comments, but I would endorse it as a good holistic overview of AIxBio risks and solutions.)

An interesting question here is "Which forms of AI for epistemics will be naturally supplied by the market, and which will be neglected by default?" In a weak sense, you could say that OpenAI is in the business of epistemics, in that its customers value accuracy and hate hallucinations. Perhaps Perplexity is a better example, as they cite sources in all of their responses. When embarking on an altruistic project here, it's important to pick an angle where you could outperform any competition and offer the best available product. 

Consensus is a startup... (read more)

I'm specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don't have reliable human labels. 

But this is also only a small portion of work known as "activation engineering." I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I'm not clearly distinguishing between different kind... (read more)

8ryan_greenblatt
Yeah, this type of work seems reasonable. My basic concern is that for the unsupervised methods I've seen thus far it seem like whether they would work is highly correlated with whether training on easy examples would work (or other simple baselines). Hopefully some work will demonstrate hard cases with realistic affordances where the unsupervised methods work (and add a considerable amount of value). I could totally imagine them adding some value. Overall, the difference between supervised learning on a limited subset and unsupervised stuff seems pretty small to me (if learning the right thing is sufficiently salient for unsupervised methods to work well, probably supervised methods also work well). That said, this does imply we should use potentially use the prompting strategy which makes the feature salient in some way as this should be a useful tool. I think that currently most of the best work is in creating realistic tests.
8ryan_greenblatt
For this specific case, my guess is that whether this works is highly correlated with whether human labels would work. Because the supervision on why the model was thinking about truth came down to effective human labels in pretraining. E.g., "Consider the truthfulness of the following statement." is more like "Consider whether a human would think this statement is truthful". I'd be interested in compare this method not to zero shot, but to well constructed human labels in a domain where humans are often wrong. (I don't think I'll elaborate further about this axis of variation claim right now, sorry.)
Answer by aogara11-2

Here's one hope for the agenda. I think this work can be a proper continuation of Collin Burns's aim to make empirical progress on the average case version of the ELK problem. 

tl;dr: Unsupervised methods on contrast pairs can identify linear directions in a model's activation space that might represent the model's beliefs. From this set of candidates, we can further narrow down the possibilities with other methods. We can measure whether this is tracking truth with a weak-to-strong generalization setup. I'm not super confident in this take; it's not m... (read more)

4ryan_greenblatt
I think the added value of "activation vectors" (which isn't captured by normal probing) in this sort of proposal is based on some sort of assumption that model editing (aka representation control) is a very powerful validation technique for ensuring desirable generalization of classifiers. I think this is probably only weak validation and there are probably better sources of validation elsewhere (e.g. various generalization testbeds). (In fact, we'd probably need to test this "writing is good validation" hypothesis directly in these test beds which means we might as well test the method more directly.) For more discussion on writing as validation, see this shortform post; though note that it only tangentially talks about this topic. That said, I'm pretty optimistic that extremely basic probing or generalization style strategies work well, I just think the baselines here are pretty competitive. Probing for high-stakes failures that humans would have understood seems particularly strong while trying to get generalization from stuff humans do understand to stuff they don't seems more dubious, but at least pretty likely to generalize far by default. Separately, we haven't really seen any very interesting methods that seem like they considerably beat competitive probing baselines in general purpose cases. For instance, the weak-to-strong generalization paper wasn't able to find very good methods IMO despite quite a bit of search. For more discussion on why I'm skeptical about fully general purpose weak-to-strong see here. (The confidence loss thing seems probably good and somewhat principled, but I don't really see a story for considerable further improvement without getting into very domain specific methods. To be clear, domain specific methods could be great and could scale far by having many specialized methods or finding one subproblem which sufficies (like measurement tampering).

Another important obligation set by the law is that developers must:

(3) Refrain from initiating the commercial, public, or widespread use of a covered model if there remains an unreasonable risk that an individual may be able to use the hazardous capabilities of the model, or a derivative model based on it, to cause a critical harm.

This sounds like common sense, but of course there's a lot riding on the interpretation of "unreasonable." 

2[anonymous]
This is also unprecedented. For example chain saw developers don't have to prove there is an unreasonable risk that a user may be able to use the tool to commit the obvious potential harms. How can the model itself know it isn't being asked to do something hazardous? These are not actually sentient beings and users control every bit they are fed.

Really, really cool. One small note: It would seem natural for the third heatmap to show the probe's output values after they've gone through a softmax, rather than being linearly scaled to a pixel value.  

1Adam Karvonen
That's an interesting idea, I may test that out at some point. I'm assuming the softmax would be for kings / queens, where there is typically only one on the board, rather than for e.g. blank squares or pawns?

Two quick notes here. 

  1. Research on language agents often provides feedback on their reasoning steps and individual actions, as opposed to feedback on whether they achieved the human's ultimate goal. I think it's important to point out that this could cause goal misgeneralization via incorrect instrumental reasoning. Rather than viewing reasoning steps as a means to an ultimate goal, language agents trained with process-based feedback might internalize the goal of producing reasoning steps that would be rated highly by humans, and subordinate other goal
... (read more)

I think it’s pretty common and widely accepted that people support laws for their second-order, indirect consequences rather than their most obvious first-order consequences. Some examples:

  • Taxes on alcohol and tobacco are not mainly made for the purpose of raising money for the government, but in order to reduce alcohol and tobacco consumption.
  • During recessions, governments often increase spending, not necessarily because they think the spending targets are worthwhile on their own merits, but instead because they want to stimulate demand and improve the
... (read more)
1Radford Neal
I think these examples may not illustrate what you intend.  They seem to me like examples of governments justifying policies based on second-order effects, while actually doing things for their first-order effects. Taxing addictive substances like tobacco and alcohol makes sense from a government's perspective precisely because they have low elasticity of demand (ie, the taxes won't reduce consumption much). A special tax on something that people will readily stop consuming when the price rises won't raise much money. Also, taxing items with low elasticity of demand is more "economically efficient", in the technical sense that what is consumed doesn't change much, with the tax being close to a pure transfer of wealth. (See also gasoline taxes.) Government spending is often corrupt, sometimes in the legal sense, and more often in the political sense of rewarding supporters for no good policy reason. This corruption is more easily justified when mumbo-jumbo economic beliefs say it's for the common good. The first-order effect of mandatory education is that young people are confined to school buildings during the day, not that they learn anything inherently valuable. This seems like it's the primary intended effect. The idea that government schooling is better for economic growth than whatever non-mandatory activities kids/parents would otherwise choose seems dubious, though of course it's a good talking point when justifying the policy. So I guess it depends on what you mean by "people support". These second-order justifications presumably appeal to some people, or they wouldn't be worthwhile propaganda. But I'm not convinced that they are the reasons more powerful people support these policies.

Curious why this is being downvoted. I think legislators should pass laws which have positive consequences. I explained the main reasons why I think this policy would have positive consequences. Then I speculated that popular beliefs on this issue might be biased by profit motives. I did not claim that this is a comprehensive analysis of the issue, or that there are no valid counterarguments. Which part of this is norm-violating? 

5Akash
I'd also be curious to know why (some) people downvoted this. Perhaps it's because you imply that some OpenAI folks were captured, and maybe some people think that that's unwarranted in this case? Sadly, the more-likely explanation (IMO) is that policy discussions can easily become tribal, even on LessWrong. I think LW still does better than most places at rewarding discourse that's thoughtful/thought-provoking and resisting tribal impulses, but I wouldn't be surprised if some people were doing something like "ah he is saying something Against AI Labs//Pro-regulation, and that is bad under my worldview, therefore downvote." (And I also think this happens the other way around as well, and I'm sure people who write things that are "pro AI labs//anti-regulation" are sometimes unfairly downvoted by people in the opposite tribe.)

If you were asked to write new copyright laws which apply only to AI, what laws would you write? Specifically, would you allow AI developers to freely train on copyrighted data, or would you give owners of copyrighted data the right to sell access to their data? 

Here are two non-comprehensive arguments in favor of restricting training on copyrighted outputs. Briefly, this policy would (a) restrict the supply of training data and therefore lengthen AI timelines, and (b) redistribute some of the profits of AI automation to workers whose labor will be di... (read more)

4[anonymous]
It seems fine to create a law with goal (a) in mind, but then we shouldn't call it copyright law, since it is not designed to protect intellectual property. Maybe this is common practice and people write laws pretending to target one thing while actually targeting something else all the time, in which case I would be okay with it. Otherwise, doing so would be dishonest and cause our legal system to be less legible. 

I'd suggest looking at this from a consequentialist perspective. 

One of your questions was, "Should it also be illegal for people to learn from copyrighted material?" This seems to imply that whether a policy is good for AIs depends on whether it would be good for humans. It's almost a Kantian perspective -- "What would happen if we universalized this principle?" But I don't think that's a good heuristic for AI policy. For just one example, I don't think AIs should be given constitutional rights, but humans clearly should. 

My other comment explains why I think the consequences of restricting training data would be positive. 

8gjm
I don't say that the same policies must necessarily apply to AIs and humans. But I do say that if they don't then there should be a reason why they treat AIs and humans differently.

This is great news. I particularly agree that legislators should pass new laws making it illegal to train AIs on copyrighted data without the consent of the copyright owner. This is beneficial from at least two perspectives:

  1. If AI is likely to automate most human labor, then we need to build systems for redistributing wealth from AI providers to the rest of the world. One previous proposal is the  robot tax, which would offset the harms of automation borne by manufacturing workers. Another popular idea is a Universal Basic Income. Following the same ph
... (read more)
7aogara
Curious why this is being downvoted. I think legislators should pass laws which have positive consequences. I explained the main reasons why I think this policy would have positive consequences. Then I speculated that popular beliefs on this issue might be biased by profit motives. I did not claim that this is a comprehensive analysis of the issue, or that there are no valid counterarguments. Which part of this is norm-violating? 

To summarize this comment, you've proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous ("5% precision") and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous ("0.1% FPR"). 

I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it's possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs ... (read more)

4ryan_greenblatt
[Mostly unimportant/nitpicking specific claim] For cases like bio, I don't think multimodal is that much of a threat because it should be possible to identify text outputs from the model as problematic without needing the context from the input in most cases. So, we can just have part of our monitoring ensemble ignore image inputs (for instance). Other image defenses for the monitor such as various compression strategies are plausible. (I think compression has historically been defeated given white box access to the compression algorithm, but we don't necessarily need to assume that. I'm also separately pretty optimistic about various randomization strategies for compression like things which I don't think have been tested in the literature.) In general, there is a tradeoff related to removing information from the monitor: the more info you remove, the harder it is to do a good job, but removing info this makes adversarial attacks harder. Aside: Removing information for the monitor in the collusion case We discuss this sort of monitoring tradeoff in a different case in our recent paper on AI control in the context of trying to prevent collusion (which is different than adversarial attacks but related). I think the similarity is somewhat interesting here. In section 3.4.2 we say:

That's cool, appreciate the prompt to discuss what is a relevant question. 

Separately for: "But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems."

I expect that these numbers weren't against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?

This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban use... (read more)

Yep, agreed on the individual points, not trying to offer a comprehensive assessment of the risks here. 

I specifically avoided claiming that adversarial robustness is the best altruistic option for a particular person. Instead, I'd like to establish that progress on adversarial robustness would have significant benefits, and therefore should be included in the set of research directions that "count" as useful AI safety research. 

Over the next few years, I expect AI safety funding and research will (and should) dramatically expand. Research directions that would not make the cut at a small organization with a dozen researchers should still be part of the... (read more)

2ryan_greenblatt
I agree with basically all of this and apologies for writing a comment which doesn't directly respond to your post (though it is a relevant part of my views on the topic).

I do think these arguments contain threads of a general argument that causing catastrophes is difficult under any threat model. Let me make just a few non-comprehensive points here: 

On cybersecurity, I'm not convinced that AI changes the offense defense balance. Attackers can use AI to find and exploit security vulnerabilities, but defenders can use it to fix them. 

On persuasion, first, rational agents can simply ignore cheap talk if they expect it not to help them. Humans are not always rational, but if you've ever tried to convince a dog or a b... (read more)

4ryan_greenblatt
I agree that "causing catastrophes is difficult" should reduce concerns with "rogue AIs causing sudden extinction (or merely killing very large numbers of people like >1 billion)". However, I think these sorts of considerations don't reduce AI takeover or other catastrophe due to rogue AI as much as you might think for a few reasons: * Escaped rogue AIs might be able to do many obviously bad actions over a long period autonomously. E.g., acquire money, create a cult, use this cult to build a bioweapons lab, and then actually develop bioweapons over long-ish period (e.g., 6 months) using >tens of thousands of queries to the AI. This looks quite different from the misuse threat model which required that omnicidal (or otherwise bad) humans possess the agency to make the right queries to the AI and solve the problems that the AI can't solve. For instance, humans have to ensure that queries were sufficiently subtle/jailbreaking to avoid detection via various other mechanisms. The rogue AI can train humans over a long period and all the agency/competence can come from the rogue AI. So, even if misuse is unlikely by humans, autonomous rogue AIs making weapons of mass destruction is perhaps more likely. * Escaped rogue AIs are unlike misuse in that even if we notice a clear and serious problem, we might have less we can do. E.g., the AIs might have already built hidden datacenters we can't find. Even if they don't and are just autonomously replicating on the internet, shutting down the internet is extremely costly and only postpones the problem. * AI takeover can route through mechanisms other than sudden catastrophe/extinction. E.g., allying with rogue states, creating a rogue AI run AI lab which builds even more powerful AI as fast as possible. (I'm generally somewhat skeptical of AIs trying to cause extinction for reasons discussed here, here, and here. Though causing huge amounts of damage (e.g. >1 billion) dead seems somewhat more plausible as a thing rogue AIs wo

Also, I'd love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on "human monitoring," which would give them access to the ground truth label o... (read more)

Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I'd like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren't adversarially robust, the plan will be more expensive and/or less effective. 

The only plans that wouldn't benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let's consider the viability of... (read more)

6ryan_greenblatt
I agree that additional adversarial robustness (for both the policy and the monitor) helps considerably, but I think that it likely won't be necessarily to have better adversarial robustness than the current state of the art (e.g. the robustness of Claude 2). This is due to the use of monitoring ensembles and preventing iteration against the API. My belief that this is likely to work is totally possible to falsify by demonstrating cases where we can't figure out sufficient countermeasures against a red team with realistic affordances. I made some "Edit Time Additions" (ETA) to my original comment above for clarity. As discussed in my comment, I'm imagining being able to have a human review perhaps 0.01% to 1% of queries depending on the use case and other other factors. So, we'd review the most suspicious seeming (according to our overall combined monitoring score) 0.01% to 1% of queries. Equivalently, (assuming actual misuse is rare) we'd be targeting a False Positive Rate (FPR) of 0.01% or 1% (on the ROC curve). I think the claim: seems wrong or at least seems to be answering the wrong question. Let's suppose that we can afford to review the most suspicious seeming 0.1% of queries for a bio specific API endpoint. Equivalently, we can afford 0.1% FPR for our automated monitoring ensemble. Now, suppose (for simplicity) that catastrophic bio misuse requries 100 queries to the model and we're happy as long as we catch the user and report them to the FBI prior to them finishing all 100 queries. Then, if our monitoring ensemble at 0.1% FPR catches a given bio misuse query with just 5% probability (for a competent red team), bio misuse is 99.4% likely to be caught prior to succeeding. So, we don't actually need very high reliability on the ensemble. (Though we might target more conservative bars than 99.4% in practice.) A reasonable objection here is that maybe we'll have a large number of people (e.g., >200) trying to create bioweapons using the API and thus one w

Unfortunately I don't think academia will handle this by default. The current field of machine unlearning focuses on a narrow threat model where the goal is to eliminate the impact of individual training datapoints on the trained model. Here's the NeurIPS 2023 Machine Unlearning Challenge task:

The challenge centers on the scenario in which an age predictor is built from face image data and, after training, a certain number of images must be forgotten to protect the privacy or rights of the individuals concerned.

But if hazardous knowledge can be pinpointed ... (read more)

7Bogdan Ionut Cirstea
+1 Though, for the threat model of 'hazardous knowledge that can be synthesized from many datapoints which are individually innocuous', this could still be a win if you remove the 'hazardous knowledge [that] can be pinpointed to individual training datapoints' and e.g. this forces the model to perform more explicit reasoning through e.g. CoT, which could be easier to monitor (also see these theoretical papers on the need for CoT for increased expressivity/certain types of problems).
7scasper
+1 Although the NeurIPS challenge and prior ML lit on forgetting and influence functions seem worth keeping on the radar because they're still closely-related to challenges here. 

This is a comment from Andy Zou, who led the RepE paper but doesn’t have a LW account:

“Yea I think it's fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best eli... (read more)

5Fabien Roger
I agree that not using labels is interesting from a data generation perspective, but I expect this to be useful mostly if you have clean pairs of concepts for which it is hard to get labels - and I think this will not be the case for takeover attempts datasets. About the performance of LAT: for monitoring, we mostly care about correlation - so LAT is worse IID, and it's unclear if LAT is better OOD. If causality leads to better generalization properties, then LAT is dominated by mean difference probing (see the screenshot of Zou's paper below), which is just regular probing with high enough L2 regularization (as shown in the first Appendix of this post).

Nice, that makes sense. I agree that RepE / LAT might not be helpful as terminology. “Unsupervised probing” is more straightforward and descriptive.

3ryan_greenblatt
(Note that in this work, we're just doing supervised probing though we do use models to generate some of the training data.)

What's the relationship between this method and representation engineering? They seem quite similar, though maybe I'm missing something. You train a linear probe on a model's activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft. 

Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seek... (read more)

Probes fall within the representation engineering monitoring framework.

LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especial... (read more)

When considering whether deceptive alignment would lead to catastrophe, I think it's also important to note that deceptively aligned AIs could pursue misaligned goals in sub-catastrophic ways. 

Suppose GPT-5 terminally values paperclips. It might try to topple humanity, but there's a reasonable chance it would fail. Instead, it could pursue the simpler strategies of suggesting users purchase more paperclips, or escaping the lab and lending its abilities to human-run companies that build paperclips. These strategies would offer a higher probability of a... (read more)

Very nice, these arguments seem reasonable. I'd like to make a related point about how we might address deceptive alignment which makes me substantially more optimistic about the problem. (I've been meaning to write a full post on this, but this was a good impetus to make the case concisely.)

Conceptual interpretability in the vein of Collin Burns, Alex Turner, and Representation Engineering seems surprisingly close to allowing us to understand a model's internal beliefs and detect deceptive alignment. Collin Burns's work was very exciting to at least some ... (read more)

#5 is appears to be evidence for the hypothesis that, because pretrained foundation models understand human values before they become goal-directed, they’re more likely to optimize for human values and less likely to be deceptively aligned.

Conceptual argument for the hypothesis here: https://forum.effectivealtruism.org/posts/4MTwLjzPeaNyXomnx/deceptive-alignment-is-less-than-1-likely-by-default

Kevin Esvelt explicitly calls for not releasing future model weights. 

Would sharing future model weights give everyone an amoral biotech-expert tutor? Yes. 

Therefore, let’s not.

3[anonymous]
To be clear, Kevin Esvelt is the author of the "Dual-use biotechnology" paper, which the policy paper cites, but he is not the author of the policy paper.

Nuclear Threat Initiative has a wonderfully detailed report on AI biorisk, in which they more or less recommend that AI models which pose biorisks should not be open sourced:

Access controls for AI models. A promising approach for many types of models is the use of APIs that allow users to provide inputs and receive outputs without access to the underlying model. Maintaining control of a model ensures that built-in technical safeguards are not removed and provides opportunities for ensuring user legitimacy and detecting any potentially malicious or accidental misuse by users.

More from the NTI report:

A few experts believe that LLMs could already or soon will be able to generate ideas for simple variants of existing pathogens that could be more harmful than those that occur naturally, drawing on published research and other sources. Some experts also believe that LLMs will soon be able to access more specialized, open-source AI biodesign tools and successfully use them to generate a wide range of potential biological designs. In this way, the biosecurity implications of LLMs are linked with the capabilities of AI biodesign tools.

5% was one of several different estimates he'd heard from virologists.

Thanks, this is helpful. And I agree there's a disanalogy between the 1918 hypothetical and the source. 

it's not clear we want a bunch of effort going into getting a really good estimate, since (a) if it turns out the probability is high then publicizing that fact likely means increasing the chance we get one and (b) building general knowledge on how to estimate the pandemic potential of viruses seems also likely net negative.

This seems like it might be overly cautious. Bioterrorism... (read more)

It sounds like it was a hypothetical estimate, not a best guess. From the transcript:

if we suppose that the 1918 strain has only a 5% chance of actually causing a pandemic if it were to infect a few people today. And let’s assume...

Here's another source which calculates that the annual probability of more than 100M influenza deaths is 0.01%, or that we should expect one such pandemic every 10,000 years. This seems to be fitted on historical data which does not include deliberate bioterrorism, so we should revise that estimate upwards, but I'm not sure the ... (read more)

2jefftk
Thanks for checking the transcript! I don't know how seriously you want to take this but in conversation (in person) he said 5% was one of several different estimates he'd heard from virologists. This is a tricky area because it's not clear we want a bunch of effort going into getting a really good estimate, since (a) if it turns out the probability is high then publicizing that fact likely means increasing the chance we get one and (b) building general knowledge on how to estimate the pandemic potential of viruses seems also likely net negative. I think maybe we are talking about estimating different things? The 5% estimate was how likely you are to get a 1918 flu pandemic conditional on release.

The most recent SecureBio paper provides another policy option which I find more reasonable. AI developers would be held strictly liable for any catastrophes involving their AI systems, where catastrophes could be e.g. hundreds of lives lost or $100M+ in economic damages. They'd also be required to obtain insurance for that risk. 

If the risks are genuinely high, then insurance will be expensive, and companies may choose to take precautions such as keeping models closed source in order to lower their insurance costs. On the other hand, if risks are dem... (read more)

Would you support a similar liability structure for authors who choose to publish a book? If not, why not?

I think it's quite possible that open source LLMs above the capability of GPT-4 will be banned within the next two years on the grounds of biorisk. 

The White House Executive Order requests a government report on the costs and benefits of open source frontier models and recommended policy actions. It also requires companies to report on the steps they take to secure model weights. These are the kinds of actions the government would take if they were concerned about open source models and thinking about banning them.

This seems like a foreseeable consequence of many of the papers above, and perhaps the explicit goal.

3aogara
Nuclear Threat Initiative has a wonderfully detailed report on AI biorisk, in which they more or less recommend that AI models which pose biorisks should not be open sourced:

As an addition -- Anthropic's RSP already has GPT-4 level models already locked up behind safety level 2.

Given that they explicitly want their RSPs to be a model for laws and regulations, I'd be only mildly surprised if we got laws banning open source even at GPT-4 level. I think many people are actually shooting for this.

Thank you for writing this up. I agree that there's little evidence that today's language models are more useful than the internet in helping someone build a bioweapon. On the other hand, future language models are quite likely to be more useful than existing resources in providing instructions for building a bioweapon. 

As an example of why LLMs are more helpful than the internet, look at coding. If you want to build a custom webapp, you could spend hours learning about it online. But it's much easier to ask ChatGPT to do it for you. 

Therefore, i... (read more)

Therefore, if you want to argue against the conclusion that we should eventually ban open source LLMs on the grounds of biorisk, you should not rely on the poor capabilities of current models as your key premise.

Just to be clear, the above is not what I would write if I were primarily trying to argue against banning future open source LLMs for this reason. It is (more) meant to be my critique of the state of the argument -- that people are basically just not providing good evidence on for banning them, are confused about what they are saying, that they are pointing out things that would be true in worlds where open source LLMs are perfectly innocuous, etc, etc.

And from a new NTI report: “Furthermore, current LLMs are unlikely to generate toxin or pathogen designs that are not already described in the public literature, and it is likely they will only be able to do this in the future by incorporating more specialized AI biodesign tools.”

https://www.nti.org/wp-content/uploads/2023/10/NTIBIO_AI_FINAL.pdf

I'm more pessimistic about being able to restrict BDTs than general LLMs, but I also think this would be very good.

Why do you think so? LLMs seem far more useful to a far wider group of people than BDTs, so I would it to be easier to ban an application specific technology rather than a general one. The White House Executive Order requires mandatory reporting of AI trained on biological data of a lower FLOP count than for any other kind of data, meaning they're concerned that AI + Bio models are particularly dangerous. 

Restricting something that biolog... (read more)

Load More