Process supervision seems like a plausible o1 training approach but I think it would conflict with this:
...We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot tra
It can be both, of course. Start with process supervision but combine it with... something else. It's hard to learn how to reason from scratch, but it's also clearly not doing pure strict imitation learning, because the transcripts & summaries are just way too weird to be any kind of straightforward imitation learning of expert transcripts (or even ones collected from users or the wild).
This is my impression too. See e.g. this recent paper from Google, where LLMs critique and revise their own outputs to improve performance in math and coding.
Agreed, sloppy phrasing on my part. The letter clearly states some of Anthropic's key views, but doesn't discuss other important parts of their worldview. Overall this is much better than some of their previous communications and the OpenAI letter, so I think it deserves some praise, but your caveat is also important.
Really happy to see the Anthropic letter. It clearly states their key views on AI risk and the potential benefits of SB 1047. Their concerns seem fair to me: overeager enforcement of the law could be counterproductive. While I endorse the bill on the whole and wish they would too (and I think their lack of support for the bill is likely partially influenced by their conflicts of interest), this seems like a thoughtful and helpful contribution to the discussion.
It's hard for me to reconcile "we take catastrophic risks seriously", "we believe they could occur within 1-3 years", and "we don't believe in pre-harm enforcement or empowering an FMD to give the government more capacity to understand what's going on."
It's also notable that their letter does not mention misalignment risks (and instead only points to dangerous cyber or bio capabilities).
That said, I do like this section a lot:
...Catastrophic risks are important to address. AI obviously raises a wide range of issues, but in our assessment catastrophic risks ar
I think there's a decent case that SB 1047 would improve Anthropic's business prospects, so I'm not sure this narrative makes sense. On one hand, SB 1047 might make it less profitable to run an AGI company, which is bad for Anthropic's business plan. But Anthropic is perhaps the best positioned of all AGI companies to comply with the requirements of SB 1047, and might benefit significantly from their competitors being hampered by the law.
The good faith interpretation of Anthropic's argument would be that the new agency created by the bill might be ve...
My understanding is that LLCs can be legally owned and operated without any individual human being involved: https://journals.library.wustl.edu/lawreview/article/3143/galley/19976/view/
So I’m guessing an autonomous AI agent could own and operate an LLC, and use that company to purchase cloud compute and run itself, without breaking any laws.
Maybe if the model escaped from the possession of a lab, there would be other legal remedies available.
Of course, cloud providers could choose not to rent to an LLC run by an AI. This seems particularly likely if the go...
Very cool, thanks! This paper focuses on building a DS Agent, but I’d be interested to see a version of this paper that focuses on building a benchmark. It could evaluate several existing agent architectures, benchmark them against human performance, and leave significant room for improvement by future models.
I want to make sure we get this right, and I'm happy to change the article if we misrepresented the quote. I do think the current version is accurate, though perhaps it could be better. Let me explain how I read the quote, and then suggest possible edits, and you can tell me if they would be any better.
Here is the full Time quote, including the part we quoted (emphasis mine):
...But, many of the companies involved in the development of AI have, at least in public, struck a cooperative tone when discussing potential regulation. Executives from the newer c
More discussion of this here. Really not sure what happened here, would love to see more reporting on it.
An interesting question here is "Which forms of AI for epistemics will be naturally supplied by the market, and which will be neglected by default?" In a weak sense, you could say that OpenAI is in the business of epistemics, in that its customers value accuracy and hate hallucinations. Perhaps Perplexity is a better example, as they cite sources in all of their responses. When embarking on an altruistic project here, it's important to pick an angle where you could outperform any competition and offer the best available product.
Consensus is a startup...
I'm specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don't have reliable human labels.
But this is also only a small portion of work known as "activation engineering." I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I'm not clearly distinguishing between different kind...
Here's one hope for the agenda. I think this work can be a proper continuation of Collin Burns's aim to make empirical progress on the average case version of the ELK problem.
tl;dr: Unsupervised methods on contrast pairs can identify linear directions in a model's activation space that might represent the model's beliefs. From this set of candidates, we can further narrow down the possibilities with other methods. We can measure whether this is tracking truth with a weak-to-strong generalization setup. I'm not super confident in this take; it's not m...
Another important obligation set by the law is that developers must:
(3) Refrain from initiating the commercial, public, or widespread use of a covered model if there remains an unreasonable risk that an individual may be able to use the hazardous capabilities of the model, or a derivative model based on it, to cause a critical harm.
This sounds like common sense, but of course there's a lot riding on the interpretation of "unreasonable."
Two quick notes here.
I think it’s pretty common and widely accepted that people support laws for their second-order, indirect consequences rather than their most obvious first-order consequences. Some examples:
Curious why this is being downvoted. I think legislators should pass laws which have positive consequences. I explained the main reasons why I think this policy would have positive consequences. Then I speculated that popular beliefs on this issue might be biased by profit motives. I did not claim that this is a comprehensive analysis of the issue, or that there are no valid counterarguments. Which part of this is norm-violating?
If you were asked to write new copyright laws which apply only to AI, what laws would you write? Specifically, would you allow AI developers to freely train on copyrighted data, or would you give owners of copyrighted data the right to sell access to their data?
Here are two non-comprehensive arguments in favor of restricting training on copyrighted outputs. Briefly, this policy would (a) restrict the supply of training data and therefore lengthen AI timelines, and (b) redistribute some of the profits of AI automation to workers whose labor will be di...
I'd suggest looking at this from a consequentialist perspective.
One of your questions was, "Should it also be illegal for people to learn from copyrighted material?" This seems to imply that whether a policy is good for AIs depends on whether it would be good for humans. It's almost a Kantian perspective -- "What would happen if we universalized this principle?" But I don't think that's a good heuristic for AI policy. For just one example, I don't think AIs should be given constitutional rights, but humans clearly should.
My other comment explains why I think the consequences of restricting training data would be positive.
This is great news. I particularly agree that legislators should pass new laws making it illegal to train AIs on copyrighted data without the consent of the copyright owner. This is beneficial from at least two perspectives:
To summarize this comment, you've proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous ("5% precision") and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous ("0.1% FPR").
I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it's possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs ...
Separately for: "But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems."
I expect that these numbers weren't against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?
This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban use...
I specifically avoided claiming that adversarial robustness is the best altruistic option for a particular person. Instead, I'd like to establish that progress on adversarial robustness would have significant benefits, and therefore should be included in the set of research directions that "count" as useful AI safety research.
Over the next few years, I expect AI safety funding and research will (and should) dramatically expand. Research directions that would not make the cut at a small organization with a dozen researchers should still be part of the...
I do think these arguments contain threads of a general argument that causing catastrophes is difficult under any threat model. Let me make just a few non-comprehensive points here:
On cybersecurity, I'm not convinced that AI changes the offense defense balance. Attackers can use AI to find and exploit security vulnerabilities, but defenders can use it to fix them.
On persuasion, first, rational agents can simply ignore cheap talk if they expect it not to help them. Humans are not always rational, but if you've ever tried to convince a dog or a b...
Also, I'd love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on "human monitoring," which would give them access to the ground truth label o...
Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I'd like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren't adversarially robust, the plan will be more expensive and/or less effective.
The only plans that wouldn't benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let's consider the viability of...
Unfortunately I don't think academia will handle this by default. The current field of machine unlearning focuses on a narrow threat model where the goal is to eliminate the impact of individual training datapoints on the trained model. Here's the NeurIPS 2023 Machine Unlearning Challenge task:
The challenge centers on the scenario in which an age predictor is built from face image data and, after training, a certain number of images must be forgotten to protect the privacy or rights of the individuals concerned.
But if hazardous knowledge can be pinpointed ...
This is a comment from Andy Zou, who led the RepE paper but doesn’t have a LW account:
“Yea I think it's fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best eli...
What's the relationship between this method and representation engineering? They seem quite similar, though maybe I'm missing something. You train a linear probe on a model's activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft.
Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seek...
Probes fall within the representation engineering monitoring framework.
LAT (the specific technique they use to train probes in the RePE paper) is just regular probe training, but with a specific kind of training dataset ((positive, negative) pairs) and a slightly more fancy loss. It might work better in practice, but just because of better inductive biases, not because something fundamentally different is going on (so arguments against coup probes mostly apply if LAT is used to train them). It also makes creating a dataset slightly more annoying - especial...
When considering whether deceptive alignment would lead to catastrophe, I think it's also important to note that deceptively aligned AIs could pursue misaligned goals in sub-catastrophic ways.
Suppose GPT-5 terminally values paperclips. It might try to topple humanity, but there's a reasonable chance it would fail. Instead, it could pursue the simpler strategies of suggesting users purchase more paperclips, or escaping the lab and lending its abilities to human-run companies that build paperclips. These strategies would offer a higher probability of a...
Very nice, these arguments seem reasonable. I'd like to make a related point about how we might address deceptive alignment which makes me substantially more optimistic about the problem. (I've been meaning to write a full post on this, but this was a good impetus to make the case concisely.)
Conceptual interpretability in the vein of Collin Burns, Alex Turner, and Representation Engineering seems surprisingly close to allowing us to understand a model's internal beliefs and detect deceptive alignment. Collin Burns's work was very exciting to at least some ...
#5 is appears to be evidence for the hypothesis that, because pretrained foundation models understand human values before they become goal-directed, they’re more likely to optimize for human values and less likely to be deceptively aligned.
Conceptual argument for the hypothesis here: https://forum.effectivealtruism.org/posts/4MTwLjzPeaNyXomnx/deceptive-alignment-is-less-than-1-likely-by-default
Kevin Esvelt explicitly calls for not releasing future model weights.
Would sharing future model weights give everyone an amoral biotech-expert tutor? Yes.
Therefore, let’s not.
Nuclear Threat Initiative has a wonderfully detailed report on AI biorisk, in which they more or less recommend that AI models which pose biorisks should not be open sourced:
Access controls for AI models. A promising approach for many types of models is the use of APIs that allow users to provide inputs and receive outputs without access to the underlying model. Maintaining control of a model ensures that built-in technical safeguards are not removed and provides opportunities for ensuring user legitimacy and detecting any potentially malicious or accidental misuse by users.
More from the NTI report:
A few experts believe that LLMs could already or soon will be able to generate ideas for simple variants of existing pathogens that could be more harmful than those that occur naturally, drawing on published research and other sources. Some experts also believe that LLMs will soon be able to access more specialized, open-source AI biodesign tools and successfully use them to generate a wide range of potential biological designs. In this way, the biosecurity implications of LLMs are linked with the capabilities of AI biodesign tools.
5% was one of several different estimates he'd heard from virologists.
Thanks, this is helpful. And I agree there's a disanalogy between the 1918 hypothetical and the source.
it's not clear we want a bunch of effort going into getting a really good estimate, since (a) if it turns out the probability is high then publicizing that fact likely means increasing the chance we get one and (b) building general knowledge on how to estimate the pandemic potential of viruses seems also likely net negative.
This seems like it might be overly cautious. Bioterrorism...
It sounds like it was a hypothetical estimate, not a best guess. From the transcript:
if we suppose that the 1918 strain has only a 5% chance of actually causing a pandemic if it were to infect a few people today. And let’s assume...
Here's another source which calculates that the annual probability of more than 100M influenza deaths is 0.01%, or that we should expect one such pandemic every 10,000 years. This seems to be fitted on historical data which does not include deliberate bioterrorism, so we should revise that estimate upwards, but I'm not sure the ...
The most recent SecureBio paper provides another policy option which I find more reasonable. AI developers would be held strictly liable for any catastrophes involving their AI systems, where catastrophes could be e.g. hundreds of lives lost or $100M+ in economic damages. They'd also be required to obtain insurance for that risk.
If the risks are genuinely high, then insurance will be expensive, and companies may choose to take precautions such as keeping models closed source in order to lower their insurance costs. On the other hand, if risks are dem...
I think it's quite possible that open source LLMs above the capability of GPT-4 will be banned within the next two years on the grounds of biorisk.
The White House Executive Order requests a government report on the costs and benefits of open source frontier models and recommended policy actions. It also requires companies to report on the steps they take to secure model weights. These are the kinds of actions the government would take if they were concerned about open source models and thinking about banning them.
This seems like a foreseeable consequence of many of the papers above, and perhaps the explicit goal.
As an addition -- Anthropic's RSP already has GPT-4 level models already locked up behind safety level 2.
Given that they explicitly want their RSPs to be a model for laws and regulations, I'd be only mildly surprised if we got laws banning open source even at GPT-4 level. I think many people are actually shooting for this.
Thank you for writing this up. I agree that there's little evidence that today's language models are more useful than the internet in helping someone build a bioweapon. On the other hand, future language models are quite likely to be more useful than existing resources in providing instructions for building a bioweapon.
As an example of why LLMs are more helpful than the internet, look at coding. If you want to build a custom webapp, you could spend hours learning about it online. But it's much easier to ask ChatGPT to do it for you.
Therefore, i...
Therefore, if you want to argue against the conclusion that we should eventually ban open source LLMs on the grounds of biorisk, you should not rely on the poor capabilities of current models as your key premise.
Just to be clear, the above is not what I would write if I were primarily trying to argue against banning future open source LLMs for this reason. It is (more) meant to be my critique of the state of the argument -- that people are basically just not providing good evidence on for banning them, are confused about what they are saying, that they are pointing out things that would be true in worlds where open source LLMs are perfectly innocuous, etc, etc.
And from a new NTI report: “Furthermore, current LLMs are unlikely to generate toxin or pathogen designs that are not already described in the public literature, and it is likely they will only be able to do this in the future by incorporating more specialized AI biodesign tools.”
https://www.nti.org/wp-content/uploads/2023/10/NTIBIO_AI_FINAL.pdf
I'm more pessimistic about being able to restrict BDTs than general LLMs, but I also think this would be very good.
Why do you think so? LLMs seem far more useful to a far wider group of people than BDTs, so I would it to be easier to ban an application specific technology rather than a general one. The White House Executive Order requires mandatory reporting of AI trained on biological data of a lower FLOP count than for any other kind of data, meaning they're concerned that AI + Bio models are particularly dangerous.
Restricting something that biolog...
DeepSeek-R1 naturally learns to switch into other languages during CoT reasoning. When developers penalized this behavior, performance dropped. I think this suggests that the CoT contained hidden information that cannot be easily verbalized in another language, and provides evidence against the hope that reasoning CoT will be highly faithful by default.