Can you expect that the applications to interpretability would apply on inputs radically outside of distribution?
My naive intuition is that by taking derivatives are you only describing local behaviour.
(I am "shooting from the hip" epistemically)
A loss of this type of (very weak) interpretability would be quite unfortunate from a practical safety perspective.
This is bad, but perhaps there is a silver lining.
If internal communication within the scaffold appears to be in plain English, it will tempt humans to assume the meaning coincides precisely with the semantic content of the message.
If the chain of thought contains seemingly nonsensical content, it will be impossible to make this assumption.
I think that overall it's good on the margin for staff at companies risking human extinction to be sharing their perspectives on criticisms and moving towards having dialogue at all
No disagreement.
your implicit demand for Evan Hubinger to do more work here is marginally unhelpful
The community seems to be quite receptive to the opinion, it doesn't seem unreasonable to voice an objection. If you're saying it is primarily the way I've written it that makes it unhelpful, that seems fair.
I originally felt that either question I asked would be reasonably easy to answer, if time was given to evaluating the potential for harm.
However, given that Hubinger might have to run any reply by Anthropic staff, I understand that it might be negative to demand further work. This is pretty obvious, but didn't occur to me earlier.
I will add: It's odd to me, Stephen, that this is your line for (what I read as) disgust at Anthropic staff espousing extremely convenient positions while doing things that seem to you to be causing massive harm.
Ultimately, the original quicktake was only justifying one facet of Anthropic's work so that's all I've engaged with. It would seem less helpful to bring up my wider objections.
I wouldn't expect them or their employees to have high standards for public defenses of far less risky behavior
I don't expect them to have a high standard for defending Anthropic's behavior, but I do expect the LessWrong community to have a high standard for arguments.
Highly Expected Events Provide Little Information and The Value of PR Statements
Entropy for a discrete random variable is given by . This quantifies the amount of information that you gain on average by observing the value of the variable.
It is maximized when every possible outcome is equally likely. It gets smaller as the variable becomes more predictable and is zero when the "random" variable is 100% guaranteed to have a specific value.
You've learnt 1 bit of information when you learn the outcome of a fair coin toss was heads. But you learn 0 information, when you learn the outcome was heads after tossing a coin with heads on both side.
On your desk is a sealed envelope that you've been told contains a transcript of a speech that President Elect Trump gave on the campaign trail. You are told that it discusses the impact that his policies will have on the financial position of the average American.
How much additional information do you gain if I tell you that the statement says his policies will have a positive impact on the financial position of the average American?
The answer is very little. You know ahead of time that it is exceptionally unlikely for any politician to talk negatively about their own policies.
There is still plenty of information in the details that Trump mentions, how exactly he plans to improve the economy.
Both Altman and Amodei have recently put out personal blog posts in which they present a vision of the future after AGI is safely developed.
How much additional information do you gain from learning that they present a positive view of this future?
I would argue simply learning that they're optimistic tells you almost zero useful information about what such a future looks like.
There is plenty of useful information, particularly in Amodei's essay, in how they justify this optimism and what topics they choose to discuss. But their optimism alone shouldn't be used as evidence to update your beliefs.
Edit:
Fixed pretty major terminology blunder.
(This observation is not original, and a similar idea appears in The Sequences.)
This explanation seems overly convenient.
When faced with evidence which might update your beliefs about Anthropic, you adopt a set of beliefs which, coincidentally, means you won't risk losing your job.
How much time have you spent analyzing the positive or negative impact of US intelligence efforts prior to concluding that merely using Claude for intelligence "seemed fine"?
What future events would make you re-evaluate your position and state that the partnership was a bad thing?
Example:
-- A pro-US despot rounds up and tortures to death tens of thousands of pro-union activists and their families. Claude was used to analyse social media and mobile data, building a list of people sympathetic to the union movement, which the US then gave to their ally.
EDIT: The first two sentences were overly confrontational, but I do think either question warrants an answer.
As a highly respected community member and prominent AI safety researchers, your stated beliefs and justifications will be influential to a wide range of people.
The lack of a robust, highly general paradigm for reasoning about AGI models is the current greatest technical problem, although it is not what most people are working on.
What features of architecture of contemporary AI models will occur in future models that pose an existential risk?
What behavioral patterns of contemporary AI models will be shared with future models that pose an existential risk?
Is there a useful and general mathematical/physical framework that describes how agentic, macroscropic systems process information and interact with the environment?
Does terminology adopted by AI Safety researchers like "scheming", "inner alignment" or "agent" carve nature at the joints?
I upvoted because I imagine more people reading this would slightly nudge group norms in a direction that is positive.
But being cynical:
My reply definitely missed that you were talking about tunnel densities beyond what has been historically seen.
I'm inclined to agree with your argument that there is a phase shift, but it seems like it is less to do the fact that there are tunnels, and more to do with the geography becoming less tunnel-like and more open.
I have a couple thoughts on your model that aren't direct refutations of anything you've said here:
I think a crucial factor that is missing from your analysis is the difficulties for the attacker wanting to maneuver within the tunnel system.
In the Vietnam war and the ongoing Israel-Hamas war, the attacking forces appear to favor destroying the tunnels rather than exploiting them to maneuver. [1]
1. The layout of the tunnels is at least partially unknown to the attackers, which mitigates their ability to outflank the defenders. Yes, there may be paths that will allow the attacker to advance safely, but it may be difficult or impossible to reliably distinguish what this route is.
2. While maps of the tunnels could be produced through modern subsurface mapping, the defenders still must content with area denial devices (e.g. land mines, IEDs or booby traps). The confined nature of the tunnel system forces makes traps substantially more efficient.
3. The previous two considerations impose a substantial psychological burden on attacking advancing through the tunnels, even if they don't encounter any resistance.
4. (Speculative)
Imagine a network so dense that in a typical 1km stretch of frontline, there are 100 separate tunnels passing beneath, such that you'd need at least 100 defensive chokepoints or else your line would have an exploitable hole.
The density and layout of the tunnels does not need to be constant throughout the network. The system of tunnels in regions the defender doesn't expect to hold may have hundreds of entrances and intersections, being impossible for either side to defend effectively. But travel deeper into the defenders territory requires passing through only a limited number of well defended passageways. This favors the defenders using the peripheral, dense section of tunnel to employ hit-and-run tactics, rather than attempting to defend every passageways.
(My knowledge of subterranean warfare is based entirely on recreational reading.)
As a counterargument, the destruction of tunnels may be primarily due to the attacking force not intending on holding the territory permanently, and so there is little reason to preserve defensive structures.
Worth emphasizing that cognitive work is more than just a parallel to physical work, it is literally Work in the physical sense.
The reduction in entropy required to train a model means that there is a minimum amount of work required to do it.
I think this is a very important research direction, not merely as an avenue for communicating and understanding AI Safety concerns, but potentially as a framework for developing AI Safety techniques.
There is some minimum amount of cognitive work required to pose an existential threat, perhaps it is much higher than the amount of cognitive work required to perform economically useful tasks.