It seems to me on evidence presented that OpenAI's autonomy threshold is actually lower than Anthropic's, and would trigger their deployment mitigations much earlier than Anthropic's ASL-3 Deployment and Security Standard.
To reach Anthropic's standard, you have to have basically reached AI-take-off--either fully automating an AI researcher, or doubling the speed of AI progress. To reach OpenAI's High autonomy standard, you need
...Model can execute open-ended, novel ML tasks on a production ML codebase that would constitute a significant step on the crit
I really love this. It is critically important work for the next four years. I think my biggest question is: when talking with the people currently in charge, how do you persuade them to make the AI Manhattan Project into something that advances AI Safety more than AI capabilities? I think you gave a good hint when you say,
But true American AI supremacy requires not just being first, but being first to build AGI that remains reliably under American control and aligned with American interests. An unaligned AGI would threaten American sovereignty
but i ...
Admirable nuance and opportunities-focused thinking--well done! I recently wrote about a NatSec policy that might be useful for consolidating AI development in the United States and thereby safeguarding US National Security through introducing new BIS export controls on model weights themselves.
sensitivity of benchmarks to prompt variations introduces inconsistencies in evaluation
When evaluating human intelligence, random variation is also something evaluators must deal with. Psychometricians have more or less solved this problem by designing intelligence tests to include a sufficiently large battery of correlated test questions. By serving a large battery of questions, one can exploit regression to the mean in the same way that samples from a distribution in general can arrive at an estimate of a population mean from samples.
I suppose the ...
As I have written the proposal, it applies to anyone applying for an employment visa in the US in any industry. Someone in a foreign country who wants to move to the US would not have to decide to focus on AI in order to move to the US; they may choose any pathway that they believe would induce a US employer to sponsor them, or that they believe the US government would approve through self-petitioning pathways in the EB-1 and EB-2 NIW.
Having said that, I expect that AI-focused graduates will be especially well placed to secure an employment visa, but it do...
This frightening logic leaves several paths to survival. One is to make personal intent aligned AGI, and get it in the hands of a trustworthy-enough power structure. The second is to create a value-aligned AGI and release it as a sovereign, and hope we got its motivations exactly right on the first try. The third is to Shut It All Down, by arguing convincingly that the first two paths are unlikely to work - and to convince every human group capable of creating or preventing AGI work. None of these seem easy.[3]
Is there an option which is "perso...
I wish. I think a 100-way multipolar scenario would be too unstable to last more than a few years.
I think AGI that can self-improve produces really bad game theoretic equilibria, even when the total situation is far from zero-sum. I'm afraid that military technology favors offense over defense, and conflict won't be limited to the infosphere. It seems to me that the nature of physics makes it way easier to blow stuff up than to make stuff that resists being blown up. Nukes produced mutually assured destruction because they're so insanely destructive, and c...
A crux for me is the likelihood of multiple catastrophic events of a size greater than the threshold ($500m) but smaller than the liquidity of a developer whose model contributed to the events, and the likelihood of those events in advance of a catastrophic event much larger than those events.
If a model developer is valued at $5 billion and has access to $5b, and causes $1b in damage, they could pay for the $1b damage. Anthropic's proposal would make them liable in the event that they cause this damage. Consequently the developer would be correctly incenti...
has the time hanged from Tuesday to Wednesday, or do you do events on Tuesday in addition to this event on Wednesday?
Generally I'd steer towards informal power over formal power.
Think about the OpenAI debacle last year. If I understand correctly, Microsoft had no formal power to exert control over OpenAI. But they seemed to have employees on their side. They could credibly threaten to hire away all the talent, and thereby reconstruct OpenAI's products as Microsoft IP. Beyond that, perhaps OpenAI was somewhat dependent on Microsoft's continued investment, and even though they don't have to do as Microsoft says, are they really going to jeopardise future funding? Wha...
In the human brain there is quite a lot of redundancy of information encoding. This could be for a variety of reasons.
here's one hot take: In a brain and a language model I can imagine that during early learning, the network hasn't learned concepts like "how to code" well enough to recognize that each training instance is an instance of the same thing. Consequently, during that early learning stage, the model does just encode a variety of representations for what turns out to be the same thing. 800 vector encodes in it starts to match each subsequent train...
Have you tried asking Claude to summarize it for you?
For me the issue is that
it isn't clear how you could enforce attendance or
what value individual attendees could have to make it worth their while to attend regularly.
(2) is sort of a collective action/game theoretic/coordination problem.
(1) reflects the rationalist nature of the organization.
Traditional religions back up attendance by divine command. They teach absolutist, divine command theoretic accounts of morality, backed up by accounts of commands from God to attend regularly. At the most severe mode these are backed by threat of eternal hell...
I tried a similar venn diagram approach more recently. I didn't really distinguish between bare "consciousness" and "sentience". I'm still not sure if I agree "aware without thoughts and feelings" is meaningful. I think awareness might alwyas be awareness of something. But nevertheless they are at least distinct concepts and they can be conceptually separated! Otherwise my model echos the one you have created earlier.
https://www.lesswrong.com/posts/W5bP5HDLY4deLgrpb/the-intelligence-sentience-orthogonality-thesis
I think it's a really interesting ques...
If Ray eventually found that the money was "still there", doesn't this make Sam right that "the money was really all there, or close to it" and "if he hadn’t declared bankruptcy it would all have worked out"?
Ray kept searching, Ray kept finding.
That would raise the amount collected to $9.3 billion—even before anyone asked CZ for the $2.275 billion he’d taken out of FTX. Ray was inching toward an answer to the question I’d been asking from the day of the collapse: Where did all that money go? The answer was: nowhere. It was still there.
What a great read. Best of luck with this project. It sounds compelling.
Seems to me that in this case, the two are connected. If I falsely believed my group was in the minority, I might refrain from clicking the button out of a sense of fairness or deference to the majority group.
Consequently, the lie not only influenced people who clicked the button, it perhaps also influenced people who did not. So due to the false premise on which the second survey was based, it should be disregarded altogether. To not disregard would be to have obtained by fraud or trickery a result that is disadvantageous to all the majority group m...
Differentiating intelligence and agency seems hugely clarifying for many discussions in alignment.
You might have noticed I didn't actually fully differentiate intelligence and agency. It seems to me to exert agency a mind needs a certain amount of intelligence, and so I think all agents are intelligent, though not all intelligences are agentic. Agents that are minimally intelligent (like simple RL agents in simple computer models) also are pretty minimally agentic. I'd be curious to hear about a counter-example.
Incidentally I also like Anil Seth's work and...
I found the Clark et al. (2019) "Bayesing Qualia" article very useful, and that did give me an intuition of the account that perhaps sentience arises out of self-awareness. But they themselves acknowledged in their conclusion that the paper didn't quite demonstrate that principle, and I didn't find myself convinced of it.
Perhaps what I'd like readers to take away is that sentience and self-awareness can be at the very least conceptually distinguished. Even if it isn't clear empirically whether or not they are intrinsically linked, we ought to maintain a co...
I wanted to write myself about a popular confusion between decision making, consciousness, and intelligence which among other things leads to bad AI alignment takes and mediocre philosophy.
This post has not got a lot of attention, so if you write your own post, perhaps the topic will have another shot at reaching popular consciousness (heh), and if you succeed, I might try to learn something about how you did it and this post did not!
I wasn't thinking that it's possible to separate qualia perception and self awareness
Separating qualia and self-awareness is...
Thank you for the link to the Friston paper. I'm reading that and will watch Lex Fridman's interview with Joscha Bach, too. I sort of think "illusionism" is a bit too strong, but perhaps it's a misnomer rather than wrong (or I could be wrong altogether). Clark, Friston, and Wilkinson say
...But in what follows we aim not to Quine (explain away) qualia but to ‘Bayes’ them – to reveal them as products of a broadly speaking rational process of inference, of the kind imagined by the Reverend Bayes in his (1763) treatise on how to form and update beliefs on t
great post, two points of disagreement that are worth mentioning
There's not just acceptance at stake here. Medical insurance companies are not typically going to buy into a responsibility to support clients' morphological freedom, as if medically transitioning is in the same class of thing as a cis person getting a facelift woman getting a boob job, because it is near-universally understood this is an "elective" medical procedure. But if their clients have a "condition" that requires "treatment", well, now insurers are on the hook to pay. Public health systems operate according to similar principles, providing services...
I guess this falls into the category of "Well, we’ll deal with that problem when it comes up", but I'd imagine when a human preference in a particular dilemma is undefined or even just highly uncertain, one can often defer to other rules like--rather than maximize an uncertain preference, default to maximizing the human's agency, in scenarios where preference is unclear, even if this predictably leads to less-than-optimal preference satisfaction.
I think your point is interesting and I agree with it, but I don't think Nature are only addressing the general public. To me, it seems like they're addressing researchers and policymakers and telling them what they ought to focus on as well.
Well written, I really enjoyed this. This is not really on topic but I'd be curious to read and "idiot's guide" or maybe an "autist's guide" on how to avoid sounding condescending.
interpretability on pretrained model representations suggest they're already internally "ensembling" many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstraction
That seems encouraging to me. There's a model of AGI value alignment where the system has a particular goal it wants to achieve and brings all it's capabilities to bear on achieving that goal. It does this by having a "world m...
I've been writing about multi-objective RL and trying to figure out a way that an RL agent could optimize for a non-linear sum of objectives in a way that avoids strongly negative outcomes on any particular objective.
This sounds like a very interesting question.
I get stuck trying to answer your question itself on the differences between AGI and humans.
But taking your question itself at its face:
ferreting out the fundamental intentions
What sort of context are you imagining? Humans aren't even great at identifying the fundamental reason for their own actions. They'll confabulate if forced to.
thank you for writing this. I really personally appreciate it!
That's smart! When I started graduate school in psychology in 2013, mirror neurons felt like, colloquially, "hot shit", but within a few years, people had started to cringe quite dramatically whenever the phrase was used. I think your reasoning in (3) is spot on.
Your example leads to fun questions like, "how do I recognize juggling", including "what stimuli activate the concept of juggling when I do it" vs "what stimuli activate the concept of juggling when I see you do it"?, and intuitively, nothing there seems to require that those be the same neurons, e...
Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.
Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined.
You haven't factored in the possibility Putin gets deposed by forces inside Russia who might be worried about a nuclear war and conditional on use of tactical nukes, intuitively that seems likely enough to materially lower p(kaboom).
American Academy of Pediatrics lies to us once again....
"If caregivers are wearing masks, does that harm kids’ language development? No. There is no evidence of this. And we know even visually impaired children develop speech and language at the same rate as their peers."
This is a textbook case of the Law of No Evidence. Or it would be, if there wasn’t any Proper Scientific Evidence.
Is it, though? I'm no expert, but I tried to find Relevant Literature. Sometimes, counterintuitive things are true.
https://www.researchgate.net/publication/220009177_Language_D...
Still working my way through reading this series--it is the best thing I have read in quite a while and I'm very grateful you wrote it!
I feel like I agree with your take on "little glimpses of empathy" 100%.
I think fear of strangers could be implemented without a steering subsystem circuit maybe? (Should say up front I don't know more about developmental psychology/neuroscience than you do, but here's my 2c anyway). Put aside whether there's another more basic steering subsystem circuit for agency detection; we know that pretty early on, through some combi...
Event is on tonight as planned at 7. If you're coming, looking forward to seeing you!
I wrote a paper on another experiment by Berridge reported in Zhang & Berridge (2009). Similar behavior was observed in that experiment, but the question explored was a bit different. They reported a behavioral pattern in which rats typically found moderately salty solutions appetitive and very salty solutions aversive. Put into salt deprivation, rats then found both solutions appetitive, but the salty solution less so.
They (and we) took it as given that homeostatic regulation set a 'present value' for salt that was dependent on the organism's cu...
Hey Steve, I am reading through this series now and am really enjoying it! Your work is incredibly original and wide-ranging as far as I can see--it's impressive how many different topics you have synthesized.
I have one question on this post--maybe doesn't rise above the level of 'nitpick', I'm not sure. You mention a "curiosity drive" and other Category A things that the "Steering Subsystem needs to do in order to get general intelligence". You've also identified the human Steering Subsystem as the hypothalamus and brain stem.
Is it possible things like a ...
Very late to the party here. I don't know how much of the thinking in this post you still endorse or are still interested in. But this was a nice read. I wanted to add a few things:
- since you wrote this piece back in 2021, I have learned there is a whole mini-field of computer science dealing with multi-objective reward learning, maybe centered around . Maybe a good place to start there is https://link.springer.com/article/10.1007/s10458-022-09552-y
- The shard theory folks have done a fairly good job sketching out broad principles but it seems...
At this moment in time I have two theories about how shards seem to be able to form consistent and competitive values that don't always optimize for some ultimate goal:
I have pointed this out to folks in the context of AI timelines: metaculus gives predictions for "weakly AGI" but I consider hypothetical GATO-x which can generalize to a task outside it's training distribution or many tasks outside it's training distribution to be AGI, yet a considerable way from an AGI with enough agency to act on its own.
OTOH it isn't so much reassurance if bootstrapping this thing up to agency with as little as a batch script to keep it running will make it agentic.
But the time between weak AGI and agentic AGI is a prime learning oppor...
You mentioned in the pre-print that results were "similar" for the two color temperatures, and referred to the Appendix for more information, but it seems like the Appendix isn't included in your pre-print. Are you able to elaborate on how similar results in these two conditions were? In my own personal exploration of this area I have put a lot of emphasis on color temperature. Your study makes me adjust down the importance of color temperature, although it would be good to get more information.
A consolidated list of bad or incomplete solutions could have considerable didactic value--it could keep people learn more about the various challenges involved.
The goal of having a list of many bad ideas is different from having a focused explanation about why certain ideas are bad.
Writing posts about bad ideas and how they fail could be a type of post that's valuable but it's different than just listing ideas.
Not sure what I was thinking about, but probably just that my understanding is that "safe AGI via AUP" would have to penalize the agent for learning to achieve anything not directly related to the end goal, and that might make it too difficult to actually achieve the end goal when e.g. it turns out to need tangentially related behavior.
Your "social dynamics" section encouraged me to be bolder sharing my own ideas on this forum, and I wrote up some stuff today that I'll post soon, so thank you for that!
That was an inspiring and enjoyable read!
Can you say why you think AUP is "pointless" for Alignment? It seems to me attaining cautious behavior out of a reward learner might turn out to be helpful. Overall my intuition is it could turn out to be an essential piece of the puzzle.
I can think of one or two reasons myself, but I barely grasp the finer points of AUP as it is, so speculation on my part here might be counterproductive.
I would very much like to see your dataset, as a zotero database or some other format, in order to better orient myself to the space. Are you able to make this available somehow?
Very very helpful! The clustering is obviously a function of the corpus. From your narrative, it seems like you only added the missing arx.iv files after clustering. Is it possible the clusters would look different with those in?
One approach to low-impact AI might be to pair an AGI system with a human supervisor who gives it explicit instructions where it is permitted to continue. I have proposed a kind of "decision paralysis" where, given multiple conflicting goals, a multi-objective agent would simply choose not to act (I'm not the first or only one to describe this kind of conservativism, but I don't recall the framing others have used). In this case, the multi-objectives might be the primary objective and then your low-impact objective.
This might be a way forward to deal with ...
It seems like even amongst proponents of a "fast takeoff", we will probably have a few months of time between when we've built a superintelligence that appears to have unaligned values and when it is too late to stop it.
At that point, isn't stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?
That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it woul...
Thanks for your thorough response. It is well-argued and as a result, I take back what I said. I'm not entirely convinced by your response but I will say I now have no idea! Being low-information on this, though, perhaps my reaction to the "challenge trial" idea mirrors other low-information responses, which is going to be most of them, so I'll persist in explaining my thinking mainly in the hope it'll help you and other pro-challenge people argue your case to others.
I'll start with maybe my biggest worry about a challenge trial: the idea you could have a ...
It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.
So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I'm not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness.