All of Ben Smith's Comments + Replies

It seems to institutional frameworks that credible transparency is an important necessary (not sufficient) step for credible benignness, that credible transparency is currently not implemented within existing frameworks such as RSPs and Summit commitments, but credible transparency would be a very achievable step forward.

So right now, model evals do suffice to demonstrate benignness, but we have to have some confidence in those evals, and transparency (e.g., openness to independent eval testing) seems essential. Then, when evals are no longer sufficient, I'm not sure what will be, but whatever it is, it will for sure require transparent testing by independent observers for credible benignness. 

It seems to me on evidence presented that OpenAI's autonomy threshold is actually lower than Anthropic's, and would trigger their deployment mitigations much earlier than Anthropic's ASL-3 Deployment and Security Standard.

To reach Anthropic's standard, you have to have basically reached AI-take-off--either fully automating an AI researcher, or doubling the speed of AI progress. To reach OpenAI's High autonomy standard, you need 

Model can execute open-ended, novel ML tasks on a production ML codebase that would constitute a significant step on the crit

... (read more)
7Zach Stein-Perlman
Briefly: 1. For OpenAI, I claim the cyber, CBRN, and persuasion Critical thresholds are very high (and also the cyber High threshold). I agree the autonomy Critical threshold doesn't feel so high. 2. For Anthropic, most of the action is at ASL-4+, and they haven't even defined the ASL-4 standard yet. (So you can think of the current ASL-4 thresholds as infinitely high. I don't think "The thresholds are very high" for OpenAI was meant to imply a comparison to Anthropic; it's hard to compare since ASL-4 doesn't exist. Sorry for confusion.)

I really love this. It is critically important work for the next four years. I think my biggest question is: when talking with the people currently in charge, how do you persuade them to make the AI Manhattan Project into something that advances AI Safety more than AI capabilities? I think you gave a good hint when you say, 

But true American AI supremacy requires not just being first, but being first to build AGI that remains reliably under American control and aligned with American interests. An unaligned AGI would threaten American sovereignty

but i ... (read more)

Admirable nuance and opportunities-focused thinking--well done! I recently wrote about a NatSec policy that might be useful for consolidating AI development in the United States and thereby safeguarding US National Security through introducing new BIS export controls on model weights themselves.

sensitivity of benchmarks to prompt variations introduces inconsistencies in evaluation

 

When evaluating human intelligence, random variation is also something evaluators must deal with. Psychometricians have more or less solved this problem by designing intelligence tests to include a sufficiently large battery of correlated test questions. By serving a large battery of questions, one can exploit regression to the mean in the same way that samples from a distribution in general can arrive at an estimate of a population mean from samples.

I suppose the ... (read more)

As I have written the proposal, it applies to anyone applying for an employment visa in the US in any industry. Someone in a foreign country who wants to move to the US would not have to decide to focus on AI in order to move to the US; they may choose any pathway that they believe would induce a US employer to sponsor them, or that they believe the US government would approve through self-petitioning pathways in the EB-1 and EB-2 NIW.

Having said that, I expect that AI-focused graduates will be especially well placed to secure an employment visa, but it do... (read more)

Ben Smith123

This frightening logic leaves several paths to survival. One is to make personal intent aligned AGI, and get it in the hands of a trustworthy-enough power structure. The second is to create a value-aligned AGI and release it as a sovereign, and hope we got its motivations exactly right on the first try. The third is to Shut It All Down, by arguing convincingly that the first two paths are unlikely to work - and to convince every human group capable of creating or preventing AGI work. None of these seem easy.[3] 

 

Is there an option which is "perso... (read more)

Seth Herd107

I wish. I think a 100-way multipolar scenario would be too unstable to last more than a few years.

I think AGI that can self-improve produces really bad game theoretic equilibria, even when the total situation is far from zero-sum. I'm afraid that military technology favors offense over defense, and conflict won't be limited to the infosphere. It seems to me that the nature of physics makes it way easier to blow stuff up than to make stuff that resists being blown up. Nukes produced mutually assured destruction because they're so insanely destructive, and c... (read more)

A crux for me is the likelihood of multiple catastrophic events of a size greater than the threshold ($500m) but smaller than the liquidity of a developer whose model contributed to the events, and the likelihood of those events in advance of a catastrophic event much larger than those events.

If a model developer is valued at $5 billion and has access to $5b, and causes $1b in damage, they could pay for the $1b damage. Anthropic's proposal would make them liable in the event that they cause this damage. Consequently the developer would be correctly incenti... (read more)

2RobertM
Yeah, requiring purchase of insurance covering $BIGNUM seems more likely to work here, at least if you believe that insurance will be accurately priced (in ways that are sensitive to e.g. safety practices that would actually reduce risk), and you expect there to be catastrophes that are small enough to leave the insurer around.

has the time hanged from Tuesday to Wednesday, or do you do events on Tuesday in addition to this event on Wednesday?

1Czynski
Permanently changed to Wednesdays, but forgot that was in the group description; now fixed. There is a Manifold-associated event, Taco Tuesdays, running in SF, and I decided I'd rather stop scheduling against it.

Generally I'd steer towards informal power over formal power. 

Think about the OpenAI debacle last year. If I understand correctly, Microsoft had no formal power to exert control over OpenAI. But they seemed to have employees on their side. They could credibly threaten to hire away all the talent, and thereby reconstruct OpenAI's products as Microsoft IP. Beyond that, perhaps OpenAI was somewhat dependent on Microsoft's continued investment, and even though they don't have to do as Microsoft says, are they really going to jeopardise future funding? Wha... (read more)

In the human brain there is quite a lot of redundancy of information encoding. This could be for a variety of reasons.

here's one hot take: In a brain and a language model I can imagine that during early learning, the network hasn't learned concepts like "how to code" well enough to recognize that each training instance is an instance of the same thing. Consequently, during that early learning stage, the model does just encode a variety of representations for what turns out to be the same thing. 800 vector encodes in it starts to match each subsequent train... (read more)

Have you tried asking Claude to summarize it for you?

3lemonhope
The last time I tried to do that I embarrassed myself but that was with claude 2 and important specific document questions not like "what are the most important parts". I might try that next time. Zvi's writing is addictive though.

For me the issue is that

  1. it isn't clear how you could enforce attendance or

  2. what value individual attendees could have to make it worth their while to attend regularly.

(2) is sort of a collective action/game theoretic/coordination problem.

(1) reflects the rationalist nature of the organization.

Traditional religions back up attendance by divine command. They teach absolutist, divine command theoretic accounts of morality, backed up by accounts of commands from God to attend regularly. At the most severe mode these are backed by threat of eternal hell... (read more)

I tried a similar venn diagram approach more recently. I didn't really distinguish between bare "consciousness" and "sentience". I'm still not sure if I agree "aware without thoughts and feelings" is meaningful. I think awareness might alwyas be awareness of something. But nevertheless they are at least distinct concepts and they can be conceptually separated! Otherwise my model echos the one you have created earlier.

 https://www.lesswrong.com/posts/W5bP5HDLY4deLgrpb/the-intelligence-sentience-orthogonality-thesis

I think it's a really interesting ques... (read more)

If Ray eventually found that the money was "still there", doesn't this make Sam right that "the money was really all there, or close to it" and "if he hadn’t declared bankruptcy it would all have worked out"?

Ray kept searching, Ray kept finding.

That would raise the amount collected to $9.3 billion—even before anyone asked CZ for the $2.275 billion he’d taken out of FTX. Ray was inching toward an answer to the question I’d been asking from the day of the collapse: Where did all that money go? The answer was: nowhere. It was still there.

What a great read. Best of luck with this project. It sounds compelling.

Seems to me that in this case, the two are connected. If I falsely believed my group was in the minority, I might refrain from clicking the button out of a sense of fairness or deference to the majority group. 

Consequently, the lie not only influenced people who clicked the button, it perhaps also influenced people who did not. So due to the false premise on which the second survey was based, it should be disregarded altogether. To not disregard would be to have obtained by fraud or trickery a result that is disadvantageous to all the majority group m... (read more)

Differentiating intelligence and agency seems hugely clarifying for many discussions in alignment.

You might have noticed I didn't actually fully differentiate intelligence and agency. It seems to me to exert agency a mind needs a certain amount of intelligence, and so I think all agents are intelligent, though not all intelligences are agentic. Agents that are minimally intelligent (like simple RL agents in simple computer models) also are pretty minimally agentic. I'd be curious to hear about a counter-example.


Incidentally I also like Anil Seth's work and... (read more)

I found the Clark et al. (2019) "Bayesing Qualia" article very useful, and that did give me an intuition of the account that perhaps sentience arises out of self-awareness. But they themselves acknowledged in their conclusion that the paper didn't quite demonstrate that principle, and I didn't find myself convinced of it.

Perhaps what I'd like readers to take away is that sentience and self-awareness can be at the very least conceptually distinguished. Even if it isn't clear empirically whether or not they are intrinsically linked, we ought to maintain a co... (read more)

I wanted to write myself about a popular confusion between decision making, consciousness, and intelligence which among other things leads to bad AI alignment takes and mediocre philosophy.

This post has not got a lot of attention, so if you write your own post, perhaps the topic will have another shot at reaching popular consciousness (heh), and if you succeed, I might try to learn something about how you did it and this post did not!

I wasn't thinking that it's possible to separate qualia perception and self awareness

Separating qualia and self-awareness is... (read more)

Thank you for the link to the Friston paper. I'm reading that and will watch Lex Fridman's interview with Joscha Bach, too. I sort of think "illusionism" is a bit too strong, but perhaps it's a misnomer rather than wrong (or I could be wrong altogether). Clark, Friston, and Wilkinson say
 

But in what follows we aim not to Quine (explain away) qualia but to ‘Bayes’ them – to reveal them as products of a broadly speaking rational process of inference, of the kind imagined by the Reverend Bayes in his (1763) treatise on how to form and update beliefs on t

... (read more)
2TAG
"Quining means some thing stronger than "explaining away". Dennett:
2Roman Leventov
I've never heard Bach himself calling his position "illusionism", I've just applied this label to his view spontaneously. This could be inaccurate, indeed, especially from the perspective of historical usage of the term. Instead of Bach's interviews with Lex, I'd recommend this recent show: https://youtu.be/PkkN4bJN2pg (disclaimer: you will need to persevere through a lot of bad and often lengthy and ranty takes on AI by the hosts throughout the show, which at least to me was quite annoying. Unfortunately, there is no transcript, it seems.)

great post, two points of disagreement that are worth mentioning

  1. Exploring the full ability of dogs and cats to communicate isn't so much impractical to do in academia; it just isn't very theoretically interesting. We know animals can do operant conditioning (we've known for over 100 years probably), but we also know they struggle with complex syntax. I guess there's a lot of uncertainty in the middle, so I'm low confidence about this. But generally to publish a high impact paper about dog or cat communication you'd have to show they can do more than "condi
... (read more)
Ben Smith*2812

There's not just acceptance at stake here. Medical insurance companies are not typically going to buy into a responsibility to support clients' morphological freedom, as if medically transitioning is in the same class of thing as a cis person getting a facelift woman getting a boob job, because it is near-universally understood this is an "elective" medical procedure. But if their clients have a "condition" that requires "treatment", well, now insurers are on the hook to pay. Public health systems operate according to similar principles, providing services... (read more)

Ben SmithΩ110

I guess this falls into the category of "Well, we’ll deal with that problem when it comes up", but I'd imagine when a human preference in a particular dilemma is undefined or even just highly uncertain, one can often defer to other rules like--rather than maximize an uncertain preference, default to maximizing the human's agency, in scenarios where preference is unclear, even if this predictably leads to less-than-optimal preference satisfaction.

I think your point is interesting and I agree with it, but I don't think Nature are only addressing the general public. To me, it seems like they're addressing researchers and policymakers and telling them what they ought to focus on as well.

Well written, I really enjoyed this. This is not really on topic but I'd be curious to read and "idiot's guide" or maybe an "autist's guide" on how to avoid sounding condescending.

2Seth Herd
Aw, thanks! I think that not sounding condescending is absolutely critical to having good discussions on this (and many other obscure and technical topics). I have had a lifelong journey of going from sounding condescending way too much, to sounding less condescending, at least when I remember to try. I don't know if I'm a bit on the autism spectrum, or just raised to value logic and winning arguments over social skills. I think a lot of it is tone of voice and timing. I'm not going to get those by acting, so I just try to adopt a soft and patient emotional tone, and continually remind myself that the person I'm talking to hasn't thought about this topic nearly as much, and I probably sound like an idiot when I talk about other people's favorite topics. Finding points of agreement and voicing them before moving on to points of disagreement is key. So is not expecting to change someone's mind in the moment. I think offering ideas and perspectives, and letting people think them through is how people learn and change beliefs.

interpretability on pretrained model representations suggest they're already internally "ensembling" many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstraction

That seems encouraging to me. There's a model of AGI value alignment where the system has a particular goal it wants to achieve and brings all it's capabilities to bear on achieving that goal. It does this by having a "world m... (read more)

I've been writing about multi-objective RL and trying to figure out a way that an RL agent could optimize for a non-linear sum of objectives in a way that avoids strongly negative outcomes on any particular objective.

https://www.lesswrong.com/posts/i5dLfi6m6FCexReK9/a-brief-review-of-the-reasons-multi-objective-rl-could-be

1Seth Herd
Thank you! This is addressing the question I was trying to get at. I'll check it out.

This sounds like a very interesting question.

I get stuck trying to answer your question itself on the differences between AGI and humans.

But taking your question itself at its face:

ferreting out the fundamental intentions

What sort of context are you imagining? Humans aren't even great at identifying the fundamental reason for their own actions. They'll confabulate if forced to.

2TekhneMakre
Any context where there's any impressive successes. I gave possible examples here:

thank you for writing this. I really personally appreciate it!

That's smart! When I started graduate school in psychology in 2013, mirror neurons felt like, colloquially, "hot shit", but within a few years, people had started to cringe quite dramatically whenever the phrase was used. I think your reasoning in (3) is spot on.

Your example leads to fun questions like, "how do I recognize juggling", including "what stimuli activate the concept of juggling when I do it" vs "what stimuli activate the concept of juggling when I see you do it"?, and intuitively, nothing there seems to require that those be the same neurons, e... (read more)

Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.

Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined. 

1Fabian Schimpf
I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold. 

You haven't factored in the possibility Putin gets deposed by forces inside Russia who might be worried about a nuclear war and conditional on use of tactical nukes, intuitively that seems likely enough to materially lower p(kaboom).

American Academy of Pediatrics lies to us once again....

"If caregivers are wearing masks, does that harm kids’ language development? No. There is no evidence of this. And we know even visually impaired children develop speech and language at the same rate as their peers."
This is a textbook case of the Law of No Evidence. Or it would be, if there wasn’t any Proper Scientific Evidence.

Is it, though? I'm no expert, but I tried to find Relevant Literature. Sometimes, counterintuitive things are true.

https://www.researchgate.net/publication/220009177_Language_D... (read more)

Ben SmithΩ120

Still working my way through reading this series--it is the best thing I have read in quite a while and I'm very grateful you wrote it!

I feel like I agree with your take on "little glimpses of empathy" 100%.

I think fear of strangers could be implemented without a steering subsystem circuit maybe? (Should say up front I don't know more about developmental psychology/neuroscience than you do, but here's my 2c anyway). Put aside whether there's another more basic steering subsystem circuit for agency detection; we know that pretty early on, through some combi... (read more)

Event is on tonight as planned at 7. If you're coming, looking forward to seeing you!

I wrote a paper on another experiment by Berridge reported in Zhang & Berridge (2009). Similar behavior was observed in that experiment, but the question explored was a bit different. They reported a behavioral pattern in which rats typically found moderately salty solutions appetitive and very salty solutions aversive. Put into salt deprivation, rats then found both solutions appetitive, but the salty solution less so. 

They (and we) took it as given that homeostatic regulation set a 'present value' for salt that was dependent on the organism's cu... (read more)

Ben SmithΩ470

Hey Steve, I am reading through this series now and am really enjoying it! Your work is incredibly original and wide-ranging as far as I can see--it's impressive how many different topics you have synthesized.

I have one question on this post--maybe doesn't rise above the level of 'nitpick', I'm not sure. You mention a "curiosity drive" and other Category A things that the "Steering Subsystem needs to do in order to get general intelligence". You've also identified the human Steering Subsystem as the hypothalamus and brain stem.

Is it possible things like a ... (read more)

5Steven Byrnes
Thanks! First of all, to make sure we’re on the same page, there’s a difference between “self-supervised learning” and “motivation to reduce prediction error”, right? The former involves weight update, the latter involves decisions and rewards. The former is definitely a thing in the neocortex—I don’t think that’s controversial. As for the latter, well I don’t know the full suite of human motivations, but novelty-seeking is definitely a thing, and spending all day in a dark room is not much of a thing, and both of those would go against a motivation to reduce prediction error. On the other hand, people sometimes dislike being confused, which would be consistent with a motivation to reduce prediction error. So I figure, maybe there’s a general motivation to reduce prediction error (but there are also other motivations that sometimes outweigh it), or maybe there isn’t such a motivation at all (but other motivations can sometimes coincidentally point in that direction). Hard to say. ¯\_(ツ)_/¯ I absolutely believe that there are signals from the telencephalon, communicating telencephalon activity / outputs, which are used as inputs to the calculations leading up to the final reward prediction error (RPE) signal in the brainstem. Then there has to be some circuitry somewhere setting things up such that some particular type of telencephalon activity / outputs have some particular effect on RPE. Where is this circuitry? Telencephalon or brainstem? Well, I guess you can say that if a connection from Telencephalon Point A to Brainstem Point B is doing something specific and important, then it’s a little bit arbitrary whether we call this “telencephalon circuitry” versus “brainstem circuitry”. In all the examples I’ve seen, it’s tended to make more sense to lump it in with the brainstem / hypothalamus. But it’s hard for me to argue that without a better understanding of what you have in mind here.
Ben SmithΩ110

Very late to the party here. I don't know how much of the thinking in this post you still endorse or are still interested in. But this was a nice read. I wanted to add a few things:

 - since you wrote this piece back in 2021, I have learned there is a whole mini-field of computer science dealing with multi-objective reward learning, maybe centered around . Maybe a good place to start there is https://link.springer.com/article/10.1007/s10458-022-09552-y

 - The shard theory folks have done a fairly good job sketching out broad principles but it seems... (read more)

At this moment in time I have two theories about how shards seem to be able to form consistent and competitive values that don't always optimize for some ultimate goal:

  • Overall, Shard theory is developed to describe behavior of human agents whose inputs and outputs are multi-faceted. I think something about this structure might facilitate the development of shards in many different directions. This seems different to modern deep RL agent; although they also potentially can have lots of input and output nodes, these are pretty finely honed to achieve a fairl
... (read more)

I have pointed this out to folks in the context of AI timelines: metaculus gives predictions for "weakly AGI" but I consider hypothetical GATO-x which can generalize to a task outside it's training distribution or many tasks outside it's training distribution to be AGI, yet a considerable way from an AGI with enough agency to act on its own.

OTOH it isn't so much reassurance if bootstrapping this thing up to agency with as little as a batch script to keep it running will make it agentic.

But the time between weak AGI and agentic AGI is a prime learning oppor... (read more)

1Roman Leventov
Yes, in this interview, Connor Leahy said he has an idea of what these components are, but he wouldn't tell publicly.

You mentioned in the pre-print that results were "similar" for the two color temperatures, and referred to the Appendix for more information, but it seems like the Appendix isn't included in your pre-print. Are you able to elaborate on how similar results in these two conditions were? In my own personal exploration of this area I have put a lot of emphasis on color temperature. Your study makes me adjust down the importance of color temperature, although it would be good to get more information.

A consolidated list of bad or incomplete solutions could have considerable didactic value--it could keep people learn more about the various challenges involved.

3localdeity
For inspiration in the genre of learning-what-not-to-do, I suggest "How To Write Unmaintainable Code".  Also "Fumblerules".

The goal of having a list of many bad ideas is different from having a focused explanation about why certain ideas are bad. 

Writing posts about bad ideas and how they fail could be a type of post that's valuable but it's different than just listing ideas. 

Not sure what I was thinking about, but probably just that my understanding is that "safe AGI via AUP" would have to penalize the agent for learning to achieve anything not directly related to the end goal, and that might make it too difficult to actually achieve the end goal when e.g. it turns out to need tangentially related behavior.

Your "social dynamics" section encouraged me to be bolder sharing my own ideas on this forum, and I wrote up some stuff today that I'll post soon, so thank you for that!

That was an inspiring and enjoyable read!

Can you say why you think AUP is "pointless" for Alignment? It seems to me attaining cautious behavior out of a reward learner might turn out to be helpful. Overall my intuition is it could turn out to be an essential piece of the puzzle.

I can think of one or two reasons myself, but I barely grasp the finer points of AUP as it is, so speculation on my part here might be counterproductive.

6TurnTrout
Off-the-cuff: * AUP, or any other outer objective function / reward function scheme, relies on having any understanding at all of how to transmute outer reward schedules (e.g. the AUP reward function + training) into internal cognitive structures (e.g. a trained policy which reliably doesn't take actions which destroy vases) which are stable over time (e.g. the policy "cares about" not turning into an agent which destroys vases). * And if we knew how to do that, we could probably do a lot more exciting things than impact-limited agents; we probably would have just solved a lot of alignment in one fell swoop. * I think I have some ideas of how this happens in people, and how we might do it for AGI.  * Even if impact measures worked, I think we really want an AI which can perform a pivotal act, or at least something really important and helpful (Eliezer often talks about the GPU-melting; my private smallest pivotal act is not that, though). * Impact measures probably require big competitiveness hits, which twins with the above point. Please go ahead and speculate anyways. Think for yourself as best you can, don't defer to me, just mark your uncertainties!

I would very much like to see your dataset, as a zotero database or some other format, in order to better orient myself to the space. Are you able to make this available somehow?

Very very helpful! The clustering is obviously a function of the corpus. From your narrative, it seems like you only added the missing arx.iv files after clustering. Is it possible the clusters would look different with those in?

2Jan
Hey Ben! :) Thanks for the comment and the careful reading! Yes, we only added the missing arx.iv papers after clustering, but then we repeat the dimensionality reduction and show that the original clustering still holds up even with the new papers (Figure 4 bottom right). I think that's pretty neat (especially since the dimensionality reduction doesn't "know" about the clustering) but of course the clusters might look slightly different if we also re-run k-means on the extended dataset.

One approach to low-impact AI might be to pair an AGI system with a human supervisor who gives it explicit instructions where it is permitted to continue. I have proposed a kind of "decision paralysis" where, given multiple conflicting goals, a multi-objective agent would simply choose not to act (I'm not the first or only one to describe this kind of conservativism, but I don't recall the framing others have used). In this case, the multi-objectives might be the primary objective and then your low-impact objective.

This might be a way forward to deal with ... (read more)

2Fabian Schimpf
Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can't do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability. 

It seems like even amongst proponents of a "fast takeoff", we will probably have a few months of time between when we've built a superintelligence that appears to have unaligned values and when it is too late to stop it.

At that point, isn't stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?

That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it woul... (read more)

9Yonatan Cale
I think there's no known way to ask an AI to do "just one thing" without doing a ton of harm meanwhile. See this on creating a strawberry safely.  Yudkowsky uses the example "[just] burn all GPUs" in is latest post.
6mako yass
Seems useless if the first system pretends convincingly to be aligned (which I think is going to be the norm) so you never end up deploying the second system? And "defeat the first AGI" seems almost as difficult to formalize correctly as alignment, to me: * One problem is that when the unaligned AGI transmits itself to another system, how do define it as the same AGI? Is there a way of defining identity that doesn't leave open a loophole that the first can escape through in some way? * So I'm considering "make the world as if neither of you had ever been made", that wouldn't have that problem, but it's impossible to actually attain this goal so I don't know how you get it to satisfice over it then turn itself off afterwards, concerned it would become an endless crusade.
3plex
One of the first priorities of an AI in a takeoff would be to disable other projects which might generate AGIs. A weakly superintelligent hacker AGI might be able to pull this off before it could destroy the world. Also, fast takeoff could be less than months by some people's guess. And what do you think happens when the second AGI wins, then maximizes the universe for "the other AI was defeated". Some serious unintended consequences, even if you could specify it well.

Thanks for your thorough response. It is well-argued and as a result, I take back what I said. I'm not entirely convinced by your response but I will say I now have no idea! Being low-information on this, though, perhaps my reaction to the "challenge trial" idea mirrors other low-information responses, which is going to be most of them, so I'll persist in explaining my thinking mainly in the hope it'll help you and other pro-challenge people argue your case to others.


I'll start with maybe my biggest worry about a challenge trial: the idea you could have a ... (read more)

Load More