All of Mark Xu's Comments + Replies

I think I expect Earth in this case to just say no and not sell the sun? But I was confused at like 2 points in your paragraph so I don't think I understand what you're saying that well. I also think we're probably on mostly the same page, and am not that interested in hashing out further potential disagreements.

Also, mostly unrelated, maybe a hot take, but if you're able to get outcompeted because you don't upload, then the future you're in is not very good.

Cool. I misinterpreted your previous comment and think we're basically on the same page.

I think the majority of humans probably won't want to be uploads, leave the solar system permanently, etc. Maybe this is where we disagree? I don't really think there's going to be a thing that most people care about more.

3Vladimir_Nesov
Uploads have 10,000x life expectancy due to running faster, regardless of what global circumstance eventually destroys them (I'm expecting distributed backups for biological humans as well, but by definition they remain much slower).
4habryka
Sorry, that's literally what I am saying. If many people don't want to leave the solar system, and don't want to be uploads, then using the matter and energy available in the solar system effectively is a decision with a huge stake to many people.  I think if everyone or really almost everyone would want to be an upload, I think this would make it more likely that we should keep the sun intact, because then the sun could belong to just the few humans who don't have better alternatives in other solar systems. But if there is anything above 10% of humanity who don't want to be uploaded, or go on long-distance spaceship journeys in their biological bodies, then you better make sure you make the solar system great for this substantial fraction of humanity, and I think that will likely involve disassembling the sun.  I agree with you that many people don't want to be uploads, etc. I disagree that the majority of people who don't want to be uploads have attachments to the specific celestial bodies in our solar system. I think they just want to have a good life in their biological bodies, doing nice human things. Those goals would be non-trivially hampered if they couldn't disassemble the sun. That's like 99.9% of the energy and matter by which they could achieve those goals, and while I do think this subset of the population will be selected for less scope-sensitivity, I think there will be enough scope-sensitivity to make leaving the sun intact a bad choice. (To be clear, I disagree that the majority of humanity would not want to be uploads over the course of multiple generations, but it seems plausible to me that like 10%-20% of humanity don't want to be uploads, even over multiple generations)

I don't think that's a very good analogy, but I will say that is is basically true for the Amish. And I do think that we should respect their preferences. (I seperately think cars are not that good, and that people would infact prefer to bicycle around or ride house drawn carriges or whatever if civilization was conducive to that, although that's kinda besides the point.)

I'm not arguing that we should be conservative about changing the sun. I'm just claiming that people like the sun and won't want to see it eaten/fundamentally transformed, and that we shou... (read more)

2Ben Pace
I am not sure what point you are making with "respect their preferences", I am not proposing one country go to war with other countries to take the sun. For instance, one way it might go down is someone will just offer to buy it from Earth, and the price will be many orders of magnitude more resources than Earth has, so Earth will accept, and replace it with an artificial source of light & heat. I may be wrong about the estimates of the value of the energy, neither of us have specified how the rest of the stars in the universe will get distributed. For concreteness, I am here imagining something like: the universe is not a whole singleton but made of many separate enclaves that have their own governance and engage in trade with one another, and that Earth is a special one that keeps a lot of its lineage with present-day Earth, and is generally outcompeted by all the others ones that are smarter/faster and primarily run by computational-minds rather than biological ones. 

I am claiming that people when informed will want the sun to continuing being the sun. I also think that most people when informed will not really care that much about creating new people, will continue to believe in the act-omission distinction, etc. And that this is a coherent view that will add up to a large set of people wanting things in the solar system to remain conservatively the same. I seperately claim that if this is true, then other people should just respect this preference, and use the other stars that people don't care about for energy.

6habryka
As I mentioned in the other thread, it seems right to me that some people will want the sun to continue being the sun, but my sense is that within the set of people who don't want to leave the solar system, don't want to be uploads, don't want to be cryogenically shipped to other solar systems, or otherwise for some reason will have strong preferences over what happens with this specific solar system, this will be a much less important preference than using the sun for things that people care about more.
5Ben Pace
Analogously: "I am claiming that people when informed will want horses to continue being the primary mode of transportation. I also think that most people when informed will not really care that much about economic growth, will continue to believe that you're more responsible for changing things than for maintaining the status quo, etc. And that this is a coherent view that will add up to a large set of people wanting things in cities to remain conservatively the same. I separately claim that if this is true, then other people should just respect this preference, and go find new continents / planets on which to build cars that people in the cities don't care about." Sometimes it's good to be conservative when you're changing things, like if you're changing lots of social norms or social institutions, but I don't get it at all in this case. The sun is not a complicated social institution, it's primarily a source of heat and light and much of what we need can be easily replicated especially when you have nanobots. I am much more likely to grant that we should be slow to change things like democracy and the legal system than I am that we should change exactly how and where we should get heat and light. Would you have wanted conservatism around moving from candles to lightbulbs? Installing heaters and cookers in the house instead of fire pits? I don't think so.

But most people on Earth don't want "an artificial system to light the Earth in such a way as to mimic the sun", they want the actual sun to go on existing.

4AnthonyC
I think that's very probably true, yes. I'm not certain that will continue to be true indefinitely, or that it will or should continue to be the deciding factor for future decision making. I'm just pointing out that we're actually discussing a small subset of a very large space of options, that there are ways of "eating the sun" that allow life to continue unaltered on Earth, and so on. TBH even if we don't do anything like this, I wouldn't be terribly surprised if future humans end up someday building a full or partial shell around Earth anyway, long before there's anything like large-scale starlifting under discussion. Living space, power generation, asteroid deflection, off-world industry, just to name a few reasons we might do something like that. It could end up being easier and cheaper to increase available surface area and energy by orders of magnitude doing something like this than by colonizing other planets and moons.
2habryka
(This seems false and in as much as someone is willing to take bets that resolve after superintelligence, I would bet that most people do not care any appreciable amount about the actual sun existing by the time humanity is capable of actually providing a real alternative) 

This point doesn't make sense to me. It sounds similar to saying "Most people don't like it when companies develop more dense housing in cities, therefore a good democracy should not have it" or "Most people don't like it when their horse-drawn carriages are replaced by cars, therefore a good democracy should not have it". 

The cost-benefit calculations on these things work out and it's good if most uninformed people who haven't spent much time on it are not able to get in the way of companies that are building goods and services in this regard. 

T... (read more)

This is in part the reasoning used by Judge Kaplan:

Kaplan himself said on Thursday that he decided on his sentence in part to make sure that Bankman-Fried cannot harm other people going forward. “There is a risk that this man will be in a position to do something very bad in the future,” he said. “In part, my sentence will be for the purpose of disabling him, to the extent that can appropriately be done, for a significant period of time.”

from https://time.com/6961068/sam-bankman-fried-prison-sentence/

It's kind of strange that, from my perspective, these mistakes are very similar to the mistakes I think I made, and also see a lot of other people making. Perhaps one "must" spend too long doing abstract slippery stuff to really understand the nature of why it doesn't really work that well?

2Erik Jenner
Yeah. I think there's a broader phenomenon where it's way harder to learn from other people's mistakes than from your own. E.g. see my first bullet point on being too attached to a cool idea. Obviously, I knew in theory that this was a common failure mode (from the Sequences/LW and from common research advice), and someone even told me I might be making the mistake in this specific instance. But my experience up until that point had been that most of the research ideas I'd been similarly excited about ended up ~working (or at least the ones I put serious time into).

I know what the word means, I just think in typical cases people should be saying a lot more about why something is undignified, because I don’t think people’s senses of dignity typically overlap that much, especially if the reader doesn’t typically read LW. In these cases I think permitting the use of the word “undignified” prevents specificity.

2Ben Pace
Gotcha. I think this text that you wrote is really ambiguous: It's ambiguous between you not having an explicit understanding of what the meaning of the words are, versus you not understanding what the person you're speaking with is intending to convey (and the first meaning is IMO the more natural one). I think having good norms around tabooing words is tricky. In this case, my sense is that some people are using the word in a relatively meaningless way that is actively unhelpful, but also that some people are using the word to mean something quite important, and it's not great to remove the word for the second group. I think if you want to move toward people not using the word, you will get more buy-in if you include a proposed alternative for the second group, and in the absence there should mostly be a move toward "regularly ask someone to taboo the word" so that you can distinguish between the two kinds of uses.

"Undignified" is really vague

I sometimes see/hear people say that "X would be a really undignified". I mostly don't really know what this means? I think it means "if I told someone that I did X, I would feel a bit embarassed." It's not really an argument against X. It's not dissimilar to saying "vibes are off with X".

Not saying you should never say it, but basically every use I see could/should be replaced with something more specific.

9Zach Stein-Perlman
I disagree with Ben. I think the usage that Mark is talking about is a reference to Death with Dignity. A central example (written by me) is It's playful and unserious but "X would be undignified" roughly means "it would be an unfortunate error if we did X or let X happen" and is used in the context of AI doom and our ability to affect P(doom).
5Ben Pace
But it's a very important concept! It means doing something that breaks your ability to respect yourself. For instance, you might want to win a political election, and you think you can win on policies and because people trust you, but you're losing, and so you consider using attack-ads or telling lies or selling out to rich people who you believe are corrupt. You can actually do these and get away with it, and they're bad in different ways, but one of the ways it's bad is you no longer are acting in a way where you relate to yourself as someone deserving of respect. Which is bad for the rest of your life, where you'll probably treat yourself poorly and implicitly encourage others to treat you poorly as well. Who wants to work with someone or be married to someone or be friends with someone that they do not respect? I care about people's preferences and thoughts less when I do not respect them, and I will probably care about my own less if I do not respect myself, and implicitly encourage others to not treat me as worthy of respect as well (e.g. "I get why you don't want to be in a relationship for me; I wouldn't want to be in a relationship with me.") To live well and trade with others it is important to be a person worthy of basic respect, and not doing undignified things ("this is beneath me") is how you maintain this. 

Yeah I didn’t really use good words. I mean something more like “make your identity fit yourself better” which often involves making it smaller by removing false beliefs about constraints, but also involves making it larger in some ways, eg uncovering new passions.

I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that "corrupted", although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).

Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)

Mark Xu*Ω418027

AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind

Some considerations:

  • Safety researchers should want Google Deepmind (GDM) to have a robust and flourishing safety department. It seems plausible that GDM will be able to create "the smartest" models: they have lots of talent, and own lots of computers. (see e.g. https://epochai.org/data/notable-ai-models#computing-capacity)
  • Anthropic (ANT) might run into trouble in the future due to not owning their own computers, e.g. if Amazon (or where ever they're renting th
... (read more)
2Neel Nanda
The high level claim seems pretty true to me. Come to the GDM alignment team, it's great over here! It seems quite important to me that all AGI labs have good safety teams Thanks for writing the post!
evhubΩ7114

Some possible counterpoints:

  • Centralization might actually be good if you believe there are compounding returns to having lots of really strong safety researchers in one spot working together, e.g. in terms of having other really good people to work with, learn from, and give you feedback.
  • My guess would be that Anthropic resources its safety teams substantially more than GDM in terms of e.g. compute per researcher (though I'm not positive of this).
  • I think the object-level research productivity concerns probably dominate, but if you're thinking about inf
... (read more)
9Raemon
I think two major cruxes for me here are: * is it actually tractable to affect Deepmind's culture and organizational decisionmaking * how close to the threshold is Anthropic for having a good enough safety culture? My current best guess is that Anthropic is still under the threshold for good enough safety culture (despite seeming better than I expected in a number of ways), and meanwhile that Deepmind is just too intractably far gone.  I think people should be hesitant to work at any scaling lab, but, I think Anthropic might be possible to make "the one actually good scaling lab", and I don't currently expect that to be tractable at Deepmind and I think "having at least one" seems good for the world (although it's a bit hard for me to articulate why at the moment) I am interested in hearing details about Deepmind that anyone thinks should change my mind about this. This viewpoint is based on having spent at least 10s of hours trying to learn and about influence both org's culture, at various times. In both cases, I don't get the sense that people at the orgs really have a visceral sense that "decisionmaking processes can be fake", I think they will be fake by default and the org is better modeled as following general incentives, and DeepMind has too many moving people and moving parts at a low enough density that it doesn't seem possible to fix. For me to change my mind about this, I would need to someone there to look me in the eye and explain that they do have a visceral sense of how organizational decisionmaking processes can be fake, and why they nonetheless think DeepMind is tractable to fix. I assume it's hard for @Rohin Shah and @Neel Nanda can't really say anything publicly that's capable of changing my mind for various confidentiality and political reasons, but, like, that's my crux. (conving me in more general terms "Ray, you're too pessimistic about org culture" would hypothetically somehow work, but, you have a lot of work to do given how thorou
8Scott Emmons
Great post. I'm on GDM's new AI safety and alignment team in the Bay Area and hope readers will consider joining us! What evidence is there that working at a scaling lab risks creating a "corrupted" perception? When I try thinking of examples, the people that come to my mind seem to have quite successfully transitioned from working at a scaling lab to doing nonprofit / government work. For example: * Paul Christiano went from OpenAI to the nonprofit Alignment Research Center (ARC) to head of AI safety at the US AI Safety Institute. * Geoffrey Irving worked at Google Brain, OpenAI, and Google DeepMind. Geoffrey is now Chief Scientist at the UK AI Safety Institute. * Beth Barnes worked at DeepMind and OpenAI and is now founder and head of research at Model Evaluation and Threat Research (METR).
9[anonymous]
I largely agree with this take & also think that people often aren't aware of some of GDM's bright spots from a safety perspective. My guess is that most people overestimate the degree to which ANT>GDM from a safety perspective. For example, I think GDM has been thinking more about international coordination than ANT. Demis has said that he supports a "CERN for AI" model, and GDM's governance team (led by Allan Dafoe) has written a few pieces about international coordination proposals. ANT has said very little about international coordination. It's much harder to get a sense of where ANT's policy team is at. My guess is that they are less enthusiastic about international coordination relative to GDM and more enthusiastic about things like RSPs, safety cases, and letting scaling labs continue unless/until there is clearer empirical evidence of loss of control risks.  I also think GDM deserves some praise for engaging publicly with arguments about AGI ruin and threat models. (On the other hand, GDM is ultimately controlled by Google, which makes it unclear how important Demis's opinions or Allan's work will be. Also, my impression is that Google was neutral or against SB1047, whereas ANT eventually said that the benefits outweighed the costs.)
3Arthur Conmy
> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted” Does this mean something like: 1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or   2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon or something else entirely?
kaveΩ11192

ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the "optimal allocation".

I think this depends on whether you think AI safety at a lab is more of an O-ring process or a swiss-cheese process. Also, if you think it's more of an O-ring process, you might be generally less excited about working at a scaling lab.

idk how much value that adds over this shortform, and I currently find AI prose a bit nauseating.

1Knight Lee
That's fair. To be honest I've only used AI for writing code, I merely heard about other people having success with AI drafts. Maybe their situation was different, or they were bad at English to the point that AI writes better than them.

Hiliariously, it seems likely that our disagreement is even more meta, on the question of "how do you know when you have enough information to know", or potentially even higher, e.g. "how much uncertainty should one have given that they think they know" etc.

I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.

The "epsilon fallacy" can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.

I also seperately think that "bottleneck" is not generally a good term to apply to a complex project with high amounts of technica... (read more)

3johnswentworth
I do think that "we don't have enough information to know where the bottlenecks are yet" is in-general a reasonable counterargument to a "just focus on the bottlenecks" approach (insofar as we in fact do not yet have enough information). In this case I think we do have enough information, so that's perhaps a deeper crux.

related to the claim that "all models are meta-models", in that they are objects capable of e.g evaluating how applicable they are for making a given prediction. E.g. "newtonian mechanics" also carries along with it information about how if things are moving too fast, you need to add more noise to its predictions, i.e. it's less true/applicable/etc.

tentative claim: there are models of the world, which make predictions, and there is "how true they are", which is the amount of noise you fudge the model with to get lowest loss (maybe KL?) in expectation.

E.g. "the grocery store is 500m away" corresponds to "my dist over the grocery store is centered at 500m, but has some amount of noise"

2cubefox
So perhaps noise ≈ inverse of variance ≈ degree of truth?

related to the claim that "all models are meta-models", in that they are objects capable of e.g evaluating how applicable they are for making a given prediction. E.g. "newtonian mechanics" also carries along with it information about how if things are moving too fast, you need to add more noise to its predictions, i.e. it's less true/applicable/etc.

My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the "long reflection", and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible.

I seperately have hope we can solve "the entire problem" at some point, e.g. through ARC's agenda (which I spend most of my time trying to derisk and advance).

yep agreed, I have a bunch of vague plans in this direction. I most generally think that AI control is a pretty good tool in the toolbox, and is unlikely to make things much worse but plausibly makes things much better.

I agree it is better work on bottlenecks than non-bottlenecks. I have high uncertainty about where such bottlenecks will be, and I think sufficiently low amounts of work have gone into "control" that it's obviously worth investing more, because e.g. I think it'll let us get more data on where bottlenecks are.

2Mark Xu
see my longer comment https://www.lesswrong.com/posts/A79wykDjr4pcYy9K7/mark-xu-s-shortform#8qjN3Mb8xmJxx59ZG

Yes, I agree. If I had more time, this would have been a top-level post. If anyone reading wants to write such a post using my quick take as a base, I would be happy to take a look and offer comments. I might do it myself at some point as well.

1Knight Lee
I'm not sure if this is allowed here, but maybe you can ask an AI to write a draft and manually proofread for mistakes?
3MondSemmel
Just make it a full post without doing much if any editing, and link to this quick take and its comments when you do. A polished full post is better than an unpolished one, but an unpolished one is better than none at all.
Mark Xu*Ω6814733

Alignment researchers should think hard about switching to working on AI Control

I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50/50 across people who a... (read more)

1MichaelDickens
This is a good and important point. I don't have a strong opinion on whether you're right, but one counterpoint: AI companies are already well-incentivized to figure out how to control AI, because (as Wei Dai said) controllable AI is more economically useful. It makes more sense for nonprofits / independent researchers to do work that AI companies wouldn't do otherwise.
9Rohin Shah
This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like: * Clearly alignment: debate theory, certain flavors of process supervision * Clearly control: removing affordances (e.g. "don't connect the model to the Internet") * Could be either one: interpretability, critique models (in control this is called "untrusted monitoring"), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain of thought, ... Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn't make much sense for research, most of which helps with both. It does make sense as a distinction for "what plan will you implement in practice" -- but labs have said very little publicly about that. Other things that characterize work done under the name of "control" so far are (1) it tries to be very concrete about its threat models, to a greater degree than most other work in AI safety, and (2) it tries to do assurance, taking a very worst case approach. Maybe you're saying that people should do those things more, but this seems way more contentious and I'd probably just straightforwardly disagree with the strength of your recommendation (though probably not its direction). Nitpick: I would also quibble with your definitions; under your definitions, control seems like a subset of alignment (the one exception if you notice the model is scheming and then simply stop using AI). I think you really have to define alignment as models reliably doing what you want independent of the surrounding context, or talk about "trying to do what you want" (which only makes sense when applied to models, so has similar upshots). Tbc I like control and think more effort should be put into it; I just disagree with the strength o

quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”

This feels like a pretty central cruxy point - and not just for the relevance of the pivotal act framing specifically. I think it's underlying a whole difference of worldview or problem-solving approach.

A couple other points in a similar direction:

... (read more)

I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem. 

When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool addi... (read more)

1AI_Symbiote
This post raises an important perspective on the practicalities of AI Control versus Alignment. Given the potential for AI to function productively even when not fully aligned, do you believe that current AI control methods are scalable enough to handle future AGI systems? Additionally, what would be the main challenges in ensuring that AI control strategies are robust in highly dynamic or emergency scenarios?
Wei DaiΩ246238

Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.

When we have controlled unaligned AIs doing economically useful work, they probably won't be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify ... (read more)

2Chris_Leong
I would suggest 50% of researchers working on a broader definition of control: including "control", technical governance work and technical outreach (scary demos, model organisms of misalignment). 
2jacquesthibs
I'm in the process of trying to build an org focused on "automated/augmented alignment research." As part of that, I've been thinking about which alignment research agendas could be investigated in order to make automated alignment safer and trustworthy. And so, I've been thinking of doing internal research on AI control/security and using that research internally to build parts of the system I intend to build. I figured this would be a useful test case for applying the AI control agenda and iterating on issues we face in implementation, and then sharing those insights with the wider community. Would love to talk to anyone who has thoughts on this or who would introduce me to someone who would fund this kind of work.
TsviBTΩ12214

The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.

Well, this would be the lone crux. The rest of the stuff you wrote is about non-exploding AI, right? And is therefore irrelevant to the thing about everyone dying, except insofar as controlled non-exploding AI can help prevent uncontrolled exploding AI from killing everyone?

2cubefox
I don't see a significant difference in your distinction between alignment and control. If you say alignment is about doing what you want (which I strongly disagree with in its generality, e.g. when someone might want to murder or torture people or otherwise act unethically), that obviously includes your wanting to "be OK" when the AI didn't do exactly what you want. Alignment comes in degrees, and you merely seem to equate control with non-perfect alignment and alignment with perfect alignment. Or I might be misunderstanding what you have in mind.

This topic is important enough that you could consider making a full post.

My belief is that this would improve reach, and also make it easier for people to reference your arguments. 

Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic, institutional level change. 

2Cole Wyeth
I disagree with this take. A.I. control will only be important in a multipolar situation in which no single A.I. system can create a gray goo catastrophe etc. But if such pivotal acts are impossible and no singular A.I. takes control, but instead many A.I.’s are competing, than some groups will develop better or worse control for economic reasons and it won’t affect existential risk much to work on it now. I don’t think I can see a situation where control matters - only a few players have A.G.I. for a very long time and none escape or are open sourced but also none gain a decisive advantage?
4Noosphere89
I agree with bullet points 1, 2, 3, 6 and 7, partially agree with bullet point 5, and disagree with bullet point 4. Thus, I agree with the central claim here: For more of my analysis on the bullet points, read the rest of the comment. For bullet point 1, I basically agree with this, mostly due to not favoring binary assumptions and instead prefer continuous quantity reasoning, which tends to both be a better match for the IRL world, and also continuous quantity reasoning gives you more information than binary outcomes. I really like bullet point 2, and also think that even in a scenario where it's easy to prevent defection, you should still have controls that make defecting employees have much less reward and much more punishment for subversive actions. I deeply agree with point 3, and I'd frame AI control in one of 2 ways: 1. As a replacement for the pivotal act concept. 2. As a pivotal act that doesn't require destruction or death, and doesn't require you to overthrow nations in your quest. A nitpick: AI labor will be the huge majority of alignment progress in every stage, not just the early stage. I think one big reason the pivotal act frame dominated a lot of discussions is the assumption that we would get a pure software singularity which would FOOM in several weeks, but reality is shaping up to not be a pure software-singularity, since physical stuff like robotics and data centers still matters. There's a reason why every hyperscaler is trying to get large amounts of power and datacenter compute contracts, because they realize that the singularity is bottlenecked currently on power and to a lesser extent compute. I disagree with 4, but that's due to my views on alignment, which tend to view it as a significantly easier problem than the median LWer does, and in particular I view essentially 0 need for philosophical deconfusion to make the future go well. I agree that AI control enhances alignment arguments universally, and provides more compelling

I directionally agree with this (and think it's good to write about this more, strongly upvoted!)

For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:

  1. "ensuring that if the AIs are not aligned [...], then you are still OK" (which I think is the main meaning of "AI control")
  2. Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood's existing work on control).

I th... (read more)

Yes I agree with what you have written, and do think it’s overall not that likely that everything pans out as hoped. We do also have other hopes for how this general picture can still cohere if the specific path doesn’t work out, eg we’re open to learning some stuff empirically and adding an “algorithmic cherry on top” to produce the estimate.

The literature review is very strange to me. Where is the section on certified robustness against epsilon-ball adversarial examples? The techniques used in that literature (e.g. interval propagation) are nearly identical to what you discuss here.

I was meaning to include such a section, but forgot :). Perhaps I will edit it in. I think such work is qualitatively similar to what we're trying to do, but that the key difference is that we're interested in "best guess" estimates, as opposed to formally verified-to-be-correct estimates (mostly because we don'... (read more)

4Rohin Shah
Fwiw the thing that scares me is the combination of three things: 1. Producing an accurate estimate requires "understanding" the input 2. Our standard for the quality of estimates will be very high 3. The approach is purely algorithmic (rather than e.g. using the flexible intelligence of LLMs / humans) You presumably agree with (1) and (3). For (2), it seems like there's a lot of correlations you need to take into account for this to produce estimates that aren't wildly off: 1. Underestimates of risk: These are cases where the presumption of independence neglects correlations between things that happen for catastrophe-causing events. For example, maybe a model is planning to defect when RSA-2048 is factored, and it has N different consistency checks it runs to make sure we aren't fooling it about whether it has been factored. The method has to see that there's a common cause for all the consistency checks to return 1, otherwise it will underestimate risk by a factor exponential in N. (And tbc I think even separately from more exotic examples like RSA-2048, it will be relatively common for this pattern to arise with networks, e.g. it happens any time that in reality there is a common cause that produces many pieces of evidence, and then the network aggregates those pieces of evidence to infer the common cause.) 2. Overestimates of risk: These are cases where the presumption of independence neglects correlations that rule out catastrophe, and so lead to overestimates. For example, the model might be willing to make up fake instructions for creating bioweapons to help with scifi writing, but never provide real instructions. So the method needs to capture the "understanding" of which instructions are fake vs real. I agree this isn't a proof of impossibility, since a purely algorithmic approach (SGD) produced the "understanding" in the first place, so in theory a purely algorithmic approach could still capture all that understanding to produce accurate estimates

I don’t think Paul thinks verification is generally easy or that delegation is fundamentally viable. He, for example, doesn’t suck at hiring because he thinks it’s in fact a hard problem to verify if someone is good at their job.

I liked Rohin's comment elsewhere on this general thread.

I’m happy to answer more specific questions, although would generally feel more comfortable answering questions about my views then about Paul’s.

If you're commited to producing a powerful AI then the thing that matters is the probability there exists something you can't find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there's a decent chance you just won't get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop b... (read more)

4Ben Pace
It seems plausible to both of us that you can use some straightforward selection against straightforward deception and end up succeeding, up to a certain power level, and that marginal research on how to do this improves your odds. But: * I think there's a power level where it definitely doesn't work, for the sort of ontological reasons alluded to here whereby[1] useful cognition for achieving an AI's goals will optimize against you understanding it even without it needing to be tagged as deceptive or for the AI to have any self-awareness of this property. * I also think it's always a terrifying bet to make due to the adversarialness, whereby you may get a great deal of evidence consistent with it all going quite rosy right up until it dramatically fails (e.g. FTX was an insanely good investment according to financial investors and Effective Altruists right up until it was the worst investment they'd ever made and these people were not stupid). These reasons give me a sense of naivety to betting on "trying to straightforwardly select against deceptiveness" that "but a lot of the time it's easier for me to verify the deceptive behavior than for the AI to generate it!" doesn't fully grapple with, even while it's hard to point to the exact step whereby I imagine such AI developers getting tricked. ...however my sense from the first half of your comment ("I think our current understanding is sufficiently paltry that the chance of this working is pretty low") is that we're broadly in agreement about the odds of betting on this (even though I kind of expect you would articulate why quite differently to how I did). You then write: Certainly being able to show that an AI is behaving deceptively in a way that is hard to train out will in some worlds be useful for pausing AI capabilities progress, though I think this not a great set of world to be betting on ending up in — I think it more likely than not that an AI company would willingly deploy many such AIs.  Be tha

For any given system, you have some distribution over which properties will be necessary to verify in order to not die to that system. Some of those you will in fact be able to verify, thereby obtaining evidence about whether that system is dangerous. “Strategic deception” is a large set of features, some of which are possible to verify.

6Ben Pace
I'm hearing you say "If there's lots of types of ways to do strategic deception, and we can easily verify the presence (or lack) of a wide variety of them, this probably give us a good shot of selecting against all strategically deceptive AIs in our selection process". And I'm hearing John's position as "At a sufficient power level, if a single one of them gets through your training process you're screwed. And some of the types will be very hard to verify the presence of." And then I'm left with an open question as to whether the former is sufficient to prevent the latter, on which my model of Mark is optimistic (i.e. gives it >30% chance of working) and John is pessimistic (i.e. gives it <5% chance of working).

yes, you would need the catastrophe detector to be reasonably robust. Although I think it's fine if e.g. you have at least 1/million chance of catching any particular catastrophe.

I think there is a gap, but that the gap is probably not that bad (for "worst case" tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do "abstractions" over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.

I think there are some easy-to-verify properties that would make us more likely to die if they were hard-to-verify. And therefore think "verification is easier than generation" is an important part of the overall landscape of AI risk.

I think both that:

  • this is not a good characterization of Paul's views
  • verification is typically easier than generalization and this fact is important for the overall picture for AI risk

I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:

  • the motte: there exist hard to verify properties
  • the bailey: all/most important properties are hard to verify
2Raemon
(I didn't want to press it since your first comment sounded like you were kinda busy, but I am interested in hearing more details about this)

I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:

  • the motte: there exist hard to verify properties
  • the bailey: all/most important properties are hard to verify

I don't think I am trying to claim that bailey at all. For purposes of AI risk, if there is even just one single property of a given system which is both (a) necessary for us to not die to that system, and (b) hard to verify, then difficulty of verification is a blocking issue for outsourcing align... (read more)

2Noosphere89
As someone who has used this insight that verification is easier than generation before, I heartily support this point: One of my worked examples of this being important is that this was part of my argument on why AI alignment generalizes further than AI capabilities, where in this context it's much easier and more reliable to give feedback on whether a situation was good for my values, than to actually act on the situation itself. Indeed, it's so much easier that social reformers tend to fall into the trap of thinking that just because you can verify something is right or wrong means you can just create new right social norms just as easily, when the latter problem is much harder than the former problem. This link is where I got this quote: https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/ I also agree that these 2 theses are worth distinguishing:

I agree ergonimics can be hard to verify. But some ergonomics are easy to verify, and chairs conform to those ergonomics (e.g. having a backrest is good, not having sharp stabby parts are good, etc.).

I mean, sure, for any given X there will be some desirable properties of X which are easy to verify, and it's usually pretty easy to outsource the creation of an X which satisfies the easy-to-verify properties. The problem is that the easy-to-verify properties do not typically include all the properties which are important to us. Ergonomics is a very typical example.

Extending to AI: sure, there will be some desirable properties of AI which are easy to verify, or properties of alignment research which are easy to verify, or properties of plans which are eas... (read more)

I agree that there are some properties of objects that are hard to verify. But that doesn't mean generation is as hard as verification in general. The central property of a chair (that you can sit on it) is easy to verify.

I think catastrophe detectors in practice will be composed of neural networks interacting with other stuff, like scientific literature, python, etc.

With respect to the stuff quoted, I think all but "doing experiments" can be done with a neural net doing chain of thought (although not making claims about quality).

I think we're trying to solve a different problem than trusted monitoring, but I'm not that knowledgeable about what issues trusted monitoring is trying to solve. The main thing that I don't think you can do with monitoring is producing a model tha... (read more)

2Fabien Roger
Thanks for your explanations! Good point. I agree that if you don't have time or resources to train the policy against the monitor on-the-fly, doing the catastrophe-suppression upfront is very nice. It is worth noting that if "training against the monitor" didn't work because the monitor was not robust enough, then my understanding is that the same issue would plague the tail-risk estimation approach: for optimization against tail-risk to not get gamed, you not only need the tail-risk estimation to be robust, but you also need the catastrophe detector to be robust. Is that right? I would have thought there is a big gap between "doing tail-risk estimation for an NN forward pass" and "doing tail-risk estimation for an NN CoT".  Argmax (the one in T=0 sampling) is not a nice non-linearity, and my guess is that none of the hopes for NN tail-risk estimation present in the post allow you to make claims about weak model CoT. If you do layer-by-layer modeling (going through n_layers x n_sampled_tokens steps), I would be surprised if you ended up being able to show sth like "after careful consideration, and despite the naive warning signs (which you need to have, otherwise the CoT is not very useful), the CoT always converges to saying that there is no catastrophe" - it's not like in single-forward-pass land where there is a clear sense in which the activations should not go in certain regions. I haven't thought about it much, so maybe I am too pessimistic about dealing with the argmax?

This post uses "I can identify ways in which chairs are bad" as an example. But it's easier for me to verify that I can sit in a chair and that it's comfortable then to make a chair myself. So I don't really know why this is a good example for "verification is easier than generation".

More examples:

  • I can tell my computer is a good typing machine, but cannot make one myself
  • I can tell a waterbottle is water tight, but do not know how to make a water bottle
  • I can tell that my pepper grinder grinds pepper, but do not know how to make a pepper grinder.

If th... (read more)

7jmh
I assumed John was pointing at verifying that perhaps the chemicals used in the production of the chair might have some really bad impact on the environmnet, start causing a problem with the food chain eco system and make food much scarcers for everyone -- including the person who bought the chair -- in the meaningfully near future. Something a long those lines.  As you note, verifying the chair functions as you want -- as a place to sit that is comfortable -- is pretty easy. Most of us probably do that without even really thinking about it. But will this chair "kill me" in the future is not so obvious or easy to assess. I suspect at the core, this is a question about an assumption about evaluating a simple/non-complex world and doing so in an inherently complex world do doesn't allow true separability in simple and independant structures.
3AprilSR
This feels more like an argument that Wentworth's model is low-resolution than that he's actually misidentified where the disagreement is?

I think "basically obviates" is too strong. imitation of human-legible cognitive strategies + RL seems liable to produce very different systems that would been produced with pure RL. For example, in the first case, RL incentizes the strategies being combine in ways conducive to accuracy (in addition to potentailly incentivizing non-human-legible cognitive strategies), whereas in the second case you don't get any incentive towards productively useing human-legible cogntive strategies.

I don't think this characterization is accurate at all, but don't think I can explain the disagreement well enough for it to be productive.

4Raemon
I had interpeted your initial comment to mean "this post doesn't accurately characterize Pauls views" (as opposed to "John is confused/wrong about the object level of 'is verification easier than generation' in a way that is relevant for modeling AI outcomes") I think your comment elsethread was mostly commenting on the object level. I'm currently unsure if your line "I don't think this characterization is accurate at all" was about the object level, or about whether this post successfully articulates a difference in Paul's views vs Johns. 
4Ben Pace
You can't explain the disagreement well-enough to be productive yet! I have faith that you may be able to in the future, even if not now. For reference, readers can see Paul and John debate this a little in this 2022 thread on AIs doing alignment research.

if you train on (x, f(x)) pairs, and you ask it to predict f(x') on some novel input x', and also to write down what it thinks f is, do you know if these answers will be consistent? For instance, the model could get f wrong, and also give the wrong prediction for f(x), but it would be interesting if the prediction for f(x) was "connected" to it's sense of what f was.

4Owain_Evans
Good question. I expect you would find some degree of consistency here. Johannes or Dami might be able to some results on this.
1silentbob
Indeed! I think I remember having read that a while ago. A different phrasing I like to use is "Do you have a favorite movie?", because many people actually do and then are happy to share it, and if they don't, they naturally fall back on something like "No, but I recently watched X and it was great" or so.

It's important to distinguish between:

  • the strategy of "copy P2's strategy" is a good strategy
  • because P2 had a good strategy, there exists a good strategy for P1

Strategy stealing assumption isn't saying that copying strategies is a good strategy, it's saying the possibility of copying means that there exists a strategy P1 can take that is just a good as P2.

2Nathan Helm-Burger
Yes, that's true. But I feel like the post doesn't seem to address this. The first-mover-only strategy I think the Rogue AI team is going to be considering as one of its top options is, "Wipe out humanity (except for a few loyal servants) with an single unblockable first strike". The copy-strategy that I think humanity should pursue here is, "Wipe out the Rogue AI with overwhelming force." Of course, this requires humanity to even know that the Rogue AI team exists and is contemplating a first strike. That's not an easy thing to accomplish, because an earlier strategy that the Rogue AI team is likely to be pursuing is "hide from the powerful opposed group that currently has control over 99% of the world's resources."

You could instead ask whether or not the observer could predict the location of a single particle p0, perhaps stipulating that p0 isn't the particle that's randomly perturbed.

My guess is that a random 1 angstrom perturbation is enough so that p0's location after 20s is ~uniform. This question seems easier to answer, and I wouldn't really be surprised if the answer is no?

Here's a really rough estimate: This says 10^{10} s^{-1} per collision, so 3s after start ~everything will have hit the randomly perturbed particle, and then there are 17 * 10^{10} more col... (read more)

The system is chaotic, so the uncertainty increases exponentially with each collision. Also atoms are only about 1 angstrom wide, so the first unpredictable collision p0 makes will send it flying in some random direction, and totally miss the 2nd atom it should have collided with, instead probably hitting some other atom.

3habryka
Doing it for one particle seems like it would be harder than doing it for all particles, since even if you are highly uncertain about each individual particle, in-aggregate that could still produce a quite high confidence about which side has more particles. So my guess is it matters a lot whether it's almost uniform or not.

The bounty is still active. (I work at ARC)

Humans going about their business without regard for plants and animals has historically not been that great for a lot of them.

2Gesild Muka
I’m not sure the analogy works or fully answers my question. The equilibrium that comes with ‘humans going about their business’ might favor human proliferation at the cost of plant and animal species (and even lead to apocalyptic ends) but the way I understand it the difference in intelligence from human and superintelligence is comparable to humans and bacteria, rather than human and insect or animal. I can imagine practical ways there might be friction between humans and SI, for resource appropriation for example, but the difference in resource use would also be analogous to collective human society vs bacteria. The resource use of an SI would be so massive that the puny humans can go about their business. Am I not understanding it right? Or missing something?

Here are some things I think you can do:

  • Train a model to be really dumb unless I prepend a random secret string. The goverment doesn't have this string, so I'll be able to predict my model and pass their eval. Some precedent in: https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal

  • I can predict a single matrix multiply just by memorizing the weights, and I can predict ReLU, and I'm allowed to use helper AIs.

  • I just train really really hard on imitating 1 particular individual, then have them just say whatever first comes to mind.

2TurnTrout
Thanks for the comment! Quick reacts: I'm concerned about the first bullet, not about 2, and bullet 3 seems to ignore top-k probability prediction requirements (the requirement isn't to just ID the most probable next token). Maybe there's a recovery of bullet 3 somehow, though?

You have to specify your backdoor defense before the attacker picks which input to backdoor.

I think Luke told your mushroom story to me. Defs not a coincidence.

Load More