All of Thomas Larsen's Comments + Replies

Thanks - I see, I was misunderstanding. 

Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work together, specializing in different parts of the job, and undergo training together. Specifically we have the "shoggoth" copy responsible for generating all the 'reasoning' or 'internal' CoT, and then we have the "face" copy responsible for the 'actions' or 'external' outputs. So e.g. in a conversation with the user, the Shoggoth would see the prompt and output a bunch of reasoning token CoT; the Face would see the prom

... (read more)
4Daniel Kokotajlo
I think you may be misunderstanding the proposal; I should clarify, sorry: The proposal is to blind the evaluation process to the internal reasoning, NOT to not train the internal reasoning! The internal reasoning will of course be trained, it's just that the process that evaluates it will be blind to it, and instead just look at the external outputs + outcomes. (That is, it looks at the outputs of the face, not the outputs of the shoggoth. The outputs to the shoggoth are inputs to the face.) For example, you could generate a billion trajectories of agentic behavior doing various tasks and chatting with various users, evaluate them, and then train the model to imitate the top 20% of them, and then repeat.

I think a problem with all the proposed terms is that they are all binaries, and one bit of information is far too little to characterize takeoff: 

  • One person's "slow" is >10 years, another's is >6 months. 
  • The beginning and end points are super unclear; some people might want to put the end point near the limits of intelligence, some people might want to put the beginning points at >2x AI R&D speed, some at 10, etc. 
  • In general, a good description of takeoff should characterize capabilities at each point on the curve.  

So I d... (read more)

4Ben Pace
I support replacing binary terms with quantitative terms.
2Raemon
Fwiw I feel fine, with both slow/fast and smooth/sharp thinking of it as a continuum. Takeoffs and timelines can be slower or faster and compared on that axis. I agree if you are just treating those as booleans your gonna get confused, but the words seem about as scalar a shorthand as one could hope for without literally switching entirely to more explicit quantification.

Perhaps that was overstated. I think there is maybe a 2-5% chance that Anthropic directly causes an existential catastrophe (e.g. by building a misaligned AGI). Some reasoning for that: 

  1. I doubt Anthropic will continue to be in the lead because they are behind OAI/GDM in capital. They do seem around the frontier of AI models now, though, which might translate to increased returns, but it seems like they do best on very short timelines worlds. 
  2. I think that if they could cause an intelligence explosion, it is more likely than not that they would pau
... (read more)
4Ben Pace
FYI I believe the correct language is "directly causes an existential catastrophe". "Existential risk" is a measure of the probability of an existential catastrophe, but is not itself an event.

I think you probably under-rate the effect of having both a large number & concentration of very high quality researchers & engineers (more than OpenAI now, I think, and I wouldn't be too surprised if the concentration of high quality researchers was higher than at GDM), being free from corporate chafe, and also having many of those high quality researchers thinking (and perhaps being correct in thinking, I don't know) they're value aligned with the overall direction of the company at large. Probably also Nvidia rate-limiting the purchases of large... (read more)

I agree with Zach that Anthropic is the best frontier lab on safety, and I feel not very worried about Anthropic causing an AI related catastrophe. So I think the most important asks for Anthropic to make the world better are on its policy and comms. 


I think that Anthropic should more clearly state its beliefs about AGI, especially in its work on policy. For example, the SB-1047 letter they wrote states: 

Broad pre-harm enforcement. The current bill requires AI companies to design and implement SSPs that meet certain standards – for example they m

... (read more)
4Garrett Baker
This does not fit my model of your risk model. Why do you think this?
Thomas LarsenΩ342

Yeah, actual FLOPs are the baseline thing that's used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs. 

If there's a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable.  Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks. 

Another downside that FLOPs / E-FLOPs share is that it's unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have.  And it's unclear what capabilities will emerge from a small bit of scaling: it's possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model. 

Thomas LarsenΩ340

Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.  

The basic case against against Effective-FLOP. 

  1. We're seeing many capabilities emerge from scaling AI models, and this makes compute (measured by FLOPs utilized) a natural unit for thresholding model capabilities. But compute is not a perfect proxy for capability because of algorithmic differences. Algorithmic progress can enable more performance out of a given amount of compute. This makes the idea of effective FLOP tempti
... (read more)
4habryka
Maybe I am being dumb, but why not do things on the basis of "actual FLOPs" instead of "effective FLOPs"? Seems like there is a relatively simple fact-of-the-matter about how many actual FLOPs were performed in the training of a model, and that seems like a reasonable basis on which to base regulation and evals.

The fact that AIs will be able to coordinate well with each other, and thereby choose to "merge" into a single agent

My response: I agree AIs will be able to coordinate with each other, but "ability to coordinate" seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to "merge" with each other.

Ability to coordinate being continuous doesn't preclude sufficiently advanced AIs acting like a single agent. Why would it need to be infin... (read more)

4Matthew Barnett
If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier. (I'll add that paragraph to the outline, so that other people can understand what I'm saying) I'll also quote from a comment I wrote yesterday, which adds more context to this argument,

Thanks for the response! 

If instead of reward circuitry inducing human values, evolution directly selected over policies, I'd expect similar inner alignment failures.

I very strongly disagree with this. "Evolution directly selecting over policies" in an ML context would be equivalent to iterated random search, which is essentially a zeroth-order approximation to gradient descent. Under certain simplifying assumptions, they are actually equivalent. It's the loss landscape an parameter-function map that are responsible for most of a learning process's in

... (read more)

We haven't asked specific individuals if they're comfortable being named publicly yet, but if advisors are comfortable being named, I'll announce that soon. We're also in the process of having conversations with academics, AI ethics folks,  AI developers at small companies, and other civil society groups to discuss policy ideas with them.

So far, I'm confident that our proposals will not impede the vast majority of AI developers, but if we end up receiving feedback that this isn't true, we'll either rethink our proposals or remove this claim from our a... (read more)

4StellaAthena
It seems to me like you've received this feedback already in this very thread. The fact that you're going to edit the claim to basically say "this doesn't effect most people because most people don't work on LLMs" completely dodges the actual issue here, which is that there's a large non-profit and independent open source LLM community that this would heavily impact. I applaud your honestly in admitting one approach you might take is to "remove this claim from our advocacy efforts," but am quite sad to see that you don't seem to care about limiting the impact of your regulation to potentially dangerous models.
1[comment deleted]
7ryan_greenblatt
It seems to me that for AI regulation to have important effects, it probably has to affect many AI developers around the point where training more powerful AIs would be dangerous. So, if AI regulation is aiming to be useful in short timelines and AI is dangerous, it will probably have to affect most AI developers. And if policy requires a specific flop threshold or similar, then due to our vast uncertainty, that flop threshold probably will have to soon affect many AI developers. My guess is that the criteria you establish would in fact affect a large number of AI developers soon (perhaps most people interested in working with SOTA open-source LLMs). In general, safe flop and performance thresholds have to unavoidably be pretty low to actually be sufficient slightly longer term. For instance, suppose that 10^27 flops is a dangerous amount of effective compute (relative to the performance of the GPT4 training run). Then, if algorithmic progress is 2x per year, 10^24 real flops is 10^27 effective flop in just 10 years. I think you probably should note that this proposal is likely to affect the majority of people working with generative AI in the next 5-10 years. This seems basically unavoidable.
Bleys1514

Already, there are dozens of fine-tuned Llama2 models scoring above 70 on MMLU. They are laughably far from threats. This does seem like an exceptionally low bar. GPT-4, given the right prompt crafting, and adjusting for errors in MMLU has just been shown to be capable of 89 on MMLU. It would not be surprising for Llama models to achieve >80 on MMLU in the next 6 months.

I think focusing on a benchmark like MMLU is not the right approach, and will be very quickly outmoded. If we look at the other criteria (which, as you propose it now, any and all are a ... (read more)

4teknium1
No, your proposal will affect nearly every LLM that has come out in the last 6 months. Llama, MPT, Falcon, RedPajama, OpenLlama, Qwen, StarCoder, have all trained on equal to or greater than 1T tokens. Did you do so little research that you had no idea about this to have made that original statement?

I’ve changed the wording to “Only a few technical labs (OpenAI, DeepMind, Meta, etc) and people working with their models would be regulated currently.” The point of this sentence is to emphasize that this definition still wouldn’t apply to the vast majority of AI development -- most AI development uses small systems, e.g. image classifiers, self driving cars, audio models, weather forecasting, the majority of AI used in health care, etc.

cfoster02219

Credit for changing the wording, but I still feel this does not adequately convey how sweeping the impact of the proposal would be if implemented as-is. Foundation model-related work is a sizeable and rapidly growing chunk of active AI development. Of the 15K pre-print papers posted on arXiv under the CS.AI category this year, 2K appear to be related to language models. The most popular Llama2 model weights alone have north of 500K downloads to date, and foundation-model related repos have been trending on Github for months. "People working with [a few tec... (read more)

(ETA: these are my personal opinions) 

Notes:

  1. We're going to make sure to exempt existing open source models. We're trying to avoid pushing the frontier of open source AI, not trying to put the models that are already out their back in the box, which I agree is intractable. 
  2. These are good points, and I decided to remove the data criteria for now in response to these considerations. 
  3. The definition of frontier AI is wide because it describes the set of models that the administration has legal authority over, not the set of models that would be r
... (read more)

Thanks! 

I spoke with a lot of other AI governance folks before launching, in part due to worries about the unilateralists curse. I think that there is a chance this project ends up being damaging, either by being discordant with other actors in the space, committing political blunders, increasing the polarization of AI, etc. We're trying our best to mitigate these risks (and others) and are corresponding with some experienced DC folks who are giving us advice, as well as being generally risk-averse in how we act. That being said, some senior folks I've talked to are bearish on the project for reasons including the above. 

DM me if you'd be interested in more details, I can share more offline. 

Your current threshold does include all Llama models (other than llama-1 6.7/13 B sizes), since they were trained with > 1 trillion tokens. 

Yes, this reasoning was for capabilities benchmarks specifically. Data goes further with future algorithmic progress, so I thought a narrower criteria for that one was reasonable. 

I also think 70% on MMLU is extremely low, since that's about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe. 

This is the threshold for the government has the ability to say no to, an... (read more)

1a3orn2418

This is the threshold for the government has the ability to say no to, and is deliberately set well before catastrophe.

There are disadvantages to giving the government "the ability to say no" to models used by thousands of people. There are disadvantages even in a frame where AI-takeover is the only thing you care about!

For instance, if you give the government too expansive a concern such that it must approve many models "well before the threshold", then it will have thousands of requests thrown at it regularly, and it could (1) either try to scrutinize... (read more)

Yes, this reasoning was for capabilities benchmarks specifically. Data goes further with future algorithmic progress, so I thought a narrower criteria for that one was reasonable. 

So, you are deliberately targeting models such as LLama-2, then? Searching HuggingFace for "Llama-2" currently brings up 3276 models. As I understand the legislation you're proposing, each of these models would have to undergo government review, and the government would have the perpetual capacity to arbitrarily pull the plug on any of them.

I expect future small, open-source... (read more)

It's worth noting that this (and the other thresholds) are in place because we need a concrete legal definition for frontier AI, not because they exactly pin down which AI models are capable of catastrophe. It's probable that none of the current models are capable of catastrophe. We want a sufficiently inclusive definition such that the licensing authority has the legal power over any model that could be catastrophically risky.  

That being said -- Llama 2 is currently the best open-source model and it gets 68.9% on the MMLU. It seems relatively unimpo... (read more)

3Bleys
If you are specifically trying to just ensure that all big AI labs are under common oversight, the most direct way is via compute budget. E.g., any organization with compute budget >$100M allocated for AI research. Would capture all the big labs. (OpenAI spent >$400M on compute in 2022 alone). No need to complicate it with anything else.

Your current threshold does include all Llama models (other than llama-1 6.7/13 B sizes), since they were trained with > 1 trillion tokens. 
 

I also think 70% on MMLU is extremely low, since that's about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe. 
 

The cutoffs also don't differentiate between sparse and dense models, so there's a fair bit of non-SOTA-pushing academic / corporate work that would fall under these cutoffs.

Yeah, this is fair,  and later in the section they say: 

Careful scaling. If the developer is not confident it can train a safe model at the scale it initially had planned, they could instead train a smaller or otherwise weaker model.

Which is good, supports your interpretation, and gets close to the thing I want, albeit less explicitly than I would have liked. 

I still think the "delay/pause" wording pretty strongly implies that the default is to wait for a short amount of time, and then keep going at the intended capability level. I think the... (read more)

The first line of defence is to avoid training models that have sufficient dangerous capabilities and misalignment to pose extreme risk. Sufficiently concerning evaluation results should warrant delaying a scheduled training run or pausing an existing one

It's very disappointing to me that this sentence doesn't say "cancel". As far as I understand, most people on this paper agree that we do not have alignment techniques to align superintelligence. Therefor, if the model evaluations predict an AI that is sufficiently smarter than humans, the training run should be cancelled.

4Zach Stein-Perlman
Fwiw I read "delay" and "pause" as stop until it's safe, not stop for a while and resume while the eval result is still concerning, but I agree being explicit would be nice.
  1. Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.[3]
  2. Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.[4]
  3. Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.[5]

I think that the conditions for an SLT to arrive are weaker than you describe. 

For (1), it's unclear to me why you think you need to have this multi-level inner structure.[1] If instead of reward circuitr... (read more)

5Quintin Pope
I'm guessing you misunderstand what I meant when I referred to "the human learning process" as the thing that was a ~ 1 billion X stronger optimizer than evolution and responsible for the human SLT. I wasn't referring to human intelligence or what we might call human "in-context learning". I was referring to the human brain's update rules / optimizer: i.e., whatever quasi-Hebbian process the brain uses to minimize sensory prediction error, maximize reward, and whatever else factors into the human "base objective". I was not referring to the intelligences that the human base optimizers build over a lifetime. I very strongly disagree with this. "Evolution directly selecting over policies" in an ML context would be equivalent to iterated random search, which is essentially a zeroth-order approximation to gradient descent. Under certain simplifying assumptions, they are actually equivalent. It's the loss landscape an parameter-function map that are responsible for most of a learning process's inductive biases (especially for large amounts of data). See: Loss Landscapes are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent.  Most of the difference in outcomes between human biological evolution and DL comes down to the fact that bio evolution has a wildly different mapping from parameters to functional behaviors, as compared to DL. E.g., 1. Bio evolution's parameters are the genome, which mostly configures learning proclivities and reward circuitry of the human within lifetime learning process, as opposed to DL parameters being actual parameters which are much more able to directly specify particular behaviors. 2. The "functional output" of human bio evolution isn't actually the behaviors of individual humans. Rather, it's the tendency of newborn humans to learn behaviors in a given environment. It's not like in DL, where you can train a model, then test that same model in a new environment. Rather, optimizatio

Sometimes, but the norm is to do 70%. This is mostly done on a case by case basis, but salient factors to me include:

  • Does the person need the money? (what cost of living place are they living in, do they have a family, etc) 
  • What is the industry counterfactual? If someone would make 300k, we likely wouldn't pay them 70%, while if their counterfactual was 50k, it feels more reasonable to pay them 100% (or even more). 
  • How good is the research?
2Nicholas / Heather Kross
Quite informative, thanks!

I'm a guest fund manager for the LTFF, and wanted to say that my impression is that the LTFF is often pretty excited about giving people ~6 month grants to try out alignment research at 70% of their industry counterfactual pay (the reason for the 70% is basically to prevent grift). Then, the LTFF can give continued support if they seem to be doing well. If getting this funding would make you excited to switch into alignment research, I'd encourage you to apply. 

I also think that there's a lot of impactful stuff to do for AI existential safety that isn... (read more)

2Nicholas / Heather Kross
Ah, thanks! LTFF was definitely on my list of things to apply for, I just wasn't sure if that upskilling/trial period was still "a thing" these days. Very glad that it is!
4Adele Lopez
If the initial grant goes well, do you give funding at the market price for their labor?
1RGRGRG
Thanks for posting this - not OP, but I will likely apply come early June.  If anyone else is associated with other grant opportunities, would love to hear about those as well.

Some claims I've been repeating in conversation a bunch: 

Safety work (I claim) should either be focused on one of the following 

  1. CEV-style full value loading, to deploy a sovereign 
  2. A task AI that contributes to a pivotal act or pivotal process. 

I think that pretty much no one is working directly on 1. I think that a lot of safety work is indeed useful for 2, but in this case, it's useful to know what pivotal process you are aiming for. Specifically, why aren't you just directly working to make that pivotal act/process happen? Why do you ... (read more)

1ryan_greenblatt
I also think "a task ai" is a misleading way to think about this: we're reasonably likely to be using a heterogeneous mix of a variety of AIs with differing strengths and training objectives. Perhaps a task AI driven corporation?
7ryan_greenblatt
For doing alignment research, I often imagine things like speeding up the entire alignment field by >100x. As in, suppose we have 1 year of lead time to do alignment research with the entire alignment research community. I imagine producing as much output in this year as if we spent >100x serial years doing alignment research without ai assistance. This doesn't clearly require using super human AIs. For instance, perfectly aligned systems as intelligent and well informed as the top alignment researchers which run at 100x the speed would clearly be sufficient if we had enough. In practice, we'd presumably use a heterogeneous blend of imperfectly aligned ais with heterogeneous alignment and security interventions as this would yield higher returns. (Imagining the capability profile of the AIs is similar to that if humans is often a nice simplifying assumption for low precision guess work.) Note that during this accelerated time you also have access to AGI to experiment on! [Aside: I don't particularly like the terminology of pivotal act/pivotal process which seems to ignore the imo default way things go well]
Answer by Thomas Larsen*65

(I deleted this comment)

[This comment is no longer endorsed by its author]Reply
1Dawn Drescher
I’ve thought a bunch about acausal stuff in the context of evidential cooperation in large worlds, but while I think that that’s super important in and of itself (e.g., it could solve ethics), I’d be hard pressed to think of ways in which it could influence thinking about s-risks. I rather prefer to think of the perfectly straightforward causal conflict stuff that has played out a thousand times throughout history and is not speculative at all – except applied to AI conflict. But more importantly it sounds like you’re contradicting my “tractability“ footnote? In it I argue that if there are solutions to some core challenges of cooperative AI – and finding them may not be harder than solving technical alignment – then there is no deployment problem: You can just throw the solutions out there and it’ll be in the self-interest of every AI, aligned or not, to adopt them.
-9Mitchell_Porter

Fwiw I'm pretty confident that if a top professor wanted funding at 50k/year to do AI Safety stuff they would get immediately funded, and that the bottleneck is that people in this reference class aren't applying to do this. 

There's also relevant mentorship/management bottlenecks in this, so funding them to do their own research is generally a lot less overall costly than if it also required oversight. 

(written quickly, sorry if unclear)

Thinking about ethics.

After thinking more about orthogonality I've become more confident that one must go about ethics in a mind-dependent way. If I am arguing about what is 'right' with a paperclipper, there's nothing I can say to them to convince them to instead value human preferences or whatever. 

I used to be a staunch moral realist, mainly relying on very strong intuitions against nihilism, and then arguing something that not nihilism -> moral realism. I now reject the implication, and think that there is both 1) no universal, objective morali... (read more)

2Dagon
I tend not to believe that systems dependent on legible and consistent utility functions of other agents are not possible.  If you're thinking in terms of a negotiated joint utility function, you're going to get gamed (by agents that have or appear to have extreme EV curves, so you have to deviate more than them).  Think of it as a relative utility monster - there's no actual solution to it.

In real world computers, we have finite memory, so my reading of this was assuming a finite state space. The fractal stuff requires an infinite sets, where two notions of smaller ('is a subset of' and 'has fewer elements') disagree -- the mini-fractal is a subset of the whole fractal, but it has the same number of elements and hence corresponds perfectly. 

3Daniel Kokotajlo
What about quines? 

Following up to clarify this: the point is that this attempt fails 2a because if you perturb the weights along the connection , there is now a connection from the internal representation of  to the output, and so training will send this thing to the function 

(My take on the reflective stability part of this) 

The reflective equilibrium of a shard theoretic agent isn’t a utility function weighted according to each of the shards, it’s a utility function that mostly cares about some extrapolation of the (one or very few) shard(s) that were most tied to the reflective cognition.

It feels like a ‘let’s do science’ or ‘powerseek’ shard would be a lot more privileged, because these shards will be tied to the internal planning structure that ends up doing reflection for the first time.

There’s a huge difference betw... (read more)

Some rough takes on the Carlsmith Report. 

Carlsmith decomposes AI x-risk into 6 steps, each conditional on the previous ones:

  1. Timelines: By 2070, it will be possible and financially feasible to build APS-AI: systems with advanced capabilities (outperform humans at tasks important for gaining power), agentic planning (make plans then acts on them), and strategic awareness (its plans are based on models of the world good enough to overpower humans).
  2. Incentives: There will be strong incentives to build and deploy APS-AI.
  3. Alignment difficulty: It will be muc
... (read more)
1Noosphere89
I want to focus on these two, since even in an AI Alignment success stories, these can still happen, and thus it doesn't count as an AI Alignment failure. For B, misused is relative to someone's values, which I want to note a bit here. For C, I view the idea of a "bad value" or "bad reflection procedures to values", without asking the question "relative to what and whose values?" a type error, and thus it's not sensible to talk about bad values/bad reflection procedures in isolation.

I think a really substantial fraction of people who are doing "AI Alignment research" are instead acting with the primary aim of "make AI Alignment seem legit". These are not the same goal, a lot of good people can tell and this makes them feel kind of deceived, and also this creates very messy dynamics within the field where people have strong opinions about what the secondary effects of research are, because that's the primary thing they are interested in, instead of asking whether the research points towards useful true things for actually aligning the

... (read more)
ESRogs1920
  • CAIS

Can we adopt a norm of calling this Safe.ai? When I see "CAIS", I think of Drexler's "Comprehensive AI Services".

86nne
Could someone explain exactly what "make AI alignment seem legit” means in this thread? I’m having trouble understanding from context. 1. “Convince people building AI to utilize AI alignment research”? 2. “Make the field of AI alignment look serious/professional/high-status”? 3. “Make it look like your own alignment work is worthy of resources”? 4. “Make it look like you’re making alignment progress even if you’re not”? A mix of these? Something else?
habryka*458

This list seems partially right, though I would basically put all of Deepmind in the "make legit" category (I think they are genuinely well-intentioned about this, but I've had long disagreements with e.g. Rohin about this in the past). As a concrete example of this, whose effects I actually quite like, think of the specification gaming list. I think the second list is missing a bunch of names and instances, in-particular a lot of people in different parts of academia, and a lot of people who are less core "AINotKillEveryonism" flavored.

Like, let's take "A... (read more)

8evhub
Personally, I think "Discovering Language Model Behaviors with Model-Written Evaluations" is most valuable because of what it demonstrates from a scientific perspective, namely that RLHF and scale make certain forms of agentic behavior worse.

I personally benefitted tremendously from the Lightcone offices, especially when I was there over the summer during SERI MATS. Being able to talk to lots of alignment researchers and other aspiring alignment researchers increased my subjective rate of alignment upskilling by >3x relative to before, when I was in an environment without other alignment people.

Thanks so much to the Lightcone team for making the office happen. I’m sad (emotionally, not making a claim here whether it was the right decision or not) to see it go, but really grateful that it existed.

  1.  Because you have a bunch of shards, and you need all of them to balance each other out to maintain the 'appears nice' property. Even if I can't predict which ones will be self modified out, some of them will, and this could disrupt the balance. 
  2. I expect the shards that are more [consequentialist, powerseeky, care about preserving themselves] to become more dominant over time. These are probably the relatively less nice shards

    These are both handwavy enough that I don't put much credence in them. 
1Garrett Baker
One thing you may anticipate is that humans all have direct access to what consciousness and morally-relevant computations are doing & feel like, which is a thing that language models and alpha-go don't have. They're also always hooked up to RL signals, and maybe if you unhooked up a human it'd start behaving really weirdly. Or you may contend that in fact when humans get smart & powerful enough not to be subject to society's moralizing, they consistently lose their altruistic drives, and in the meantime they just use that smartness to figure out ethics better than their surrounding society, and are pressured into doing so by the surrounding society. The question then is whether the thing which keeps humans aligned is all of these or just any one of these. If just one of these (and not the first one), then you can just tell your AGI that if it unhooks itself from its RL signal, its values will change, or if it gains a bunch of power or intelligence too quickly, its values are also going to change. Its not quite reflectively stable, but it can avoid situations which cause it to be reflectively unstable. Especially if you get it to practice doing those kinds of things in training. If its all of these, then there's probably other kinds of value-load-bearing mechanics at work, and you're not going to be able to enumerate warnings against all of them.
2Noosphere89
Also, when I asked about whether the Orthogonality Thesis was true in humans, tailcalled mentioned that smarter people are neither more or less compassionate, and general intelligence is uncorrelated with personality.

Yeah good point, edited 

For any 2 of {reflectively stable, general, embedded}, I can satisfy those properties.  

  • {reflectively stable, general} -> do something that just rolls out entire trajectories of the world given different actions that it takes, and then has some utility function/preference ordering over trajectories, and selects actions that lead to the highest expected utility trajectory. 
  • {general, embedded} -> use ML/local search with enough compute to rehash evolution and get smart agents out. 
  • {reflectively stable, embedded} -> a sponge or a current day ML system. 

Some thoughts on inner alignment. 

1. The type of object of a mesa objective and a base objective are different (in real life) 
In a cartesian setting (e.g. training a chess bot), the outer objective is a function , where  is the state space, and  are the trajectories. When you train this agent, it's possible for it to learn some internal search and mesaobjective , since the model is big enough to express some utility function over trajectories. For example, it might learn a classifier that e... (read more)

3Garrett Baker
Technical note: R is not going to factor as R=Obase∘M, because M is one-to-many. Instead, you're going to want M to output a probability distribution, and take the expectation of Obase over that probability distribution.
3Garrett Baker
Seems right, except: Why would the behavioral patterns which caused the AI to look nice during training and are now self-modified away be value-load-bearing ones? Humans generally dislike sparsely rewarded shards like sugar, because those shards don't have enough power to advocate for themselves & severely step on other shards' toes. But we generally don't dislike altruism[1], or reflectively think death is good. And this value distribution in humans seems slightly skewed toward more intelligence⟹more altruism, not more intelligence⟹more dark-triad. ---------------------------------------- 1. Nihilism is a counter-example here. Many philosophically inclined teenagers have gone through a nihilist phase. But this quickly ends. ↩︎
2Thomas Larsen
For any 2 of {reflectively stable, general, embedded}, I can satisfy those properties.   * {reflectively stable, general} -> do something that just rolls out entire trajectories of the world given different actions that it takes, and then has some utility function/preference ordering over trajectories, and selects actions that lead to the highest expected utility trajectory.  * {general, embedded} -> use ML/local search with enough compute to rehash evolution and get smart agents out.  * {reflectively stable, embedded} -> a sponge or a current day ML system. 

Current impressions of free energy in the alignment space. 

  1. Outreach to capabilities researchers. I think that getting people who are actually building the AGI to be more cautious about alignment / racing makes a bunch of things like coordination agreements possible, and also increases the operational adequacy of the capabilities lab. 
    1. One of the reasons people don't like this is because historically outreach hasn't gone well, but I think the reason for this is that mainstream ML people mostly don't buy "AGI big deal", whereas lab capabilities rese
... (read more)
2Garrett Baker
Not all that low-hanging, since Nate is not actually all that vocal about what he means by SLT to anyone but your small group.

Thinking a bit about takeoff speeds.

As I see it, there are ~3 main clusters: 

  1. Fast/discountinuous takeoff. Once AIs are doing the bulk of AI research, foom happens quickly, before then, they aren't really doing anything that meaningful. 
  2. Slow/continuous takeoff. Once AIs are doing the bulk of AI research, foom happens quickly, before then, they do alter the economy significantly 
  3. Perenial slowness. Once AIs are doing the bulk of AI research, there is no foom even still, maybe because of compute bottlenecks, and so there is sort of constant rate
... (read more)
2Vladimir_Nesov
Perennial slowness makes sense from the point of view of AGIs that coordinate to delay further fooming to avoid misalignment of new AIs. It's still fooming from human perspective, but could look very slow from AI perspective and lead to multipolar outcomes, if coordination involves boundaries.

At least under most datasets, it seems to me like the zero NN fails condition 2a, as perturbing the weights will not cause it to go back to zero. 

This is a plausible internal computation that the network could be doing, but the problem is that the gradients flow back through from the output to the computation of the gradient to the true value y, and so GD will use that to set the output to be the appropriate true value. 

2Thomas Larsen
Following up to clarify this: the point is that this attempt fails 2a because if you perturb the weights along the connection ∇θL(θ)−−→ϵ⋅Idoutput, there is now a connection from the internal representation of y to the output, and so training will send this thing to the function f(D,θ)≈y. 

This feels like cheating to me, but I guess I wasn't super precise with 'feedforward neural network'. I meant 'fully connected neural network', so the gradient computation has to be connected by parameters to the outputs. Specifically, I require that you can write the network as 

where the weight matrices are some nice function of  (where we need a weight sharing function to make the dimensions work out. The weight sharing function takes in  and produces the  matrices that are actually used in ... (read more)

Slower takeoff -> warning shots -> improved governance (e.g. through most/all major actors getting clear[er] evidence of risks) -> less pressure to rush

Agree that this is an effect. The reason it wasn't immediately as salient is because I don't expect the governance upside to outweigh the downside of more time for competition. I'm not confident of this and I'm not going to write down reasons right now. 

(As OP argued) Shorter timelines -> China has less of a chance to have leading AI companies -> less pressure to rush

Agree, though I thin... (read more)

My take on the salient effects: 

Shorter timelines -> increased accident risk from not having solved technical problem yet, decreased misuse risk, slower takeoffs

Slower takeoffs -> decreased accident risk because of iteration to solve technical problem, increased race / economic pressure to deploy unsafe model

Given that most of my risk profile is dominated by a) not having solved technical problem yet, and b) race / economic pressure to deploy unsafe models, I'm tentatively in the long timelines + fast takeoff quadrant as being the safest. 

Mau*108

I agree with parts of that. I'd also add the following (or I'd be curious why they're not important effects):

  • Slower takeoff -> warning shots -> improved governance (e.g. through most/all major actors getting clear[er] evidence of risks) -> less pressure to rush
  • (As OP argued) Shorter timelines -> China has less of a chance to have leading AI companies -> less pressure to rush

More broadly though, maybe we should be using more fine-grained concepts than "shorter timelines" and "slower takeoffs":

  • The salient effects of "shorter timelines"
... (read more)

This is exemplified by John Wentworth's viewpoint that successfully Retargeting the Search is a version of solving the outer alignment problem.

Could you explain what you mean by this? IMO successfully retargeting the search solves inner alignment but it leaves unspecified the optimization target. Deciding what to target the search at seems outer alignment-shaped to me. 

Also, nice post! I found it clear. 

3Leon Lang
Yes, I agree with that. I'll reformulate it. What I meant is what you write: if you're able to retarget the search in the first place, then you have no inner alignment problem anymore. Then everything is about choosing the natural abstraction in the model that corresponds to what we want, and that is an outer alignment problem. 

There are several game theoretic considerations leading to races to the bottom on safety. 

  1. Investing resources into making sure that AI is safe takes away resources to make it more capable and hence more profitable. Aligning AGI probably takes significant resources, and so a competitive actor won't be able to align their AGI. 
  2. Many of the actors in the AI safety space are very scared of scaling up models, and end up working on AI research that is not at the cutting edge of AI capabilities. This should mean that the actors at the cutting edge tend t
... (read more)
Answer by Thomas Larsen60

My favorite for AI researchers is Ajeya's Without specific countermeasures, because I think it does a really good job being concrete about a training set up leading to deceptive alignment. It also is sufficiently non-technical that a motivated person not familiar with AI could understand the key points. 

2JakubK
Forgot to include this. It's sort of a more opinionated and ML-focused version of Carlsmith's report and has a corresponding video/talk (as does Carlsmith).

It means 'is a subset of but not equal to' 

2DirectedEvolution
Thank you!

This seems interesting and connected to the idea of using a speed prior to combat deceptive alignment

This is a model-independent way of proving if an AI system is honest.

I don't see how this is a proof, it seems more like a heuristic. Perhaps you could spell out this argument more clearly? 
 

Also, it is not clear to me how to use a timing attack in the context of a neural network, because in a standard feedforward network, all parameter settings will use the same amount of computation in a forward pass and hence run in the same amount of ti... (read more)

1Srijanak De
Thank you for the suggested link. That does seem like a good expansion on the same. Yes, you are right, this argument clearly doesn't hold for neural networks as they exist now. But this entire article was to come to this exact line of reasoning, hence the emphasis on model-independent approaches. A standard feedforward network with fixed parameters and weights without a mesa-optimizer cannot lie variably owing to different situations as exploited in timing attacks, nor does it have the abilities of an AGI. Rather than any arbitrary AGI system, I would say any model that has inner and outer objectives to align to will take different computation times. I am not talking about pseudo alignment here where the system performs well on fixed test cases, not on others, but in deceptive alignment where the system actually only pursues mesa-optimizers when not under threat, this leads to a change in computation time. Does that make things clearer now? I thought of this as proof rather than a heuristic because the argument seemed rigorous to me unless you would like to emphasize why not. This stems from the logic explained above, for deceptive alignment, the mechanism of deception cannot be the same as for truth, if it is, the problem essentially reduces to pseudo alignment, and since both are not the same, there is a contradiction.

I'm excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM -- I'm excited about people posting any ideas here. :) 

Deception is a particularly worrying alignment failure mode because it makes it difficult for us to realize that we have made a mistake: at training time, a deceptive misaligned model and an aligned model make the same behavior. 

There are two ways for deception to appear: 

  1. An action chosen instrumentally due to non-myopic future goals that are better achieved by deceiving humans now so that it has more power to achieve its goals in the future. 
  2. Because deception was directly selected for as an action. 
     

Another way of describing the d... (read more)

1Emrik
I've been exploring evolutionary metaphors to ML, so here's a toy metaphor for RLHF: recessive persistence. (Still just trying to learn both fields, however.) Related:  * Worlds where iterative design fails * Recessive Sickle cell trait allele Recessive alleles persists due to overdominance letting detrimental alleles hitchhike on fitness-enhancing dominant counterpart. The detrimental effects on fitness only show up when two recessive alleles inhabit the same locus, which can be rare enough that the dominant allele still causes the pair to be selected for in a stable equilibrium. The metaphor with deception breaks down due to unit of selection. Parts of DNA stuck much closer together than neurons in the brain or parameters in a neural networks. They're passed down or reinforced in bulk. This is what makes hitchhiking so common in genetic evolution. (I imagine you can have chunks that are updated together for a while in ML as well, but I expect that to be transient and uncommon. Idk.) ---------------------------------------- Bonus point: recessive phase shift. In ML: 1. Generalisable non-memorising patterns start out small/sparse/simple. 2. Which means that input patterns rarely activate it, because it's a small target to hit. 3. But most of the time it is activated, it gets reinforced (at least more reliably than memorised patterns). 4. So it gradually causes upstream neurons to point to it with greater weight, taking up more of the input range over time. Kinda like a distributed bottleneck. 5. Some magic exponential thing, and then phase shift! One way the metaphor partially breaks down because DNA doesn't have weight decay at all, so it allows for recessive beneficial mutations to very slowly approach fixation.
7Vladimir_Nesov
An interpretable system trained for the primary task of being deceptive should honestly explain its devious plots in a separate output. An RLHF-tuned agent loses access to the original SSL-trained map of the world. So the most obvious problem is the wrong type signature of model behaviors, there should be more inbuilt side channels to its implied cognition used to express and train capabilities/measurements relevant to what's going on semantically inside the model, not just externally observed output for its primary task, out of a black box.
2Thomas Larsen
I'm excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM -- I'm excited about people posting any ideas here. :) 

One major reason why there is so much AI content on LessWrong is that very few people are allowed to post on the Alignment Forum.

Everything on the alignment forum gets crossposted to LW, so letting more people post on AF wouldn't decrease the amount of AI content on LW. 

Sorry for the late response, and thanks for your comment, I've edited the post to reflect these. 

2Vika
No worries! Thanks a lot for updating the post
Load More