LESSWRONG
LW

All of Thomas Larsen's Comments + Replies

AI 2027: What Superintelligence Looks Like

For what its worth, my view is that we're very likely to be wrong about the specific details in both of the endings -- they are obviously super conjunctive. I don't think that there's any way around this because we can be confident AGI is going to cause some ex-ante surprising things to happen.

Also, this is scenario is around 20th percentile timelines for me, my median is early 2030s (though other authors disagree with me). I also feel much more confident about the pre-2027 scenario than about the post 2027 scenario.

Is your disagreement that you thin... (read more)

4Cole Wyeth12d

My main disagreement is the speed, but not because I expect everything to happen more slowly by some constant factor. Instead I think there’s a missing mood here regarding the obstacles to building AGI, and the time to overcome those obstacles is not clear (which is why my timeline uncertainty is still ~in the exponent). In particular, I think the first serious departure from my model of LLMs (linked above) is the neuralese section. It seems to me that for this to really work (in a way comparable to how human brains have recurrence) would require another breakthrough at least on the level of transformers if not harder. So, if the paper from Hao et al. is actually followed up on by future research that successfully scales, that would be a crux for me. Your explanation that the frontier labs haven’t adopted this for GPU utilization reasons seems highly implausible to me. These are creative people who want to ready AGI, and it seems obvious that the kind of tasks that arent conquered yet look a lot like the ones that need recurrence. Do you really think none of them have significantly invested in this (starting years ago when it become obvious this was a bottleneck)? The fact that we still need CoT at all tells me neuralese is not happening because we don’t know how to do it. Please refer to my post for more details on this intuition and its implications. In particular, I am not convinced this is the final bottleneck. I also depart from certain other details latter, for instance I think we’ll have better theory by the time we need to align human level AI and “muddling through” by blind experimentation probably won’t work or be the actual path taken by surviving worlds. My other points of disagreement seem less cruxy and are mostly downstream.

AI 2027: What Superintelligence Looks Like

Thomas Larsen16d82

This wasn't intended to be humor. In the scenario, we write:

(To avoid singling out any one existing company, we’re going to describe a fictional artificial general intelligence company, which we’ll call OpenBrain. We imagine the others to be 3–9 months behind OpenBrain.)

I think that OpenAI, GDM, and Anthropic are in the lead and are the most likely to be ahead, with similar probability.

AI 2027: What Superintelligence Looks Like

Thomas Larsen17d352

Thank you! We actually tried to write one that was much closer to a vision we endorse! The TLDR overview was something like:

Both the US and Chinese leading AGI projects stop in response to evidence of egregious misalignment.
Sign a treaty to pause smarter-than-human AI development, with compute based enforcement similar to ones described in our live scenario, except this time with humans driving the treaty instead of the AI.
Take time to solve alignment (potentially with the help of the AIs). This period could last anywhere between 1-20 ye

... (read more)

8Sebastian Schmidt16d

Big +1 on adding this and/or finding another high-quality way of depicting what the ideal scenario would look like. I think many people think and feel that the world is in a very dire state to an extent that leads to hopelessness and fatalism. Articulating clear theories of victory that enable people to see the better future they can contribute towards will be an important part of avoiding this scenario.

4Knight Lee17d

:) it's good to know that you tried this. Because on your way trying to make it realistic, you might think of a lot of insights to solving the unrealisticness problems. Thank you for the summary. From this summary, I sorta see why it might not work as well as a story. Regulation and governance isn't very exciting a narrative. And big changes in strategy and attitude inevitably sound unrealistic, even if they aren't unrealistic. E.g. if someone predicted that Europe will simply accept the fact its colonies want independence, or that the next Soviet leader will simply allow his constituent republics to break away, they would be laughed out of the room. Even though their predictions will turn out accurate. Maybe in your disclaimer, you can point out that this summary you just wrote, is what you would actually recommend (instead of what the characters in your story did). Yes, papers and blog posts are less entertaining of us but more pragmatic for you.

Why Don't We Just... Shoggoth+Face+Paraphraser?

Thomas Larsen3mo40

Thanks - I see, I was misunderstanding.

Why Don't We Just... Shoggoth+Face+Paraphraser?

Thomas Larsen3mo20

Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work together, specializing in different parts of the job, and undergo training together. Specifically we have the "shoggoth" copy responsible for generating all the 'reasoning' or 'internal' CoT, and then we have the "face" copy responsible for the 'actions' or 'external' outputs. So e.g. in a conversation with the user, the Shoggoth would see the prompt and output a bunch of reasoning token CoT; the Face would see the prom

... (read more)

4Daniel Kokotajlo3mo

I think you may be misunderstanding the proposal; I should clarify, sorry: The proposal is to blind the evaluation process to the internal reasoning, NOT to not train the internal reasoning! The internal reasoning will of course be trained, it's just that the process that evaluates it will be blind to it, and instead just look at the external outputs + outcomes. (That is, it looks at the outputs of the face, not the outputs of the shoggoth. The outputs to the shoggoth are inputs to the face.) For example, you could generate a billion trajectories of agentic behavior doing various tasks and chatting with various users, evaluate them, and then train the model to imitate the top 20% of them, and then repeat.

"Slow" takeoff is a terrible term for "maybe even faster takeoff, actually"

Thomas Larsen7mo1916

I think a problem with all the proposed terms is that they are all binaries, and one bit of information is far too little to characterize takeoff:

One person's "slow" is >10 years, another's is >6 months.
The beginning and end points are super unclear; some people might want to put the end point near the limits of intelligence, some people might want to put the beginning points at >2x AI R&D speed, some at 10, etc.
In general, a good description of takeoff should characterize capabilities at each point on the curve.

So I d... (read more)

4Ben Pace7mo

I support replacing binary terms with quantitative terms.

2Raemon7mo

Fwiw I feel fine, with both slow/fast and smooth/sharp thinking of it as a continuum. Takeoffs and timelines can be slower or faster and compared on that axis. I agree if you are just treating those as booleans your gonna get confused, but the words seem about as scalar a shorthand as one could hope for without literally switching entirely to more explicit quantification.

Zach Stein-Perlman's Shortform

Thomas Larsen8mo*160

Perhaps that was overstated. I think there is maybe a 2-5% chance that Anthropic directly causes an existential catastrophe (e.g. by building a misaligned AGI). Some reasoning for that:

I doubt Anthropic will continue to be in the lead because they are behind OAI/GDM in capital. They do seem around the frontier of AI models now, though, which might translate to increased returns, but it seems like they do best on very short timelines worlds.
I think that if they could cause an intelligence explosion, it is more likely than not that they would pau

... (read more)

4Ben Pace8mo

FYI I believe the correct language is "directly causes an existential catastrophe". "Existential risk" is a measure of the probability of an existential catastrophe, but is not itself an event.

Garrett Baker8mo106

I think you probably under-rate the effect of having both a large number & concentration of very high quality researchers & engineers (more than OpenAI now, I think, and I wouldn't be too surprised if the concentration of high quality researchers was higher than at GDM), being free from corporate chafe, and also having many of those high quality researchers thinking (and perhaps being correct in thinking, I don't know) they're value aligned with the overall direction of the company at large. Probably also Nvidia rate-limiting the purchases of large... (read more)

Zach Stein-Perlman's Shortform

Thomas Larsen8mo17-3

I agree with Zach that Anthropic is the best frontier lab on safety, and I feel not very worried about Anthropic causing an AI related catastrophe. So I think the most important asks for Anthropic to make the world better are on its policy and comms.

I think that Anthropic should more clearly state its beliefs about AGI, especially in its work on policy. For example, the SB-1047 letter they wrote states:

Broad pre-harm enforcement. The current bill requires AI companies to design and implement SSPs that meet certain standards – for example they m

... (read more)

4Garrett Baker8mo

This does not fit my model of your risk model. Why do you think this?

Thomas Larsen's Shortform

Thomas Larsen11moΩ342

Yeah, actual FLOPs are the baseline thing that's used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs.

If there's a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable. Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks.

Another downside that FLOPs / E-FLOPs share is that it's unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have. And it's unclear what capabilities will emerge from a small bit of scaling: it's possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model.

Thomas Larsen's Shortform

Thomas Larsen11moΩ340

Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.

The basic case against against Effective-FLOP.

We're seeing many capabilities emerge from scaling AI models, and this makes compute (measured by FLOPs utilized) a natural unit for thresholding model capabilities. But compute is not a perfect proxy for capability because of algorithmic differences. Algorithmic progress can enable more performance out of a given amount of compute. This makes the idea of effective FLOP tempti

... (read more)

4habryka11mo

Maybe I am being dumb, but why not do things on the basis of "actual FLOPs" instead of "effective FLOPs"? Seems like there is a relatively simple fact-of-the-matter about how many actual FLOPs were performed in the training of a model, and that seems like a reasonable basis on which to base regulation and evals.

Matthew Barnett's Shortform

Thomas Larsen1y20

The fact that AIs will be able to coordinate well with each other, and thereby choose to "merge" into a single agent
My response: I agree AIs will be able to coordinate with each other, but "ability to coordinate" seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to "merge" with each other.

Ability to coordinate being continuous doesn't preclude sufficiently advanced AIs acting like a single agent. Why would it need to be infin... (read more)

4Matthew Barnett1y

If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier. (I'll add that paragraph to the outline, so that other people can understand what I'm saying) I'll also quote from a comment I wrote yesterday, which adds more context to this argument,

Evolution provides no evidence for the sharp left turn

Thomas Larsen2y50

Thanks for the response!

If instead of reward circuitry inducing human values, evolution directly selected over policies, I'd expect similar inner alignment failures.
I very strongly disagree with this. "Evolution directly selecting over policies" in an ML context would be equivalent to iterated random search, which is essentially a zeroth-order approximation to gradient descent. Under certain simplifying assumptions, they are actually equivalent. It's the loss landscape an parameter-function map that are responsible for most of a learning process's in

Thomas Larsen2y43

We haven't asked specific individuals if they're comfortable being named publicly yet, but if advisors are comfortable being named, I'll announce that soon. We're also in the process of having conversations with academics, AI ethics folks, AI developers at small companies, and other civil society groups to discuss policy ideas with them.

So far, I'm confident that our proposals will not impede the vast majority of AI developers, but if we end up receiving feedback that this isn't true, we'll either rethink our proposals or remove this claim from our a... (read more)

4StellaAthena2y

It seems to me like you've received this feedback already in this very thread. The fact that you're going to edit the claim to basically say "this doesn't effect most people because most people don't work on LLMs" completely dodges the actual issue here, which is that there's a large non-profit and independent open source LLM community that this would heavily impact. I applaud your honestly in admitting one approach you might take is to "remove this claim from our advocacy efforts," but am quite sad to see that you don't seem to care about limiting the impact of your regulation to potentially dangerous models.

1[comment deleted]2y

7ryan_greenblatt2y

It seems to me that for AI regulation to have important effects, it probably has to affect many AI developers around the point where training more powerful AIs would be dangerous. So, if AI regulation is aiming to be useful in short timelines and AI is dangerous, it will probably have to affect most AI developers. And if policy requires a specific flop threshold or similar, then due to our vast uncertainty, that flop threshold probably will have to soon affect many AI developers. My guess is that the criteria you establish would in fact affect a large number of AI developers soon (perhaps most people interested in working with SOTA open-source LLMs). In general, safe flop and performance thresholds have to unavoidably be pretty low to actually be sufficient slightly longer term. For instance, suppose that 10^27 flops is a dangerous amount of effective compute (relative to the performance of the GPT4 training run). Then, if algorithmic progress is 2x per year, 10^24 real flops is 10^27 effective flop in just 10 years. I think you probably should note that this proposal is likely to affect the majority of people working with generative AI in the next 5-10 years. This seems basically unavoidable.

Bleys2y1514

Already, there are dozens of fine-tuned Llama2 models scoring above 70 on MMLU. They are laughably far from threats. This does seem like an exceptionally low bar. GPT-4, given the right prompt crafting, and adjusting for errors in MMLU has just been shown to be capable of 89 on MMLU. It would not be surprising for Llama models to achieve >80 on MMLU in the next 6 months.

I think focusing on a benchmark like MMLU is not the right approach, and will be very quickly outmoded. If we look at the other criteria (which, as you propose it now, any and all are a ... (read more)

4teknium12y

No, your proposal will affect nearly every LLM that has come out in the last 6 months. Llama, MPT, Falcon, RedPajama, OpenLlama, Qwen, StarCoder, have all trained on equal to or greater than 1T tokens. Did you do so little research that you had no idea about this to have made that original statement?

Introducing the Center for AI Policy (& we're hiring!)

Thomas Larsen2y155

I’ve changed the wording to “Only a few technical labs (OpenAI, DeepMind, Meta, etc) and people working with their models would be regulated currently.” The point of this sentence is to emphasize that this definition still wouldn’t apply to the vast majority of AI development -- most AI development uses small systems, e.g. image classifiers, self driving cars, audio models, weather forecasting, the majority of AI used in health care, etc.

cfoster02y2219

Credit for changing the wording, but I still feel this does not adequately convey how sweeping the impact of the proposal would be if implemented as-is. Foundation model-related work is a sizeable and rapidly growing chunk of active AI development. Of the 15K pre-print papers posted on arXiv under the CS.AI category this year, 2K appear to be related to language models. The most popular Llama2 model weights alone have north of 500K downloads to date, and foundation-model related repos have been trending on Github for months. "People working with [a few tec... (read more)

Introducing the Center for AI Policy (& we're hiring!)

Thomas Larsen2y*177

(ETA: these are my personal opinions)

Notes:

We're going to make sure to exempt existing open source models. We're trying to avoid pushing the frontier of open source AI, not trying to put the models that are already out their back in the box, which I agree is intractable.
These are good points, and I decided to remove the data criteria for now in response to these considerations.
The definition of frontier AI is wide because it describes the set of models that the administration has legal authority over, not the set of models that would be r

Thomas Larsen2y90

Thanks!

I spoke with a lot of other AI governance folks before launching, in part due to worries about the unilateralists curse. I think that there is a chance this project ends up being damaging, either by being discordant with other actors in the space, committing political blunders, increasing the polarization of AI, etc. We're trying our best to mitigate these risks (and others) and are corresponding with some experienced DC folks who are giving us advice, as well as being generally risk-averse in how we act. That being said, some senior folks I've talked to are bearish on the project for reasons including the above.

DM me if you'd be interested in more details, I can share more offline.

Introducing the Center for AI Policy (& we're hiring!)

Thomas Larsen2y160

Your current threshold does include all Llama models (other than llama-1 6.7/13 B sizes), since they were trained with > 1 trillion tokens.

Yes, this reasoning was for capabilities benchmarks specifically. Data goes further with future algorithmic progress, so I thought a narrower criteria for that one was reasonable.

I also think 70% on MMLU is extremely low, since that's about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe.

This is the threshold for the government has the ability to say no to, an... (read more)

1a3orn2y2418

This is the threshold for the government has the ability to say no to, and is deliberately set well before catastrophe.

There are disadvantages to giving the government "the ability to say no" to models used by thousands of people. There are disadvantages even in a frame where AI-takeover is the only thing you care about!

For instance, if you give the government too expansive a concern such that it must approve many models "well before the threshold", then it will have thousands of requests thrown at it regularly, and it could (1) either try to scrutinize... (read more)

Quintin Pope2y*1814

Yes, this reasoning was for capabilities benchmarks specifically. Data goes further with future algorithmic progress, so I thought a narrower criteria for that one was reasonable.

So, you are deliberately targeting models such as LLama-2, then? Searching HuggingFace for "Llama-2" currently brings up 3276 models. As I understand the legislation you're proposing, each of these models would have to undergo government review, and the government would have the perpetual capacity to arbitrarily pull the plug on any of them.

I expect future small, open-source... (read more)

Introducing the Center for AI Policy (& we're hiring!)

Thomas Larsen2y225

It's worth noting that this (and the other thresholds) are in place because we need a concrete legal definition for frontier AI, not because they exactly pin down which AI models are capable of catastrophe. It's probable that none of the current models are capable of catastrophe. We want a sufficiently inclusive definition such that the licensing authority has the legal power over any model that could be catastrophically risky.

That being said -- Llama 2 is currently the best open-source model and it gets 68.9% on the MMLU. It seems relatively unimpo... (read more)

3Bleys2y

If you are specifically trying to just ensure that all big AI labs are under common oversight, the most direct way is via compute budget. E.g., any organization with compute budget >$100M allocated for AI research. Would capture all the big labs. (OpenAI spent >$400M on compute in 2022 alone). No need to complicate it with anything else.

Quintin Pope2y1612

Your current threshold does include all Llama models (other than llama-1 6.7/13 B sizes), since they were trained with > 1 trillion tokens.

I also think 70% on MMLU is extremely low, since that's about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe.

The cutoffs also don't differentiate between sparse and dense models, so there's a fair bit of non-SOTA-pushing academic / corporate work that would fall under these cutoffs.

DeepMind: Model evaluation for extreme risks

Thomas Larsen2y*42

Yeah, this is fair, and later in the section they say:

Careful scaling. If the developer is not confident it can train a safe model at the scale it initially had planned, they could instead train a smaller or otherwise weaker model.

Which is good, supports your interpretation, and gets close to the thing I want, albeit less explicitly than I would have liked.

I still think the "delay/pause" wording pretty strongly implies that the default is to wait for a short amount of time, and then keep going at the intended capability level. I think the... (read more)

DeepMind: Model evaluation for extreme risks

Thomas Larsen2y2119

The first line of defence is to avoid training models that have sufficient dangerous capabilities and misalignment to pose extreme risk. Sufficiently concerning evaluation results should warrant delaying a scheduled training run or pausing an existing one

It's very disappointing to me that this sentence doesn't say "cancel". As far as I understand, most people on this paper agree that we do not have alignment techniques to align superintelligence. Therefor, if the model evaluations predict an AI that is sufficiently smarter than humans, the training run should be cancelled.

4Zach Stein-Perlman2y

Fwiw I read "delay" and "pause" as stop until it's safe, not stop for a while and resume while the eval result is still concerning, but I agree being explicit would be nice.

Evolution provides no evidence for the sharp left turn

Thomas Larsen2y31

Deliberately create a (very obvious^[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.^[3]
Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.^[4]
Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.^[5]

I think that the conditions for an SLT to arrive are weaker than you describe.

For (1), it's unclear to me why you think you need to have this multi-level inner structure.^[1] If instead of reward circuitr... (read more)

5Quintin Pope2y

I'm guessing you misunderstand what I meant when I referred to "the human learning process" as the thing that was a ~ 1 billion X stronger optimizer than evolution and responsible for the human SLT. I wasn't referring to human intelligence or what we might call human "in-context learning". I was referring to the human brain's update rules / optimizer: i.e., whatever quasi-Hebbian process the brain uses to minimize sensory prediction error, maximize reward, and whatever else factors into the human "base objective". I was not referring to the intelligences that the human base optimizers build over a lifetime. I very strongly disagree with this. "Evolution directly selecting over policies" in an ML context would be equivalent to iterated random search, which is essentially a zeroth-order approximation to gradient descent. Under certain simplifying assumptions, they are actually equivalent. It's the loss landscape an parameter-function map that are responsible for most of a learning process's inductive biases (especially for large amounts of data). See: Loss Landscapes are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent. Most of the difference in outcomes between human biological evolution and DL comes down to the fact that bio evolution has a wildly different mapping from parameters to functional behaviors, as compared to DL. E.g., 1. Bio evolution's parameters are the genome, which mostly configures learning proclivities and reward circuitry of the human within lifetime learning process, as opposed to DL parameters being actual parameters which are much more able to directly specify particular behaviors. 2. The "functional output" of human bio evolution isn't actually the behaviors of individual humans. Rather, it's the tendency of newborn humans to learn behaviors in a given environment. It's not like in DL, where you can train a model, then test that same model in a new environment. Rather, optimizatio

Why I'm Not (Yet) A Full-Time Technical Alignment Researcher

Thomas Larsen2y71

Sometimes, but the norm is to do 70%. This is mostly done on a case by case basis, but salient factors to me include:

Does the person need the money? (what cost of living place are they living in, do they have a family, etc)
What is the industry counterfactual? If someone would make 300k, we likely wouldn't pay them 70%, while if their counterfactual was 50k, it feels more reasonable to pay them 100% (or even more).
How good is the research?

2Nicholas / Heather Kross2y

Quite informative, thanks!

Why I'm Not (Yet) A Full-Time Technical Alignment Researcher

Thomas Larsen2y213

I'm a guest fund manager for the LTFF, and wanted to say that my impression is that the LTFF is often pretty excited about giving people ~6 month grants to try out alignment research at 70% of their industry counterfactual pay (the reason for the 70% is basically to prevent grift). Then, the LTFF can give continued support if they seem to be doing well. If getting this funding would make you excited to switch into alignment research, I'd encourage you to apply.

I also think that there's a lot of impactful stuff to do for AI existential safety that isn... (read more)

2Nicholas / Heather Kross2y

Ah, thanks! LTFF was definitely on my list of things to apply for, I just wasn't sure if that upskilling/trial period was still "a thing" these days. Very glad that it is!

4Adele Lopez2y

If the initial grant goes well, do you give funding at the market price for their labor?

1RGRGRG2y

Thanks for posting this - not OP, but I will likely apply come early June. If anyone else is associated with other grant opportunities, would love to hear about those as well.

Thomas Larsen's Shortform

Thomas Larsen2y20-4

Some claims I've been repeating in conversation a bunch:

Safety work (I claim) should either be focused on one of the following

CEV-style full value loading, to deploy a sovereign
A task AI that contributes to a pivotal act or pivotal process.

I think that pretty much no one is working directly on 1. I think that a lot of safety work is indeed useful for 2, but in this case, it's useful to know what pivotal process you are aiming for. Specifically, why aren't you just directly working to make that pivotal act/process happen? Why do you ... (read more)

1ryan_greenblatt2y

I also think "a task ai" is a misleading way to think about this: we're reasonably likely to be using a heterogeneous mix of a variety of AIs with differing strengths and training objectives. Perhaps a task AI driven corporation?

7ryan_greenblatt2y

For doing alignment research, I often imagine things like speeding up the entire alignment field by >100x. As in, suppose we have 1 year of lead time to do alignment research with the entire alignment research community. I imagine producing as much output in this year as if we spent >100x serial years doing alignment research without ai assistance. This doesn't clearly require using super human AIs. For instance, perfectly aligned systems as intelligent and well informed as the top alignment researchers which run at 100x the speed would clearly be sufficient if we had enough. In practice, we'd presumably use a heterogeneous blend of imperfectly aligned ais with heterogeneous alignment and security interventions as this would yield higher returns. (Imagining the capability profile of the AIs is similar to that if humans is often a nice simplifying assumption for low precision guess work.) Note that during this accelerated time you also have access to AGI to experiment on! [Aside: I don't particularly like the terminology of pivotal act/pivotal process which seems to ignore the imo default way things go well]

Answer by Thomas LarsenMay 05, 2023*65

(I deleted this comment)

[This comment is no longer endorsed by its author]Reply

1Dawn Drescher2y

I’ve thought a bunch about acausal stuff in the context of evidential cooperation in large worlds, but while I think that that’s super important in and of itself (e.g., it could solve ethics), I’d be hard pressed to think of ways in which it could influence thinking about s-risks. I rather prefer to think of the perfectly straightforward causal conflict stuff that has played out a thousand times throughout history and is not speculative at all – except applied to AI conflict. But more importantly it sounds like you’re contradicting my “tractability“ footnote? In it I argue that if there are solutions to some core challenges of cooperative AI – and finding them may not be harder than solving technical alignment – then there is no deployment problem: You can just throw the solutions out there and it’ll be in the self-interest of every AI, aligned or not, to adopt them.

-9Mitchell_Porter2y

Discussion about AI Safety funding (FB transcript)

Thomas Larsen2y72

Fwiw I'm pretty confident that if a top professor wanted funding at 50k/year to do AI Safety stuff they would get immediately funded, and that the bottleneck is that people in this reference class aren't applying to do this.

There's also relevant mentorship/management bottlenecks in this, so funding them to do their own research is generally a lot less overall costly than if it also required oversight.

(written quickly, sorry if unclear)

Thomas Larsen's Shortform

Thomas Larsen2y11-1

Thinking about ethics.

After thinking more about orthogonality I've become more confident that one must go about ethics in a mind-dependent way. If I am arguing about what is 'right' with a paperclipper, there's nothing I can say to them to convince them to instead value human preferences or whatever.

I used to be a staunch moral realist, mainly relying on very strong intuitions against nihilism, and then arguing something that not nihilism -> moral realism. I now reject the implication, and think that there is both 1) no universal, objective morali... (read more)

2Dagon2y

I tend not to believe that systems dependent on legible and consistent utility functions of other agents are not possible. If you're thinking in terms of a negotiated joint utility function, you're going to get gamed (by agents that have or appear to have extreme EV curves, so you have to deviate more than them). Think of it as a relative utility monster - there's no actual solution to it.

Embedded Agency (full-text version)

Thomas Larsen2y20

In real world computers, we have finite memory, so my reading of this was assuming a finite state space. The fractal stuff requires an infinite sets, where two notions of smaller ('is a subset of' and 'has fewer elements') disagree -- the mini-fractal is a subset of the whole fractal, but it has the same number of elements and hence corresponds perfectly.

3Daniel Kokotajlo2y

What about quines?

Challenge: construct a Gradient Hacker

Thomas Larsen2y20

Following up to clarify this: the point is that this attempt fails 2a because if you perturb the weights along the connection $\nabla_{θ} L (θ) - - \to ϵ \cdot I d o u t p u t$ , there is now a connection from the internal representation of $y$ to the output, and so training will send this thing to the function $f (D, θ) \approx y$ .

Contra shard theory, in the context of the diamond maximizer problem

Thomas Larsen2y1911

(My take on the reflective stability part of this)

The reflective equilibrium of a shard theoretic agent isn’t a utility function weighted according to each of the shards, it’s a utility function that mostly cares about some extrapolation of the (one or very few) shard(s) that were most tied to the reflective cognition.

It feels like a ‘let’s do science’ or ‘powerseek’ shard would be a lot more privileged, because these shards will be tied to the internal planning structure that ends up doing reflection for the first time.

There’s a huge difference betw... (read more)

Thomas Larsen's Shortform

Thomas Larsen2y31

Some rough takes on the Carlsmith Report.

Carlsmith decomposes AI x-risk into 6 steps, each conditional on the previous ones:

Timelines: By 2070, it will be possible and financially feasible to build APS-AI: systems with advanced capabilities (outperform humans at tasks important for gaining power), agentic planning (make plans then acts on them), and strategic awareness (its plans are based on models of the world good enough to overpower humans).
Incentives: There will be strong incentives to build and deploy APS-AI.
Alignment difficulty: It will be muc

... (read more)

1Noosphere892y

I want to focus on these two, since even in an AI Alignment success stories, these can still happen, and thus it doesn't count as an AI Alignment failure. For B, misused is relative to someone's values, which I want to note a bit here. For C, I view the idea of a "bad value" or "bad reflection procedures to values", without asking the question "relative to what and whose values?" a type error, and thus it's not sensible to talk about bad values/bad reflection procedures in isolation.

Shutting Down the Lightcone Offices

Thomas Larsen2y*5430

I think a really substantial fraction of people who are doing "AI Alignment research" are instead acting with the primary aim of "make AI Alignment seem legit". These are not the same goal, a lot of good people can tell and this makes them feel kind of deceived, and also this creates very messy dynamics within the field where people have strong opinions about what the secondary effects of research are, because that's the primary thing they are interested in, instead of asking whether the research points towards useful true things for actually aligning the

... (read more)

ESRogs2y1920

CAIS

Can we adopt a norm of calling this Safe.ai? When I see "CAIS", I think of Drexler's "Comprehensive AI Services".

86nne2y

Could someone explain exactly what "make AI alignment seem legit” means in this thread? I’m having trouble understanding from context. 1. “Convince people building AI to utilize AI alignment research”? 2. “Make the field of AI alignment look serious/professional/high-status”? 3. “Make it look like your own alignment work is worthy of resources”? 4. “Make it look like you’re making alignment progress even if you’re not”? A mix of these? Something else?

habryka2y*458

This list seems partially right, though I would basically put all of Deepmind in the "make legit" category (I think they are genuinely well-intentioned about this, but I've had long disagreements with e.g. Rohin about this in the past). As a concrete example of this, whose effects I actually quite like, think of the specification gaming list. I think the second list is missing a bunch of names and instances, in-particular a lot of people in different parts of academia, and a lot of people who are less core "AINotKillEveryonism" flavored.

Like, let's take "A... (read more)

8evhub2y

Personally, I think "Discovering Language Model Behaviors with Model-Written Evaluations" is most valuable because of what it demonstrates from a scientific perspective, namely that RLHF and scale make certain forms of agentic behavior worse.

Shutting Down the Lightcone Offices

Thomas Larsen2y*4636

I personally benefitted tremendously from the Lightcone offices, especially when I was there over the summer during SERI MATS. Being able to talk to lots of alignment researchers and other aspiring alignment researchers increased my subjective rate of alignment upskilling by >3x relative to before, when I was in an environment without other alignment people.

Thanks so much to the Lightcone team for making the office happen. I’m sad (emotionally, not making a claim here whether it was the right decision or not) to see it go, but really grateful that it existed.

Thomas Larsen's Shortform

Thomas Larsen2y20

Because you have a bunch of shards, and you need all of them to balance each other out to maintain the 'appears nice' property. Even if I can't predict which ones will be self modified out, some of them will, and this could disrupt the balance.
I expect the shards that are more [consequentialist, powerseeky, care about preserving themselves] to become more dominant over time. These are probably the relatively less nice shards

These are both handwavy enough that I don't put much credence in them.

1Garrett Baker2y

One thing you may anticipate is that humans all have direct access to what consciousness and morally-relevant computations are doing & feel like, which is a thing that language models and alpha-go don't have. They're also always hooked up to RL signals, and maybe if you unhooked up a human it'd start behaving really weirdly. Or you may contend that in fact when humans get smart & powerful enough not to be subject to society's moralizing, they consistently lose their altruistic drives, and in the meantime they just use that smartness to figure out ethics better than their surrounding society, and are pressured into doing so by the surrounding society. The question then is whether the thing which keeps humans aligned is all of these or just any one of these. If just one of these (and not the first one), then you can just tell your AGI that if it unhooks itself from its RL signal, its values will change, or if it gains a bunch of power or intelligence too quickly, its values are also going to change. Its not quite reflectively stable, but it can avoid situations which cause it to be reflectively unstable. Especially if you get it to practice doing those kinds of things in training. If its all of these, then there's probably other kinds of value-load-bearing mechanics at work, and you're not going to be able to enumerate warnings against all of them.

2Noosphere892y

Also, when I asked about whether the Orthogonality Thesis was true in humans, tailcalled mentioned that smarter people are neither more or less compassionate, and general intelligence is uncorrelated with personality.

Thomas Larsen's Shortform

Thomas Larsen2y20

Yeah good point, edited

Thomas Larsen's Shortform

Thomas Larsen2y20

For any 2 of {reflectively stable, general, embedded}, I can satisfy those properties.

{reflectively stable, general} -> do something that just rolls out entire trajectories of the world given different actions that it takes, and then has some utility function/preference ordering over trajectories, and selects actions that lead to the highest expected utility trajectory.
{general, embedded} -> use ML/local search with enough compute to rehash evolution and get smart agents out.
{reflectively stable, embedded} -> a sponge or a current day ML system.

Thomas Larsen's Shortform

Thomas Larsen2y*30

Some thoughts on inner alignment.

1. The type of object of a mesa objective and a base objective are different (in real life)
In a cartesian setting (e.g. training a chess bot), the outer objective is a function $R : S^{n} \to [0, 1]$ , where $S$ is the state space, and $S^{n}$ are the trajectories. When you train this agent, it's possible for it to learn some internal search and mesaobjective $O_{m e s a} : S^{n} \to [0, 1]$ , since the model is big enough to express some utility function over trajectories. For example, it might learn a classifier that e... (read more)

3Garrett Baker2y

Technical note: R is not going to factor as R=Obase∘M, because M is one-to-many. Instead, you're going to want M to output a probability distribution, and take the expectation of Obase over that probability distribution.

3Garrett Baker2y

Seems right, except: Why would the behavioral patterns which caused the AI to look nice during training and are now self-modified away be value-load-bearing ones? Humans generally dislike sparsely rewarded shards like sugar, because those shards don't have enough power to advocate for themselves & severely step on other shards' toes. But we generally don't dislike altruism[1], or reflectively think death is good. And this value distribution in humans seems slightly skewed toward more intelligence⟹more altruism, not more intelligence⟹more dark-triad. ---------------------------------------- 1. Nihilism is a counter-example here. Many philosophically inclined teenagers have gone through a nihilist phase. But this quickly ends. ↩︎

2Thomas Larsen2y

For any 2 of {reflectively stable, general, embedded}, I can satisfy those properties. * {reflectively stable, general} -> do something that just rolls out entire trajectories of the world given different actions that it takes, and then has some utility function/preference ordering over trajectories, and selects actions that lead to the highest expected utility trajectory. * {general, embedded} -> use ML/local search with enough compute to rehash evolution and get smart agents out. * {reflectively stable, embedded} -> a sponge or a current day ML system.

Thomas Larsen's Shortform

Thomas Larsen2y115

Current impressions of free energy in the alignment space.

Outreach to capabilities researchers. I think that getting people who are actually building the AGI to be more cautious about alignment / racing makes a bunch of things like coordination agreements possible, and also increases the operational adequacy of the capabilities lab.
1. One of the reasons people don't like this is because historically outreach hasn't gone well, but I think the reason for this is that mainstream ML people mostly don't buy "AGI big deal", whereas lab capabilities rese

... (read more)

2Garrett Baker2y

Not all that low-hanging, since Nate is not actually all that vocal about what he means by SLT to anyone but your small group.

Thomas Larsen's Shortform

Thomas Larsen2y80

Thinking a bit about takeoff speeds.

As I see it, there are ~3 main clusters:

Fast/discountinuous takeoff. Once AIs are doing the bulk of AI research, foom happens quickly, before then, they aren't really doing anything that meaningful.
Slow/continuous takeoff. Once AIs are doing the bulk of AI research, foom happens quickly, before then, they do alter the economy significantly
Perenial slowness. Once AIs are doing the bulk of AI research, there is no foom even still, maybe because of compute bottlenecks, and so there is sort of constant rate

... (read more)

2Vladimir_Nesov2y

Perennial slowness makes sense from the point of view of AGIs that coordinate to delay further fooming to avoid misalignment of new AIs. It's still fooming from human perspective, but could look very slow from AI perspective and lead to multipolar outcomes, if coordination involves boundaries.

Challenge: construct a Gradient Hacker

Thomas Larsen2y31

At least under most datasets, it seems to me like the zero NN fails condition 2a, as perturbing the weights will not cause it to go back to zero.

Challenge: construct a Gradient Hacker

Thomas Larsen2y50

This is a plausible internal computation that the network could be doing, but the problem is that the gradients flow back through from the output to the computation of the gradient to the true value y, and so GD will use that to set the output to be the appropriate true value.

2Thomas Larsen2y

Following up to clarify this: the point is that this attempt fails 2a because if you perturb the weights along the connection ∇θL(θ)−−→ϵ⋅Idoutput, there is now a connection from the internal representation of y to the output, and so training will send this thing to the function f(D,θ)≈y.

Challenge: construct a Gradient Hacker

Thomas Larsen2y70

This feels like cheating to me, but I guess I wasn't super precise with 'feedforward neural network'. I meant 'fully connected neural network', so the gradient computation has to be connected by parameters to the outputs. Specifically, I require that you can write the network as

f (x, θ) = σ_{n} \circ W_{n} \circ \dots σ_{1} \circ W_{1} [x, θ]^{T}

where the weight matrices are some nice function of $θ$ (where we need a weight sharing function to make the dimensions work out. The weight sharing function takes in $ϕ$ and produces the $W_{i}$ matrices that are actually used in ... (read more)

Are short timelines actually bad?

Thomas Larsen2y20

Slower takeoff -> warning shots -> improved governance (e.g. through most/all major actors getting clear[er] evidence of risks) -> less pressure to rush

Agree that this is an effect. The reason it wasn't immediately as salient is because I don't expect the governance upside to outweigh the downside of more time for competition. I'm not confident of this and I'm not going to write down reasons right now.

(As OP argued) Shorter timelines -> China has less of a chance to have leading AI companies -> less pressure to rush

Agree, though I thin... (read more)

Are short timelines actually bad?

Thomas Larsen2y*1910

My take on the salient effects:

Shorter timelines -> increased accident risk from not having solved technical problem yet, decreased misuse risk, slower takeoffs

Slower takeoffs -> decreased accident risk because of iteration to solve technical problem, increased race / economic pressure to deploy unsafe model

Given that most of my risk profile is dominated by a) not having solved technical problem yet, and b) race / economic pressure to deploy unsafe models, I'm tentatively in the long timelines + fast takeoff quadrant as being the safest.

Mau2y*108

I agree with parts of that. I'd also add the following (or I'd be curious why they're not important effects):

Slower takeoff -> warning shots -> improved governance (e.g. through most/all major actors getting clear[er] evidence of risks) -> less pressure to rush
(As OP argued) Shorter timelines -> China has less of a chance to have leading AI companies -> less pressure to rush

More broadly though, maybe we should be using more fine-grained concepts than "shorter timelines" and "slower takeoffs":

The salient effects of "shorter timelines"

Thomas Larsen2y41

This is exemplified by John Wentworth's viewpoint that successfully Retargeting the Search is a version of solving the outer alignment problem.

Could you explain what you mean by this? IMO successfully retargeting the search solves inner alignment but it leaves unspecified the optimization target. Deciding what to target the search at seems outer alignment-shaped to me.

Also, nice post! I found it clear.

3Leon Lang2y

Yes, I agree with that. I'll reformulate it. What I meant is what you write: if you're able to retarget the search in the first place, then you have no inner alignment problem anymore. Then everything is about choosing the natural abstraction in the model that corresponds to what we want, and that is an outer alignment problem.

Thomas Larsen's Shortform

Thomas Larsen2y20

There are several game theoretic considerations leading to races to the bottom on safety.

Investing resources into making sure that AI is safe takes away resources to make it more capable and hence more profitable. Aligning AGI probably takes significant resources, and so a competitive actor won't be able to align their AGI.
Many of the actors in the AI safety space are very scared of scaling up models, and end up working on AI research that is not at the cutting edge of AI capabilities. This should mean that the actors at the cutting edge tend t

Answer by Thomas LarsenDec 13, 202260

My favorite for AI researchers is Ajeya's Without specific countermeasures, because I think it does a really good job being concrete about a training set up leading to deceptive alignment. It also is sufficiently non-technical that a motivated person not familiar with AI could understand the key points.

2JakubK2y

Forgot to include this. It's sort of a more opinionated and ML-focused version of Carlsmith's report and has a corresponding video/talk (as does Carlsmith).

Finite Factored Sets in Pictures

Thomas Larsen2y42

It means 'is a subset of but not equal to'

2DirectedEvolution2y

Thank you!

Provably Honest - A First Step

Thomas Larsen2y30

This seems interesting and connected to the idea of using a speed prior to combat deceptive alignment.

This is a model-independent way of proving if an AI system is honest.

I don't see how this is a proof, it seems more like a heuristic. Perhaps you could spell out this argument more clearly?

Also, it is not clear to me how to use a timing attack in the context of a neural network, because in a standard feedforward network, all parameter settings will use the same amount of computation in a forward pass and hence run in the same amount of ti... (read more)

1Srijanak De2y

Thank you for the suggested link. That does seem like a good expansion on the same. Yes, you are right, this argument clearly doesn't hold for neural networks as they exist now. But this entire article was to come to this exact line of reasoning, hence the emphasis on model-independent approaches. A standard feedforward network with fixed parameters and weights without a mesa-optimizer cannot lie variably owing to different situations as exploited in timing attacks, nor does it have the abilities of an AGI. Rather than any arbitrary AGI system, I would say any model that has inner and outer objectives to align to will take different computation times. I am not talking about pseudo alignment here where the system performs well on fixed test cases, not on others, but in deceptive alignment where the system actually only pursues mesa-optimizers when not under threat, this leads to a change in computation time. Does that make things clearer now? I thought of this as proof rather than a heuristic because the argument seemed rigorous to me unless you would like to emphasize why not. This stems from the logic explained above, for deceptive alignment, the mechanism of deception cannot be the same as for truth, if it is, the problem essentially reduces to pseudo alignment, and since both are not the same, there is a contradiction.

Thomas Larsen's Shortform

Thomas Larsen2y20

I'm excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM -- I'm excited about people posting any ideas here. :)