Often you can compare your own Fermi estimates with those of other people, and that’s sort of cool, but what’s way more interesting is when they share what variables and models they used to get to the estimate. This lets you actually update your model in a deeper way.

From a Paul Christiano talk called "How Misalignment Could Lead to Takeover" (from February 2023): Assume we're in a world where AI systems are broadly deployed, and the world has become increasingly complex, where humans know less and less about how things work. A viable strategy for AI takeover is to wait until there is certainty of success. If a 'bad AI' is smart, it will realize it won't be successful if it tries to take over, not a problem.  So you lose when a takeover becomes possible, and some threshold of AIs behave badly. If all the smartest AIs you produce end up having the same problem, then takeover becomes more likely. In other words, the smartest AIs will only take over if all the other AIs will also try to takeover. Additionally, this is more likely to happen in an unstable world that could come from: * War (or other exogenous shocks): making it harder for humans to detect and respond to problems. * Cascading loss of control: some AIs try takeover and aren't fully successful, but make the world increasingly crazy, which cascades into other AIs (who previously thought it would be unsuccessful) now joining the takeover. * Models interacting with each other: once some AIs realize could be successful, they may want to share that information with other models to see if other models may want to join in. There's this tempting feeling of training your AI against these takeover attempts ("hey, look at this bad AI who tried to takeover the data center, don't do that!"), but you may just be training your model to learn that it needs to go much bigger if it wants to actually succeed at the takeover attempt. Paul believes that if this kind of phenomenon is real, then he expects we can get compelling demonstrations in a lab (that would require some imagination to bridge the examples in the lab vs the wild). We'll still get demonstrations in the wild, but unclear if they will be big enough to make humanity reconsider things.
tlevin2d630
4
I think some of the AI safety policy community has over-indexed on the visual model of the "Overton Window" and under-indexed on alternatives like the "ratchet effect," "poisoning the well," "clown attacks," and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable). I'm not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more effective actors in the DC establishment overall are much more in the habit of looking for small wins that are both good in themselves and shrink the size of the ask for their ideal policy than of pushing for their ideal vision and then making concessions. Possibly an ideal ecosystem has both strategies, but it seems possible that at least some versions of "Overton Window-moving" strategies executed in practice have larger negative effects via associating their "side" with unreasonable-sounding ideas in the minds of very bandwidth-constrained policymakers, who strongly lean on signals of credibility and consensus when quickly evaluating policy options, than the positive effects of increasing the odds of ideal policy and improving the framing for non-ideal but pretty good policies. In theory, the Overton Window model is just a description of what ideas are taken seriously, so it can indeed accommodate backfire effects where you argue for an idea "outside the window" and this actually makes the window narrower. But I think the visual imagery of "windows" actually struggles to accommodate this -- when was the last time you tried to open a window and accidentally closed it instead? -- and as a result, people who rely on this model are more likely to underrate these kinds of consequences. Would be interested in empirical evidence on this question (ideally actual studies from psych, political science, sociology, econ, etc literatures, rather than specific case studies due to reference class tennis type issues).
TurnTrout1dΩ23505
3
A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy: > A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented). By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction): > On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It's just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.  * This definition also makes obvious the fact that "shards" are a matter of implementation, not of behavior. * It also captures the fact that "shard" definitions are somewhat subjective. In one moment, I might model someone is having a separate "ice cream shard" and "cookie shard", but in another situation I might choose to model those two circuits as a larger "sweet food shard." So I think this captures something important. However, it leaves a few things to be desired: * What, exactly, is a "motivational circuit"? Obvious definitions seem to include every neural network with nonconstant outputs. * Demanding a compositional representation is unrealistic since it ignores superposition. If k dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have k≤dmodel shards, which seems obviously wrong and false.  That said, I still find this definition useful.  I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing. 1. ^ Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
Looking for blog platform/framework recommendations I had a Wordpress blog, but I don't like wordpress and I want to move away from it.  Substack doesn't seem like a good option because I want high customizability and multilingual support (my Blog is going to be in English and Hebrew). I would like something that I can use for free with my own domain (so not Wix). The closest thing I found to what I'm looking for was MkDocs Material, but it's still geared too much towards documentation, and I don't like its blog functionality enough. Other requirements: Dark/Light mode, RSS, Newsletter support. Does anyone have another suggestion? It's fine if it requires a bit of technical skill (though better if it doesn't).
Richard_Ngo18hΩ6122
3
Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible. This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition? The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definition of G (especially if the AI isn't actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions). The problem here is that these counterfactuals aren't very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question "what would the AI be doing in this world?" has no sensible answer (or maybe the answer would be "it would realize it's in a weird hypothetical world and behave accordingly"). Similarly, if we model this using the do-operation, the best policy is something like "wait until the human's goals suddenly and inexplicably change, then optimize hard for their new goal". Having said that, in some sense what it means to model someone as an agent is that you can easily imagine them pursuing some other goal. So the counterfactuals above might not be too unnatural; or at least, no more unnatural than any other intervention modeled by Pearl's do-operator. Overall this line of inquiry seems promising and I plan to spend more time thinking about it.

Popular Comments

Recent Discussion

5th generation military aircraft are extremely optimised to reduce their radar cross section. It is this ability above all others that makes the f-35 and the f-22 so capable - modern anti aircraft weapons are very good, so the only safe way to fly over a well defended area is not to be seen.

But wouldn't it be fairly trivial to detect a stealth aircraft optically?

This is what an f-35 looks like from underneath at about 10 by 10 pixels:

You and I can easily tell what that is (take a step back, or squint). So can GPT4:

The image shows a silhouette of a fighter jet in the sky, likely flying at high speed. The clear blue sky provides a sharp contrast, making the aircraft's dark outline prominent. The

...
2Yair Halberstadt2h
But then no need for stealth at all?
2avturchin1h
They likely use them in places where no air defence is present and still at some disatnce using JDAM.  I think that I missed the main thing about stealth - they are stealth for radar on the distances like 100 km, but visible for radar on the distances like 10 km (arbitrary numbers). But optical observation on distances of 100 km is impossible (need large telescopes, but you need to know where to look). Also optical density of atmosphere starts playing role as well a spherical size of earth.
2Yair Halberstadt35m
Why would you need large telescopes? Naked eye has angular resolution of 30m at 100km, you need something slightly better. A small lense should do it. Cameras + zoom lens are well understood mass produced components. And this is a highly parallelizable task.

Note you don't even need high resolution in all directions, just high enough to see whether it's worth zooming in/switching to a better camera.

From a Paul Christiano talk called "How Misalignment Could Lead to Takeover" (from February 2023):

Assume we're in a world where AI systems are broadly deployed, and the world has become increasingly complex, where humans know less and less about how things work.

A viable strategy for AI takeover is to wait until there is certainty of success. If a 'bad AI' is smart, it will realize it won't be successful if it tries to take over, not a problem. 

So you lose when a takeover becomes possible, and some threshold of AIs behave badly. If all the smartest AIs... (read more)

I've been going through the FAR AI videos from the alignment workshop in December 2023. I'd like people to discuss their thoughts on Shane Legg's 'necessary properties' that every AGI safety plan needs to satisfy. The talk is only 5 minutes, give it a listen:

Otherwise, here are some of the details:

All AGI Safety plans must solve these problems (necessary properties to meet at the human level or beyond):

  1. Good world model
  2. Good reasoning
  3. Specification of the values and ethics to follow

All of these require good capabilities, meaning capabilities and alignment are intertwined.

Shane thinks future foundation models will solve conditions 1 and 2 at the human level. That leaves condition 3, which he sees as solvable if you want fairly normal human values and ethics.

Shane basically thinks that if the above...

I think this is a great idea, except that on easy mode "a good specification of values and ethics to follow" means a few pages of text (or even just the prompt "do good things"), while other times "a good specification of values" is a learning procedure that takes input from a broad sample of humanity, and has carefully-designed mechanisms that influence its generalization behavior in futuristic situations (probably trained on more datasets that had to be painstakingly collected), and has been engineered to work smoothly with the reasoning process and not encourage perverse behavior.

1Mikhail Samin3h
Wow. This is hopeless. Pointing at agents that care about human values and ethics is, indeed, the harder part. No one has any idea how to approach this and solve the surrounding technical problems. If smart people think they do, they haven’t thought about this enough and/or aren’t familiar with existing work.
2Answer by Chris_Leong7h
The biggest problem here is it fails to account for other actors using such systems to cause chaos and the possibility that the offense-defense balance likely strongly favours the attacker, particularly if you've placed limitations on your systems that make them safer. Aligned human-ish level AI's doesn't provide a victory condition.
2mic9h
I agree that we want more progress on specifying values and ethics for AGI. The ongoing SafeBench competition by the Center for AI Safety has a category for this problem:

Nate Silver tries to answer the question: "How do people formulate their political beliefs?"

An important epistemological question that is, he says, under-discussed.

He lays out his theory:

I think political beliefs are primarily formulated by two major forces:

Politics as self-interest. Some issues have legible, material stakes. Rich people have an interest in lower taxes. Sexually active women (and men!) who don’t want to bear children have an interest in easier access to abortion. Members of historically disadvantaged groups have an interest in laws that protect their rights

Politics as personal identity — whose team are you on. But other issues have primarily symbolic stakes. These serve as vehicles for individual and group expression — not so much “identity politics” but politics as identity. People are trying to figure out where

...
Viliam1h20

Some issues have legible, material stakes.

Scrolling down... the table of how important are individual topics for young people; "student debt" is at its very bottom.

(Also, inflation on the very top? But isn't inflation a good thing if all you have is an enormous debt?)

ACX recently posted about the Rootclaim Covid origins debate, coming out in favor of zoonosis. Did the post change the minds of those who read it, or not? Did it change their judgment in favor of zoonosis (as was probably the goal of the post), or conversely did it make them think Lab Leak was more likely (as the "Don't debate conspiracy theorists" theory claims)?

I analyzed the ACX survey to find out, by comparing responses before and after the post came out. The ACX survey asked readers whether they think the origin of Covid is more likely natural or Lab Leak. The ACX survey went out March 26th and was open until about April 10th. The Covid origins post came out March 28th, and the highlights on April...

Viliam2h20

This, and also most people on ACX respect Scott and his opinions, so if he demonstrates that he has put a lot of thought into this, and then he makes a conclusion, it will sound convincing to most.

Basically, we need to consider not just how many people believe some idea, but also how strongly. The typical situation with a conspiracy theory is that we have a small group that believes X very strongly, and a large group that believes non-X with various degrees of strength, from strongly to almost zero. What happens then is that people with a strong belief typ... (read more)

11DusanDNesic18h
Is a lot of the effect not "people who read ACX trust Scott Alexander"? Like, the survey selects for most "passionate" readers, those willing to donate their free time to Scott for research with ~nothing in return. Him publicly stating on his platform "I am now much less certain of X" is likely to make that group of people be less certain of X?

This is the ninth post in my series on Anthropics. The previous one is The Solution to Sleeping Beauty.

Introduction

There are some quite pervasive misconceptions about betting in regards to the Sleeping Beauty problem.

One is that you need to switch between halfer and thirder stances based on the betting scheme proposed. As if learning about a betting scheme is supposed to affect your credence in an event.

Another is that halfers should bet at thirders odds and, therefore, thirdism is vindicated on the grounds of betting. What do halfers even mean by probability of Heads being 1/2 if they bet as if it's 1/3?

In this post we are going to correct them. We will understand how to arrive to correct betting odds from both thirdist and halfist positions, and...

First of all, it‘s certainly important to distinguish between a probability model and a strategy. The job of a probability model is simply to suggest the probability of certain events and to describe how probabilities are affected by the realization of other events. A strategy on the other hand is to guide decision making to arrive at certain predefined goals.

Of course. As soon as we are talking about goals and strategies we are not talking about just probabilities anymore. We are also talking about utilities and expected utilities. However, probabilities ... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Announcing open applications for the AI Safety Careers Course India 2024!

Axiom Futures has launched its flagship AI Safety Careers Course 2024 to equip emerging talent working on India with foundational knowledge in AI safety. Spread out across 8-10 weeks, the program will provide candidates with key skills and networking opportunities to take their first step toward an impactful career in the domain. Each week will correspond with a curriculum module that candidates will be expected to complete, and discuss with their cohort during the facilitated seminar. We expect a set of candidates to pursue applied projects of their choice.

The program is aimed at undergraduate and (Master’s/PhD) graduate students, and young professionals. If you are a high-school student with demonstrated interest in AI safety, we encourage you to apply. We expect applicants...

I am honored to be part of enabling more people from around the world to contribute to the safe and responsible development of AI. 

Effective Altruism (EA) is a movement trying to invest time and money in causes that do the most possible good per some unit investment. EA was at one point called optimal philanthropy.of effort. The label applies broadly, including a philosophy, a community, a set of organisations and set of behaviours. Likewise it also sometimes means how to donate effectively to charities, choose one's career, do the most good per $, do good in general or ensure the most good happens.  All of these different framings have slightly different implications.

The basic concept behind EA is that youone would really struggle to donate 100 times more money or time to charity than you currently do but, spending a little time researching who to donate to could have an impact on roughly this order of magnitude. The same argument works for doing good with your career or volunteer hours.

ImageImage
ImageImage
https://forum.effectivealtruism.org/posts/ZbaDmowkXbTBsxvHn/historical-ea-funding-data Spreadsheet https://docs.google.com/spreadsheets/d/1IeO7NIgZ-qfSTDyiAFSgH6dMn1xzb6hB2pVSdlBJZ88/edit#gid=771773474 

GPT-5 training is probably starting around now. It seems very unlikely that GPT-5 will cause the end of the world. But it’s hard to be sure. I would guess that GPT-5 is more likely to kill me than an asteroid, a supervolcano, a plane crash or a brain tumor. We can predict fairly well what the cross-entropy loss will be, but pretty much nothing else.

Maybe we will suddenly discover that the difference between GPT-4 and superhuman level is actually quite small. Maybe GPT-5 will be extremely good at interpretability, such that it can recursively self improve by rewriting its own weights.

Hopefully model evaluations can catch catastrophic risks before wide deployment, but again, it’s hard to be sure. GPT-5 could plausibly be devious enough to circumvent all of...

1Prometheus11h
My birds are singing the same tune.
1Odd anon15h
Sam Altman confirmed (paywalled, sorry) in November that GPT-5 was already under development. (Interestingly, the confirmation was almost exactly six months after Altman told a senate hearing (under oath) that "We are not currently training what will be GPT-5; we don't have plans to do it in the next 6 months.")

It probably began training in January and finished around early April. And they're now doing evals.

2MrCheeze13h
"Under development" and "currently training" I interpret as having significantly different meanings.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA