Often you can compare your own Fermi estimates with those of other people, and that’s sort of cool, but what’s way more interesting is when they share what variables and models they used to get to the estimate. This lets you actually update your model in a deeper way.

Buck9hΩ263514
2
[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.] I’m interested in the following subset of risk from AI: * Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman). * Scheming: Risk associated with loss of control to AIs that arises from AIs scheming * So e.g. I exclude state actors stealing weights in ways that aren’t enabled by the AIs scheming, and I also exclude non-scheming failure modes. IMO, state actors stealing weights is a serious threat, but non-scheming failure modes aren’t (at this level of capability and dignity). * Medium dignity: that is, developers of these AIs are putting a reasonable amount of effort into preventing catastrophic outcomes from their AIs (perhaps they’re spending the equivalent of 10% of their budget on cost-effective measures to prevent catastrophes). * Nearcasted: no substantial fundamental progress on AI safety techniques, no substantial changes in how AI works. This subset of risk is interesting because I think it’s a natural scenario at which to target technical work on AI safety. (E.g. it’s the main scenario we’re targeting with our AI control agenda.) I claim that the majority of this risk comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored. Using AIs for AI development looks uniquely risky to me among applications of early-transformative AIs, because unlike all other applications I know about: * It’s very expensive to refrain from using AIs for this application. * There’s no simple way to remove affordances from the AI such that it’s very hard for the AI to take a small sequence of actions which plausibly lead quickly to loss of control. In contrast, most other applications of AI probably can be controlled just by restricting their affordances. If I’m right that the risk from scheming early-transformative models is concentrated onto this pretty specific scenario, it implies a bunch of things: * It implies that work on mitigating these risks should focus on this very specific setting. * It implies that AI control is organizationally simpler, because most applications can be made trivially controlled. * It is pretty plausible to me that AI control is quite easy, because you actually can remove affordances from the AIs that are doing AI R&D such that it’s hard for them to cause problems.
tlevin3d640
6
I think some of the AI safety policy community has over-indexed on the visual model of the "Overton Window" and under-indexed on alternatives like the "ratchet effect," "poisoning the well," "clown attacks," and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable). I'm not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more effective actors in the DC establishment overall are much more in the habit of looking for small wins that are both good in themselves and shrink the size of the ask for their ideal policy than of pushing for their ideal vision and then making concessions. Possibly an ideal ecosystem has both strategies, but it seems possible that at least some versions of "Overton Window-moving" strategies executed in practice have larger negative effects via associating their "side" with unreasonable-sounding ideas in the minds of very bandwidth-constrained policymakers, who strongly lean on signals of credibility and consensus when quickly evaluating policy options, than the positive effects of increasing the odds of ideal policy and improving the framing for non-ideal but pretty good policies. In theory, the Overton Window model is just a description of what ideas are taken seriously, so it can indeed accommodate backfire effects where you argue for an idea "outside the window" and this actually makes the window narrower. But I think the visual imagery of "windows" actually struggles to accommodate this -- when was the last time you tried to open a window and accidentally closed it instead? -- and as a result, people who rely on this model are more likely to underrate these kinds of consequences. Would be interested in empirical evidence on this question (ideally actual studies from psych, political science, sociology, econ, etc literatures, rather than specific case studies due to reference class tennis type issues).
TurnTrout2dΩ24555
4
A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy: > A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented). By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction): > On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It's just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.  * This definition also makes obvious the fact that "shards" are a matter of implementation, not of behavior. * It also captures the fact that "shard" definitions are somewhat subjective. In one moment, I might model someone is having a separate "ice cream shard" and "cookie shard", but in another situation I might choose to model those two circuits as a larger "sweet food shard." So I think this captures something important. However, it leaves a few things to be desired: * What, exactly, is a "motivational circuit"? Obvious definitions seem to include every neural network with nonconstant outputs. * Demanding a compositional representation is unrealistic since it ignores superposition. If k dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have k≤dmodel shards, which seems obviously wrong and false.  That said, I still find this definition useful.  I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing. 1. ^ Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
I don't really know what people mean when they try to compare "capabilities advancements" to "safety advancements". In one sense, its pretty clear. The common units are "amount of time", so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes. For example, if someone releases a new open source model people say that's a capabilities advance, and should not have been done. Yet I think there's a pretty good case that more well-trained open source models are better for time-to-alignment than for time-to-doom, since much alignment work ends up being done with them, and the marginal capabilities advance here is zero. Such work builds on the public state of the art, but not the private state of the art, which is probably far more advanced. I also don't often see people making estimates of the time-wise differential impacts here. Maybe people think such things would be exfo/info-hazardous, but nobody even claims to have estimates here when the topic comes up (even in private, though people are glad to talk about their hunches for what AI will look like in 5 years, or the types of advancements necessary for AGI), despite all the work on timelines. Its difficult to do this for the marginal advance, but not so much for larger research priorities, which are the sorts of things people should be focusing on anyway.
@jessicata once wrote "Everyone wants to be a physicalist but no one wants to define physics". I decided to check SEP article on physicalism and found that, yep, it doesn't have definition of physics: > Carl Hempel (cf. Hempel 1969, see also Crane and Mellor 1990) provided a classic formulation of this problem: if physicalism is defined via reference to contemporary physics, then it is false — after all, who thinks that contemporary physics is complete? — but if physicalism is defined via reference to a future or ideal physics, then it is trivial — after all, who can predict what a future physics contains? Perhaps, for example, it contains even mental items. The conclusion of the dilemma is that one has no clear concept of a physical property, or at least no concept that is clear enough to do the job that philosophers of mind want the physical to play. > > <...> > > Perhaps one might appeal here to the fact that we have a number of paradigms of what a physical theory is: common sense physical theory, medieval impetus physics, Cartesian contact mechanics, Newtonian physics, and modern quantum physics. While it seems unlikely that there is any one factor that unifies this class of theories, perhaps there is a cluster of factors — a common or overlapping set of theoretical constructs, for example, or a shared methodology. If so, one might maintain that the notion of a physical theory is a Wittgensteinian family resemblance concept. This surprised me because I have a definition of a physical theory and assumed that everyone else uses the same. Perhaps my personal definition of physics is inspired by Engels's "Dialectics of Nature": "Motion is the mode of existence of matter." Assuming "matter is described by physics," we are getting "physics is the science that reduces studied phenomena to motion." Or, to express it in a more analytical manner, "a physicalist theory is a theory that assumes that everything can be explained by reduction to characteristics of space and its evolution in time." For example, "vacuum" is a part of space with a "zero" value in all characteristics. A "particle" is a localized part of space with some non-zero characteristic. A "wave" is part of space with periodic changes of some characteristic in time and/or space. We can abstract away "part of space" from "particle" and start to talk about a particle as a separate entity, and speed of a particle is actually a derivative of spatial characteristic in time, and force is defined as the cause of acceleration, and mass is a measure of resistance to acceleration given the same force, and such-n-such charge is a cause of such-n-such force, and it all unfolds from the structure of various pure spatial characteristics in time. The tricky part is, "Sure, we live in space and time, so everything that happens is some motion. How to separate physicalist theory from everything else?" Let's imagine that we have some kind of "vitalist field." This field interacts with C, H, O, N atoms and also with molybdenum; it accelerates certain chemical reactions, and if you prepare an Oparin-Haldane soup and radiate it with vitalist particles, you will soon observe autocatalytic cycles resembling hypothetical primordial life. All living organisms utilize vitalist particles in their metabolic pathways, and if you somehow isolate them from an outside source of particles, they'll die. Despite having a "vitalist field," such a world would be pretty much physicalist. An unphysical vitalist world would look like this: if you have glowing rocks and a pile of organic matter, the organic matter is going to transform into mice. Or frogs. Or mosquitoes. Even if the glowing rocks have a constant glow and the composition of the organic matter is the same and the environment in a radius of a hundred miles is the same, nobody can predict from any observables which kind of complex life is going to emerge. It looks like the glowing rocks have their own will, unquantifiable by any kind of measurement. The difference is that the "vitalist field" in the second case has its own dynamics not reducible to any spatial characteristics of the "vitalist field"; it has an "inner life."

Popular Comments

Recent Discussion

Basically all ideas/insights/research about AI is potentially exfohazardous. At least, it's pretty hard to know when some ideas/insights/research will actually make things better; especially in a world where building an aligned superintelligence (let's call this work "alignment") is quite harder than building any superintelligence (let's call this work "capabilities"), and there's a lot more people trying to do the latter than the former, and they have a lot more material resources.

Ideas about AI, let alone insights about AI, let alone research results about AI, should be kept to private communication between trusted alignment researchers. On lesswrong, we should focus on teaching people the rationality skills which could help them figure out insights that help them build any superintelligence, but are more likely to first give them insights...

Note that I agree with your sentiment here, although my concrete argument is basically what LawrenceC wrote as a reply to this post.

1mesaoptimizer5m
Ryan, this is kind of a side-note but I notice that you have a very Paul-like approach to arguments and replies on LW. Two things that come to notice: 1. You have a tendency to reply to certain posts or comments with "I don't quite understand what is being said here, and I disagree with it." or, "It doesn't track with my views", or equivalent replies that seem not very useful for understanding your object level arguments. (Although I notice that in the recent comments I see, you usually postfix it with some elaboration on your model.) 2. In the comment I'm replying to, you use a strategy of black-box-like abstraction modeling of a situation to try to argue for a conclusion, one that usually involves numbers such as multipliers or percentages. (I have the impression that Paul uses this a lot, and one concrete example that comes to mind is the takeoff speeds essay. I usually consider such arguments invalid when they seem to throw away information we already have, or seem to use a set of abstractions that don't particularly feel appropriate to the information I believe we have. I just found this interesting and plausible enough to highlight to you. Its a moderate investment of my time to find out examples from your comment history to highlight all these instances, but writing this comment still seemed valuable.
1mesaoptimizer15m
This is a really well-written response. I'm pretty impressed by it.
1mesaoptimizer29m
"The optimal amount of fraud is non-zero."

[I'm posting this as a very informal community request in lieu of a more detailed writeup, because if I wait to do this in a much more careful fashion then it probably won't happen at all. If someone else wants to do a more careful version that would be great!]

By crux here I mean some uncertainty you have such that your estimate for the likelihood of existential risk from AI - your "p(doom)" if you like that term - might shift significantly if that uncertainty were resolved.

More precisely, let's define a crux as a proposition such that: (a) your estimate for the likelihood of existential catastrophe due to AI would shift a non-trivial amount depending on whether that proposition was true or false; (b) you think there's at least...

Relevant: My Taxonomy of AI-risk counterarguments, inspired by Zvi Mowshowitz's The Crux List.

Sylvia is a philosopher of science. Her focus is probability and she has worked on a few theories that aim to extend and modify the standard axioms of probability in order to tackle paradoxes related to infinite spaces. In particular there is a paradox of the "infinite fair lottery" where within standard probability it seems impossible to write down a "fair" probability function on the integers. If you give the integers any non-zero probability, the total probability of all integers is unbounded, so the function is not normalisable. If you give the integers zero probability, the total probability of all integers is also zero. No other option seems viable for a fair distribution.

This paradox arises in a number of places within cosmology, especially in the context of

...

Thanks for posting Mako. I even mention Effective Altruism/Longtermism at one point in the video!

Announcing the first academic Mechanistic Interpretability workshop, held at ICML 2024! I think this is an exciting development that's a lagging indicator of mech interp gaining legitimacy as an academic field, and a good chance for field building and sharing recent progress! 

We'd love to get papers submitted if any of you have relevant projects! Deadline May 29, max 4 or max 8 pages. We welcome anything that brings us closer to a principled understanding of model internals, even if it's not "traditional” mech interp. Check out our website for example topics! There's $1750 in best paper prizes. We also welcome less standard submissions, like open source software, models or datasets, negative results, distillations, or position pieces. 

And if anyone is attending ICML, you'd be very welcome at the workshop!...

3Florian_Dietz3h
Would a tooling paper be appropriate for this workshop? I wrote a tool that helps ML researchers to analyze the internals of a neural network: https://github.com/FlorianDietz/comgra It is not directly research on mechanistic interpretability, but this could be useful for many people working in the field.

Looks relevant to me on a skim! I'd probably want to see some arguments in the submission for why this is useful tooling for mech interp people specifically (though being useful to non mech interp people too is a bonus!)

Meta: I'm writing this in the spirit of sharing negative results, even if they are uninteresting. I'll be brief. Thanks to Aaron Scher for lots of conversations on the topic.

Summary

Problem statement

You are given a sequence of 100 random digits. Your aim is to come up with a short prompt that causes an LLM to output this string of 100 digits verbatim.

To do so, you are allowed to fine-tune the model beforehand. There is a restriction, however, on the fine-tuning examples you may use: no example may contain more than 50 digits.

Results

I spent a few hours with GPT-3.5 and did not get a satisfactory solution. I found this problem harder than I initially expected it to be.

Setup

The question motivating this post's setup is: can you do precise steering...

Introduction

Effective altruism prides itself on truthseeking. That pride is justified in the sense that EA is better at truthseeking than most members of its reference category, and unjustified in that it is far from meeting its own standards. We’ve already seen dire consequences of the inability to detect bad actors who deflect investigation into potential problems, but by its nature you can never be sure you’ve found all the damage done by epistemic obfuscation because the point is to be self-cloaking. 

My concern here is for the underlying dynamics of  EA’s weak epistemic immune system, not any one instance. But we can’t analyze the problem without real examples, so individual instances need to be talked about. Worse, the examples that are easiest to understand are almost by definition...

Originally I felt happy about these, because “mostly agreeing” is an unusually positive outcome for that opening. But these discussions are grueling. It is hard to express kindness and curiosity towards someone yelling at you for a position you explicitly disclaimed. Any one of these stories would be a success but en masse they amount to a huge tax on saying anything about veganism, which is already quite labor intensive.

The discussions could still be worth it if it changed the arguer’s mind, or at least how they approached the next argument. But I don’t g

... (read more)
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Hello! My name is Amy.

This is my first LessWrong post. I'm about somewhat certain it will be deleted, but I'm giving it a shot anyway, because I've seen this argument thrown around a few places and I still don't understand. I've read a few chunks of the Sequences, and the fundamentals of rationality sequences. 

What makes artificial general intelligence 'inevitable'? What makes artificial superintelligence 'inevitable'? Can't people decide simply not to build AGI/ASI?

I'm very, very new to this whole scene, and while I'm personally convinced AGI/ASI is coming, I haven't really been convinced it's inevitable, the way so many people online (mostly Twitter!) seem convinced. 

While I'd appreciate to hear your thoughts, what I'd really love is to get some sources on this. What are the best sequences to read on this topic? Are there any studies or articles which make this argument?

Or is this all just some ridiculous claim those 'e/acc' people cling to?

Hope this doesn't get deleted! Thank you for your help!

1zeshen7h
Yeah, many people, like the majority of users on this forum, have decided to not build AGI. On the other hand, other people have decided to build AGI and are working hard towards it.  Side note: LessWrong has a feature to post posts as Questions, you might want to use it for questions in the future.
8the gears to ascension14h
In order to decide to not build it, all people who can and would otherwise build it must in some way end up not doing so. For any individual actor who could build it, they must either choose themselves to not build it, or be prevented from doing so. Pushing towards the former is why it's a good idea to not publish ideas that could, even theoretically, help with building it. In order for the latter to occur, rules backed by sufficient monitoring and force must be used. I don't expect that to happen in time. As a result, I am mostly optimistic about plans where it goes well, rather than plans where it doesn't happen. Plans where it goes well depend on figuring out how to encode to it an indelible target that makes it care about everyone, and then convincing a team who will build it to use that target. as you can imagine, that is an extremely tall order. Therefore, I expect humanity to die, likely incrementally as more and more businesses grow that are more and more AI-powered and uninhibited by any worker or even owner constraints. But those are the places where I see branches that can be intervened on. If you want to prevent it, people are attempting to get governments to implement rules sufficient to actually prevent it from coming into existence anywhere, at all. It looks to me like it's going to just create regulatory capture and still allow the companies and governments to create catastrophically uncaring AI. And no, your question is not the kind that would be deleted here. I appreciate you posting it. Sorry to be so harshly gloomy in response.
12Nathan Helm-Burger15h
I think people can in theory collectively decide not to build AGI or ASI. Certainly you as an individual can choose this! Where things get tricky is when asking whether that outcome seems probable, or coming up with a plan to bring that outcome about. Similarly, as a child I wondered, "Why can't people just choose not to have wars, just decide not to kill each other?" People have selfish desires, and group loyalty instincts, and limited communication and coordination capacity, and the world is arranged in such a way that sometimes this leads to escalating cycles of group conflict that are net bad for everyone involved. That's the scenario I think we are in with AI development also. Everyone would be safer if we didn't, but getting everyone to agree not to and hold to that agreement even in private seems intractably hard.

In the war example, wars are usually negative sum for all involved, even in the near-term. And so while they do happen, wars are pretty rare, all things considered.

Meanwhile, the problem with AI development is that that there are enormous financial incentives for building increasingly more powerful AI, right up to the point of extinction. Which also means that you need not some but all people from refraining from developing more powerful AI. This is a devilishly difficult coordination problem. What you get by default, absent coordination, is that everyone ... (read more)

The beauty industry offers a large variety of skincare products (marketed mostly at women), differing both in alleged function and (substantially) in price. However, it's pretty hard to test for yourself how much any of these product help. The feedback loop for things like "getting less wrinkles" is very long.

So, which of these products are actually useful and which are mostly a waste of money? Are more expensive products actually better or just have better branding? How can I find out?

I would guess that sunscreen is definitely helpful, and using some moisturizers for face and body is probably helpful. But, what about night cream? Eye cream? So-called "anti-aging"? Exfoliants?

1Anders Lindström2h
You mean in a positive or negative way? Harmful? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5615097/ , and/or useless? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1447210/ 
1David Fendrich2h
David Sinclair mentioned in a podcast that he is also a bit worried about the long term anabolic effects of the retinoids. He suggested cycling it, possibly synchronized with other catabolic cycling such as fasting.
2Vanessa Kosoy2h
Can you say more? What are "anabolic effects"? What does "cycling" mean in this context?

A simplistic model of your metabolism is that you have two states:

  1. The anabolic state which builds muscle and creates new cells.
  2. The catabolic state which tears down dysfunctional structures and recycles your cells.

A common theme in scientific anti-aging is that you need to balance both states and that the modern life leads us to spend too long in the anabolic state (in a state of abundance, well fed, moderate temperature and not physically stressed). Anabolic interventions can lead to good outcomes in the short-term and quick results, but can potentially be... (read more)

A person at our local LW meetup (not active at LW.com) tested various Soylent alternatives that are available in Europe and wrote a post about them:

______________________

Over the course of the last three months, I've sampled parts of the
european Soylent alternatives to determine which ones would work for me
longterm.

- The prices are always for the standard option and might differ for
e.g. High Protein versions.
- The prices are always for the amount where you get the cheapest
marginal price (usually around a one month supply, i.e. 90 meals)
- Changing your diet to Soylent alternatives quickly leads to increased
flatulence for some time - I'd recommend a slow adoption.
- You can pay for all of them with Bitcoin.
- The list is...

1Mir4h
Have you updated on this since you made this comment (I ask to check whether I should invest in doing a search)? If not, do you now recall any specific examples?
Viliam2h20

I haven't paid attention to this recently (I have small kids, so we need to cook anyway), but I think it is magnesium and calcium -- they somehow interfere with each other's absorption.

Just a random thing I found in google, but didn't read it: https://pubmed.ncbi.nlm.nih.gov/1211491/

(Plus there is a more general concern about what other similar relations may exist that no one has studied yet, because most people do not eat like "I only eat X at the same time as Y, mixed together".)

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA