Best of LessWrong 2022

This post explores the concept of simulators in AI, particularly self-supervised models like GPT. Janus argues that GPT and similar models are best understood as simulators that can generate various simulacra, not as agents themselves. This framing helps explain many counterintuitive properties of language models. Powerful simulators could have major implications for AI capabilities and alignment.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Akash310
22
Suppose the US government pursued a "Manhattan Project for AGI". At its onset, it's primarily fuelled by a desire to beat China to AGI. However, there's some chance that its motivation shifts over time (e.g., if the government ends up thinking that misalignment risks are a big deal, its approach to AGI might change.) Do you think this would be (a) better than the current situation, (b) worse than the current situation, or (c) it depends on XYZ factors?
StefanHex370
7
Collection of some mech interp knowledge about transformers: Writing up folk wisdom & recent results, mostly for mentees and as a link to send to people. Aimed at people who are already a bit familiar with mech interp. I've just quickly written down what came to my head, and may have missed or misrepresented some things. In particular, the last point is very brief and deserves a much more expanded comment at some point. The opinions expressed here are my own and do not necessarily reflect the views of Apollo Research. Transformers take in a sequence of tokens, and return logprob predictions for the next token. We think it works like this: 1. Activations represent a sum of feature directions, each direction representing to some semantic concept. The magnitude of directions corresponds to the strength or importance of the concept. 1. These features may be 1-dimensional, but maybe multi-dimensional features make sense too. We can either allow for multi-dimensional features (e.g. circle of days of the week), acknowledge that the relative directions of feature embeddings matter (e.g. considering days of the week individual features but span a circle), or both. See also Jake Mendel's post. 2. The concepts may be "linearly" encoded, in the sense that two concepts A and B being present (say with strengths α and β) are represented as α*vector_A + β*vector_B). This is the key assumption of linear representation hypothesis. See Chris Olah & Adam Jermyn but also Lewis Smith. 2. The residual stream of a transformer stores information the model needs later. Attention and MLP layers read from and write to this residual stream. Think of it as a kind of "shared memory", with this picture in your head, from Anthropic's famous AMFTC. 1. This residual stream seems to slowly accumulate information throughout the forward pass, as suggested by LogitLens. 2. Additionally, we expect there to be internally-relevant information inside the residual stream, such as whether
Every time I have an application form for some event, the pattern is always the same. Steady trickle of applications, and then a doubling on the last day. And for some reason it still surprises me how accurate this model is. The trickle can be a bit uneven, but the doubling the last day is usually close to spot on. This means that by the time I have a good estimate of what the average number of applications per day is, then I can predict what the final number will be. This is very useful, for knowing if I need to advertise more or not. For the upcoming AISC, the trickle was a late skewed, which meant that an early estimate had me at around 200 applicants, but the final number of on-time application is 356. I think this is because we where a bit slow at advertising early on, but Remmelt made a good job sending out reminders towards the end. Application deadline was Nov 17.  At midnight GMT before Nov 17 we had 172 application.  At noon GMT Nov 18 (end of Nov 17 anywhere-on-Earth) we had 356 application  The doubling rule predicted 344, which is only 3% off Yes, I count the last 36 hours as "the last day". This is not cheating since that's what I always done (approximately [1]), since starting to observe this pattern. It's the natural thing to do when you live at or close to GMT, or at least if your brain works like mine.  1. ^ I've always used my local midnight as the divider. Sometimes that has been Central European Time, and sometimes there is daylight saving time. But it's all pretty close.
I wish there was a bibTeX functionality for alignment forum posts...
List of some larger mech interp project ideas (see also: short and medium-sized ideas). Feel encouraged to leave thoughts in the replies below! What is going on with activation plateaus: Transformer activations space seems to be made up of discrete regions, each corresponding to a certain output distribution. Most activations within a region lead to the same output, and the output changes sharply when you move from one region to another. The boundaries seem to correspond to bunched-up ReLU boundaries as predicted by grokking work. This feels confusing. Are LLMs just classifiers with finitely many output states? How does this square with the linear representation hypothesis, the success of activation steering, logit lens etc.? It doesn't seem in obvious conflict, but it feels like we're missing the theory that explains everything. Concrete project ideas: 1. Can we in fact find these discrete output states? Of course we expect thee to be a huge number, but maybe if we restrict the data distribution very much (a limited kind of sentence like "person being described by an adjective") we are in a regime with <1000 discrete output states. Then we could use clustering (K-means and such) on the model output, and see if the cluster assignments we find map to activation plateaus in model activations. We could also use a tiny model with hopefully less regions, but Jett found regions to be crisper in larger models. 2. How do regions/boundaries evolve through layers? Is it more like additional layers split regions in half, or like additional layers sharpen regions? 3. What's the connection to the grokking literature (as the one mentioned above)? 4. Can we connect this to our notion of features in activation space? To some extent "features" are defined by how the model acts on them, so these activation regions should be connected. 5. Investigate how steering / linear representations look like through the activation plateau lens. On the one hand we expect adding a steering

Popular Comments

Recent Discussion

Nobody designing a financial system today would invent credit cards. The Western world uses credit cards because replacing legacy systems is expensive. China doesn't use credit cards. They skipped straight from cash to WeChat Pay. Skipping straight to the newest technology when you're playing catch-up is called leapfrogging.

A world-class military takes decades to create. The United States' oldest active aircraft carrier was commissioned in 1975. For reference, the Microsoft Windows operating system was released in 1985. The backbone of NATO's armed forces was designed for a world before autonomous drones and machine learning.

The United States dominates at modern warfare. Developed in WWII, modern warfare combines tanks, aircraft, artillery and mechanized[1] infantry to advance faster than the enemy can coordinate a response.

Modern warfare is expensive—and not just because...

lsusr20

You're right. I just like the phrase "postmodern warfare" because I think it's funny.

NB: This week there is a film-watching event afterwards. Vote in the comments on what film we watch. Yes, you have to read the sequences in order to join the film-watching.

Come get old-fashioned with us, and let's read the sequences at Lighthaven! We'll show up, mingle, do intros, and then split off into randomized groups for some sequences discussion. Please do the reading beforehand - it should be no more than 20 minutes of reading.

This group is aimed for people who are new to the sequences and would enjoy a group experience, but also for people who've been around LessWrong and LessWrong meetups for a while and would like a refresher.

This meetup will also have dinner provided! We'll be ordering pizza-of-the-day from Sliver (including 2 vegan pizzas).

...
2trevor
Screen arrangement suggestion: Rather than everyone sitting in a single crowd and commenting on the film, we split into two clusters, one closer to the screen and one further.  The people in the front cluster hope to watch the film quietly, the people in the back cluster aim to comment/converse/socialize during the film, with the common knowledge that they should aim to not be audible to the people in the front group, and people can form clusters and move between them freely.  The value of this depends on what film is chosen; eg "A space Odyssey" is not watchable without discussing historical context and "Tenet" ought to have some viewers wanting to better understand the details of what time travelly thing just happened.
5Said Achmiz
… why? I’ve watched this movie, and I… don’t think I’m aware of any special “historical context” that was relevant to it. (Or, at any rate, I don’t know what you mean by this.) It seemed to work out fine…
trevor40

The content/minute rate is too low, it follows 1960s film standards where audiences weren't interested in science fiction films unless concepts were introduced to them very very slowly (at the time they were quite satisfied by this due to lower standards, similar to Shakespeare).

As a result it is not enjoyable (people will be on their phones) unless you spend much of the film either thinking or talking with friends about how it might have affected the course of science fiction as a foundational work in the genre (almost every sci-fi fan and writer at the time watched it).

1PaulBecon
Rashomon (Kurosawa in Japanese) epistemics of 5 people who share an experience but have disjoint recollections
Daniel Kokotajlo

Here's a fairly concrete AGI safety proposal:

 

Default AGI design: Let's suppose we are starting with a pretrained LLM 'base model' and then we are going to do a ton of additional RL ('agency training') to turn it into a general-purpose autonomous agent. So, during training it'll do lots of CoT of 'reasoning' (think like how o1 does it) and then it'll output some text that the user or some external interface sees (e.g. typing into a browser, or a chat window), and then maybe it'll get some external input (the user's reply, etc.) and then the process repeats many times, and then some process evaluates overall performance (by looking at the entire trajectory as well as the final result) and doles out reinforcement.

Proposal part 1: Shoggoth/Face

...
6johnswentworth
The problem with that sort of attitude is that, when the "experiment" yields so few bits and has such a tenuous connection to the thing we actually care about (as in Charlie's concern), that's exactly when You Are Not Measuring What You Think You Are Measuring bites real hard. Like, sure, you'll see this system do something in the toy chess experiment, but that's just not going to be particularly relevant to the things an actual smarter-than-human AI does in the situations Charlie's concerned about. If anything, the experimenter is far more to likely to fool themselves into thinking their results are relevant to Charlie's concern than they are to correctly learn anything relevant to Charlie's concern.
4Daniel Kokotajlo
That's a reasonable point and a good cautionary note. Nevertheless, I think someone should do the experiment I described. It feels like a good start to me, even though it doesn't solve Charlie's concern.

I haven't decided yet whether to write up a proper "Why Not Just..." for the post's proposal, but here's an overcompressed summary. (Note that I'm intentionally playing devil's advocate here, not giving an all-things-considered reflectively-endorsed take, but the object-level part of my reflectively-endorsed take would be pretty close to this.)

Charlie's concern isn't the only thing it doesn't handle. The only thing this proposal does handle is an AI extremely similar to today's, thinking very explicitly about intentional deception, and even then the propos... (read more)

2Bogdan Ionut Cirstea
Here's a somewhat wild idea to have a 'canary in a coalmine' when it comes to steganography and non-human (linguistic) representations: monitor for very sharp drops in BrainScores (linear correlations between LM activations and brain measurements, on the same inputs) - e.g. like those calculated in Scaling laws for language encoding models in fMRI. (Ideally using larger, more diverse, higher-resolution brain data.)  

Trump and the Republican party will wield broad governmental control during what will almost certainly be a critical period for AGI development. In this post, we want to briefly share various frames and ideas we’ve been thinking through and actively pitching to Republican lawmakers over the past months in preparation for the possibility of a Trump win.

Why are we sharing this here? Given that >98% of the EAs and alignment researchers we surveyed earlier this year identified as everything-other-than-conservative, we consider thinking through these questions to be another strategically worthwhile neglected direction. 

(Along these lines, we also want to proactively emphasize that politics is the mind-killer, and that, regardless of one’s ideological convictions, those who earnestly care about alignment must take seriously the possibility that Trump will be the US president...

Thanks for clarifying. By "policy" and "standards" and "compelled speech" I thought you meant something more than community norms and customs. This is traditionally an important distinction to libertarians and free speech advocates. I think the distinction carves reality at the joints, and I hope you agree. I agree that community norms and customs can be unwelcoming.

3xpym
My biggest problem with the trans discourse is that it's a giant tower of motte-and-baileys, and there's no point where it's socially acceptable to get off the crazy train. Sure, at this point it seems likely that gender dysphoria isn't an entirely empty notion. Implying that this condition might be in any way undesirable is already a red line though, with discussions of how much of it is due to social contagion being very taboo, naturally. And that only people experiencing bad enough dysphoria to require hormones and/or surgery could claim to be legitimately trans is a battle lost long ago. Moving past that, there is non-binary, genderfluid, neo-genders, otherkin, etc, concepts that don't seem to be plausibly based in some currently known crippling biological glitch, and yet those identities are apparently just as legitimate. Where does it stop? Should society be entirely reorganized every time a new fad gains traction? Should everybody questioning that be ostracized? Then there's the "passing" issue. I accept the argument that nowadays in most social situations we have no strong reasons to care about chromosomes/etc, people can successfully play many roles traditionally associated with the opposite sex. But sexual dimorphism is the entire reason for having different pronouns in the first place, and yet apparently you don't even have to try (at all, let alone very hard) to "pass" as your chosen gender for your claim to be legitimate. What is the point? Here the unresolved tension between gender-critical and gender-affirming feminism is the most glaring.
1xpym
I'd say that atheism had already set the "conservatives not welcome" baseline way back when, and this resulted in the community norms evolving accordingly. Granted, these days the trans stuff is more salient, but the reason it flourished here even more than in other tech-adjacent spaces has much to do with that early baseline. Sure, but somebody admitting that certainly isn't the modal conservative.
1Sting
I wouldn't call the tone back then "conservatives not welcome". Conservatism is correlated with religiosity, but it's not the same thing. And I wouldn't even call the tone "religious people are unwelcome" -- people were perfectly civil with religious community members.  The community back then were willing to call irrational beliefs irrational, but they didn't go beyond that. Filtering out people who are militantly opposed to rational conclusions seems fine. 
3Oliver Daniels
I wish there was a bibTeX functionality for alignment forum posts...
habryka20

Yeah, IMO we should just add a bunch of functionality for integrating alignment forum stuff more with academic things. It’s been on my to do list for a long time.

This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race for Machine Superintelligence. Consider subscribing to stay up to date with my work.

An influential congressional commission is calling for a militarized race to build superintelligent AI based on threadbare evidence

The US-China AI rivalry is entering a dangerous new phase. 

Earlier today, the US-China Economic and Security Review Commission (USCC) released its annual report, with the following as its top recommendation: 

Congress establish and fund a Manhattan Project-like program dedicated to racing to and acquiring an Artificial General Intelligence (AGI) capability. AGI is generally defined as

...

As mentioned in another reply, I'm planning to do a lot more research and interviews on this topic, especially with people who are more hawkish on China. I also think it's important that unsupported claims with large stakes get timely pushback, which is in tension with the type of information gathering you're recommending (which is also really important, TBC!).

1garrison
Claiming that China as a country is racing toward AGI != Chinese AI companies aren't fast following US AI companies, which are explicitly trying to build AGI. This is a big distinction!
1garrison
Hey Seth, appreciate the detailed engagement. I don't think the 2017 report is the best way to understand what China's intentions are WRT to AI, but there was nothing in the report to support Helberg's claim to Reuters. I also cite multiple other sources discussing more recent developments (with the caveat in the piece that they should be taken with a grain of salt). I think the fact that this commission was not able to find evidence for the "China is racing to AGI" claim is actually pretty convincing evidence in itself. I'm very interested in better understanding China's intentions here and plan to deep dive into it over the next few months, but I didn't want to wait until I could exhaustively search for the evidence that the report should have offered while an extremely dangerous and unsupported narrative takes off. I also really don't get the error pushback. These really were less technical errors than basic factual errors and incoherent statements. They speak to a sloppiness that should affect how seriously the report should be taken. I'm not one to gatekeep ai expertise, but idt it's too much to expect a congressional commission with a top recommendation to commence in a militaristic AI arms race to have SOMEONE read a draft who knows that chatgpt-3 isn't a thing.
4David James
As mentioned above, the choice of Manhattan Project instead of Apollo Project is glaring. Worse, there is zero mention of AI safety, AI alignment, or AI evaluation in the Recommendations document. Lest you think I'm expecting too much, the report does talk about safety, alignment, and evaluation ... for non-AI topic areas! (see bolded words below: "safety", "aligning", "evaluate") * "Congress direct the U.S. Government Accountability Office to investigate the reliability of safety testing certifications for consumer products and medical devices imported from China." (page 736) * "Congress direct the Administration to create an Outbound Investment Office within the executive branch to oversee investments into countries of concern, including China. The office should have a dedicated staff and appropriated resources and be tasked with: [...] Expanding the list of covered sectors with the goal of aligning outbound investment restrictions with export controls." (page 737) * "Congress direct the U.S. Department of the Treasury, in coordination with the U.S. Departments of State and Commerce, to provide the relevant congressional committees a report assessing the ability of U.S. and foreign financial institutions operating in Hong Kong to identify and prevent transactions that facilitate the transfer of products, technology, and money to Russia, Iran, and other sanctioned countries and entities in violation of U.S. export controls, financial sanctions, and related rules. The report should [...] Evaluate the extent of Hong Kong’s role in facilitating the transfer of products and technologies to Russia, Iran, other adversary countries, and the Mainland, which are prohibited by export controls from being transferred to such countries;" (page 741)
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

I run a weekly sequences-reading meetup with some friends, and I want to add a film-component, where we watch films that have some tie-in to what we've read.

I got to talking with friends about what good rationality films there are. We had some ideas but I wanted to turn it to LessWrong to find out.

So please, submit your rationalist films! Then we can watch and discuss them :-)

Here are the rules for the thread.

  1. Each answer should have 1 film.
  2. Each answer should explain how the film ties in to rationality.

Optional extra: List some essays in the sequences that the film connects to. Yes, non-sequences posts by other rationalists like Scott Alexander and Robin Hanson are allowed.

Spoilers

If you are including spoilers for the film, use spoiler tags! Put >! at the start of the paragraph to cover the text, and people can hover-over if they want to read it, like so:

This is hidden text!

Answer by trevor40

Tenet (2020) by George Nolan revolves around recursive thinking and responding to unreasonably difficult problems. Nolan introduces the time-reversed material as the core dynamic, then iteratively increases the complexity from there, in ways specifically designed to ensure that as much of the audience as possible picks up as much recursive thinking as possible.

This chart describes the movement of all key characters plot elements through the film; it is actually very easy to follow for most people. But you can also print out a bunch of copies and hand them ... (read more)

12Answer by aysja
Jan suggested a similar one (Baraka), but I was going to say Koyaanisqatsi. It’s one of my favorite films; I still feel deeply affected by it. I bring it up here, though, because it does an excellent job of inviting viewers to do original seeing. It’s basically a 90 minute documentary about the world, but it doesn’t feel like it has any agenda. It’s just shot after shot of what this planet is like (the Grand Canyon, a commute to work, a factory farm). It doesn’t shy away from anything, doesn’t feel like it’s grasping at some goal. Just an honest, gentle look at what the world is like, and what humans are up to. Part of the reason I say that it’s good at inviting original seeing is that it does a really excellent job of perspective modulation (especially wrt time). E.g., it’ll slow down or speed up processes in ways that made me pop out of how I normally relate to them. It lingers on features I wouldn't normally wouldn’t linger on (like someone’s face) which turned it into this entirely new and strange experience. In general, it takes the mundane and makes it into something kind of glorious—a piece of the world to be marveled at, to be wondered at, a thing to be curious about. But it’s not just mundanity either; it reminds you that you’re in a vast universe, on a planet that not too long ago didn’t contain you. It starts with a close up of a cave painting, and it ends with this haunting scene of a rocket falling down to Earth. And I remember really grokking, at the end of it, just how strange and just how powerful a thing intelligence is—the magnitude of what we’ve accomplished. I’d had that feeling before, but something about it really stayed with me after watching this film.
1Mateusz Bagiński
Astronaut.io
2TsviBT
(FWIW this was my actual best candidate for a movie that would fit, but I remembered so few details that I didn't want to list it.)

Many of you readers may instinctively know that this is wrong. If you flip a coin (50% chance) twice, you are not guaranteed to get heads. The probability of getting a heads is 75%. However you may be surprised to learn that there is some truth to this statement; modifying the statement just slightly will yield not just a true statement, but a useful and interesting one.

It's a spoiler, though. If you want to figure this out as you read this article yourself, you should skip this and then come back. Ok, ready? Here it is:

It's a  chance and I did it  times, so the probability should be... 
Almost always.

 

The math:

Suppose you're flipping a coin and you want to find the probability of NOT flipping a single heads in a...

My guesses at what the spoiler was going to be:

  • Ten non-independent trials, a 10% chance each (in the prior state of knowledge, not conditional on previous results,), and only one trial can succeed. You satisfy these conditions with something like "I hid a ball in one of ten boxes", and the chance really is 100% that one is a "success".

  • Regardless of whether the trials are independent, the maximum probability that at least one is a success is the sum of the probabilities per trial. In this case that doesn't yield a useful bound because we already know

... (read more)
2Dweomite
Is that error common?  I can only recall encountering one instance of it with surety, and I only know about that particular example because it was signal-boosted by people who were mocking it.
1noggin-scratcher
I know someone who taught math to low-ability kids, and reported finding it difficult to persuade them otherwise. I assume some number of them carried on into adulthood still doing it.
1Pedro Callado
I guess wisdom is about understanding, by observation, how "the wheels" role in the machine. You're probably right but you always need to test different solutions. Cybernetics allows you to find out-of-the-box responses on a "plug n play" logic.

I'm agnostic on the existence of dragons. I don't usually talk about this, because people might misinterpret me as actually being a covert dragon-believer, but I wanted to give some background for why I disagree with calls for people to publicly assert the non-existence of dragons.

Before I do that, though, it's clear that horrible acts have been committed in the name of dragons. Many dragon-believers publicly or privately endorse this reprehensible history. Regardless of whether dragons do in fact exist, repercussions continue to have serious and unfair downstream effects on our society.

Given that history, the easy thing to do would be to loudly and publicly assert that dragons don't exist. But while a world in which dragons don't exist would be preferable, that a claim has inconvenient or harmful consequences isn't evidence of its truth...

2jefftk
Say more?
lc20

So one of the themes of sequences is that deliberate self-deception or thought censorship - deciding to prevent yourself from "knowing" or learning things you would otherwise learn - is almost always irrational. Reality is what it is, regardless of your state of mind, and at the end of the day whatever action you're deciding to take - for example, not talking about dragons - you could also be doing if you knew the truth. So when you say:

But if I decided to look into it I might instead find myself convinced that dragons do exist. In addition to this being

... (read more)