Lightcone Infrastructure FundraiserGoal 1:$891,196 of $1,000,000
Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Mark Xu5-2
3
"Undignified" is really vague I sometimes see/hear people say that "X would be a really undignified". I mostly don't really know what this means? I think it means "if I told someone that I did X, I would feel a bit embarassed." It's not really an argument against X. It's not dissimilar to saying "vibes are off with X". Not saying you should never say it, but basically every use I see could/should be replaced with something more specific.
Given the OpenAI o3 results making it clear that you can pour more compute to solve problems, I'd like to announce that I will be mentoring at SPAR for an automated interpretability research project using AIs with inference-time compute. I truly believe that the AI safety community is dropping the ball on this angle of technical AI safety and that this work will be a strong precursor of what's to come. Note that this work is a small part in a larger organization on automated AI safety I’m currently attempting to build. Here’s the link: https://airtable.com/appxuJ1PzMPhYkNhI/shrBUqoOmXl0vdHWo?detail=eyJwYWdlSWQiOiJwYWd5SURLVXg5WHk4bHlmMCIsInJvd0lkIjoicmVjRW5rU3d1UEZBWHhQVHEiLCJzaG93Q29tbWVudHMiOmZhbHNlLCJxdWVyeU9yaWdpbkhpbnQiOnsidHlwZSI6InBhZ2VFbGVtZW50IiwiZWxlbWVudElkIjoicGVsSmM5QmgwWDIxMEpmUVEiLCJxdWVyeUNvbnRhaW5lcklkIjoicGVsUlNqc0xIbWhUVmJOaE4iLCJzYXZlZEZpbHRlclNldElkIjoic2ZzRGNnMUU3Mk9xSXVhYlgifX0 Here’s the pitch: As AIs become more capable, they will increasingly be used to automate AI R&D. Given this, we should seek ways to use AIs to help us also make progress on alignment research. Eventually, AIs will automate all research, but for now, we need to choose specific tasks that AIs can do well on. The kind of problems we can expect AIs will be good at fairly soon are the kind that have reliable metrics they can optimize, have a lot of knowledge about, and can iterate on fairly cheaply. As a result, we can make progress toward automating interpretability research by coming up with experimental setups that allow AIs to iterate. For now, we can leave the exact details a bit broad, but here are some examples of what we could attempt to use AIs to make deep learning models more interpretable: 1. Optimizing Sparse Autoencoders (SAEs): sparse autoencoders (or transcoders) can be used to help us interpret the features of deep learning models. However, SAEs may suffer from issues like polysemanticity. Our goal is to create a SAE training setup that can give us
habryka3638
14
Is it OK for LW admins to look at DM metadata for spam prevention reasons?  Sometimes new users show up and spam a bunch of other users in DMs (in particular high-profile users). We can't limit DM usage to only users with activity on the site, because many valuable DMs get sent by people who don't want to post publicly. We have some basic rate limits for DMs, but of course those can't capture many forms of harassment or spam.  Right now, admins can only see how many DMs users have sent, and not who users have messaged, without making a whole manual database query, which we have a policy of not doing unless we have a high level of suspicion of malicious behavior. However, I feel like it would be quite useful for identifying who is doing spammy things if we could also see who users have sent DMs to, but of course, this might feel bad from a privacy perspective to people.  So I am curious about what others think. Should admins be able to look at DM metadata to help us identify who is abusing the DM system? Or should we stick to aggregate statistics like we do right now? (React or vote "agree" if you think we should use DM metadata, and react or vote "disagree" if you think we should not use DM metadata).
quila90
1
i observe that processes seem to have a tendency towards what i'll call "surreal equilibria". [status: trying to put words to a latent concept. may not be legible, feel free to skip. partly 'writing as if i know the reader will understand' so i can write about this at all. maybe it will interest some.] progressively smaller-scale examples: * it's probably easiest to imagine this with AI neural nets, procedurally following some adapted policy even as the context changes from the one they grew in. if these systems have an influential, hard to dismantle role, then they themselves become the rules governing the progression of the system for whatever arises next, themselves ceasing to be the actors or components they originally were; yet as they are "intelligent" they still emit the words as if the old world is true; they become simulacra, the automatons keep moving as they were, this is surreal. going out with a whimper. * i don't mean this to be about AI in particular; the A in AI is not fundamental. * early → late-stage capitalism. early → late-stage democracy. * structures which became ingrained as rules of the world. note the difference between "these systems have Naturally Changed from an early to late form" and "these systems became persistent constraints, and new adapted optimizers sprouted within them". it looks like i'm trying to describe an iterative pattern of established patterns becoming constraints bearing permanent resemblance to what they were, and new things sprouting up within the new context / constrained world, eventually themselves becoming constraints.[1] i also had in mind smaller scale examples. * a community forms around some goal and decides to moderate and curate itself in some consistent way, hoping this will lead to good outcomes; eventually the community is no longer the thing it set out to be; the original principles became the constraints. (? - not sure how much this really fits) * a group of internet friends agrees to regu
leogao810
12
I decided to conduct an experiment at neurips this year: I randomly surveyed people walking around in the conference hall to ask whether they had heard of AGI I found that out of 38 respondents, only 24 could tell me what AGI stands for (63%) we live in a bubble (https://x.com/nabla_theta/status/1869144832595431553)

Popular Comments

Recent Discussion

5Mark Xu
"Undignified" is really vague I sometimes see/hear people say that "X would be a really undignified". I mostly don't really know what this means? I think it means "if I told someone that I did X, I would feel a bit embarassed." It's not really an argument against X. It's not dissimilar to saying "vibes are off with X". Not saying you should never say it, but basically every use I see could/should be replaced with something more specific.

I disagree with Ben. I think the usage that Mark is referring to is a reference to Death with Dignity. A central example of my usage is

it would be undignified if AI takes over because we didn't really try off-policy probes; maybe they just work; someone should figure that out

It's playful and unserious but "X would be undignified" roughly means "it would be an unfortunate error if we did X or let X happen" and is used in the context of AI doom and our ability to affect P(doom).

4Ben Pace
But it's a very important concept! It means doing something that breaks your ability to respect yourself. For instance, you might want to win a political election, and you think you can win on policies and because people trust you, but you're losing, and so you consider using attack-ads or telling lies or selling out to rich people who you believe are corrupt. You can actually do these and get away with it, and they're bad in different ways, but one of the ways it's bad is you no longer are acting in a way where you relate to yourself as someone deserving of respect. Which is bad for the rest of your life, where you'll probably treat yourself poorly and implicitly encourage others to treat you poorly as well. Who wants to work with someone or be married to someone or be friends with someone that they do not respect? I care about people's preferences and thoughts less when I do not respect them, and I will probably care about my own less if I do not respect myself, and implicitly encourage others to not treat me as worthy of respect as well (e.g. "I get why you don't want to be in a relationship for me; I wouldn't want to be in a relationship with me.") To live well and trade with others it is important to be a person worthy of basic respect, and not doing undignified things ("this is beneath me") is how you maintain this. 
4Ben Pace
It seems to me that you have a concept-shaped hole, where people are constantly talking about an idea you don't get, and you have made a map-territory error in believing that they also do not have a referent here for the word. In general if a word has been in use for 100s of years, I think your prior should be that there is a referent there — I actually just googled it and the dictionary definition of dignity is the same as I gave ("the state or quality of being worthy of honor or respect"), so I think this one is straightforward to figure out. It is certainly possible that the other people around you also don't have a referent and are just using words the way children play with lego, but I'd argue that still is insufficient reason to attempt to prevent people who do know what the word is intended to mean from using the word. It's a larger discussion than this margin can contain, but my common attitude toward words losing their meaning in many people's minds is that we ought to rescue the meaning rather than lose it.

Note: below is a hypothetical future written in strong terms and does not track my actual probabilities.

Throughout 2025, a huge amount of compute is spent on producing data in verifiable tasks, such as math[1] (w/ "does it compile as a proof?" being the ground truth label) and code (w/ "does it compile and past unit tests?" being the ground truth label).

In 2026, when the next giant compute clusters w/ their GB200's are built, labs train the next larger model over 100 days, then some extra RL(H/AI)F and whatever else they've cooked up by then.

By mid-2026, we have a model that is very generally intelligent, that is superhuman in coding and math proofs. 

Naively, 10x-ing research means releasing 10x the same quality amount of papers in a year; however, these...

Hey Logan, thanks for writing this!

We talked about this recently, but for others reading this: given that I'm working on building an org focused on this kind of work and wrote a relevant shortform lately, I wanted to ping anybody reading this to send me a DM if you are interested in either making this happen (looking for a cracked CTO atm and will be entering phase 2 of Catalyze Impact in January) or provide feedback to an internal vision doc.

7Logan Riggs
This is mostly a "reeling from o3"-post. If anyone is doom/anxiety-reading these posts, well, I've been doing that too! At least, we're in this together:)

I have a lot of ideas about AGI/ASI safety. I've written them down in a paper and I'm sharing the paper here, hoping it can be helpful. 

Title: A Comprehensive Solution for the Safety and Controllability of Artificial Superintelligence

Abstract:

As artificial intelligence technology rapidly advances, it is likely to implement Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI) in the future. The highly intelligent ASI systems could be manipulated by malicious humans or independently evolve goals misaligned with human interests, potentially leading to severe harm or even human extinction. To mitigate the risks posed by ASI, it is imperative that we implement measures to ensure its safety and controllability. This paper analyzes the intellectual characteristics of ASI, and three conditions for ASI to cause catastrophes (harmful goals, concealed intentions,...

I didn't read the 100 pages, but the content seems extremely intelligent and logical. I really like the illustrations, they are awesome.

A few questions.

1: In your opinion, which idea in your paper is the most important, most new (not already focused on by others), and most affordable (can work without needing huge improvements in political will for AI safety)?

2: The paper suggests preventing AI from self-iteration, or recursive self improvement. My worry is that once many countries (or companies) have access to AI which are far better and faster than human... (read more)

2plex
Glanced through the comments and saw surprisingly positive responses, but reluctant to wade into a book-length reading commitment based on that. Are the core of your ideas on alignment compressible to fulfil the compelling insight heuristic?
1Weibing Wang
The core idea about alignment is described here: https://wwbmmm.github.io/asi-safety-solution/en/main.html#aligning-ai-systems If you only focus on alignment, you can only read Sections 6.1-6.3, and the length of this part will not be too long.

There's a concept I first heard in relation to the Fermi Paradox, which I've ended up using a lot in other contexts.

Why do we see no aliens out there? A possible (though not necessarily correct) answer, is that the aliens might not want to reveal themselves for fear of being destroyed by larger, older, hostile civilizations. There might be friendly civilizations worth reaching out to, but the upside of finding friendlies is smaller than the downside of risking getting destroyed. 

Even old, powerful civilizations aren't sure that they're the oldest and most powerful civilization, and eldest civilizations could be orders of magnitude more powerful still.

So, maybe everyone made an individually rational-seeming decision to hide.

A quote from the original sci-fi story I saw describing this:

“The universe is a dark

...

The problem with Dark Forest theory is that, in the absence of FTL detection/communication, it requires a very high density and absurdly high proportion of hiding civilizations. Without that, expansionary civilizations dominate. The only known civilization, us, is expansionary for reasons that don't seem path-determinant, so it seems unlikely that the preconditions for Dark Forest theory exist.

To explain:

Hiders have limited space and mass-energy to work with. An expansionary civilization, once in its technological phase, can spread to thousands of star sys... (read more)

Last week, a report was released on the potential risks of mirror biology - that is, biological systems with reversed chirality. I had previously looked into the topic, and was skeptical of the viability of this as a global catastrophic or existential risk. This report was tremendously helpful in filling in details of a scenario where, if mirror bacteria are developed, they could cause a global catastrophe. I will assume that readers here have or will read at least a summary, if not the original Science article or perhaps the summaries in the (300 page) report itself. And before I get into details, I will say that I’m not a biologist; I have a significant familiarity with relevant topics, but there is some chance I get technical...

2ChristianKl
It seems to me that it's going to be easier to build a bacteria with changed coding for amino acid then to get a whole mirror organism bacteria to work. Having a 4-base pairs per amino acid coding where a single mutation does not result in a different amino acid being expressed and is a stop codon is useful for having a stable organism that doesn't mutate and thus people might build it. You get the same problem of the new bacteria being immune against existing phages but on the plus-side it's not harder for the immune system to deal with it. Instead of focusing research dollars on antibiotics, I would expect them to be more effectively spend on phage development to be able to create phages that target potentially problematic bacteria.

I don't understand. You shouldn't get any changes from changing encoding if it produces the same proteins - the difference for mirror life is that it would also mirror proteins, etc.

Historically, I've gotten the impression from the AI Safety community that theoretical AI alignment work is promising in long timeline situations but less relevant in short timeline ones. As timelines have shrunk and capabilities have accelerated, I feel like I've seen theoretical alignment work appear less frequently and with less fanfare.

However, I think that we should challenge this communal assumption. In particular, not all capabilities advance at the same rate, and I expect that formally verified mathematical reasoning capabilities will accelerate particularly quickly. Formally-verified mathematics has a beautifully clean training signal, it feels like the perfect setup for aggressive amounts of RL. We know that Deepmind's AlphaProof can get IMO Silver Medal performance (with caveats) while writing proofs in Lean. There's no reason to expect that performance...

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

See livestream, site, OpenAI thread, Nat McAleese thread.

OpenAI announced (but isn't yet releasing) o3 and o3-mini (skipping o2 because of telecom company O2's trademark). "We plan to deploy these models early next year." "o3 is powered by further scaling up RL beyond o1"; I don't know whether it's a new base model.

o3 gets 25% on FrontierMath, smashing the previous SoTA. (These are really hard math problems.[1]) Wow. (The dark blue bar, about 7%, is presumably one-attempt and most comparable to the old SoTA; unfortunately OpenAI didn't say what the light blue bar is, but I think it doesn't really matter and the 25% is for real.[2])

o3 also is easily SoTA on SWE-bench Verified and Codeforces.

It's also easily SoTA on ARC-AGI, after doing RL on the public ARC-AGI...

1Bill Benzon
Beating benchmarks, even very difficult ones, is all find and dandy, but we must remember that those tests, no matter how difficult, are at best only a limited measure of human ability. Why? Because they present the test-take with a well-defined situation to which they must respond. Life isn't like that. It's messy and murky. Perhaps the most difficult step is to wade into the mess and the murk and impose a structure on it – perhaps by simply asking a question – so that one can then set about dealing with that situation in terms of the imposed structure. Tests give you a structured situation. That's not what the world does. Consider this passage from Sam Rodiques, "What does it take to build an AI Scientist" Right. How do we put o3, or any other AI, out in the world where it can roam around, poke into things, and come up with its own problems to solve? If you want AGI in any deep and robust sense, that's what you have to do. That calls for real agency. I don't see that OpenAI or any other organization is anywhere close to figuring out how to do this.
8Zach Stein-Perlman
OpenAI didn't fine-tune on ARC-AGI, even though this graph suggests they did. Sources: Altman said François Chollet (in the blogpost with the graph) said and An OpenAI staff member replied and further confirmed that "tuned" in the graph is Another OpenAI staff member said So on ARC-AGI they just pretrained on 300 examples (75% of the 400 in the public training set). Performance is surprisingly good. [heavily edited after first posting]

Thank you so much for your research! I would have never found these statements.

I'm still quite suspicious. Why would they be "including a (subset of) the public training set"? Is it accidental data contamination? They don't say so. Do they think simply including some questions and answers without reinforcement learning or reasoning would help the model solve other such questions? That's possible but not very likely.

Were they "including a (subset of) the public training set" in o3's base training data? Or in o3's reinforcement learning problem/answer sets?

A... (read more)

1O O
Do you have a link to these?

My median expectation is that AGI[1] will be created 3 years from now. This has implications on how to behave, and I will share some useful thoughts I and others have had on how to orient to short timelines.

I’ve led multiple small workshops on orienting to short AGI timelines and compiled the wisdom of around 50 participants (but mostly my thoughts) here. I’ve also participated in multiple short-timelines AGI wargames and co-led one wargame.

This post will assume median AGI timelines of 2027 and will not spend time arguing for this point. Instead, I focus on what the implications of 3 year timelines would be. 

I didn’t update much on o3 (as my timelines were already short) but I imagine some readers did and might feel disoriented now. I hope...

Note that "The AI Safety Community" is not part of this list. I think external people without much capital just won't have that much leverage over what happens.

What would you advise for external people with some amount of capital, say $5M? How would this change for each of the years 2025-2027?

I haven’t found many credible reports on what algorithms and techniques have been used to train the latest generation of powerful AI models (including OpenAI’s o3). Some reports suggest that reinforcement learning (RL) has been a key part, which is also consistent with what OpenAI officially reported about o1 three months ago.

The use of RL to enhance the capabilities of AGI[1] appears to be a concerning development. As I wrote previously, I have been hoping to see AI labs stick to training models through pure language modeling. By “pure language modeling” I don’t rule out fine-tuning with RLHF or other techniques designed to promote helpfulness/alignment, as long as they don’t dramatically enhance capabilities. I’m also okay with the LLMs being used as part of more complex AI systems that invoke many...

1purple fire
"you risk encouraging i) CoTs that carry side information that's only known to the model" This is true by default, but not intractable. For example, you can train the CoT model with periodic paraphrasing to avoid steganography, or you can train a CoT model just for capabilities and introduce a separate model that reasons about safety. Daniel Kokotajlo has some nice writeups about these ideas, he calls it the Shoggoth/Face framework. "superhuman capabilities" Agreed that this would be bad, but condition on this happening, better to do it with RL CoT over blackbox token generation. "planning ahead and agency in ways that are difficult to anticipate" Not sure why this would be the case--shouldn't having access to the model's thought process make this easier to anticipate than if the long-term plans were stored in neuralese across a bunch of transformer layers? "RL encotages this reasoning process to be more powerful, more agentic, and less predictable" This is something I agree with in the sense that our frontier models are trained with RL, and those models are also the most powerful and most agentic (since they're more capable), but I'm not convinced that this is inherent to RL training, and I'm not exactly sure in what way these models are less predictable.
1Nadav Brandes
Why are you conditioning on superhuman AGI emerging? I think it's something very dangerous that our society isn't ready for. We should pursue a path where we can enjoy as many of the benefits of sub-human-level AGI (of the kind we already have) without risking uncontrolled acceleration. Pushing for stronger capabilities with open-ended RL is counterproductive for the very scenario we need to avoid. It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning? Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?  I like the idea of discouraging steganography, but I still worry that given strong enough incentives, RL-trained models will find ways around this. 

I think we're working with a different set of premises, so I'll try to disentangle a few ideas.

First, I completely agree with you that building superhuman AGI carries a lot of risks, and that society broadly isn't prepared for the advent of AI models that can perform economically useful labor.

Unfortunately, economic and political incentives being what they are, capabilities research will continue to happen. My more specific claim is that conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators an... (read more)

1Nadav Brandes
Thank you Seth for the thoughtful reply. I largely agree with most of your points. I agree that RL trained to accomplish things in the real world is far more dangerous than RL trained to just solve difficult mathematical problems (which in turn is more dangerous than vanilla language modeling). I worry that the real-world part will soon become commonplace, judging from current trends. But even without the real-world part, models could still be incentivized to develop superhumam abilities and complex strategic thinking (which could be useful for solving mathematical and coding problens). Regarding the chances of stopping/banning open-ended RL, I agree it's a very tall order, but my impression of the advocacy/policy landscape is that people might be open to it under the right conditions. At any rate I wasn't trying to reason about what's reasonable to ask for, only on the implications of different paths. I think the discussion should start there, and then we can consider what's wise to advocate for. For all of these reasons, I fully agree with you that work on demonstrating these risks in a rigorous and credible way is one of the most important efforts for AI safety.