This seems largely correct but I must admit I have never seen an experiment that clearly demonstrates that diffusion is the main feature. Perhaps such experiments have been carried out but if so I think one would have to do something extremely challenging like filming the process at extremely high FPS rates with something like sa scanning electron microscope. My sense is that the "performance curve" of filters is mostly empirically deduced while we are actually only extrapolating when making statements about what exactly causes these empirical results.
For example, another process I intuitively feel is different between air and water is the density and thus the force of the fluid on contaminants. If you travel in a boat, it is so much harder to stick your hand in the water compared to the air. Similarly, a particle that could potentially attach to a filter fiber in water is unlikely to stay attached as the water would exceed such a high force on it that it detaches. This is why one washes one's car with a water hose, not an air hose.
I would be interested in any experiment that has looked at the micro scale physics involved in air filtration but my impression after looking at a lot of filter literature is that there are few, if any such studies.
I agree about it being hard to understand the immune system completely, i should have written "understand one single process well enough to have high confidence". So i just wanted to understand one step, such as the binding of something to just one of the TLRs. And the understanding could be empirical too - I would be confident if researchers could robustly repeat a failure of some mirror component to bind to a TLR, for example.
Hi - I would like you to explain, in rather more detail, how this entity works. It's "Claude", but presumably you have set it up in some way so that it has a persistent identity and self-knowledge beyond just being Claude?
Fixed, thanks
It's faster and more logical in its theoretical underpinning, and generally does a better job than UMAP.
Not a big deal, I just like letting people know that there's a new algorithm now which seems like a solid pareto improvement.
OpenAI shared they trained the o3 we tested on 75% of the Public Training set
Probably a dataset for RL, that is the model was trained to try and try again to solve these tests with long chains of reasoning, not just tuned or pretrained on them, as a detail like 75% of examples sounds like a test-centric dataset design decision, with the other 25% going to the validation part of the dataset.
Altman: "didn't go do specific work ... just the general effort"
Seems plausible they trained on ALL the tests, specifically targeting various tests. The public part of ARC-AGI is "just" a part of that dataset of all the tests. Could be some part of explaining the o1/o3 difference in $20 tier.
I really like your vision for interpretability!
I've been a little pessimistic about mech interp as compared to chain of thought, since CoT already produces very understandable end-to-end explanations for free (assuming faithfulness etc.).
But I'd be much more excited if we could actually understand circuits to the point of replacing them with annotated code or perhaps generating on-demand natural language explanations in an expandable tree. And as long as we can discover the key techniques for doing that, the fiddly cognitive work that would be required to scale it across a whole neural net may be feasible with AI only as capable as o3.
We did the the 80% pledge thing, and that was like a thing that everybody was just like, "Yes, obviously we're gonna do this."
Does anyone know what this is referring to? (Maybe a pledge to donate 80%? If so, curious about 80% of what & under what conditions.)
When/how did you learn it? (Inasmuch as your phrasing is not entirely metaphorical.)
I've sometimes said that dignity in the first skill I learned (often to the surprise of others, since I am so willing to look silly or dumb or socially undignified). Part of my original motivation for bothering to intervene on x-risk, is that it would be beneath my dignity to live on a planet with an impending intelligence explosion on track to wipe out the future, and not do anything about it.
I think Ben's is a pretty good description of what it means for me, modulo that the "respect" in question is not at all social. It's entirely about my relationship with myself. My dignity or not is often not visible to others at all.
I think it approaches it from a different level of abstraction though. Alignment faking is the strategy used to achieve goal guarding. I think both can be useful framings.
The following quote from Harry Potter and the Half-Blood Prince often runs through my mind, and matches up with what Eliezer is advising us to collectively do in that essay.
"It was, he thought, the difference between being dragged into the arena to face a battle to the death and walking into the arena with your head held high. Some people, perhaps, would say that there was little to choose between the two ways, but Dumbledore knew - and so do I, thought Harry, with a rush of fierce pride, and so did my parents - that there was all the difference in the world."
...?
Death with Dignity is straightforwardly using the word dignity in line with its definition (and thus in line with the explanation I gave), so if you think that's the usage Mark is referring to then you should agree with the position that dignity is a word that is being consistently used to mean "the state or quality of being worthy of honor or respect".
I disagree with Ben. I think the usage that Mark is referring to is a reference to Death with Dignity. A central example of my usage is
it would be undignified if AI takes over because we didn't really try off-policy probes; maybe they just work; someone should figure that out
It's playful and unserious but "X would be undignified" roughly means "it would be an unfortunate error if we did X or let X happen" and is used in the context of AI doom and our ability to affect P(doom).
Hey Logan, thanks for writing this!
We talked about this recently, but for others reading this: given that I'm working on building an org focused on this kind of work and wrote a relevant shortform lately, I wanted to ping anybody reading this to send me a DM if you are interested in either making this happen (looking for a cracked CTO atm and will be entering phase 2 of Catalyze Impact in January) or provide feedback to an internal vision doc.
I didn't read the 100 pages, but the content seems extremely intelligent and logical. I really like the illustrations, they are awesome.
A few questions.
1: In your opinion, which idea in your paper is the most important, most new (not already focused on by others), and most affordable (can work without needing huge improvements in political will for AI safety)?
2: The paper suggests preventing AI from self-iteration, or recursive self improvement. My worry is that once many countries (or companies) have access to AI which are far better and faster than humans at AI research, each one will be tempted to allow a very rapid self improvement cycle.
Each country might fear that if it doesn't allow it, one of the other countries will, and that country's AI will be so intelligent it can engineer self replicating nanobots which take over the world. This motivates each country to allow the recursive self improvement, even if the AI's methods of AI development become so advanced they are inscrutable by human minds.
How can we prevent this?
Edit: sorry I didn't read the paper. But when I skimmed it you did have a section on "AI Safety Governance System," and talked about an international organization to get countries to do the right thing. I guess one question is, why would an international system succeed in AI safety, when current international systems have so far failed to prevent countries from acting selfishly in ways which severely harms other countries (e.g. all wars, exploitation, etc.)?
The problem with Dark Forest theory is that, in the absence of FTL detection/communication, it requires a very high density and absurdly high proportion of hiding civilizations. Without that, expansionary civilizations dominate. The only known civilization, us, is expansionary for reasons that don't seem path-determinant, so it seems unlikely that the preconditions for Dark Forest theory exist.
To explain:
Hiders have limited space and mass-energy to work with. An expansionary civilization, once in its technological phase, can spread to thousands of star systems in mere thousands of years and become unstoppable by hiders. So, hiders need to kill expansionists before that happens. But if they're going to hide in their home system, they can't detect anything faster than FTL! So you need murderous hiding civs within at least a thousand light years of every single habitable planet in the galaxy, all of which need to have evolved before any expansionary civs in the area. This is improbable unless basically every civ is a murderous hider. The fact that the only known civ is not a murderous hider, for generalizable reasons, is thus evidence against the Dark Forest theory.
Potential objections:
Still governed by FTL, expansionary civ would become overwhelmingly strong before probes reported back.
If the probes succeed in killing everything in the galaxy before they reach the stars, you didn't need to hide in the first place. (Also, note that hiding is a failed strategy for everyone else in this scenario, you can't do anything about a killer probe when you're the equivalent of the Han dynasty. Or the equivalent of a dinosaur.) If the probes fail, the civ they failed against will have no reason to hide, having been already discovered, and so will expand and dominate.
Conceivable, but I'd rather be the expansionary civs here?
I think this is the strongest objection. If, for example, a hider civ could send out a few ships that can travel at a higher percentage of lightspeed than anything the expansionary civ can do, and those ships can detonate stars or something, and catching up to this tech would take millions of years, then just a few ships could track down and obliterate the expansionary civ within thousands/tens of thousands of years and win.
The problem is that the "hider civ evolved substantially earlier" part has to be true everywhere in the galaxy, or else somewhere an expansionary civilization wins and then snowballs with their resource advantages - this comes back to the "very high density and absurdly high proportion of hiding civilizations" requirement. The hiding civs have to always be the oldest whenever they meet an expansionary civ, and older to a degree that the expansionary civ's likely several orders of magnitude more resources and population doesn't counteract the age difference.
It seems to me that you have a concept-shaped hole, where people are constantly talking about an idea you don't get, and you have made a map-territory error in believing that they also do not have a referent here for the word. In general if a word has been in use for 100s of years, I think your prior should be that there is a referent there — I actually just googled it and the dictionary definition of dignity is the same as I gave ("the state or quality of being worthy of honor or respect"), so I think this one is straightforward to figure out.
It is certainly possible that the other people around you also don't have a referent and are just using words the way children play with lego, but I'd argue that still is insufficient reason to attempt to prevent people who do know what the word is intended to mean from using the word. It's a larger discussion than this margin can contain, but my common attitude toward words losing their meaning in many people's minds is that we ought to rescue the meaning rather than lose it.
I don't understand. You shouldn't get any changes from changing encoding if it produces the same proteins - the difference for mirror life is that it would also mirror proteins, etc.
But it's a very important concept! It means doing something that breaks your ability to respect yourself. For instance, you might want to win a political election, and you think you can win on policies and because people trust you, but you're losing, and so you consider using attack-ads or telling lies or selling out to rich people who you believe are corrupt. You can actually do these and get away with it, and they're bad in different ways, but one of the ways it's bad is you no longer are acting in a way where you relate to yourself as someone deserving of respect. Which is bad for the rest of your life, where you'll probably treat yourself poorly and implicitly encourage others to treat you poorly as well. Who wants to work with someone or be married to someone or be friends with someone that they do not respect? I care about people's preferences and thoughts less when I do not respect them, and I will probably care about my own less if I do not respect myself, and implicitly encourage others to not treat me as worthy of respect as well (e.g. "I get why you don't want to be in a relationship for me; I wouldn't want to be in a relationship with me.")
To live well and trade with others it is important to be a person worthy of basic respect, and not doing undignified things ("this is beneath me") is how you maintain this.
Thank you so much for your research! I would have never found these statements.
I'm still quite suspicious. Why would they be "including a (subset of) the public training set"? Is it accidental data contamination? They don't say so. Do they think simply including some questions and answers without reinforcement learning or reasoning would help the model solve other such questions? That's possible but not very likely.
Were they "including a (subset of) the public training set" in o3's base training data? Or in o3's reinforcement learning problem/answer sets?
Altman never said "we didn't go do specific work [targeting ARC-AGI]; this is just the general effort."
Instead he said,
Worth mentioning that we didn't, we target and we think it's an awesome benchmark but we didn't go do specific [inaudible]--you know, the general rule, but yeah really appreciate the partnership this was a fun one to do.
The gist I get is that he admits to targeting it but that OpenAI targets all kinds of problem/answer sets for reinforcement learning, not just ARC's public training set. It felt like he didn't want to talk about this too much, from the way he interrupted himself and changed the topic without clarifying what he meant.
The other sources do sort of imply no reinforcement learning. I'll wait to see if they make a clearer denial of reinforcement learning, rather than a "nondenial denial" which can be reinterpreted as "we didn't fine-tune o3 in the sense we didn't use a separate derivative of o3 (that's fine-tuned for just the test) to take the test."
My guess is o3 is tuned using the training set, since François Chollet (developer of ARC) somehow decided to label o3 as "tuned" and OpenAI isn't racing to correct this.
Beating benchmarks, even very difficult ones, is all find and dandy, but we must remember that those tests, no matter how difficult, are at best only a limited measure of human ability. Why? Because they present the test-take with a well-defined situation to which they must respond. Life isn't like that. It's messy and murky. Perhaps the most difficult step is to wade into the mess and the murk and impose a structure on it – perhaps by simply asking a question – so that one can then set about dealing with that situation in terms of the imposed structure. Tests give you a structured situation. That's not what the world does.
Consider this passage from Sam Rodiques, "What does it take to build an AI Scientist"
Scientific reasoning consists of essentially three steps: coming up with hypotheses, conducting experiments, and using the results to update one’s hypotheses. Science is the ultimate open-ended problem, in that we always have an infinite space of possible hypotheses to choose from, and an infinite space of possible observations. For hypothesis generation: How do we navigate this space effectively? How do we generate diverse, relevant, and explanatory hypotheses? It is one thing to have ChatGPT generate incremental ideas. It is another thing to come up with truly novel, paradigm-shifting concepts.
Right.
How do we put o3, or any other AI, out in the world where it can roam around, poke into things, and come up with its own problems to solve? If you want AGI in any deep and robust sense, that's what you have to do. That calls for real agency. I don't see that OpenAI or any other organization is anywhere close to figuring out how to do this.
Note that "The AI Safety Community" is not part of this list. I think external people without much capital just won't have that much leverage over what happens.
What would you advise for external people with some amount of capital, say $5M? How would this change for each of the years 2025-2027?
I think we're working with a different set of premises, so I'll try to disentangle a few ideas.
First, I completely agree with you that building superhuman AGI carries a lot of risks, and that society broadly isn't prepared for the advent of AI models that can perform economically useful labor.
Unfortunately, economic and political incentives being what they are, capabilities research will continue to happen. My more specific claim is that conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding (e.g. o1) as opposed to simply more capable text generators (e.g. GPT-4). I believe that the former lends itself to stronger oversight mechanisms, more tractable interpretability techniques, and more robust safety interventions during real-world deployment.
"It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?"
This might have come across wrong, and is a potential crux. Conditioning on a particular text-generation model, I would guess that applying RL increases the risk--for example, I would consider Gemini 2.0 Flash Thinking as riskier than Gemini 2.0 Flash. But if you just showed me a bunch of eval results for an unknown model and asked how risky I thought the model was based on those, I would be more concerned about a fully black-box LLM than a RL CoT/scaffolded LM.
"Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?"
No, it seems pretty clear that RL models like o3 are more capable than vanilla LLMs. So in a sense, I guess I think RL is bad because it increases capabilities faster, which I think is bad. But I still disagree that RL is worse for any theoretical reason beyond "it works better".
Tying this all back to your post, there are a few object-level claims that I continue to disagree with, but if I came to agree with you on them I would also change my mind more on the overall discussion. Specifically:
At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. (I agree with this naively, but think this probelm is a lot more tractable than e.g. interpretability on a 100b parameter transformer.)
Of course, I'm more than open to hearing stronger arguments for these, and would happily change my mind if I saw convincing evidence.
The TLDR has multiple conclusions but this is my winner:
My conclusion -- springing to a great degree from how painful seeking clear predictions in 700 pages of words has been -- is that if anyone says "I have a great track record" without pointing to specific predictions that they made, you should probably ignore them, or maybe point out their lack of epistemic virtue if you have the energy to spare for doing that kind of criticism productively.
There is a skill in writing things that, when read later, are likely to be interpreted as predictions of things that happened between writing and reading. This is the skill of astrologers. There is another skill in accurately predict the future, and writing that down. This is the skill of forecasters.
The post, and the comments and reviews on this post, show this off. People disagree with the author and each other. If there is serious debate over how a prediction resolves after the resolution time has passed, it might have been astrology the whole time.
This point is also made in Beware boasting about non-existent forecasting track records from 2022. I think this post adds substantial evidence to that perspective on forecasting in general.
As a side note, the author notes both Hanson and Yudkowsky are hard to score. This is a generalized problem that afflicts many smart people and maybe also you. It definitely afflicts me. It's not about one person.
"Undignified" is really vague
I sometimes see/hear people say that "X would be a really undignified". I mostly don't really know what this means? I think it means "if I told someone that I did X, I would feel a bit embarassed." It's not really an argument against X. It's not dissimilar to saying "vibes are off with X".
Not saying you should never say it, but basically every use I see could/should be replaced with something more specific.
The core idea about alignment is described here: https://wwbmmm.github.io/asi-safety-solution/en/main.html#aligning-ai-systems
If you only focus on alignment, you can only read Sections 6.1-6.3, and the length of this part will not be too long.
OpenAI didn't fine-tune on ARC-AGI, even though this graph suggests they did.
Sources:
Altman said
we didn't go do specific work [targeting ARC-AGI]; this is just the general effort.
François Chollet (in the blogpost with the graph) said
Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
The version of the model we tested was domain-adapted to ARC-AGI via the public training set (which is what the public training set is for). As far as I can tell they didn't generate synthetic ARC data to improve their score.
An OpenAI staff member replied
Correct, can confirm "targeting" exclusively means including a (subset of) the public training set.
and further confirmed that "tuned" in the graph is
a strange way of denoting that we included ARC training examples in the O3 training. It isn’t some finetuned version of O3 though. It is just O3.
Another OpenAI staff member said
also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint
So on ARC-AGI they just pretrained on 300 examples (75% of the 400 in the public training set). Performance is surprisingly good.
[heavily edited after first posting]
I finally googled what Elon Musk has said about solar power, and found that he did a similar calculation recently on twitter:
Once you understand Kardashev Scale, it becomes utterly obvious that essentially all energy generation will be solar.
Also, just do the math on solar on Earth and you soon figure out that a relatively small corner of Texas or New Mexico can easily serve all US electricity.
One square mile on the surface receives ~2.5 Gigawatts of solar energy. That’s Gigawatts with a “G”. It’s ~30% higher in space. The Starlink global satellite network is entirely solar/battery powered.
Factoring in solar panel efficiency (25%), packing density (80%) and usable daylight hours (~6), a reasonable rule of thumb is 3GWh of energy per square mile per day. Easy math, but almost no one does these basic calculations.
It's important to remember that o3's score on the ARC-AGI is "tuned" while previous AI's scores are not "tuned." Being explicitly trained on example test questions gives it a major advantage.
According to François Chollet (ARC-AGI designer):
Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
It's interesting that OpenAI did not test how well o3 would have done before it was "tuned."
EDIT: People at OpenAI deny "fine-tuning" o3 for the ARC (see this comment by Zach Stein-Perlman). But to me, the denials sound like "we didn't use a separate derivative of o3 (that's fine-tuned for just the test) to take the test, but we may have still done reinforcement learning on the public training set." (See my reply)
Sort of. I think the distribution of Θ is the Ap distribution, since it satisfies that formula; Θ=p is Ap. It's just that Jaynes prefers an exposition modeled on propositional logic, whereas a standard probability textbook begins with the definition of "random variables" like Θ, but this seems to me just a notational difference, since an equation like Θ=p is after all a proposition from the perspective of propositional logic. So I would rather say that Bayesian statisticians are in fact using it, and I was just explaining why you don't find any exposition of it under that name. I don't think there's a real conceptual difference. Jaynes of course would object to the word "random" in "random variable" but it's just a word, in my post I call it an "unknown quantity" and mathematically define it the usual way.
My timelines have now updated to something closer to fast takeoff. In a world like this, how valuable is educating the general public? Claude claims science started worrying about the climate in the 50s/60s. It wasn't until 2010s that we saw meaningful action beginning to take place. Do we have the time to educate?
To be clear, this is more of a question than an opinion that I hold. I am working to form an opinion.
Ah yes, you're right. I don't know why but I made the mental shortcut that the mutation rate was about the DNA of cows / humans and not the flu virus.
The general point still holds : I am wary of the assumption of a constant mutation rate of the flu virus. It really facilitates the computation, but if the computation under this simplifying hypothesis leads to a consequence which contradict reality, I would interrogate this assumption.
It's surprising to have so few human cases considering the large number of cows infected if there is a human-compatible viron per cow.
Another cause of this discrepancy could also be that due to the large mutation rate, a non-negligible part of the virons are not viable / don't replicate well / ...
There are papers which show heterogeneity for influenza / RNA viruses but I don't really know if it's between the virus population (of the same kind of virus) or within the genome. And they are like a factor 4 or so in the papers I have seen. So maybe less relevant than expected.
Regarding the details, my lack of deep knowledge of the domain is limiting. But as a mathematician who had to modelize real phenomenon and adapt the model to handle the discrepancy between the model's conclusion and reality, that's the train of thought which comes naturally to mind.
Why are you conditioning on superhuman AGI emerging? I think it's something very dangerous that our society isn't ready for. We should pursue a path where we can enjoy as many of the benefits of sub-human-level AGI (of the kind we already have) without risking uncontrolled acceleration. Pushing for stronger capabilities with open-ended RL is counterproductive for the very scenario we need to avoid.
It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?
Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?
I like the idea of discouraging steganography, but I still worry that given strong enough incentives, RL-trained models will find ways around this.
i mean i think that its' definitely an update (anything short of 95% i think would have been quite surprising to me)
Really appreciate updates on these kinds of things. Empirical data is hard to come by, so even anecdotes like this are useful!
Thanks for the reference. You and other commentator both seem to be saying the same thing: that the there isn't much use case for the Ap distribution as Bayesian statisticians have other frameworks for thinking about these sorts of problems. It seems important that I acquaint myself with the basic tools of Bayesian statistics to better contextualize Jaynes' contribution.
"you risk encouraging i) CoTs that carry side information that's only known to the model"
This is true by default, but not intractable. For example, you can train the CoT model with periodic paraphrasing to avoid steganography, or you can train a CoT model just for capabilities and introduce a separate model that reasons about safety. Daniel Kokotajlo has some nice writeups about these ideas, he calls it the Shoggoth/Face framework.
"superhuman capabilities"
Agreed that this would be bad, but condition on this happening, better to do it with RL CoT over blackbox token generation.
"planning ahead and agency in ways that are difficult to anticipate"
Not sure why this would be the case--shouldn't having access to the model's thought process make this easier to anticipate than if the long-term plans were stored in neuralese across a bunch of transformer layers?
"RL encotages this reasoning process to be more powerful, more agentic, and less predictable"
This is something I agree with in the sense that our frontier models are trained with RL, and those models are also the most powerful and most agentic (since they're more capable), but I'm not convinced that this is inherent to RL training, and I'm not exactly sure in what way these models are less predictable.
This is mostly a "reeling from o3"-post. If anyone is doom/anxiety-reading these posts, well, I've been doing that too! At least, we're in this together:)
Thank you for the thorough response.
I have a bad habit of making a comment before reading the post...
At first glance I thought these shelters should apply to all kinds of biological threats so I wondered why the title refers to mirror bacteria, and I asked the question.
Now I think I see the reason. Mirror bacteria might be not only deadly, but persists in the environment even if no one is around, while other biological threats probably spread from person to person, so shelters are more relevant to mirror bacteria.
This is close but not quite what I mean. Another attempt:
The literal Do Well At CodeForces task takes the form "you are given ~2 hours and ~6 problems, maximize this score function that takes into account the problems you solved and the times at which you solved them". In this o3 is in top 200 (conditional on no cheating). So I agree there.
As you suggest, a more natural task would be "you are given time and one problem, maximize your probability of solving it in the given time". Already at equal to ~1 hour (which is what contestants typically spend on the hardest problem they'll solve), I'd expect o3 to be noticeably worse than top 200. This is because the CodeForces scoring function heavily penalizes slowness, and so if o3 and a human have equal performance in the contests, the human has to make up for their slowness by solving more problems. (Again, this is assuming that o3 is faster than humans in wall clock time.)
I separately believe that humans would scale better than AIs w.r.t. , but that is not the point I'm making here.
I sense that my quality of communication diminishes past this point, I should get my thoughts together before speaking too confidently
I believe you're right we do something similar to the LLM's (loosely, analogously), see
https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble
(I need to learn markdown)
My intuition is still LLM pessimistic, I'd be excited to see good practical uses, this seems like tool ai and that makes my existential dread easier to manage!
- even if we only take people with bipolar disorder: how the hell can they go on so few number of hours a night with their brain being manic but not simply breaking down?
Just wanted to tune in on this from anecdotal experience:
My last ever (non-iatrogenic) hypomanic episode started unprompted. But I was terrified of falling back into depression again! My solution was to try to avoid the depression by extending my hypomania as long as possible.
How did I do this? By intentionally not sleeping and by drinking more coffee (essentially doing the opposite of whatever the internet said stabilized hypomanic patients). I had a strong intuition that this would work. (I also had a strong intuition that the depression later would be worse, but I figured I'd cross that bridge when I came to it, even though my depression was life-threatening, because I was cognitively impaired by my episode.)
It worked! It was my longest and most intense (most euphoric and erratic, but least productive) hypomanic episode, and I don't think this is fully explained by it being later in the progression of my illness.
Did I "not simply break down?" I wouldn't say that's the case, even after iirc less than a week of hypomania and ~3 hours of sleep per night.
Generally, I would say that bipolar I patients with months-long mania are also "breaking down." Mania is severely disruptive. Manic patients are constantly making thinking mistakes (inappropriate risks resulting in long-term disability/losses/hospitalizations, delusions, hallucinations). They're also not happy all the time -- a lot of mania and hypomania presents with severe anger and irritability! I would consider this a breakdown. I can't say how much of the breaking down is because of the sleep deprivation vs. the other factors of the illness.
(Fortunately, I've been episode-free for 8 years now, save for a couple of days of hypomanic symptoms on the days I was given new anxiety medications that didn't work out.)
Well, the update for me would go both ways.
On one side, as you point out, it would mean that the model's single pass reasoning did not improve much (or at all).
On the other side, it would also mean that you can get large performance and reliability gains (on specific benchmarks) by just adding simple stuff. This is significant because you can do this much more quickly than the time it takes to train a new base model, and there's probably more to be gained in that direction – similar tricks we can add by hardcoding various "system-2 loops" into the AI's chain of thought and thinking process.
You might reply that this only works if the benchmark in question has easily verifiable answers. But I don't think it is limited to those situations. If the model itself (or some subroutine in it) has some truth-tracking intuition about which of its answer attempts are better/worse, then running it through multiple passes and trying to pick the best ones should get you better performance even without easy and complete verifiability (since you can also train on the model's guesses about its own answer attempts).
Besides, I feel like humans do something similar when we reason: we think up various ideas and answer attempts and run them by an inner critic, asking "is this answer I just gave actually correct/plausible?" or "is this the best I can do, or am I missing something?."
(I'm not super confident in all the above, though.)
Lastly, I think the cost bit will go down by orders of magnitude eventually (I'm confident of that). I would have to look up trends to say how quickly I expect $4,000 in runtime costs to go down to $40, but I don't think it's all that long. Also, if you can do extremely impactful things with some model, like automating further AI progress on training runs that cost billions, then willingness to pay for model outputs could be high anyway.
Similar point is made here (towards the end): https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated
I think I agree, more or less. One caveat is that I expect RL fine-tuning to degrade the signal / faithfulness / what-have-you in the chain of thought, whereas the same is likely not true of mech interp.
you can only care about what you fully understand
I think I need an operational definition of “care about” to process this
If you define "care about" as "put resources into trying to achieve" , there's plenty of evidence that people care about things that can't fully define, and don't fully understand, not least the truth-seeking that happens here.
Your calculations look right for Shapley Values. I was calculating based on Ninety-Three's proposal (see here). So it's good that in your calculations the sum of parts equals the combined, that's what we'd expect for Shapley Values.
I think it's both in the map, as a description, but I also think the behavior itself is in the territory, and my point is that you can get the same result but have different paths to get to the result, which is in the territory.
Also, I treat the map-territory difference in a weaker way than LW often assumes, where things in the map can also be in the territory, and vice versa.
Thank you for the warm reply, it's nice and also good feedback I didn't do anything explicitly wrong with my post
It will be VERY funny if this ends up being essentially the o1 model with some tinkering to help it cycle questions multiple times to verify the best answers, or something banal like that. Wish they didn't make us wait so long to test that :/
Other, more targeted risks, such as bioweapons, pandemics and viral outbreaks would be better served by these shelters
I think they could maybe be appropriate for some bioweapons, but for most pathogen scenarios you don't need anywhere near the fourteen logs this seems to be designed for. So I do think it's important to be clear about the target threat: I expect designing for fourteen logs if you actually only need three or something makes it way more expensive.
I feel good. I'm about 3 years in now, and I still try to keep my sleep at around 6.5 hours/night (going between 6 hour [4 REM cycle] nights and 7.5 [5 REM cycle] nights). Going up to 7.5/night daily doesn't feel like it produces noticeable benefits, and I plan to keep up this 6.5-hour level. It doesn't feel forced at all. I haven't woken up to an alarm in years. I will stock up on 7.5 two days in a row if I know there's a risk of me only getting 4.5 hours (e.g., if I need to wake up for a flight).
However, despite me feeling good and I think performing well in my general life, I may have some tiredness in me. I fall asleep very quickly in the evening. After a 6 hour sleep, the next night, I can't really read on my phone or watch a show when I'm in bed or I'll fall asleep automatically. This isn't the case if I have a 7.5-hour sleep the prior two nights in a row, especially if it's also linked to me sleeping in rather than sleeping early. Falling asleep automatically could be seen as a downside, but alternatively, it also means I don't struggle to sleep, so that's even less time in bed.
I still advocate to my peers: "You've got many decades of life left. Explore sleeping less. Maybe your body can operate on 6 hours. Try intentionally getting less than 7.5 for a month, and see how you like it."
(I am admittedly not a LW regular, so please excuse this slow reply)
I haven't looked into PaCMAP, but is there a significant difference between the visuals generated using that and UMAP?
What's the last model you did check with, o1-pro?
For alphazero, I want to point out that it was announced 6 years ago (infinity by AI scale), and from my understanding we still don't have a 1000x faster version, despite much interest in one.
I don't know the details, but whatever the NN thing (derived from Lc0, a clone of AlphaZero) inside current Stockfish is can play on a laptop GPU.
And even if AlphaZero derivatives didn't gain 3OOMs by themselves it doesn't update me much that that's something particularly hard. Google itself has no interest at improving it further and just moved on to MuZero, to AlphaFold etc.
who lose sight of the point of it all
Pursuing some specific "point of it all" can be much more misguided.
Mostly just personal experience with burnout and things that I recall hearing from others; I don't have any formal papers to point at. Could be wrong.
A function in this context is a computational abstraction. I would say this is in the map.
Thank you Seth for the thoughtful reply. I largely agree with most of your points.
I agree that RL trained to accomplish things in the real world is far more dangerous than RL trained to just solve difficult mathematical problems (which in turn is more dangerous than vanilla language modeling). I worry that the real-world part will soon become commonplace, judging from current trends.
But even without the real-world part, models could still be incentivized to develop superhumam abilities and complex strategic thinking (which could be useful for solving mathematical and coding problens).
Regarding the chances of stopping/banning open-ended RL, I agree it's a very tall order, but my impression of the advocacy/policy landscape is that people might be open to it under the right conditions. At any rate I wasn't trying to reason about what's reasonable to ask for, only on the implications of different paths. I think the discussion should start there, and then we can consider what's wise to advocate for.
For all of these reasons, I fully agree with you that work on demonstrating these risks in a rigorous and credible way is one of the most important efforts for AI safety.
Even so, at some level of wealth you'll leave more behind by saving up the premium and having your children inherit the compound interest instead. That point is found through the Kelly criterion.
(The Kelly criterion is indeed equal to concave utility, but the insurance company is so wealthy that individual life insurance payouts sit on the nearly linear early part of the utility curve, whereas for most individuals it does not.)
I think you're absolutely right.
And I think there's approximately a 0% chance humanity will stop at pure language models, or even stop at o1 and o3, which very likely to use RL to dramatically enhance capabilities.
Because they use RL not to accomplish things-in-the-world but to arrive at correct answers to questions they're posed, the concerns you express (and pretty much anyone who's been paying attention to AGI risks agrees with) are not fully in play.
Open AI will continue on this path unless legislation stops them. And that's highly unlikely to happen, because the argument against is just not strong enough to convince the public or legislators.
We are mostly applying optimization pressure to our AGI systems to follow instructions and produce correct answers. Framed that way, it sounds like it's as safe an approach as you could come up with for network-based AGI. I'm not saying it's safe, but I am saying it's hard to be sure it's not without more detailed arguments and analysis. Which is what I'm trying to do in my work.
Also as you say, it would be far safer to not make these things into agents. But the ease of doing so with a smart enough model and a prompt like "continue pursuing goal X using tools Y as appropriate to gather information and take actions" ensures that they will be turned into agents.
People want a system that actually does their work, not one that just tells them what to do. So they're going to make agents out of smart LLMs. This won't be stopped even with legislation; people will do it illegally or from jurisdictions that haven't passed the laws.
So we are going to have to both hope and plan for this approach, including RL for correct answers, is safe enough. Or come up with way stronger and more convincing arguments for why it won't. I currently think it can be made safe in a realistic way with no major policy or research direction change. But I just don't know, because I haven't gotten enough people to engage deeply enough with the real difficulties and likely approaches.
I would distinguish two variants of this. There's just plain inertia, like if you have a big pile of legacy code that accumulated from a lot of work, then it takes a commensurate amount of work to change it. And then there's security, like a society needs rules to maintain itself against hostile forces. The former is sort of accidentally surreal, whereas the latter is somewhat intentionally so, in that a tendency to re-adapt would be a vulnerability.
Welcome!
To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can't naively translate benchmark scores to real-world capabilities.
scaring laws
lol
As in, for the literal task of "solve this code forces problem in 30 minutes" (or whatever the competition allows), o3 is ~ top 200 among people who do codeforces (supposing o3 didn't cheat on wall clock time). However, if you gave humans 8 serial hours and o3 8 serial hours, much more than 200 humans would be better. (Or maybe the cross over is at 64 serial hours instead of 8.)
Is this what you mean?
In the same terms as the $100-200bn I'm talking about, o3 is probably about $1.5-5bn, meaning 30K-100K H100, the system needed to train GPT-4o or GPT-4.5o (or whatever they'll call it) that it might be based on. But that's the cost of a training system, its time needed for training is cheaper (since the rest of its time can be used for other things). In the other direction, it's more expensive than just that time because of research experiments. If OpenAI spent $3bn in 2024 on training, this is probably mostly research experiments.
I'm all in for CoT! But when you RL a model to produce CoTs that are better at solving difficult ptoblems, you risk encouraging i) CoTs that carry side information that's only known to the model, ii) superhuman capabilities, and iii) planning ahead and agency in ways that are difficult to anticipate. Whether you produce the CoT with a pure LLM or a model that's also undergone RL, you end up with a seriew of tokens where each token was produced by a fully internalized reasoning process. The only difference is that RL encotages this reasoning process to be more powerful, more agentic, and less predictable. What's the adventage of RL for safety?
CodeForces ratings are determined by your performance in competitions, and your score in a competition is determined, in part, by how quickly you solve the problems. I'd expect o3 to be much faster than human contestants. (The specifics are unclear - I'm not sure how a large test-time compute usage translates to wall-clock time - but at the very least o3 parallelizes between problems.)
This inflates the results relative to humans somewhat. So one shouldn't think that o3 is in the top 200 in terms of algorithmic problem solving skills.
This shelter idea has many points of potential failure, possible showstoppers, and assuming a small population of shelters (hundreds or a few thousand), seems extremely unlikely to maintain an MVP for more than a few months.
Points of failure:
Showstoppers:
I think the strongest thing this fashion norm has going for it is that, without having read or even heard about this post until showing up at solstice, I and at least one other person who also hadn't heard of it managed to coordinate on the same theme of "black + shiny" simply by wanting to wear something decorative for the occasion. I also like the "wear something you would like to be asked about" as a different kind of fashion coordination point, and think that it would be great for the summer solstice, which has a bit more of that vibe.
What do you think is the current cost of o3, for comparison?
The human struggle to find purpose is a problem of incidentally very weak integration or dialog between reason and the rest of the brain, and self-delusional but mostly adaptive masking of one's purpose for political positioning. I doubt there's anything fundamentally intractable about it. If we can get the machines to want to carry our purposes, I think they'll figure it out just fine.
Also... you can get philosophical about it, but the reality is, there are happy people, their purpose to them is clear, to create a beautiful life for themselves and their loved ones. The people you see at neurips are more likely to be the kind of hungry, high-achieving professional who is not happy in that way, perhaps does not want to be. So maybe you're diagnosing a legitimately enduring collective issue, the sorts of humans who end up on top tend to be the ones who are capable of divorcing their actions from a direct sense of purpose, or the types of people who are pathologically busy and who lose sight of the point of it all or never have the chance to cultivate a sense for it in the first place. It may not be human nature, but it could be humanity nature. Sure.
But that's still a problem that can be solved by having more intelligence. If you can find a way to manufacture more intelligence per human than the human baseline, that's going to be a pretty good approach to it.
Evan joined Anthropic in late 2022 no? (Eg his post announcing it was Jan 2023 https://www.alignmentforum.org/posts/7jn5aDadcMH6sFeJe/why-i-m-joining-anthropic)
I think you’re correct on the timeline, I remember Jade/Jan proposing DC Evals in April 2022, (which was novel to me at the time), and Beth started METR in June 2022, and I don’t remember there being such teams actually doing work (at least not publically known) when she pitched me on joining in August 2022.
It seems plausible that anthropic’s scaring laws project was already under work before then (and this is what they’re referring to, but proliferating QA datasets feels qualitatively than DC Evals). Also, they were definitely doing other red teaming, just none that seem to be DC Evals
It's so much better if everyone in the company can walk around and tell you what are the top goals of the RSP, how do we know if we're meeting them, what AI safety level are we at right now—are we at ASL-2, are we at ASL-3—that people know what to look for because that is how you're going to have good common knowledge of if something's going wrong.
I like this goal a lot: Good RSPs could contribute to building common language/awareness around several topics (e.g., "if" conditions, "then" commitments, how safety decisions will be handled). As many have pointed out, though, I worry that current RSPs haven't been concrete or clear enough to build this kind of understanding/awareness.
One interesting idea would be to survey company employees and evaluate their understanding of RSPs & the extent to which RSPs are having an impact on internal safety culture. Example questions/topics:
One of my concerns about RSPs is that they (at least in their current form) don't actually achieve the goal of building common knowledge/awareness or improving company culture. I suspect surveys like this could prove me wrong– and more importantly, provide scaling companies with useful information about the extent to which their scaling policies are understood by employees, help foster common understanding, etc.
(Another version of this could involve giving multiple RSPs to a third-party– like an AI Safety Institute– and having them answer similar questions. This could provide another useful datapoint RE the extent to which RSPs are clearly/concretely laying out a set of specific or meaningful contributions.)
$100-200bn 5 GW training systems are now a go. So in the worlds that slow down for years if there are only $30bn systems available and would need an additional scaling push, timelines moved up a few years. Not sure how unlikely $100-200bn systems would've been without o1/o3, but they seem likely now.
I disagree. I think the current approach, with chain-of-thought reasoning, is a marked improvement over naive language modelling in terms of alignment difficulty. CoT allows us to elicit higher capabilities out of the same level base text generation model, meaning less of the computation is done inside the black box and more is done in human-readable tokens. While this still (obviously) has risks, it seems preferable to models that fully internalize the reasoning process. Do you agree with that?
Filtering liquids is pretty different from air, because a HEPA filter captures very small particles by diffusion. This means the worst performance is typically at ~0.3um (too small for ideal diffusion capture, too large for ideal interception and impaction) and is better on both bigger and smaller particles. The reported 99.97% efficiency (2.5 logs) is at this 0.3um nadir, though.
I explain it in more detail in my original post.
In short, in standard language modeling the model only tries to predict the most likely immediate next token (T1), and then the most likely token after that (T2) given T1, and so on; whereas in RL it's trying to optimize a whole sequence of next tokens (T1, ..., Tn) such that the rewards for all the tokens (up to Tn) are taken into account in the reward of the immediate next token (T1).
I basically agree. The following is speculation/playing with an idea, not something I think is likely true.
Imagine it's the future. It becomes clear that a lab could easily create mirror bacteria if they wanted to, or even deliberately create mirror pathogens. It may even get to the point where countries explicitly threaten to do this.
At that point, it might be a good idea to develop mirror life for the purposes of developing countermeasures.
I'm not that familiar with how modern vaccines and drugs are made. Can a vaccine be made without involving a living cell? What about an antibiotic?
Why does RL necessarily mean that AIs are trained to plan ahead?
Can you elaborate on why you think that genetic modification is more prone to creating inequality than other kinds of technology? You mentioned religious reasons in your original comment. Are there other reasons? On priors, I might expect it to follow a typical cost curve where it gets cheaper and more accessible over time, and where the most valuable modifications are subsidized for some people who can't afford them.
You cannot completely understand the immune system; that is something you learn early on in immunology.
That being said, the key understanding on mirror bacteria evading the immune system is that the immune system generally relies on binding to identify foreign invaders, and if they cannot bind then they cannot respond. Bacteria generally share a number of molecules on their surface, so the innate immune system has evolved to bind and detect these molecules. If they were mirrored, they would not bind as well, and would be harder to detect and respond to.
That being said, you did find the insight that they are not completely invisible. There are also systems that can detect the damage done by the infection and start a counterattack, even if they can't see the invaders themselves. But much of the counterattack would not be able to affect the mirror bacteria.
What matters in the report is that the immune system of all animals and plants will likely be (much) less effective against mirror bacteria. This doesn't mean it's an untreatable disease, as we have antibiotics that should still be effective against the mirror bacteria. But it does mean that if the mirror bacteria finds its way into the environment it is unlikely that anything can fight back well.
So far as I know, it is not the case that OpenAI had a slower-but-equally-functional version of GPT4 many months before announcement/release. What they did have is GPT4 itself, months before; but they did not have a slower version. They didn't release a substantially distilled version. For example, the highest estimate I've seen is that they trained a 2-trillion-parameter model. And the lowest estimate I've seen is that they released a 200-billion-parameter model. If both are true, then they distilled 10x... but it's much more likely that only one is true, and that they released what they trained, distilling later. (The parameter count is proportional to the inference cost.)
Previously, delays in release were believed to be about post-training improvements (e.g. RLHF) or safety testing. Sure, there were possibly mild infrastructure optimizations before release, but mostly to scale to many users; the models didn't shrink.
This is for language models. For alphazero, I want to point out that it was announced 6 years ago (infinity by AI scale), and from my understanding we still don't have a 1000x faster version, despite much interest in one.
It seems to me that it's going to be easier to build a bacteria with changed coding for amino acid then to get a whole mirror organism bacteria to work.
Having a 4-base pairs per amino acid coding where a single mutation does not result in a different amino acid being expressed and is a stop codon is useful for having a stable organism that doesn't mutate and thus people might build it.
You get the same problem of the new bacteria being immune against existing phages but on the plus-side it's not harder for the immune system to deal with it.
Instead of focusing research dollars on antibiotics, I would expect them to be more effectively spend on phage development to be able to create phages that target potentially problematic bacteria.
Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2/10, if not 3/10 correct under high-compute settings.
You could call them logic puzzles. I do think most smart people on LW would get 10/10 without too many problems, if they had enough time, although I've never tested this.
I work with bacterial viruses in liquids, and when we want to separate the bacteria from their viruses, we pass the liquid through a 0.22um filter. A quick search shows that the bacteria I work with are usually 0.5um in diameter, whereas the smallest bacteria can be down to 0.13um in diameter; however, the 0.22um filter is fairly standard for laboratory sterilization so I assume smaller bacteria are relatively rare. The 0.22um filter can also be used for gases.
But as with my usage, they block bacteria and not viruses. I'm working with 50nm-diameter viruses, but viruses of bacteria are generally smaller than those of animals; SARS-CoV2 is somewhere from 50-140nm.
If you use a small enough filter it would still filter out the viruses; but you'll need to get a pore size smaller than what is sufficient for filtering out bacteria. (and smaller pores requires more pressure, more prone to clogging, etc.)
(though for air, it's quite rare for bare viruses to be floating around; they're usually in aerosols (bacteria are often also in aerosols, which may be easier to filter out)
Yeah, I've been thinking of setting up something like this.
Personally I'm not a fan of the pasta texture of baked mac and cheese, but I've definitely sauced the cooked pasta, topped with cheese, and broiled it. That's fast, and you could spread it across multiple pans so it has more surface area. I suspect a blow torch could also work?
I agree, after all RLFH was originally for RL agents. As long as the models aren't all that smart, and the tasks they have to do aren't all that long-term, the transfer should work great, and the occasional failure won't be a problem because, again, the models aren't all that smart.
To be clear, I don't expect a 'sharp left turn' so much as 'we always implicitly incentivized exploitation of human foibles, we just always caught it when it mattered, until we didn't.'
In many publications, posts, and discussions about AI, I can see an unsaid assumption that intelligence is all about prediction power.
I think this take is not proper and this assumption does not hold. It has one underlying assumption that intelligence costs are negligible or will have negligible limits in the future with progress in lowering the cost.
This does not fit the curve of AI power vs the cost of resources needed (with even well-optimized systems like our brains - basically cells being very efficient nanites - having limits).
The problem is that the computation cost of resources (material, energy) and time should be taken into the equation of optimization. This means that the most intelligent system should have many heuristics that are "good enough" for problems in the world, not targeting the best prediction power, but for the best use of resources. This is also what we humans do - we mostly don't do exact Bayesian or other strict reasoning. We mostly use heuristics (many of which cause biases).
The decision to think more or simulate something precisely is a decision about resources. This means that deciding if to use more resources and time to predict better vs using less and deciding faster is also part of being intelligent. A very intelligent system should therefore be good at selecting resources for the problem and scaling that as its knowledge changes. This means that it should not over-commit to have the most perfect predictions and should use heuristics and techniques like clustering (including but not limited to using clustered fuzzy concepts of language) instead of a direct simulation approach, when possible.
Just a thought.
I'll admit I'm not very certain in the following claims, but here's my rough model:
Or perhaps any part of this story is false. As I said, I haven't been keeping a close enough eye on this part of things to be confident in it. But it's my current weakly-held strong view.
I think that preference preservation is something in our favor and the aligned model should have it - at least about meta-values and core values. This removes many possible modes of failure like diverging over time, or removing some values for better consistency, or sacrificing some values for better outcomes in the direction of some other values.
Nope, I didn't know PaCMAP! Thanks for the pointer, I'll have a look.
There is an ideal where each person seeks a telos that they can personally pursue in a way that is consistent with an open, fair, prosperous society and, upon adopting such a telos for themselves, they seek to make the pursuit of that telos by themselves and their assembled team into something locally efficient. Living up to this ideal is good, even though haters gonna hate.
I think AI obviously keeps getting better. But I don't think "it can be done for $1 million" is such strong evidence for "it can be done cheaply soon" in general (though the prior on "it can be done cheaply soon" was not particularly low ante -- it's a plausible statement for other reasons).
Like if your belief is "anything that can be done now can be done 1000x cheaper within 5 months", that's just clearly false for nearly every AI milestone in the last 10 years (we did not get gpt4 that's 1000x cheaper 5 months later, nor alphazero, etc).
i observe that processes seem to have a tendency towards what i'll call "surreal equilibria". [status: trying to put words to a latent concept. may not be legible, feel free to skip. partly 'writing as if i know the reader will understand' so i can write about this at all. maybe it will interest some.]
progressively smaller-scale examples:
it looks like i'm trying to describe an iterative pattern of established patterns becoming constraints bearing permanent resemblance to what they were, and new things sprouting up within the new context / constrained world, eventually themselves becoming constraints.[1]
i also had in mind smaller scale examples.
this feels related to goodhart, but where goodhart is framed more individually, and this is more like "a learned policy and its original purpose coming apart as a tendency of reality".
tangential: in this frame physics can be called the 'first constraint'
I'm having trouble parsing but I think the first point is about the mutation rate in humans? I don't expect that to be informative about flu virus except as a floor.
By the start of April half the world was locked down, and Covid was the dominant factor in human affairs for the next two years or so. Do you think that issues pertaining to AI agents are going to be dominating human affairs so soon and so totally?