Humans have always been misaligned. Things now are probably significantly better in terms of human alignment than almost any time in history (citation needed) due to high levels of education and broad agreement about many things that we take for granted (e.g. the limits of free trade are debated but there has never been so much free trade). So you would need to think that something important was different now for there to be some kind of new existential risk.
One candidate is that as tech advances, the amount of damage a small misaligned group could do is g...
One tip for research of this kind is to not only measure recall, but also precision. It's easy to block 100% of dangerous prompts by blocking 100% of prompts, but obviously that doesn't work in practice. The actual task that labs are trying to solve is to block as many unsafe prompts as possible while rarely blocking safe prompts, or in other words, looking at both precision and recall.
Of course with truly dangerous models and prompts, you do want ~100% recall, and in that situation it's fair to say that nobody should ever be able to build a bioweapon. But...
The pivotal act link is broken, fyi.
Gemini V2 (1206 experimental which is the larger model) one boxes, so.... progress?
I'm probably too conflicted to give you advice here (I work on safety at Google DeepMind), but you might want to think through, at a gears level, what could concretely happen with your work that would lead to bad outcomes. Then you can balance that against positives (getting paid, becoming more familiar with model outputs, whatever).
You might also think about how your work compares to whoever would replace you on average, and what implications that might have as well.
This is great data! I'd been wondering about this myself.
Where were you measuring air quality? How far from the stove? Same place every time?
Practicing LLM prompting?
I haven't heard the p zombie argument before, but I agree that is at least some Bayesian evidence that we're not in a sim.
Probably 3 needs to be developed further, but this is the first new piece of evidence I've seen since I first encountered the simulation argument in like 2005.
Are we playing the question game because the thread was started by Rosencranz? Is China doing well in the EV space a bad thing?
Is it the case that the tech would exist without him? I think that's pretty unclear, especially for SpaceX, where despite other startups in the space, nobody else managed to radically reduce the cost per launch in a way that transformed the industry.
Even for Tesla, which seems more pedestrian (heh) now, there were a number of years where they had the only viable car in the market. It was only once they proved it was feasible that everyone else piled in.
Progress in ML looks a lot like, we had a different setup with different data and a tweaked algorithm and did better on this task. If you want to put an asterisk on o3 because it trained in some specific way that's different from previous contenders, then basically every ML advance is going to have a similar asterisk. Seems like a lot of asterisking.
Hm I think the main thrust of this post misses something, which is that different conditions, even contradictory conditions, can easily happen locally. Obviously, it can be raining in San Francisco and sunny in LA, and you can have one person wearing a raincoat in SF and the other one the beach in LA with no problem, even if they are part of the same team.
I think this is true of wealth inequality.
Carnegie or Larry Page or Warren Buffett got their money in a non exploitative way, by being better than others at something that was extremely socially valuable....
It seems very strange to me to say that they cheated, when the public training set is intended to be used exactly for training. They did what the test specified! And they didn't even use all of it.
The whole point of the test is that some training examples aren't going to unlock the rest of it. What training definitely does it teach the model how to output the JSON in the right format, and likely how to think about what to even do with these visual puzzles.
Do we say that humans aren't a general intelligence even though for ~all valuable tasks, you have to take some time to practice, or someone has to show you, before you can do it well?
Do we say that humans aren't a general intelligence even though for ~all valuable tasks, you have to take some time to practice, or someone has to show you, before you can do it well?
More pointedly, I didn't see anyone complaining about the previous champion doing 100%-ARC-only online training while trying to solve ARC, so why would you complain about weaker offline training as a small part of a giant pretraining corpus?
(Generating millions of examples to train on, yes, people did complain about that and arguably that is 'cheating', but 'not using froze...
Why does RL necessarily mean that AIs are trained to plan ahead?
"Reliable fact recall is valuable, but why would o1 pro be especially good at it? It seems like that would be the opposite of reasoning, or of thinking for a long time?"
Current models were already good at identifying and fixing factual errors when run over a response and asked to critique and fix it. Works maybe 80% of the time to identify whether there's a mistake, and can fix it at a somewhat lower rate.
So not surprising at all that a reasoning loop can do the same thing. Possibly there's some other secret sauce in there, but just critiquing and fixing mistakes is probably enough to see the reported gains in o1.
Aha, thanks, that makes sense.
One way this could happen is searching for jailbreaks in the space of paraphrases and synonyms of a benign prompt.
Why would this produce fake/unlikely jailbreaks? If the paraphrases and such are natural, then doesn't the nearness to a real(istic) prompt enough to suggest that the jailbreak found is also realistic? Of course you can adversarially generate super unrealistic things, but does that necessarily happen with paraphrasing type attacks?
You may recall certain news items last February around Gemini and diversity that wiped many billions off of Google's market cap.
There's a clear financial incentive to make sure that models say things within expected limits.
There's also this: https://www.wired.com/story/air-canada-chatbot-refund-policy/
Really cool project! And the write-up is very clear.
In the section about options for reducing the hit to helpfulness, I was surprised you didn't mention scaling the vector you're adding or subtracting -- did you try different weights? I would expect that you can tune the strength of the intervention by weighting the difference in means vector up or down.
The usual reason is compounding. If you have an asset that is growing over time, paying taxes from it means not only do you have less of it now, but the amount you pulled out now won't compound indefinitely into the future. You want to compound growth for as long as possible on as much capital as possible. If you could diversify without paying capital gains you would, but since the choice is something like, get gains on $100 in this one stock, or get gains on $70 in this diversified basket of stocks, you might stay with the concentrated position even if you would prefer to be diversified.
This reminds me of a Brin short story which I think exactly discusses what you're talking about: https://www.davidbrin.com/tankfarm.htm
Cool concept. I'm a bit puzzled by one thing though -- presumably every time you use a tether, it slows down and drops to a lower orbit. How do you handle that? Is the idea that it's so much more massive than the rockets its boosting that its slowdown is negligible? Or do we have to go spin it back up every so often?
One way to regain energy is to run the tether in reverse - drop something from a faster orbit back into the atmosphere, siphoning off some of its energy along the way. If every time you sent one spacecraft up another was lined up to come back down, that would save a lot of trouble.
But you'll still need to do orbital corrections, offset atmospheric drag, and allow for imbalances, so yeah, it would seem like you still need a pretty beefy means of propulsion on this thing, which is oddly unmentioned for being key to the whole design.
Tethers can theoretically use more efficient propulsion because their thrust requirements are lower. The argon Hall effect thrusters on Starlink satellites have around 7x the specific impulse (fuel efficiency) of Starship engines while needing 7x the energy due to KE=mv^2/2 and having a tiny fraction of the thrust. This energy could come from a giant solar panel rather than the fuel, and every once in a while it could be refueled with a big tanker of liquid argon.
"If you are playing with a player who thinks that "all reds" is a strong hand, it can take you many, many hands to figure out that they're overestimating their hands instead of just getting anomalously lucky with their hidden cards while everyone else folds!"
As you guessed, this is wrong. If someone is playing a lot of hands, your first hypothesis is that they are too loose and making mistakes. At that point, each additional hand they play is evidence in favor of fishiness, and you can quickly become confident that they are bad.
Mistakes in the other direct...
That all sounds right, but I want to invert your setup.
If someone is playing too many hands, your first hypothesis is that they are too loose and making mistakes. If someone folds for 30 minutes, then steals the blinds once, then folds some more, you will have a hard time telling whether they're playing wrong or have had a bad run of cards.
But in either case, it is going to be significantly harder for them to tell, from inside their own still-developing understanding of the game, whether the things that are happening to them are evidence about their own mi...
I wonder if there's a way to give the black box recommended a different objective function. CTR is bad for the obvious clickbait reasons, but signals for user interaction are still valuable if you can find the right signal to use.
I would propose that returning to the site some time in the future is a better signal of quality than CTR, assuming the future is far enough away. You could try a week, a month, and a quarter.
This is maybe a good time to use reinforcement learning, since the signal is far away from the decision you need to make. When someone interacts with an article, reward the things they interacted with n weeks ago. Combined with karma, I bet that would be a better signal than CTR.
Children are evidently next word completers.
I would be very unhappy if a non disparagement agreement were sprung on me when I left the company. And I would be very reluctant to sign one entering any company.
Luckily we don't have those at Google Deepmind.
I work at DeepMind and have been influenced by METR. :)
If you want a far future fictional treatment of this kind of situation, I recommend Surface Detail by Iain Banks.
I think your model is a bit simplistic. METR has absolutely influenced the behavior of the big labs, including DeepMind. Even if all impact goes through the big labs, you could have more influence outside of the lab than as one of many employees within. Being the head of a regulatory agency that oversees the labs sets policy in a much more direct way than a mid level exec within the company can.
I went back to finish college as an adult, and my main surprise was how much fun it was. It probably depends on what classes you have left, but I took every AI class offered and learned a ton that is still relevant to my work today, 20 years later. Even the general classes were fun -- it turns out it's easy to be an excellent student if you're used to working a full work week, and being a good student is way more pleasant and less stressful than being a bad one, or at least it was for me.
I'm not sure what you should do necessarily, but given that you're t...
This is very well written and compelling. Thanks for posting it!
This is a great post. I knew that at the top end of the income distribution in the US people have more kids, but didn't understand how robust the relationship seems to be.
I think the standard evbio explanation here would ride on status -- people at the top of the tribe can afford to expend more resources for kids, and also have more access to opportunities to have kids. That would predict that we wouldn't see a radical change as everyone got more rich -- the curve would slide right and the top end of the distribution would have more kids but not necessaril...
Heh, that's why I put "strong" in there!
One big one is that the first big spreading event happened at a wet market where people and animals are in close proximity. You could check densely peopled places within some proximity of the lab to figure out how surprising it is that it happened in a wet market, but certainly animal spillover is much more likely where there are animals.
Edit: also it's honestly kind of a bad sign that you aren't aware of evidence that tends against your favored explanation, since that mostly happens during motivated reasoning.
We're here to test the so-called tower of babel theory. What if, due to some bizarre happenstance, humanity had thousands of languages that change all the time instead of a single universal language like all known intelligent species?
You should ignore the EY style "no future" takes when thinking about your future. This is because if the world is about to end, nothing you do will matter much. But if the world isn't about to end, what you do might matter quite a bit -- so you should focus on the latter.
One quick question to ask yourself is: are you more likely to have an impact on technology, or on policy? Either one is useful. (If neither seems great, then consider earning to give, or just find a way to add value in society in other ways.)
Once you figure that out, the next step is almos...
I agree that it's bad to raise a child in an environment of extreme anxiety. Don't do that.
Also try to avoid being very doomy and anxious in general, it's not a healthy state to be in. (Easier said than done, I realize.)
I think you should have a kid if you would have wanted one without recent AI progress. Timelines are still very uncertain, and strong AGI could still be decades away. Parenthood is strongly value creating and extremely rewarding (if hard at times) and that's true in many many worlds.
In fact it's hard to find probable worlds where having kids is a really bad idea, IMO. If we solve alignment and end up in AI utopia, having kids is great! If we don't solve alignment and EY is right about what happens in a fast takeoff world, it doesn't really matter if you ha...
If we don't solve alignment and EY is right about what happens in a fast takeoff world, it doesn't really matter if you have kids or not.
This IMO misses the obvious fact that you spend your life with a lot more anguish if you think that not just you, but your kid is going to die too. I don't have a kid but everyone who does seems to describe a feeling of protectiveness that transcends any standard "I really care about this person" one you could experience with just about anyone else.
Having kids does mean less time to help AI go well, so maybe it’s not so much of a good idea if you’re one of the people doing alignment work.
The thing you're missing is called instruction tuning. You gather a series of prompt/response pairs and fine tune the model over that data. Do it right and you have a chatty model.
Thanks, Zvi, these roundups are always interesting.
I have one small suggestion, which is that you limit yourself to one Patrick link per post. He's an interesting guy but his area is quite niche, and if people want his fun stories about banking systems they can just follow him. I suspect that people who care about those things already follow him, and people who don't aren't that interested to read four items from him here.
I feel like a lot of the issues in this post are that the published RSPs are not very detailed and most of the work to flesh them out is not done. E.g. the comparison to other risk policies highlights lack of detail in various ways.
I think it takes a lot of time and work to build our something with lots of analysis and detail, years of work potentially to really do it right. And yes, much of that work hasn't happened yet.
But I would rather see labs post the work they are doing as they do it, so people can give feedback and input. If labs do so, the framewo...
Thanks for your comment.
I feel like a lot of the issues in this post are that the published RSPs are not very detailed and most of the work to flesh them out is not done.
I strongly disagree with this. In my opinion, a lot of the issue is that RSPs have been thought from first principles without much consideration for everything the risk management field has done, and hence doing wrong stuff without noticing.
It's not a matter of how detailed they are; they get the broad principles wrong. As I argued (the entire table is about this) I think...
I agree with all of this. It's what I meant by "it's up to all of us."
It will be a signal of how things are going if I'm a year we still have only vague policies, or if there has been real progress in operationalizing the safety levels, detection, what the right reactions are, etc.
I think there are two paths, roughly, that RSPs could send us down.
But I also suspect that people on the more cynical side aren't going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there's probably not much to say at this point other than, let's see what happens next.
This seems wrong to me. We can say all kinds of things, like:
If you think that Anthropic and other labs that adopt these are fundamentally well meaning and trying to do the right thing, you'll assume that we are by default heading down path #1. If you are more cynical about how companies are acting, then #2 may seem more plausible.
I disagree that what you think about a lab's internal motivations should be very relevant here. For any particular lab/government adopting any particular RSP, you can just ask, does having this RSP make it easier or harder to implement future good legislation? My sense is that the answ...
New York City Mayor Eric Adams has been using ElevenLabs AI to create recordings of him in languages he does not speak and using them for robocalls. This seems pretty not great.
Can you say more about why you think this is problematic? Recording his own voice for a robocall is totally fine, so the claim here is that AI involvement makes it bad?
Yes he should disclose somewhere that he's doing this, but deepfakes with the happy participation of the person whose voice is being faked seems like the best possible scenario.
FWIW as an executive working on safety at Google, I basically never consider my normal working activities in light of what they would do to Google's stock price.
The exception is around public communication. There I'm very careful because it's asymmetrical -- I could potentially cause a pr disaster that would affect the stock, but I don't see how I could give a talk that's so good that it helps it.
Maybe a plug pulling situation would be different, but I also think it's basically impossible for it to be a unilateral situation, and if we're in such a moment, I hardly think any damage would be contained to Google's stock price, versus say the market as a whole.
Hmm, I do think that is something that seems pretty likely to change, I think?
I expect safety researchers to be consulted quite a bit on regulations that will affect Google pretty heavily and i.e. any given high-level safety researcher currently has a decent chance to testify in front of congress, and like, I would want them to feel comfortable taking actions that definitely would have a large effect on the Google stock price (like saying that Google's AGI program should be shut down completely, or nationalized, or Google should be held liable for some damages caused by its AI systems).
How much do you think that your decisions affect Google's stock price? Yes maybe more AI means a higher price, but on the margin how much will you be pushing that relative to a replacement AI person? And mostly the stock price fluctuates on stuff like how well the ads business is doing, macro factors, and I guess occasionally whether we gave a bad demo.
It feels to me like the incentive is just so diffuse that I wouldn't worry about it much.
Your idea of just donating extra gains also seems fine.
As I said in the dialogue, I think as a safety engineer, especially as someone who might end up close to the literal or metaphorical "stop button", the effect here seems to me to be potentially quite large, especially in aggregate.
That's not correct, or at least not how my Google stock grants work. The price is locked in at grant time, not vest time. In practice what that means is that you get x shares every month, which counts as income when multiplied by the current stock price.
And then you can sell them or whatever, including having a policy that automatically sells them as soon as they vest.
The star ratings are an improvement, I had felt also that breakthrough was overselling many of the items last week.
However, stars are very generic and don't capture the concept of a breakthrough very well. You could consider a lightbulb.
I also asked chatgpt to create an emoji of an AI breakthrough, and after some iteration it came up with this: https://photos.app.goo.gl/sW2TnqDEM5FzBLdPA
Use it if you like it!
Thanks for putting together this roundup, I learn things from it every time.
I agree with this.
Consider a hypothetical: there are two drugs we could use to execute prisoners convinced with the death penalty. One of them causes excruciating pain, the other does not, but costs more.
Would we feel that we would rather use the torture drug later? After all, the dude is dead, so he doesn't care either way.
I have a pretty strong intuition that those drugs are not similar. Same thing with the anesthesia example.
I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously -- we don't want eval data to leak into training data, and have multiple lines of defense to keep that from happening. It's not as trivial as you might think to avoid, since papers and blog posts and analyses can sometimes have specific examples from benchmarks in them, unmarked -- and while we do look for this kind of thing, there's no guarantee that we will be perfect at finding them. So it's completely possib... (read more)