For that specific example, I would not call it safety critical in the sense that you shouldn't use an unreliable source. Intel involves lots of noisy and untrustworthy data, and indeed the job is making sense out of lots of conflicting and noisy signals. It doesn't strike me that adding an LLM to the mix changes things all that much. It's useful, it adds signal (presumably), but also is wrong sometimes -- this is just what all the inputs are for an analyst.
Where I would say it crosses a line is if there isn't a human analyst. If an LLM analyst was directly...
"Perhaps we should pause widespread rollout of Generative AI in safety-critical domains — unless and until it can be relied on to follow rules with significant greater reliability."
This seems clearly correct to me - LLMs should not be in safety critical domains until we can make a clear case for why things will go well in that situation. I'm not actually aware of anyone using LLMs in that way yet, mostly because they aren't good enough, but I'm sure that at some point it'll start happening. You could imagine enshrining in regulation that there must be affi...
Interesting post! I've noticed that poker reasoning tends to be terrible, it's not totally clear to me why. Pretraining should contain quite a lot of poker discussion, though I guess a lot of it is garbage. I think it could be pretty easily fixed in RL if anyone cared enough, but then it wouldn't be a good test of general reasoning ability.
One nit: it's "hole card", not "whole card".
This is a great piece! I especially appreciate the concrete list at the end.
In other areas of advocacy and policy, it's typical practice to have model legislation and available experts ready to go so that when a window opens when action is possible, progress can be very fast. We need to get AI safety into a similar place.
"There have been some relatively discontinuous jumps already (e.g. GPT-3, 3.5 and 4), at least from the outside perspective."
These are firmly within our definition of continuity - we intend our approach to handle jumps larger than seen in your examples here.
Possibly a disconnect is that from an end user perspective a new release can look like a big jump, while from a developer perspective it was continuous.
Note also that continuous can still be very fast. And of course we could be wrong about discontinuous jumps.
I don't work directly on pretraining, but when there were allegations of eval set contamination due to detection of a canary string last year, I looked into it specifically. I read the docs on prevention, talked with the lead engineer, and discussed with other execs.
So I have pretty detailed knowledge here. Of course GDM is a big complicated place and I certainly don't know everything, but I'm confident that we are trying hard to prevent contamination.
I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously -- we don't want eval data to leak into training data, and have multiple lines of defense to keep that from happening. It's not as trivial as you might think to avoid, since papers and blog posts and analyses can sometimes have specific examples from benchmarks in them, unmarked -- and while we do look for this kind of thing, there's no guarantee that we will be perfect at finding them. So it's completely possib...
I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously -- we don't want eval data to leak into training data, and have multiple lines of defense to keep that from happening.
What do you mean by "we"? Do you work on the pretraining team, talk directly with the pretraining team, are just aware of the methods the pretraining team uses, or some other thing?
Humans have always been misaligned. Things now are probably significantly better in terms of human alignment than almost any time in history (citation needed) due to high levels of education and broad agreement about many things that we take for granted (e.g. the limits of free trade are debated but there has never been so much free trade). So you would need to think that something important was different now for there to be some kind of new existential risk.
One candidate is that as tech advances, the amount of damage a small misaligned group could do is g...
One tip for research of this kind is to not only measure recall, but also precision. It's easy to block 100% of dangerous prompts by blocking 100% of prompts, but obviously that doesn't work in practice. The actual task that labs are trying to solve is to block as many unsafe prompts as possible while rarely blocking safe prompts, or in other words, looking at both precision and recall.
Of course with truly dangerous models and prompts, you do want ~100% recall, and in that situation it's fair to say that nobody should ever be able to build a bioweapon. But...
I'm probably too conflicted to give you advice here (I work on safety at Google DeepMind), but you might want to think through, at a gears level, what could concretely happen with your work that would lead to bad outcomes. Then you can balance that against positives (getting paid, becoming more familiar with model outputs, whatever).
You might also think about how your work compares to whoever would replace you on average, and what implications that might have as well.
I haven't heard the p zombie argument before, but I agree that is at least some Bayesian evidence that we're not in a sim.
Probably 3 needs to be developed further, but this is the first new piece of evidence I've seen since I first encountered the simulation argument in like 2005.
Is it the case that the tech would exist without him? I think that's pretty unclear, especially for SpaceX, where despite other startups in the space, nobody else managed to radically reduce the cost per launch in a way that transformed the industry.
Even for Tesla, which seems more pedestrian (heh) now, there were a number of years where they had the only viable car in the market. It was only once they proved it was feasible that everyone else piled in.
Progress in ML looks a lot like, we had a different setup with different data and a tweaked algorithm and did better on this task. If you want to put an asterisk on o3 because it trained in some specific way that's different from previous contenders, then basically every ML advance is going to have a similar asterisk. Seems like a lot of asterisking.
Hm I think the main thrust of this post misses something, which is that different conditions, even contradictory conditions, can easily happen locally. Obviously, it can be raining in San Francisco and sunny in LA, and you can have one person wearing a raincoat in SF and the other one the beach in LA with no problem, even if they are part of the same team.
I think this is true of wealth inequality.
Carnegie or Larry Page or Warren Buffett got their money in a non exploitative way, by being better than others at something that was extremely socially valuable....
It seems very strange to me to say that they cheated, when the public training set is intended to be used exactly for training. They did what the test specified! And they didn't even use all of it.
The whole point of the test is that some training examples aren't going to unlock the rest of it. What training definitely does it teach the model how to output the JSON in the right format, and likely how to think about what to even do with these visual puzzles.
Do we say that humans aren't a general intelligence even though for ~all valuable tasks, you have to take some time to practice, or someone has to show you, before you can do it well?
Do we say that humans aren't a general intelligence even though for ~all valuable tasks, you have to take some time to practice, or someone has to show you, before you can do it well?
More pointedly, I didn't see anyone complaining about the previous champion doing 100%-ARC-only online training while trying to solve ARC, so why would you complain about weaker offline training as a small part of a giant pretraining corpus?
(Generating millions of examples to train on, yes, people did complain about that and arguably that is 'cheating', but 'not using froze...
"Reliable fact recall is valuable, but why would o1 pro be especially good at it? It seems like that would be the opposite of reasoning, or of thinking for a long time?"
Current models were already good at identifying and fixing factual errors when run over a response and asked to critique and fix it. Works maybe 80% of the time to identify whether there's a mistake, and can fix it at a somewhat lower rate.
So not surprising at all that a reasoning loop can do the same thing. Possibly there's some other secret sauce in there, but just critiquing and fixing mistakes is probably enough to see the reported gains in o1.
One way this could happen is searching for jailbreaks in the space of paraphrases and synonyms of a benign prompt.
Why would this produce fake/unlikely jailbreaks? If the paraphrases and such are natural, then doesn't the nearness to a real(istic) prompt enough to suggest that the jailbreak found is also realistic? Of course you can adversarially generate super unrealistic things, but does that necessarily happen with paraphrasing type attacks?
You may recall certain news items last February around Gemini and diversity that wiped many billions off of Google's market cap.
There's a clear financial incentive to make sure that models say things within expected limits.
There's also this: https://www.wired.com/story/air-canada-chatbot-refund-policy/
Really cool project! And the write-up is very clear.
In the section about options for reducing the hit to helpfulness, I was surprised you didn't mention scaling the vector you're adding or subtracting -- did you try different weights? I would expect that you can tune the strength of the intervention by weighting the difference in means vector up or down.
The usual reason is compounding. If you have an asset that is growing over time, paying taxes from it means not only do you have less of it now, but the amount you pulled out now won't compound indefinitely into the future. You want to compound growth for as long as possible on as much capital as possible. If you could diversify without paying capital gains you would, but since the choice is something like, get gains on $100 in this one stock, or get gains on $70 in this diversified basket of stocks, you might stay with the concentrated position even if you would prefer to be diversified.
Cool concept. I'm a bit puzzled by one thing though -- presumably every time you use a tether, it slows down and drops to a lower orbit. How do you handle that? Is the idea that it's so much more massive than the rockets its boosting that its slowdown is negligible? Or do we have to go spin it back up every so often?
One way to regain energy is to run the tether in reverse - drop something from a faster orbit back into the atmosphere, siphoning off some of its energy along the way. If every time you sent one spacecraft up another was lined up to come back down, that would save a lot of trouble.
But you'll still need to do orbital corrections, offset atmospheric drag, and allow for imbalances, so yeah, it would seem like you still need a pretty beefy means of propulsion on this thing, which is oddly unmentioned for being key to the whole design.
Tethers can theoretically use more efficient propulsion because their thrust requirements are lower. The argon Hall effect thrusters on Starlink satellites have around 7x the specific impulse (fuel efficiency) of Starship engines while needing 7x the energy due to KE=mv^2/2 and having a tiny fraction of the thrust. This energy could come from a giant solar panel rather than the fuel, and every once in a while it could be refueled with a big tanker of liquid argon.
"If you are playing with a player who thinks that "all reds" is a strong hand, it can take you many, many hands to figure out that they're overestimating their hands instead of just getting anomalously lucky with their hidden cards while everyone else folds!"
As you guessed, this is wrong. If someone is playing a lot of hands, your first hypothesis is that they are too loose and making mistakes. At that point, each additional hand they play is evidence in favor of fishiness, and you can quickly become confident that they are bad.
Mistakes in the other direct...
That all sounds right, but I want to invert your setup.
If someone is playing too many hands, your first hypothesis is that they are too loose and making mistakes. If someone folds for 30 minutes, then steals the blinds once, then folds some more, you will have a hard time telling whether they're playing wrong or have had a bad run of cards.
But in either case, it is going to be significantly harder for them to tell, from inside their own still-developing understanding of the game, whether the things that are happening to them are evidence about their own mi...
I wonder if there's a way to give the black box recommended a different objective function. CTR is bad for the obvious clickbait reasons, but signals for user interaction are still valuable if you can find the right signal to use.
I would propose that returning to the site some time in the future is a better signal of quality than CTR, assuming the future is far enough away. You could try a week, a month, and a quarter.
This is maybe a good time to use reinforcement learning, since the signal is far away from the decision you need to make. When someone interacts with an article, reward the things they interacted with n weeks ago. Combined with karma, I bet that would be a better signal than CTR.
I think your model is a bit simplistic. METR has absolutely influenced the behavior of the big labs, including DeepMind. Even if all impact goes through the big labs, you could have more influence outside of the lab than as one of many employees within. Being the head of a regulatory agency that oversees the labs sets policy in a much more direct way than a mid level exec within the company can.
I went back to finish college as an adult, and my main surprise was how much fun it was. It probably depends on what classes you have left, but I took every AI class offered and learned a ton that is still relevant to my work today, 20 years later. Even the general classes were fun -- it turns out it's easy to be an excellent student if you're used to working a full work week, and being a good student is way more pleasant and less stressful than being a bad one, or at least it was for me.
I'm not sure what you should do necessarily, but given that you're t...
This is a great post. I knew that at the top end of the income distribution in the US people have more kids, but didn't understand how robust the relationship seems to be.
I think the standard evbio explanation here would ride on status -- people at the top of the tribe can afford to expend more resources for kids, and also have more access to opportunities to have kids. That would predict that we wouldn't see a radical change as everyone got more rich -- the curve would slide right and the top end of the distribution would have more kids but not necessaril...
One big one is that the first big spreading event happened at a wet market where people and animals are in close proximity. You could check densely peopled places within some proximity of the lab to figure out how surprising it is that it happened in a wet market, but certainly animal spillover is much more likely where there are animals.
Edit: also it's honestly kind of a bad sign that you aren't aware of evidence that tends against your favored explanation, since that mostly happens during motivated reasoning.
You should ignore the EY style "no future" takes when thinking about your future. This is because if the world is about to end, nothing you do will matter much. But if the world isn't about to end, what you do might matter quite a bit -- so you should focus on the latter.
One quick question to ask yourself is: are you more likely to have an impact on technology, or on policy? Either one is useful. (If neither seems great, then consider earning to give, or just find a way to add value in society in other ways.)
Once you figure that out, the next step is almos...
I think you should have a kid if you would have wanted one without recent AI progress. Timelines are still very uncertain, and strong AGI could still be decades away. Parenthood is strongly value creating and extremely rewarding (if hard at times) and that's true in many many worlds.
In fact it's hard to find probable worlds where having kids is a really bad idea, IMO. If we solve alignment and end up in AI utopia, having kids is great! If we don't solve alignment and EY is right about what happens in a fast takeoff world, it doesn't really matter if you ha...
If we don't solve alignment and EY is right about what happens in a fast takeoff world, it doesn't really matter if you have kids or not.
This IMO misses the obvious fact that you spend your life with a lot more anguish if you think that not just you, but your kid is going to die too. I don't have a kid but everyone who does seems to describe a feeling of protectiveness that transcends any standard "I really care about this person" one you could experience with just about anyone else.
Thanks, Zvi, these roundups are always interesting.
I have one small suggestion, which is that you limit yourself to one Patrick link per post. He's an interesting guy but his area is quite niche, and if people want his fun stories about banking systems they can just follow him. I suspect that people who care about those things already follow him, and people who don't aren't that interested to read four items from him here.
We think humans are sentient because of two factors: first, we have internal experience that means we ourselves are sentient; and two, we rely on testimony from others who say they are sentient. We can rely on the latter because people seem similar. I feel sentient and say I am. You are similar to me and say you are. Probably you are sentient.
With AI, this breaks down because they aren't very similar to us in terms of cognition, brain architecture, or "life" "experience". So unfortunately AI saying they are sentient does not produce the same kind of ... (read more)