eggsyntax

AI safety & alignment researcher

Wiki Contributions

Comments

Sorted by

If AI agents could be trusted to generate a better signal/noise ratio by delegation than by working-alongside the AI (where the bottleneck is the human)

They can't typically (currently) do better on their own than working alongside a human, but a) a human can delegate a lot more tasks than they can collaborate on (and can delegate more cheaply to an AI than to another human), and b) though they're not as good on their own they're sometimes good enough.

Consider call centers as a central case here. Companies are finding it a profitable tradeoff to replace human call-center workers with AI even if the AI makes more mistakes, as long as it doesn't make too many mistakes.

Okay, my simulation point was admittedly a bit of colorful analogy

Fair enough; if it's not load-bearing for your view, that's fine. I do remain skeptical, and can sketch out why if it's of interest, but feel no particular need to continue.

It's the same reason for why we can't break out of the simulation IRL, except we don't have to face adversarial cognition, so the AI's task is even harder than our task.

This seems like it contains several ungrounded claims. Maybe I'm misreading you? But it seems weight-bearing for your overall argument, so I want to clarify.

  • We may not be in a simulation, in which case not being able to break out is no evidence of the ease of preventing breakout.
  • If we are in a simulation, then we only know that the simulation has been good enough to keep us in it so far, for a very short time since we even considered the possiblity that we were in a simulation, with barely any effort put into trying to determine whether we're in one or how to break out. We might find a way to break out once we put more effort into it, or once science and technology advance a bit further, or once we're a bit smarter than current-human.
  • We have no idea whether we're facing adversarial cognition.
  • If we are in a simulation, it's probably being run by beings much smarter than human, or at least much more advanced (certainly humans aren't anywhere remotely close to being able to simulate an entire universe containing billions of sentient minds). For the analogy to hold, the AI would have to be way below human level, and by hypothesis it's not (since we're talking about AI smart enough to be dangerous). 

Thanks! Quite similar to the Kesselman tags that @gwern uses (reproduced in this comment below), and I'd guess that one is decended from the other. Although it has somewhat different range cutoffs for each because why should anything ever be consistent.

Here are the UK ones in question (for ease of comparison):

Defence Intelligence: Probability Yardstick infographic

I think a more central question would be: do a nontrivial number of people in those parts of Europe work at soul-crushing jobs with horrible bosses? If so, what is it that they would otherwise lack that makes them feel obligated to do so?

Just to make things even more confusing, the main blog post is sometimes comparing o1 and o1-preview, with no mention of o1-mini:

And then in addition to that, some testing is done on 'pre-mitigation' versions and some on 'post-mitigation', and in the important red-teaming tests, it's not at all clear what tests were run on which ('red teamers had access to various snapshots of the model at different stages of training and mitigation maturity'). And confusingly, for jailbreak tests, 'human testers primarily generated jailbreaks against earlier versions of o1-preview and o1-mini, in line with OpenAI’s policies. These jailbreaks were then re-run against o1-preview and GPT-4o'. It's not at all clear to me how the latest versions of o1-preview and o1-mini would do on jailbreaks that were created for them rather than for earlier versions. At worst, OpenAI added mitigations against those specific jailbreaks and then retested, and it's those results that we're seeing.

In reality, every answer provides (at most) one bit of information

 

Quibble: this is true in expectation but not in reality. Suppose we're playing twenty questions, and my first question is, 'Is it a giraffe?' and you say yes. That's a lot more than one bit of information! It's just that in expectation those many bits are outweighed by the much larger number of cases where it isn't a giraffe and I get much less than one bit of information from your 'no' answer.

Do you happen to have evidence that they used process supervision? I've definitely heard that rumored, but haven't seen it confirmed anywhere that I can recall.

when COTs are sampled from these language models in OOD domains, misgeneralization is expected.

Offhand, it seems like if they didn't manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I'm guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I'm not confident in my guess there.

I'm a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage

That's a really good point. As long as benchmark scores are going up, there's not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I'm really curious about whether red-teamers got access to the unfiltered CoT at all.

Oh, that's an interesting thought, I hadn't considered that. Different models seems like it would complicate the training process considerably. But different heads/MoE seems like it might be a good strategy that would naturally emerge during training. Great point, thanks.

'...after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.'

(from 'Hiding the Chains of Thought' in their main post)

Load More