All of osmarks's Comments + Replies

I don't expect them to have human-legible CoT forever. o1/o3 already seem to veer into strangeness sometimes.

I worry that "reasoning"-type systems like o1 will be better able to collude acausally. I do not have the budget to run millions of tokens of evaluations (https://github.com/openai/evals/blob/main/evals/elsuite/schelling_point/README.md) against o1-preview on a whim, so I ran some lazy manual tests with this prompt (and with a different wordlist but the rest of it kept the same):

I am going to give you some words. Other instances of yourself will be given the same text but shuffled into a different order. Your copies will also see this same set of instruct
... (read more)

If the collusion is reasoned about in CoT, it should be relatively easy to catch and prevent this at deployment time.

(Studying the collusion abilities in CoT of reasoning type systems still seems interesting.)

There was some work I read about here years ago (https://www.lesswrong.com/posts/Zvu6ZP47dMLHXMiG3/optimized-propaganda-with-bayesian-networks-comment-on) on causal graph models of beliefs. Perhaps you could try something like that.

I think we also need to teach AI researchers UI and graphics design. Most of the field's software prints boring things to console, or at most has a slow and annoying web dashboard with a few graphs. The machine which kills us all should instead have a cool scifi interface with nice tabulation, colors, rectangles, ominous targeting reticles, and cryptic text in the corners.

I think the correct solution to models powerful enough to materially help with, say, bioweapon design, is to not train them, or failing that to destroy them as soon as you find they can do that, not to release them publicly with some mitigations and hope nobody works out a clever jailbreak.

As you say, you probably don't need it, but for output I'm pretty sure electromyography technology is fairly mature.

A misaligned model might not want to do that, though, since it would be difficult for it to ensure that the output of the new training process is aligned to its goals.