Align it
My reply to both your and @Chris_Leong 's comment is that you should simply use robust benchmarks on which high performance is interesting.
In the adversarial attack context, the attacker's objectives are not generally beyond the model's "capabilities."
A long time ago, I worked on an idea I called "Context Gems." A friend asked me to make a short, publicly accessible writeup for it. This is a really rough writeup of the idea, based on some old notes I had lying around.
Context Gems was an attempt at coming up with a theoretical way of safely eliciting superhuman capabilities from an AI. It was a fairly direct off-shoot of e.g. OSNR I later heard it was similar to some ideas Paul Christiano had a long time ago, like imitative generalization/learning the prior.
The goal is to get effective cognitive labor out of a superintelligent but possibly misaligned AI model without causing catastrophe. One insight is that if for a given goal, we want to achieve this goal while receiving a minimum amount of input/information from the superintelligent model. (Another way to phrase this is that we want to maximize the density of "helpful" information in the model's outputs, to reduce its opportunities for hijacking the human operators).
Our setup is to train/use a strong (possibly superhuman) model to generate short strings ("context gems"). We intend these strings to contain some insight about the problem domain. To verify the usefulness of these short strings, we feed them into context of weaker model, which we can assess on some benchmarks. We may "reward" the strong model based on the performance of the weak model on some problem domain.
I had Claude make me a flow-chart.
If the gems do dramatically improve the performance of the weak model, then perhaps they contain some true insight about the problem domain. Then, if the strings are sufficiently short, we might be able to expose the string to humans without causing catastrophe, hopefully resulting in similar performance improvements for humans.
Some things we might think about:
Thanks for the feedback!
I agree that it is possible to learn quickly without mentorship. However, I believe that for most programmers, the first "real" programming job is a source of tremendous growth. Why not have that earlier, and save more of one's youth?
Conventional advice directed at young people seem shockingly bad. I sat down to generate a list of anti-advice.
The anti-advice are things that I wish I was told in high school, but that are essentially negations of conventional advice.
You may not agree with the advice given here. In fact, they are deliberately controversial. They may also not be good advice. YMMV.
Thanks for the post!
The problem was that I wasn’t really suited for mechanistic interpretability research.
Sorry if I'm prodding too deep, and feel no need to respond. I always feel a bit curious about claims such as this.
I guess I have two questions (which you don't need to answer):
Hi, do you have a links to the papers/evidence?
Strong upvoted.
I think we should be wary of anchoring too hard on compelling stories/narratives.
However, as far as stories go, this vignette scores very highly for me. Will be coming back for a re-read.
but a market with a probability of 17% implies that 83% of people disagree with you
Is this a typo?