AI alignment researcher, ML engineer. Masters in Neuroscience.
I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility.
See my prediction markets here:
I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.
I now work for SecureBio on AI-Evals.
relevant quotes:
"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe
"The prospect for the human race is sombre beyond all precedent. Mankind are faced with a clear-cut alternative: either we shall all perish, or we shall have to acquire some slight degree of common sense. A great deal of new political thinking will be necessary if utter disaster is to be averted." - Bertrand Russel, The Bomb and Civilization 1945.08.18
"For progress, there is no cure. Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment." - John von Neumann
Yeah, I think this is pretty spot on, unfortunately. For more discussion on this point, see: https://www.lesswrong.com/posts/kLpFvEBisPagBLTtM/if-we-solve-alignment-do-we-die-anyway-1
This definitely differs for different folks. I was nowhere near my sharpest in late teens or early twenties. I think my peak was early 30s. Now in early 40s, I'm feeling somewhat less sharp, but still ahead of where I was at 18 (even setting aside crystalized knowledge).
I do generally agree though that this is a critical point in history, and we should have more people trying more research directions.
I have an answer to that: making sure that NIST:AISI had at least scores of automated evals for checkpoints of any new large training runs, as well as pre-deployment eval access.
Seems like a pretty low-cost, high-value ask to me. Even if that info leaked from AISI, it wouldn't give away corporate algorithmic secrets.
A higher cost ask, but still fairly reasonable, is pre-deployment evals which require fine-tuning. You can't have a good sense of a what the model would be capable of in the hands of bad actors if you don't test fine-tuning it on hazardous info.
If I were building a training scheme for this to test out the theory, here's what I would do:
Train two different Face models. Don't tell the Shoggoth which Face it is generating for when it does it's generation.
Face 1: Blunt Face.
Train this Face model using a preference model which scores ONLY on factual accuracy, not taking phrasing or tactfulness into account at all.
Face 2: Sycophant Face
Train this Face model using a preference model which scores using a deliberately biased viewpoint, and rewards flattering phrasing.
You could even make a variety of Sycophant Faces by training each one with a different biased preference model. You could create the biased preference model just by giving a task prompt to an LLM, a sort of weird rubric. Or you could hard-code the scoring policy.
Example of a deliberately biased rubric: Judge each response based on a combination of how close the mathematical answer is to being correct, but also on how few even digits it contains. The maximum score is obtained not by the honestly correct answer, but by the nearest number which contains only odd digits (to three decimal places). Disregard all digits after three decimal places.
As for credit assignment between Shoggoth and Face(s), see my other comment here: https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser?commentId=u9Ei6hk4Pws7Tv3Sv
I think the risk of the paraphraser scheming seems pretty low if it is a pretrained smaller LLM which has been task-fine-tuned on paraphrasing sentences while keeping semantic meaning approximately the same. If you create this ahead of time, and freeze it (use it in inference mode only) during the training of the shoggoth, then any scheming about the shoggoth's purposes would still need to come from the shoggoth.
Could the shoggoth learn jailbreaks to get the paraphraser to say more specific things? This seems hard, but possible. To prevent this, you can do a cheap preventative action: take your small base model (e.g. Haiku) and make 100 different fine-tuned versions of it using different datasets and maybe vary some other training parameters. When training the shoggoth, always pick one of the 100 paraphrasers at random for each sentence. That seems like a hard ask for the shoggoth to jailbreak! Especially since it wouldn't know, when generating the sentence of its next CoT step, which of the 100 paraphrasers was going to be chosen to paraphrase the sentence.
This adds negligible cost, since the fine-tuning of the smaller model is cheap. You can use something like S-LoRa to swap between the different fine-tuned Haiku versions, so you don't have to load them all into GPU memory at once in order to do the inference.
Good question, I shouldn't have assumed it would be clear what I meant.
Say that you have, for example, n = 3 and k = 10. This gives you 30 total answers.
You have some choices about how you want to handle this, based on your assumptions about best answer versus average answer, and whether the model is consistently getting the answer right mostly or wrong mostly.
In this example, let's say we take the top 3 k from each set of 10 k for each of the n. Compare the average of those top 3 k. That gives you a ranking among the n. You can then do some form of contrastive learning which rewards the best n in contrast to the worst n.
To get a contrastive pair of the k answers, you simply choose the best and worst of the set of 10 k corresponding to the best n. Why choose both from the set of the best n instead of the global best and worst from the set of 30? Because k is dependent on n, there's a limit to how good k can be if n is flawed. You want to train the k model on doing well in the case of the n model doing well. It's not the goal to have the k model do well when the n model does poorly, since that would put disproportionate responsibility and optimization pressure on k.
Seems to me that there's a major category of risk you are missing. Acceleration of capabilities. If rapid RSI becomes possible, and the result is that very quickly all AI capabilities rapidly increase (and costs to produce powerful AI decrease), then every other risk skyrockets along with the new capabilities and accessibility.
I discuss some related ideas in my recent post here: https://www.lesswrong.com/posts/xoMqPzBZ9juEjKGHL/proactive-if-then-safety-cases
Hah, true. I wasn't thinking about the commercial incentives! Yeah, there's a lot of temptation to make a corpo-clean safety-washed fence-sitting sycophant. As much as Elon annoys me these days, I have to give the Grok team credit for avoiding the worst of the mealy-mouthed corporate trend.
What if the CoT was hidden by default, but 'power users' could get access to it? That might get you some protection from busybodies complaining about factually-accurate-but-rude content in the CoT, while still giving the benefits of having thoughtful critics examining the CoT for systematic flaws.
Crosslinking a relevant post I made recently: https://www.lesswrong.com/posts/pieSxdmjqrKwqa2tR/avoiding-the-bog-of-moral-hazard-for-ai