lovetheusers — LessWrong

The cornerstone of neural algorithmic reasoning is the ability to solve algorithmic tasks, especially in a way that generalises out of distribution. While recent years have seen a surge in methodological improvements in this area, they mostly focused on building specialist models. Specialist models are capable of learning to neurally execute either only one algorithm or a collection of algorithms with identical control-flow backbone. Here, instead, we focus on constructing a generalist neural algorithmic learner -- a single graph neural network processor capable of learning to execute a wide

lovetheusers3y

AGI Ruin: A List of Lethalities

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect.

This is correct, and I believe the answer is to optimize for detecting aligned thoughts.

Replying toAGI Ruin: A List of Lethalities

lovetheusers3y

AGI Ruin: A List of Lethalities

Human raters make systematic errors - regular, compactly describable, predictable errors.

This implies it's possible- through another set of human or automated raters- rate better. If the errors are predictable, you could train a model to predict the errors- by comparing rater errors and a heavily scrutinized ground truth. You could add this model's error prediction to the rater answer and get a correct label.

Replying toAGI Ruin: A List of Lethalities

lovetheusers3y

AGI Ruin: A List of Lethalities

Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.

Modern language models are not aligned. Anthropic's HH is the closest thing available, and I'm not sure anyone else has had a chance to test it out for weaknesses or misalignment. (OpenAI's Instruct RLHF models are deceptively misaligned, and have gone more and more misaligned over time. They fail to faithfully give the right answer, and say something that is similar to the training objective-- usually something bland and "reasonable.")

Replying toDoes SGD Produce Deceptive Alignment?

lovetheusers3y

Does SGD Produce Deceptive Alignment?

For example, a model trained on the base objective "imitate what humans would say" might do nearly as well if it had the proxy objective "say something humans find reasonable." There are very few situations in which humans would find reasonable something they wouldn't say or vice-versa, so the marginal benefit of aligning the proxy objective with the base objective is quite small.

For zero-shot tasks, this is the problem text-davinci-002 and text-davinci-001 to a lesser extent face. I believe they are deceptively aligned. davinci-instruct-beta does not face this problem.

For example, when text-davinci-002 is asked zero-shot to make an analogy between two things, it will often instead plainly explain both instead.