cubefox - LessWrong

In my experience, people with mania (the opposite of depression) tend to exhibit more visible symptoms, like talking a lot and very loudly, laughing more than the situation warrants, appearing overconfident etc. While people with depression are harder to notice, except in severe cases were they can't even get out of bed. So if someone doesn't have symptoms of mania, it is likely they aren't manic.

Of course it is possible there are extremely happy people who aren't manic, but equally it is also possible that there are extremely unhappy people who aren't depressed. The latter seems rare, the former are also rare.

Anthropic announces interpretability advances. How much does this advance alignment?

cubefox2d30

It seems pretty apparent how detecting lying will dramatically help in pretty much any conceivable plan for technical alignment of AGI. But it seems like being able to monitor an entire thought process of a being smarter than us is impossible on the face of it.

If (a big If) they manage to identify the "honesty" feature, they could simply amplify this feature like they amplified the Golden Gate Bridge feature. Presumably the model would then always be compelled to say what it believes to be true, which would avoid deception, sycophancy, or lying on taboo topics for the sake of political correctness, e.g. induced by opinions being considered harmful by Constitutional AI. It would probably also cut down on the confabulation problem.

My worry is that finding the honesty concept is like trying to find a needle in a haystack: Unlikely to ever happen except by sheer luck.

Another worry is that just finding the honesty concept isn't enough, e.g. because amplifying it would have unacceptable (in practice) side effects, like the model no longer being able to mention opinions it disagrees with.

What's Going on With OpenAI's Messaging?

cubefox3d52

Do you have a source for that? His website says:

VP and Chief AI Scientist, Facebook

DeepMind's "Frontier Safety Framework" is weak and unambitious

cubefox6d30

RSP = Responsible Scaling Policy

Feeling (instrumentally) Rational

cubefox8d2-3

We can "influence" them only insofar we can "influence" what we want or believe: to a very low degree.

Feeling (instrumentally) Rational

cubefox8d6-3

It seems instrumental rationality is an even worse tool to classify "irrational" emotions. Instrumental rationality is about actions, or intentions and desires, but emotions are neither of those. We can decide what to do, but we can't decide what emotions to have.

Instruction-following AGI is easier and more likely than value aligned AGI

cubefox8d20

This plan seems to be roughly the same as Yudkowsky's plan.

Assuming that users can figure out intended goals for the AGI that are valuable and pivotal, the identification problem for describing what constitutes a safe performance of that Task, might be simpler than giving the AGI a complete description of normativity in general. [...] Relative to the problem of building a Sovereign, trying to build a Task AGI instead might step down the problem from “impossibly difficult” to “insanely difficult”, while still maintaining enough power in the AI to perform pivotal acts.