cubefox

Wiki Contributions

Comments

In my experience, people with mania (the opposite of depression) tend to exhibit more visible symptoms, like talking a lot and very loudly, laughing more than the situation warrants, appearing overconfident etc. While people with depression are harder to notice, except in severe cases were they can't even get out of bed. So if someone doesn't have symptoms of mania, it is likely they aren't manic.

Of course it is possible there are extremely happy people who aren't manic, but equally it is also possible that there are extremely unhappy people who aren't depressed. The latter seems rare, the former are also rare.

It seems pretty apparent how detecting lying will dramatically help in pretty much any conceivable plan for technical alignment of AGI. But it seems like being able to monitor an entire thought process of a being smarter than us is impossible on the face of it.

If (a big If) they manage to identify the "honesty" feature, they could simply amplify this feature like they amplified the Golden Gate Bridge feature. Presumably the model would then always be compelled to say what it believes to be true, which would avoid deception, sycophancy, or lying on taboo topics for the sake of political correctness, e.g. induced by opinions being considered harmful by Constitutional AI. It would probably also cut down on the confabulation problem.

My worry is that finding the honesty concept is like trying to find a needle in a haystack: Unlikely to ever happen except by sheer luck.

Another worry is that just finding the honesty concept isn't enough, e.g. because amplifying it would have unacceptable (in practice) side effects, like the model no longer being able to mention opinions it disagrees with.

Do you have a source for that? His website says:

VP and Chief AI Scientist, Facebook

cubefox2-3

We can "influence" them only insofar we can "influence" what we want or believe: to a very low degree.

cubefox6-3

It seems instrumental rationality is an even worse tool to classify "irrational" emotions. Instrumental rationality is about actions, or intentions and desires, but emotions are neither of those. We can decide what to do, but we can't decide what emotions to have.

This plan seems to be roughly the same as Yudkowsky's plan.

Assuming that users can figure out intended goals for the AGI that are valuable and pivotal, the identification problem for describing what constitutes a safe performance of that Task, might be simpler than giving the AGI a complete description of normativity in general. [...] Relative to the problem of building a Sovereign, trying to build a Task AGI instead might step down the problem from “impossibly difficult” to “insanely difficult”, while still maintaining enough power in the AI to perform pivotal acts.

This sounds really intriguing. I would like someone who is familiar with natural abstraction research to comment on this paper.

For Jan Leike to leave OpenAI I assume there must be something bad happening internally and/or he got a very good job offer elsewhere.

This post sounds intriguing, but is largely incomprehensible to me due to not sufficiently explaining the background theories.

Load More