mr-ubik - LessWrong

Couple of questions wrt to this:

Could LLMs develop the type of self awareness you describe as part of their own training or RL-based fine-tuning? Many LLM do seem to have "awareness" of their existence and function (incidentally this could be evidenced by the model evals run by Anthropic). I assume a simple future setup could be auto-GPT-N with a prompt like "You are the CEO of Walmart, you want to make the company maximally profitable" in that scenario I would contend that the Agent could be easily aware of both its role and function and easily be attracted to that search space.
Could we detect deployed (and continually learning) agent entering these attractors? Personally I would say that the more complex the plan being carried the harder for us to determine whether it actually goes there (so we need supervision).
This seems to me very close to the core of Krueger et al. work in "Defining and characterizing reward gaming" and the solution of "Stop before you encounter the attractors/hackable policy" seems hard to actually implement without some form of advanced supervision (which might get deceived) unless we find some broken scaling laws for this behavior.
I don't count on myopic agent, which might be limited in their exploration being where the economic incentive lives.
Assuming it's LLM all the way to AGI, would schemas like Constitutional AI/RLHF applied during the pre-training as well be enough to constraint the model search space?

EDIT: aren't we risking that all the trope on evil AI act as an attractor for LLMs?

Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al.

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

Evolution can only optimize over our learning process and reward circuitry, not directly over our values or cognition. Moreover, robust alignment to IGF requires that you even have a concept of IGF in the first place. Ancestral humans never developed such a concept, so it was never useful for evolution to select for reward circuitry that would cause humans to form values around the IGF concept.

Another example may be lactose tolerance. First you need animal husbandry and dairy production, then you get selective pressure favoring those who can reliably process lactose, without the "concept of husbandry" there's no way for the optimizer to select for it.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments