erhora

Message

erhora has not written any posts yet.

erhora

Message

erhora has not written any posts yet.

erhora

Replying toLLMs can learn about themselves by introspection

erhora1y

LLMs can learn about themselves by introspection

This is a super interesting line of work!

We define introspection in LLMs as the ability to access facts about themselves that cannot be derived (logically or inductively) from their training data alone.

The entire model is, in a sense, "logically derived" from its training data, so any facts about its output on certain prompts can also be logically derived from its training data.

Why did you choose to make non-derivability part of your definition? Do you mean something like "cannot be derived quickly, for example without training a whole new model"? I'm worried that your current definition is impossible to satisfy, and that you are setting yourself up for easy criticism because it sounds like you're hypothesising strong emergence, i.e. magic.

Replying toNear-mode thinking on AI

erhora2y

Near-mode thinking on AI

You can think of a pipeline like
feed lots of good papers in [situational awareness / out-of-context reasoning / ...] into GPT-4's context window,
ask it to generate 100 follow-up research ideas,
ask it to develop specific experiments to run for each of those ideas,
feed those experiments for GPT-4 copies equipped with a coding environment,
write the results to a nice little article and send it to a human.

Obvious, but perhaps worth reminding ourselves, that this is a recipe for automating/speeding-up AI research in general, so would be a neutral at best update for AI safety if it worked.

It does seem that for automation to have a disproportionately large impact for AI alignment it would have... (read more)

LESSWRONG
LW

LESSWRONG
LW

erhora

erhora

erhora

erhora