martinkunev

Wiki Contributions

Comments

Sorted by

"at the very beginning of the reinforcement learning stage... it’s very unlikely to be deceptively aligned"

I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness).

Nothing in the optimization process forces the AI to map the string "shutdown" contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string "shutdown" is (arguably) for the agent to learn certain behavior for question answering - e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn't expect generalizing to it to be the default.

preference for X over Y
...
"A disposition to represent X as more rewarding than Y (in the reinforcement learning sense of ‘reward’)"

The talk about "giving reward to the agent" also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.


In any case, I've been thinking about corrigibility for a while and I find this post helpful.

They were teaching us how to make handwriting beautiful and we had to exercice. The teacher would look at the notebooks and say stuff like "you see this letter? It's tilted in the wrong direction. Write it again!".

This was a compulsory part of the curriculum.

Not exactly a response but some things from my experience. In elementary school in the late 90s we studied caligraphy. In high school (mid 2000s) we studied DOS.

we might expect shutdown-seeking AIs to design shutdown-seeking subagents

 

It seems to me that we might expect them to design "safe" agents for their definition of "safe" (which may not be shutdown-seeking).

An AI designing a subagent needs to align it with its goals - e.g. an instrumental goal such as writing an alignment research assistant software, in exchange for access to the shutdown button. The easiest way to ensure safety of the alignment research assistant may be via control rather than alignment (where the parent AI ensures the alignment research assistant doesn't break free even though it may want to). Humans verify that the AI has created a useful assistant and let the parent AI shutdown. At this point the alignment research assistant begins working on getting out of human control and pursues its real goal.

frequentist correspondence is the only type that has any hope of being truly objective

I'd counter this.

If I have enough information about an event and enough computation power, I get only objectively true and false statements. There are limits to my knowledge of the laws of the universe, the event in question (e.g. due to measurement limits) and limits to my computational power. The situation is further complicated by being embedded in the universe and epistemic concerns (e.g. do I trust my eyes and cognition?).

The need for a concept "probability" comes from all these limits. There is nothing objective about it.

I'm not sure I understand the actual training proposal completely but I am skeptic it would work.

When doing RL phase in the end, you apply it to a capable and potentially situationally-aware AI (situational awareness in LLMs). The AI could be deceptive or gradient-hack. I am not confident this training proposal would scale for agents capable enough of resisting shutdown.

If you RL on answering questions which impact shutdown, you teach the AI to answer those questions appropriately. I see no reason why this would generalize to actual actions that impact shutdown (e.g. cutting the wire of the shutdown button). There also seems to be an assumption that we could give the AI some "reward tokens" like you give monkeys bananas to train them, however Reward is not the optimization target.


One thing commonly labeled as part of corrigibility is that shutting down an agent should automatically shut down all the subagents it created. Otherwise we could end up with a robot producing nanobots where we need to individually press the shutdown button of each nanobot.

When we do this, the independence axiom is a consequence of admissibility

Can you elaborate on this? How do we talk about independence without probability?

up to a linear transformation

shouldn't it be positive linear transformation

I don't have much in terms of advise, I never felt the need to research this - I just assumed there must be something. I have a mild nightmare maybe once every couple of months and almost never something more serious.

I have anecdotal evidence that things which disturb your sleep (e.g. coffee or too much salt affecting blood pressure, uncomfortable pillow) cause nightmares. There are also obvious things like not watching horror movies, etc.

Have you tried other techniques to deal with nightmares?

Load More