Chris_Leong

Sequences

Linguistic Freedom: Map and Territory Revisted
INVESTIGATIONS INTO INFINITY

Comments

Sorted by
Chris_LeongΩ410-2

I agree that we probably want most theory to be towards the applied end these days due to short timelines. Empirical work needs theory in order to direct it, theory needs empirics in order to remain grounded.

Thanks for writing this. I think it is a useful model. However, there is one thing I want to push back against:

Looking at behaviour is conceptually straightforward, and valuable, and being done

I agree with Apollo Research that evals isn't really a science yet. It mostly seems to be conducted according to vibes. Model internals could help with this, but things like building experience or auditing models using different schemes and comparing them could help make this more scientific.

Similarly, a lot of work with Model Organisms of Alignment requires a lot of careful thought to get right.

Remember back in 2013 when the talk of the town was how vector representations of words learned by neural networks represent rich semantic information? So you could do cool things like take the [male] vector, subtract the [female] vector, add the [king] vector, and get out something close to the [queen] vector? That was cool! Where's the stuff like that these days? 


Activation vectors are a thing. So it's totally happening.

Chris_LeongΩ240

"How can we get more evidence on whether scheming is plausible?" - What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.

I guess I was thinking about this in terms of getting maximal value out of wise AI advisers. The notion that comparisons might be unfair didn't even enter my mind, even though that isn't too many reasoning steps away from where I was.

Fascinating. Sounds related to the Yoga concept of kryias.

I would suggest adopting a different method of interpretation, one more grounded in what was actually said. Anyway, I think it's probably best that we leave this thread here.

Sadly, cause-neutral was an even more confusing term, so this is better than the comparative. I also think that the two notions of principles-first are less disconnected than you think, but through somewhat indirect effects.

We're mostly working on stuff to stay afloat rather than high level navigation.

 

Why do you think that this is the case?

Load More