Jonathan Kutasov

Posts

Sorted by New

27Interpretability of SAE Features Representing Check in ChessGPT

8mo

2

Wikitag Contributions

Comments

Sorted by

Newest

Interpretability of SAE Features Representing Check in ChessGPT

Jonathan Kutasov8mo10

Thanks for the suggestion! This sounds pretty cool and I think would be worth trying.

One thing that might make this a bit tricky is finding the right subset of the data to feed into Claude. Each feature only fires very rarely so it can be easy to fool yourself into thinking that you found a good classifier when you haven’t.

For example, many of the features we found only fire when they see check. However, many cases of check don’t activate the feature. The problem we ran into is that check is such an infrequent occurrence that you can only get a good number of samples showing check by taking a ton of examples overall, or by upweighting the check class in your sampling.

So if we show Claude all the examples where a feature fired and then some equal number of randomly chosen examples where it didn’t, chances are that just using “is in check” will be a great classifier. I think we can get around this with prompting Claude to find as many restrictions as possible, but sort of an interesting thing that might come up.

Reply