User Comment Replies

Three ways interpretability could be impactful

Some questions about the feasibility section:

About (2): what would 'crisp internal representations' look like? I think this is really useful to know since we haven't figure this out for the brain or LLMs (e.g. Interpreting Neural Networks through the Polytope Lens

Moreover, the current methods used in comparing the human brain's representations to those of ML models like RSA are quite high-level, or, at the very least do not reveal any useful insights for interpretability for either side (pls feel free to correct me). This question is not a point against me... (read more)

1Arthur Conmy2y

By "crisp internal representations" I mean that 1. The latent variables the model uses to perform tasks are latent variables that humans use. Contra ideas that language models reasons in completely alien ways. I agree that the two cited works are not particularly strong evidence : ( 2. The model uses these latent variables with a bias towards shorter algorithms (e.g shallower paths in transformers). This is important as it's possible that even when performing really narrow tasks, models could use a very large number of (individually understandable) latent variables in long algorithms such that no human could feasibly understand what's going on. I'm not sure what the end product of automated interpretability is, I think it would be pure speculation to make claims here.

6 non-obvious mental health issues specific to AI safety

Ed Li2y10

Thank you so much for posting this. It feels weird to tick every single symptom mentioned here...

The burnout that 'Dmitry' experiences is remarkably accurate for what am I experiencing, Are there any further guides on how to manage this? It would help me so much, any help is appreciated:)

LESSWRONG
LW

All of Ed Li's Comments + Replies