User Comment Replies

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Are there alternatives to limiting the distribution/releasing model weights publicly?

Would it be possible to use other technologies and/or techniques (e.g. federated learning, smart contracts, cryptography, etc.) for interpretability researchers to still have access to weights with the presence of bad actors in mind?

I understand that the benefits of limiting outweighs the costs, but I still think leaving the capacity to perform safety research to centralized private orgs could result in less robust capabilities and/or alignment methods. In addi... (read more)

2Simon Lermen2y

Ok, so in the Overview we cite Yang et al. While their work is similar they do have a somewhat different take and support open releases, *if*: "1. Data Filtering: filtering harmful text when constructing training data would potentially reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing models: once the models are safely aligned, aligning them toward harmful content will destroy them, concurrently also discussed by (Henderson et al., 2023)." yang et al. I also looked into henderson et al. but I am not sure if it is exactly what we would be looking for. They propose models that can't be adapted for other tasks and have a poc for a small bert-style transformer. But i can't evaluate if this would work with our models.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Miko Planas2y40

I'm quite interested in further understanding how the naming of these features will scale once we increase the layers of the transformer model and increase the size of the sparse autoencoder. This sounds like the search space for the large model doing the autointerpreting will become massive. Intuitively, might this affect the reliability of the short descriptions generated by the autointerpretability model? Additionally, if the model being analyzed -- as well as the sparse autoencoder -- is sufficiently larger than the autointerpreting model, how will thi... (read more)

Three ways interpretability could be impactful

Miko Planas2y20

I primarily see mechanistic interpretability as a potential path towards understanding how models develop capabilities and processes -- especially those that may represent misalignment. Hence, I view it as a means to monitor and align, not so much as to directly improve systems (unless of course we are able to include interpretability in the training loop).

Three ways interpretability could be impactful

Miko Planas2y30

How ambitious would it be to primarily focus on interpretability as an independent researcher (or as an employee/research engineer)?

If I've inferred correctly, one of this article's goals is to increase the number of contributors in the space. I generally agree with how impactful interpretability can be, but I am little more risk averse when it comes to it being my career path.

For context, I have just graduated and I have a decent amount of experience in Python and other technologies. With those skills, I was hoping to tackle multiple low-hangi... (read more)

0MiguelDev2y

Did I understand your question correctly? Are you viewing interpretability work as a means to improve AI systems and their capabilities?

Inside Views, Impostor Syndrome, and the Great LARP

Miko Planas2y40

In any given field, the relative contributions of people who do and don’t know what’s going on will depend on (1) how hard it is to build some initial general models of what’s going on, (2) the abundance of “low-hanging fruit”, and (3) the quality of feedback loops, so people can tell when someone’s random stumbling has actually found something useful

Reading this, I instantly thought of high-impact complex problems with low tolerance which, in according with the Cynefin framework, is best dealt with initial probing and sensing. By definition, such en... (read more)

Report on Frontier Model Training

Miko Planas2y4-1

I'm quite curious about the possibility of frontier model training costs dropping, as a result of technological advancements in hardware. In the event that its possible, how long might it take for the advancements to be adopted by large-scale AI labs?

For future posts, I'd want to see more of the specifics of ML GPUs , and the rising alternatives (e.g. companies working on hardware, research, lab partnerships, etc.) that might make it faster and cheaper to train large models.

LESSWRONG
LW

All of Miko Planas's Comments + Replies