Drejc Pesjak

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

On the Practical Applications of Interpretability

A friend recently asked me: how can I easily (ie. no training) create a model that only uses a limited subset of vocabulary? I thought about this question, and the main thing I could imagine was prompt-related: sticking in a few examples to the beginning of the prompt, repeatedly prompting the model until only certain words were used, or taking out words and sticking the processed text back in. You could also try finetuning, but you'd need to gather a lot of data and hope that the performance doesn't degrade in other tasks.

Wouldn't this be very simple, at the last layer just clip the probabilities of "forbidden words" to zero.

Reply