LESSWRONG
LW

DieSab

Message

Gradient Anatomy's - Hallucination Robustness in Medical Q&A

TL;DR We investigated reducing hallucinations in medical question-answering with Llama-3.1-8B-Instruct. Using Goodfire's Sparse Auto-Encoder (SAE) we identified neural features associated with accurate and hallucinated responses. Our study found that features related to the model’s awareness of its own knowledge limitations were available and useful in detecting hallucinations. This was further...

Feb 12, 2025•2

DieSab

DieSab — LessWrong

DieSab

Message

Gradient Anatomy's - Hallucination Robustness in Medical Q&A

Feb 12, 2025•2

DieSab

Gradient Anatomy's - Hallucination Robustness in Medical Q&A

DieSab

TL;DR

We investigated reducing hallucinations in medical question-answering with Llama-3.1-8B-Instruct.

Using Goodfire's Sparse Auto-Encoder (SAE) we identified neural features associated with accurate and hallucinated responses. Our study found that features related to the model’s awareness of its own knowledge limitations were available and useful in detecting hallucinations.

This was further demonstrated by steering the model using those features and reducing hallucination rates by almost 6%. However, we observed that larger models exhibit greater uncertainty, complicating the distinction between information clearly known to the model and that which is clearly unknown.

These findings support research that features learnt during fine tuning, allow the model to learn about its own knowledge, such features are useful in identifying lack... (read 2902 more words →)