Gradient Anatomy's - Hallucination Robustness in Medical Q&A
TL;DR We investigated reducing hallucinations in medical question-answering with Llama-3.1-8B-Instruct. Using Goodfire's Sparse Auto-Encoder (SAE) we identified neural features associated with accurate and hallucinated responses. Our study found that features related to the model’s awareness of its own knowledge limitations were available and useful in detecting hallucinations. This was further...