Does Summarization Affect LLM Performance?
Edit 4/2/25: Added footnotes; I didn't realize they got lost en-route. Hello! This is a mini-project that I carried out to get a better sense of ML engineering research. The question itself is trivial, but it was useful to walk through every step of the process by myself. I also haven’t had much experience with technical research, so I’d greatly appreciate any feedback. My code is available on github. This task originated from an application to Aryan Bhatt’s SPAR stream. Thanks to Jacob G-W, Sudarsh Kunnavakkam, and Sanyu Rajakumar for reading through / giving feedback on an initial version of this post! Any errors are, of course, my own. Tl;dr: Does summarization degrade LLM performance on GPQA Diamond? Yes, but only if the model has a high baseline. Also, the degree of summarization is not correlated with performance degradation. Summary I investigate whether summarizing questions can reduce the accuracy of a model’s responses. Before running this experiment, I had two hypotheses. 1. Summarization does affect a model’s performance, but only when mediated through ‘loss of semantic information’ (as opposed to because of the words lost). 2. More semantic information lost, results in worse model performance. I operationalized these through two summarization metrics: syntactic summarization and semantic summarization. The former represents our intuitive notion of summarization; the latter aims to capture the fuzzier notion of how much meaning is lost. Results indicated that summarization was only an issue when the model had a high baseline to start with. Whereas, when baseline performance was low, summarization had negligible effect [Table 1]. Also, the degree of summarization (either syntactic or semantic) was not correlated with performance degradation (figures attached under Results). Table 1: Model baseline performance on GPQA Diamond vs Model performance when the questions were summarized. Results shown with 95% confidence intervals. Methodology Fi