I'm trying to understand your apparent distaste for averaging.
In the physics context, you're treating it as some empirical lab technique for dealing with imperfect apparatus. Given its indifference to the particular sorts of noise or error model, averaging can appear to be unprincipled or just a tractable approximation to some better scheme for analyzing all the observations. What if there is temporal or spatial correlation in the errors? What if there is some Simpson's Paradox-style structure between groups of observations? What if the least-significant bits of the measurements spell out in ASCII what the true answer is?
However, it is nearly a meta-theorem of statistics that inference is possible only when averaging is (this follows from looking at the properties of exponential families of distributions, the only really tractable class). If some extra structure is present, the answer is not to give up averaging but instead to average ALL the things (corresponding to sufficient statistics in a richer family).
The problem is not with averaging. The problem is the misunderstanding of what the result means and where the result is actually coming from.
The average weight of a stable atomic nucleus - averaged over all stable nuclei [of all elements], for instance, is not an important fact from nuclear physics. It is almost entirely useless trivia so uninteresting that I wouldn't be surprised if not a single nuclear physicist has ever calculated it. Likewise, the average human behaviour, when there is huge variance in human behaviour, is more of a demographical fact ...
It seems to me that there is a great deal of generalization from average (or correlation, as a form of average) when interpreting the scientific findings.
Consider Sapir-Whorf hypothesis as an example; the hypothesis is tested by measuring average behaviours of huge groups of people; at the same time, it may well be that for some people strong version of Sapir-Whorf hypothesis does hold, and for some it is grossly invalid, with some people in between. We had determined that there's considerable diversity in the modes of thought by simply asking the people to describe their thought. I would rather infer from diversity of comments that I can't generalize about the human thought, than generalize from even the most accurate, most scientifically solid, most statistically significant average of some kind, and assume that this average tells of how human thought processes work in general.
In this case the average behaviour is nothing more but some indicator of the ratio between those populations; useless demographical trivia of the form "did you know that among north americans, linguistically-determined people are numerous enough to sway this particular experiment?" (a result that I wouldn't care a lot about). There has been an example posted here.
This goes for much of science, outside the physics.
There was another thread about software engineering. Even if the graph was not inverted and the co-founding variables were accounted for, the result should still have been perceived as useless trivia of the form "did you know that in such and such selection of projects the kind of mistakes that are more costly to fix with time outnumber the mistakes that are less costly to fix with time" (Mistakes in the work that is taken as input for future work, do snowball over time, and the others, not so much; any one who had ever successfully developed non-trivial product that he sold, knows that; but you can't stick 'science' label on this, yet you can stick 'science' label onto some average). Instead, the result is taken as if it literally told whenever mistakes are costlier, or less costly, to fix later. That sort of misrepresentation is in the abstracts of many papers being published.
It seems to me that this fallacy is extremely widespread. A study comes out, which generalizes from average; the elephant in the room is that it is often invalid to generalize from average; yet instead we are arguing whenever the average was measured correctly and whenever there was many enough people that the average was averaged out from. Even if it was, in many cases the result is just demographical trivia, barely relevant to the subject which the study is purposed to be about.
A study of 1 person's thought may provide some information about how thought processes work in 1 real human; it indicates that thought process can work in some particular way; a study of some average behaviour of many people provides the results that are primarily determined by demographics and ratios. Yet people often see the latter as more significant than the former, perhaps mistaking statistical significance for the significance in the everyday sense; perhaps mistaking the generalization from average for actual detailed study of large number of people. Perhaps this obsession with averaging is a form of cargo cult taking after the physics where you average the measurements to e.g. cancel out thermal noise in the sensor.
----
I want to make a main post about it, with larger number of examples; it'd be very helpful if you can post here your examples of generalization from averages.