I wouldn't expect an LLM to do this. An LLM wants to predict the most likely next word, so is going to assign high probabilities to semantically similar words (hence why they are clustered in embedding space). Whisper is trying to do speech-to-text, so as well as needing to know about semantic similarity of words it also needs to know about words that sound the same. Eg if it thinks it heard 'rug', it is pretty likely that the person speaking actually said 'mug' hence these words are clustered. Does that make sense?
Re other layers in the encoder: There are only 4 layers in Whisper tiny, couldn't find any 'listenable' features in the earlier layers 0,1 so I'm guessing they activate more on frequency patterns than human recognisable sounds. Simple linear probes trained on layers 2 and 3 suggest they learn language features (eg is_french) and is_speech. Haven't looked into it any more than that though.
Re localisation of attention - 'I'd predict that most but not all encoder heads are highly localised' - this looks true when you look at the attn patterns per head. As you said most heads (4/6) in each layer are highly localised - you can mask them up to k=10. But there are 1 or 2 heads in each layer that are not so localized and are responsible for the degradation seen when you mask them.
Working on that one - the code is not in a shareable state yet but I will link a notebook here once it is!