Another interesting corpus (though problematic for legal reasons) would be sci-hub. Quick googling gives estimates of around 50 million research articles; the average research article runs around 4000 words, and sci-hub is estimated to contain about 69% of all research articles published in peer-reviewed journals. That would put sci-hub at about 50 million * 4000 = 200B tokens and the whole scientific journal literature at an estimated 290B tokens.
Point 1 is overstated: the strong default is that unaligned AGI will be indifferent to human survival as an end. The leap to wanting to kill humans relies on a much stronger assumption than STEM-level AGI. It requires the AGI to have confidence that it can replace all human activity with its own, to the extent that human activity suppports or enables the AGI's functioning. Perhaps call this World-System-Designing-level AGI: an AGI which can design an alternate way for the world to function not based on human activity and human industrial production.
Point 3... (read more)