A net saying "I'm thinking about ways to kill you" does not necessarily imply anything whatsoever about the net actually planning to kill you
Since these nets are optimized for consistency (as it makes textual output more likely), wouldn't outputting text that is consistent with this "thought" be likely? E.g. convincing the user to kill themselves, maybe giving them a reason (by searching the web)?
MIRI is bottlenecked more on ideas worth pursuing and people who can pursue them, than on funding
Ideas come from (new) people, and you mentioned seed planting which should contribute to having such people in 4-6 years, seems like still a worthy thing to do for AGI if anything is worth doing for any cause at all (given your short timelines). If you agree what's the bottleneck for that effort?
Related work:
Show Your Work: Scratchpads for Intermediate Computation with Language Models
https://arxiv.org/abs/2112.00114
(from very surface-level perusal) Prompting the model resulted in
1) Model outputting intermediate thinking "steps"
2) Capability gain
Any updates on the API? (thinking of) Playing around with interesting ways to index LW, figure there should be something better than scraping