Thanks for the comment.
In terms of the label noise I feel like that's decently accounted for already, as the calculations I'm using are actually not from specific model scores, but rather these points are sampled from me taking the intersecting points of a smoothed curve with the accuracy level for each of the models, and then I derive the factor that exists between that point and the 50% accuracy time horizon, as you can see here in the image at this link for example: https://prnt.sc/odMcvz0isuRU

Additionally, after I noted those factors for 50% to 80%, 90%, 95% and 99% individually for all 6 of those models, I averaged them across the models, which results in the final singular set of averaged factors, which I ended up using for formulating the 80%, 95% and 99% trend lines in the final chart I made. I think there is still some higher margin of error worth noting for those higher accuracy values perhaps, but unless I'm missing something I feel as though this methodology appears relatively robust to being made redundant by "label noise".

"I think you'd want to get a ceiling on performance by baselining with human domain experts who are instructed to focus on reliability." "So, I expect that looking at 95% will underestimate AI progress."

Yes I very much agree, I don't think it's fair for people to assume from my chart "Well I guess it won't be human level until it's doing these tasks at 99% reliability" and I hope people don't end up with that take-away.
And I don't intend to imply that (99% accuracy) means "1% away from human level" either, as the humans accuracy could be less than 99% or even less than 90% or 80% depending on the task.

I don't have the resources at hand myself to implement such human testing you described though, however I think it's worth giving that feedback to the folks at METR, and then that could result in more useful data for people like me to make visualizations out of again :)

Reply