CAC
CAC has not written any posts yet.

CAC has not written any posts yet.

I don't think there's enough evidence to draw hard conclusions about this section's accuracy in either direction, but I would err on the side of thinking ai-2027's description is correct.
Footnote 10, visible in your screenshot, reads:
For example, we think coding agents will move towards functioning like Devin. We forecast that mid-2025 agents will score 85% on SWEBench-Verified.
SOTA models score at:
• 83.86% (codex-1, pass@8)
• 80.2% (Sonnet 4, pass@several, unclear how many)
• 79.4% (Opus 4, pass@several)
(Is it fair to allow pass@k? This Manifold Market doesn't allow it for its own resolution, but here I think it's okay, given that the footnote above makes claims about 'coding agents', which presumably allow iteration at test time.)
Also, note... (read more)
Do group conversations count?
I would agree that the median one-on-one conversation for me is equivalent to something like a mediocre blogpost (though I think my right-tail is longer than yours, I'd say my favorite one-on-one conversations were about as fun as watching some of my favorite movies).
But, in groups, my median shifts toward 80th percentile YouTube video (or maybe the average curated post here on LessWrong).
It does feel like a wholly different activity, and might not be the answer you're looking for. Group conversations, for example, are in a way inherently less draining: you're not forced to either speak or actively listen for 100% of the time.
My assumption is that many of these successes would tend to be widely distributed around some mean, rather than being narrowly concentrated at one point.
So if a joke needs to be 7/10 funny to get a laugh, but a comedian delivers what is actually a 6.5/10 joke, you’ll still get some subset of people who find it funnier than it is, such that it gets an appropriate amount of laughs.
Probably there’s some inefficiency, but because of this effect, the number of laughs/number of upvotes I think gives quite good information about the perceived quality of the joke/post.
I asked GPT 4.5 to write a system prompt and user message for models to write Pilish poems, feeding it your comment as context.
Then I gave these prompts to o1 (via OpenAI's playground).
GPT 4.5's system prompt
You are an expert composer skilled in writing poetry under strict, unusual linguistic constraints, specifically "Pilish." Pilish is a literary constraint in which the length of consecutive words precisely matches each digit of π (pi). The first word contains 3 letters, second word 1 letter, third word 4 letters, fourth word 1 letter, fifth word 5 letters, sixth word 9 letters, and so forth, accurately reflecting the sequence of pi’s digits.
For example, the classic Pilish sentence is:
"How
Sorry, this is the most annoying kind of nitpicking on my part, but since I guess it's probably relevant here (and for your other comment responding to Stanislav down below), the center point of the year is July 2, 2025. So we're just two weeks past the absolute mid-point – that's 54.4% of the way through the year.
Also, the codex-1 benchmarks released on May 16, while Claude 4's were announced on May 22 (certainly before the midpoint).