The only correctness filters are the hidden testcases (as is standard in most competitive coding competition). You can check the leaderboard - the positions correlate with the cumulative time taken to solve problems & codex assists. If there are any hidden metrics, I wouldn't know.
If so, how was Codex deployed solo? Did they just sample it many times on the same prompt until it produced something that passed the tests? Or something more sophisticated?
They didn't reveal this publicly. We can only guess here.
...This makes no sense to me. Do you assume solo-
It's hard to monitor most work in the short term, so having the engagements be longer-term makes it possible to adjust job and compensation based on years' of output rather than the latest delivery.
Fair point. I agree, I am exaggerating the effectiveness of certain elements. And downplaying the necessity of others.
Although, there's an inherent survivorship bias to favour a longer-term contract, because we've never experienced an efficient short-term engagement model, at scale, before. But I do believe this adjustment buffer will shorten with time, as the t...
Yes but there's generally a long enough buffer before the messenger apps change status.
Working on something personal, reading some blog, general web surfing, etc., I feel, constitute 80% of "alt work" sessions. These scenarios won't register on instant-messenger as "away". It is not about going out for a one-hour walk in the middle of the day, without informing anyone. It is these bursts of freedom, and the ability to switch context, unmonitored.
Also, pinging someone for feedback, checking someone's status or organizing group activities, seems like a less efficient monitoring medium (over constantly being in their range of vision).
I had the same experience (50 mins for first problem, as seen in the post). I agree, it is possible that the server issues biased the stats greatly.