The only correctness filters are the hidden testcases (as is standard in most competitive coding competition). You can check the leaderboard - the positions correlate with the cumulative time taken to solve problems & codex assists. If there are any hidden metrics, I wouldn't know.
If so, how was Codex deployed solo? Did they just sample it many times on the same prompt until it produced something that passed the tests? Or something more sophisticated?
They didn't reveal this publicly. We can only guess here.
This makes no sense to me. Do you assume solo-Codex exploited the prompts submitted by other competitors? Or that the assistant-Codexes communicated with each other somehow? I kinda doubt either of those happened.
After I was done, I played around with Codex (from a new account). You could only use Codex in the editors within problems. In one of the problems, I cleared the editor and just put in a simple prompt (unrelated to the problem). I remember in one of the assists, it actually generated the code for that specific problem. This is why I assumed there is some state saving, or context awareness.
It's hard to monitor most work in the short term, so having the engagements be longer-term makes it possible to adjust job and compensation based on years' of output rather than the latest delivery.
Fair point. I agree, I am exaggerating the effectiveness of certain elements. And downplaying the necessity of others.
Although, there's an inherent survivorship bias to favour a longer-term contract, because we've never experienced an efficient short-term engagement model, at scale, before. But I do believe this adjustment buffer will shorten with time, as the tendency of finer hiring accelerates. And, short-term alignment and work efficiency will increase, as everyone adapts to a "faster" work culture.
Yes but there's generally a long enough buffer before the messenger apps change status.
Working on something personal, reading some blog, general web surfing, etc., I feel, constitute 80% of "alt work" sessions. These scenarios won't register on instant-messenger as "away". It is not about going out for a one-hour walk in the middle of the day, without informing anyone. It is these bursts of freedom, and the ability to switch context, unmonitored.
Also, pinging someone for feedback, checking someone's status or organizing group activities, seems like a less efficient monitoring medium (over constantly being in their range of vision).
I had the same experience (50 mins for first problem, as seen in the post). I agree, it is possible that the server issues biased the stats greatly.