All of Javier's Comments + Replies

Javier10

We're at the wooden table with benches that seats 6-8 people.

Javier10

We accidentally created another event for this meetup. Since more people have RSVP'd on the other one, I will use it as a source of truth about who's coming and default to it for future communications. I recommend you RSVP there too if you haven't yet. Apologies for the inconvenience.

Javier10

Congrats on the excellent work! I've been following the LLM forecasting space for a while and your results are really pushing the frontier.

Some questions and comments:

  1. AI underconfidence: The AI looks underconfident <10% and >90%. This is kind of apparent from the calibration curves in figure 3b (especially) and 3c (less so), though I'm not sure about this because the figures don’t have confidence intervals. However, table 3 (AI ensemble outperforms the crowd when the crowd is uncertain but the crowd outperforms the AI ensemble overall) and figure 4c
... (read more)
1Fred Zhang
Great questions, and thanks for the helpful comments! We have not tried explicit extremizing. But in the study where we average our system's prediction with community crowd, we find improved results better than both (under Brier scores). This essentially does the extremizing in those <10% cases.  We were not aware of this! We always take unweighted average across the retrieval dates when evaluating our system. If we put more weights on the later retrieval dates, the gap between our system and human should be a bit smaller, for the reason you said. No, we have not tried. One alternative is to sample k random or uniformly spaced intervals within [open, resolve]. Unfortunately, this is not super kosher as this leaks the resolution date, which, as we argued in the paper, correlates with the label. This is on validation set. Notice that the figure caption begins with "Figure 4: System strengths. Evaluating on the validation set, we note" We will update the paper soon to include the log score. See here for some alternatives in time series modeling.  I don't know what is the perfect choice in judgemental forecasting, though. I am not sure if it has been studied at all (probably kind of an open question). Generally, the keyword here you can Google is "autocorrelation standard error" and "standard error in time series".