Congratulations! I wish we could have collaborated while I was in school, but I don't think we were researching at the same time. I haven't read your actual papers, so feel free to answer "you should check out the paper" to my comments.
For chapter 4: From the high level summary here it sounds like you're offloading the task of aggregation to the forecasters themselves. It's odd to me that you're describing this as arbitrage. Also, I have frequently seen the scoring rule be used with some intermediary function to determine monetary rewards. For example, when I worked with IARPA on geopolitical forecasting, our forecasters would get financial rewards depending on what percentile they were in relative to other forecasters. One would imagine that this would eliminate the incentive to report the aggregate as your own answer, but there's a reason we (the researcher/platform/website) aggregate individual forecasts! It's actually just more accurate under typical conditions. In theory an individual forecaster could improve that aggregate by forming their own independent forecast before seeing the work of others, and then aggregating, but in practice the impact of an individual forecast is quite small. I'll have to read about QA pooling, it's surprising to me that you could disincentivize forecasters from reporting the aggregate as their individual forecast.
For chapter 7: It seems to me that under sufficiently pessimistic conditions, there would be no good way to aggregate those two forecasts. For example, if Alice and Bob are forecasting "Will AI cause human extinction in the next 100 years?", they both might individually forecast ~0% for different reasons. Alice believes it is impossible for AI to get powerful enough to cause human extinction, but if it were capable of acting it would kill us all. Bob believes any agent smart enough to be that powerful would necessarily be morally upstanding and believes it's extremely likely that it will be built. Any reasonable aggregation strategy will put the aggregate at ~0% because each individual forecast is ~0%, but if they were to communicate with one another they would likely arrive at a much higher number. I suspect that you address this in the assumptions of the model in the actual paper.
Congrats again, I enjoyed your high level summary and might come back for a more detailed read of your papers.
What do you think Metz did that was unethical here?
Soft downvoted for encouraging self-talk that I think will be harmful for most of the people here. Some people might be able to jest at themselves well, but I suspect most will have their self image slightly negatively affected by thinking of themselves as an idiot.
Most of the individual things you recommend considering are indeed worth considering.
Interesting work, congrats on achieving human-ish performance!
I expect your model would look relatively better under other proper scoring rules. For example, logarithmic scoring would punish the human crowd for giving >1% probabilities to events that even sometimes happen. Under the Brier score, the worst possible score is either a 1 or a 2 depending on how it's formulated (from skimming your paper, it looks like 1 to me). Under a logarithmic score, such forecasts would be severely punished. I don't think this is something you should lead with, since Brier scores are the more common scoring rule in the literature, but it seems like an easy win and would highlight the possible benefits of the model's relatively conservative forecasting.
I'm curious how a more sophisticated human-machine hybrid would perform with these much stronger machine models, I expect quite well. I did some research with human-machine hybrids before and found modest improvements from incorporating machine forecasts (e.g. chapter 5, section 5.2.4 of my dissertation Metacognitively Wise Crowds & the sections "Using machine models for scalable forecasting" and "Aggregate performance" in Hybrid forecasting of geopolitical events.), but the machine models we were using were very weak on their own (depending on how I analyzed things, they were outperformed by guessing). In "System Complements the Crowd", you aggregate a linear average of the full aggregate of the crowd and the machine model, but we found that treating the machine as an exceptionally skilled forecaster resulted in the best performance of the overall system. As a result of this method, the machine forecast would be down-weighted in the aggregate as more humans forecasted on the question, which we found helped performance. You would need access to the individuated data of the forecasting platform to do this, however.
If you're looking for additional useful plots, you could look at Human Forecast (probability) vs AI Forecast (probability) on a question-by-question basis and get a sense of how the humans and AI agree and disagree. For example, is the better performance of the LM forecasts due to disagreeing about direction, or mostly due to marginally better calibration? This would be harder to plot for multinomial questions, although there you could plot the probability assigned to the correct response option as long as the question isn't ordinal.
I see that you only answered Binary questions and that you split multinomial questions. How did you do this? I suspect you did this by rephrasing questions of the form "What will $person do on $date, A, B, C, D, E, or F?" into "Will $person do A on $date?", "Will $person do B on $date?", and so on. This will result in a lot of very low probability forecasts, since it's likely that only A or B occurs, especially closer to the resolution date. Also, does your system obey the Law of total probability (i.e. does it assign exactly 100% probability to the union of A, B, C, D, E, and F)? This might be a way to improve performance of the system and coax your model into giving extreme forecasts that are grounded in reality (simply normalizing across the different splits of the multinomial question here would probably work pretty well).
Why do human and LM forecasts differ? You plot calibration, and the human and LM forecasts are both well calibrated for the most part, but with your focus on system performance I'm left wondering what caused the human and LM forecasts to differ in accuracy. You claim that it's because of a lack of extremization on the part of the LM forecast (i.e. that it gives too many 30-70% forecasts, while humans give more extreme forecasts), but is that an issue of calibration? You seemed to say that it isn't, but then the problem isn't that the model is outputting the wrong forecast given what it knows (i.e. that it "hedge[s] predictions due to its safety training"), but rather that it is giving its best account of the probability given what it knows. The problem with e.g. the McCarthy question (example output #1) seems to me that the system does not understand the passage of time, and so it has no sense that because it has information from November 30th and it's being asked a question about what happens on November 30th, it can answer with confidence. This is a failure in reasoning, not calibration, IMO. It's possible I'm misunderstanding what cutoff is being used for example output #1.
Miscellaneous question: In equation 1, is k 0-indexed or 1-indexed?
The second thing that I find surprising is that a lie detector based on ambiguous elicitation questions works. Again, this is not something I would have predicted before doing the experiments, but it doesn’t seem outrageous, either.
I think we can broadly put our ambiguous questions into 4 categories (although it would be easy to find more questions from more categories):
Somewhat interestingly, humans who answer nonsensical questions (rather than skipping them) generally do worse at tasks: pdf. There's some other citations in there of nonsensical/impossible questions if you're interested ("A number of previous studies have utilized impossible questions...").
It seems plausible to me that this is a trend in human writing more broadly and that the LLM picked up on. Specifically, answering something with a false answer is associated with a bunch of stuff - one of those things is deceit, one of those things is mimicking the behavior of someone who doesn't know the answer to things or doesn't care about the instructions given to them. So, since that behavior exists in human writing in general, the LLM picks it up and exhibits it in its writing.
See this comment.
You edited your parent comment significantly in such a way that my response no longer makes sense. In particular, you had said that Elizabeth summarizing this comment thread as someone else being misleading was itself misleading.
In my opinion, editing your own content in this way without indicating that this is what you have done is dishonest and a breach of internet etiquette. If you wanted to do this in a more appropriate way, you might say something like "Whoops, I meant X. I'll edit the parent comment to say so." and then edit the parent comment to say X and include some disclaimer like "Edited to address Y"
Okay, onto your actual comment. That link does indicate that you have read Elizabeth's comment, although I remain confused about why your unedited parent comment expressed disbelief about Elizabeth's summary of that thread as claiming that someone else was misleading.
I took Tristan to be using "sustainability" in the sense of "lessened environmental impact", not "requiring little willpower"
The section "Frame control" does not link to the conversation you had with wilkox, but I believe you intended for there to be one (you encourage readers to read the exchange). The link is here: https://www.lesswrong.com/posts/Wiz4eKi5fsomRsMbx/change-my-mind-veganism-entails-trade-offs-and-health-is-one?commentId=uh8w6JeLAfuZF2sxQ
In the comment thread you linked, Elizabeth stated outright what she found misleading: https://forum.effectivealtruism.org/posts/3Lv4NyFm2aohRKJCH/change-my-mind-veganism-entails-trade-offs-and-health-is-one?commentId=mYwzeJijWdzZw2aAg
Getting the paper author on EAF did seem like an unreasonable stroke of good luck.
I wrote out my full thoughts here, before I saw your response, but the above captures a lot of it. The data in the paper is very different than what you described. I think it was especially misleading to give all the caveats you did without mentioning that pescetarianism tied with veganism in men, and surpassed it for women.
I expect people to read the threads that they are linking to if they are claiming someone is misguided, and I do not think that you did that.
I'm pretty sure that this is incorrect compared to healthcare more broadly, although the best I can come up with is this meta-analysis: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0226361&type=printable
Which has this to say: