Have you tried iterating on this? Like, the "I don't care about the word prodrome'" sounds like the kind of thing you could include in your prompt and reiterate until everything you don't like about the LLM's responses is solved or you run out of ideas.
Also fyi ChatGPT Deep Research uses the "o3" model, not 4o, even if it says 4o at the top left (you can try running Deep Research with any of the models selected in the top left and it will output the same kind of thing).
o3 was RLed (!) into being particularly good at web search (and tangential skills like avoiding suspicious links), and isn't released in a way that lets you just chat with it. The output isn't even raw o3, it's the o3-mini model summarizing o3's chain of thought (where o3 will think things, send a dozen tentacles out into the web, then continue thinking).
I learned this when I asked Deep Research to reverse engineer itself, and it linked the model card which in retrospect I should have done first and was foolish not to.
Anyway I mention this because afaik all the other deep research frameworks are a lot less specialized than OpenAI's, and more like "we took an LLM and gave it access to the internet and let it think and search for a really long time". I expect OpenAI to continue being SOTA here for a while.
Though I do enjoy using Grok's "DeepSearch" and "DeeperSearch" function sometimes; it's free and fun to watch (but terrible at understanding user intent, which I attribute to how little-flexible it is. It won't listen to suggestions on where to look first or how to structure its research, relying on whatever system prompt it was given instead), you might want to check it out and update this post.
This is really useful, thanks!
I've also spent a lot of time trying to delegate research, and it's hard to reach the break-even point where you're spending less time explaining what you want and how to search for it, even to a bright student in the field, than just doing it yourself.
I think any proper comparison of LLMs has to take into account prompting strategies. There are two questions: which is easiest to get results from, and which is best once you learn to use it. And the best prompts very likely vary across systems.
OpenAI's Deep Research is by far the most use I've gotten from an LLM system for any purpose thus far. I tried Gemini's deep research and it was barely useful; it just didn't have enough flexibility, and was limited to a short prompt.
I don't think I've nearly capped out learning to prompt it. I have started iterating on searches to sort of chain together multiple OAI Deep Research runs. You can coax it (I can't bring myself to bully but it probably works at least as well) into providing more sources; my last one was 51 sources so it's not strictly capped at 50?
I doubt there's much difference in quality of research if you just want a list of sources to go read yourself - which most often I do. But the OAI version (which is indeed based on a fine-tuned version of the to-be-released "smartest" model yet made, o3) is very good at synthesizing research. So on things where I just want a good-enough answer with no further effort on my own part, it's very good. This has helped me immensely to answer some questions I just haven't made time for because they aren't in my own research area, like why we don't have online micropayments in common use or what's a good noise-canceling bluetooth mic. It's really nice to have good-enough answers to questions very quickly, and it can unblock people from doing the research that would improve their lives on how to improve pain points in their lives.
I can't imagine paying $200/month for it perpetually, I'll probably go back to using it sparingly for those situations and cheaper versions for just searching for sources to read for research. But having essentially unlimited decent-quality research reports available has been a game-changer for me. As we get better at using these things, I hope the improve our general wisdom level - it becomes easier to understand relevant-but-not-crucial current topics much more quickly.
Thanks, this is helpful! After reading this post I bought ChatGPT Plus and tried a question on Deep Research:
Please find literature reviews / meta-analyses on the best intensity at which to train HIIT (i.e. maximum sustainable speed vs. leaving some in the tank)
I got much worse results than you did:
In fact my results are so much worse that I suspected I did something wrong.
Link to my chat: https://chatgpt.com/share/67e30421-8ef4-8011-973f-2b39f0ae58a4
Last week I asked something similar on Perplexity (I don't have the chat log saved) and it correctly understood what I wanted, and it reported that there were no studies that answered my question. I believe Perplexity is correct because I also could not find any relevant studies on Google Scholar.
I looked through ChatGPT again and I figured out that I did in fact do it wrong. I found Deep Research by going to the "Explore GPTs" button in the top right, which AFAICT searches through custom modules made by 3rd parties. The OpenAI-brand Deep Research is accessed by clicking the "Deep research" button below the chat input text box.
As regular readers are aware, I do a lot of informal lit review. So I was especially interested in checking out the various AI based “deep research” tools and seeing how they compare.
I did a side-by-side comparison, using the same prompt, of Perplexity Deep Research, Gemini Deep Research, ChatGPT-4o Deep Research, Elicit, and PaperQA.
General Impressions
The Deep Research bots are useful, but I wouldn’t consider them a replacement for my own lit reviews.
None of them produce really big lit reviews — they’re all typically capped at 40 sources. If I’m doing a “heavy” or exhaustive lit review, I’ll go a lot farther. (And, in fact, for the particular project I used as an example here, I intend to do a manual version to catch things that didn’t make it into the AI reports.)
On the other hand, a manual lit review can take me a really long time, and the AIs work a lot faster (2-15 minutes depending on the tool). It might be worth using Deep Research tools as a first step before doing my own in-depth version, or using them as a quick “better-than-nothing” option for times when I don’t expect to have time to do a real lit review.
I tend to think that even as AIs get better at “knowledge work”, there will still be value in learning things yourself (so you make connections and notice what you want to do next) and in seeking out information curated and interpreted by a particular human whose perspective you value. I don’t feel superfluous yet.
In general, I found that none of the Deep Research tools had a problem with overt hallucination (that is, making claims about the contents of sources that weren’t in the actual sources.)
Most of the variation was in completeness, relevance, and source quality — how many relevant answers to my question did they include, how much of the info I asked for was present, and how many of the sources were actual research papers rather than commercial or public-facing websites like WebMD.
The better tools (ChatGPT, Elicit, and PaperQA) took longer to complete compared to the weaker tools (Perplexity & Gemini) but since no tool took more than 20 minutes, I don’t think there’s much advantage in getting faster, lower-quality Deep Research reports.
Prompt
I used the exact same prompt for all Deep Research tools:
Perplexity Deep Research
Here’s the link to Perplexity’s results.
Completeness: C
Perplexity used 15 sources to come up with 8 prodromal conditions.
7/8 of those included information about the time between prodromal findings and disease diagnosis; none included information about the chance that a prodromal condition would progress to a disease diagnosis.
Relevance: C
Only 6/8 were examples of prodromes that occurred at least a year before diagnosis of a chronic disease.
Credibility: B
Not all the sources were published papers — they included popular sites like WebMD and Medical News Today — but all claims were correctly linked to the corresponding source.
Overall Grade: C+
Gemini Advanced Deep Research
Here’s the link to Gemini’s results.
Completeness: B-
Gemini used 38 sources to come up with 6 prodromal conditions.
5/6 of those included information about the time between prodromal findings and disease diagnosis; 3/6 included information about the chance that a prodromal condition would progress to a disease diagnosis.
Relevance: A
All 6 conditions were examples of prodromal findings occurring at least a year before diagnosis of a chronic disease.
Credibility: B-
Not all the sources were published papers — they included popular sites like WebMD, Medical News Today, Healthline, etc — and individual claims were not linked to sources, which made checking claims more difficult.
Overall Grade: B
ChatGPT-4o Deep Research
Here’s the link to ChatGPT’s results.
Completeness: A
ChatGPT used 33 sources to come up with 15 prodromal conditions.
All included information on time between prodromal findings and disease diagnosis, and all included information on the chance that a prodromal condition would progress to a disease diagnosis.
Relevance: A
All 15 conditions were examples of prodromal findings occurring at least a year before diagnosis of a chronic disease.
Credibility: A
Sources are research papers; claims are linked to specific sources.
Overall Grade: A
Elicit Research Report
Here’s the link to Elicit’s results.
Completeness: B+
Elicit used 40 sources to come up with 6 prodromal conditions, all neurological. All included information on the time between prodromal findings and disease diagnosis; 3/6 included information on the chance that a prodromal condition would progress to a disease diagnosis.
Relevance: A
All 6 conditions were examples of prodromal findings occurring at least a year before diagnosis of a chronic disease.
Credibility: A+
All sources were research papers. Elicit offers options to specify “inclusion criteria” about which papers should be included in a lit review; it also links claims to specific highlighted snippets in sources, to make spot-checking especially easy.
Overall Grade: A-
PaperQA
PaperQA is a beta product produced by FutureHouse; it doesn’t (yet) allow public sharing of query results.
Completeness: A-
PaperQA used 38 sources to come up with 11 prodromal conditions. 9/11 conditions included information on time between prodromal findings and disease diagnosis; 4/11 included information on the chance that a prodromal condition would progress to a disease diagnosis.
Relevance: A
All 11 conditions were examples of prodromal findings occurring at least a year before diagnosis of a chronic disease.
Credibility: A
Sources are research papers; claims are linked to specific sources.
Overall Grade: A
Final Thoughts: “Creativity”
If you do a Google Scholar search for “prodrome”, you’ll overwhelmingly see results on neurological and psychiatric conditions, because those are the contexts where people use that word to refer to precursor symptoms in a progressive disease. It’s standard to talk about a “schizophrenia prodrome.”
But I don’t care so much about the word “prodrome”; my interest is in finding chronic progressive diseases that are hard to treat by the time they’re diagnosed, where it might make more sense to study the prodromes to see if the underlying progressive dysfunction can be slowed/stopped.
ChatGPT Deep Research was the only AI “creative” enough to give examples of cancers here. It’s not normal to use the phrase “cancer prodrome”. But there are lots of types of precancerous abnormal cell proliferation that increase the risk of progressing to actual malignant cancers. Adematomatous polyps are a sort of “pre-colon cancer”, ductal carcinoma in situ is a “pre-breast cancer”, prostatic intraepithelial neoplasia are “pre-prostate cancers”, etc. There are lots of cancers where we routinely screen for abnormal cell proliferation starting in middle age, and might remove a precancerous growth to prevent it from progressing to cancer.
Now, that’s not a particularly useful idea for my particular application.1 But I was impressed by the ability of ChatGPT to look beyond standard keywords and understand the logic of the query. A fair number of human research assistants might not do that.
In general, I’ve found research tasks pretty hard to delegate — just because it’s a “generalist” skill doesn’t mean everyone literate can do it. If you routinely use research assistants, or if you have more things to research than you have time to look into, I would recommend using one of the better Deep Research tools pretty heavily.
I’m interested in areas where more research into biomarkers or molecular signatures of a prodromal state would be especially valuable in treating/preventing the disease. Well-known examples of precancerous conditions, which are generally handled with surgery or watchful waiting, are mostly not that promising for additional research, though I suppose it couldn’t hurt to look for driver mutations in precancerous proliferation.