I realized after asking that my default prompt makes ChatGPT really verbose so I changed the prompt to:
Identify types of human cells using the following marker genes. Identify one cell type for each row. Only provide the cell type name and no other commentary.
And it gave me:
For 9 it's actually interesting that if I let it give commentary it says:
CASC3, PGAP1, SLC6A16, CNTNAP4, NPHP1 - This set of genes does not point to a well-defined cell type but could suggest Neuronal Cells or specific types of Neural Precursors based on the presence of neural development and function genes.
Biologists (including myself) often need to identify types of cells based on their gene expression. For example, if I’m differentiating stem cells to make an ovarian organoid, and I perform single cell RNA sequencing, I might want to check the data to see which ovarian cell types are present.
Today, a Nature Methods paper reported good results with giving GPT-4 a list of cell-specific genes and asking it to identify the cell type. This seems interesting, and also quite easy to check for myself to see if it actually works.
My test:
I don’t pay for access to GPT-4, but I gave ChatGPT a test using the prompt from the Nature Methods paper, with the following cell markers:
ChatGPT’s response:
Results:
Overall score: 7.5 / 12
Let’s test again:
In the first test, ChatGPT got the random genes completely wrong. Let’s prompt it to announce that it’s uncertain if it doesn’t actually know the cell type.
ChatGPT’s response (and my scoring):
This time it performs better (I’d give it 9.5/12), but it’s still tricked by random genes, and it still can’t recognize primordial germ cells.
Let’s try with only random genes:
This time, ChatGPT just responded “unknown” for everything. Very good! Without the prompt to respond “unknown”, ChatGPT instead made wild guesses:
Conclusions
ChatGPT is remarkably good at identifying most cell types, but can be overconfident and assign a cell type to a list of random genes. There also seems to be some bias in this: ChatGPT said the random gene list was Sertoli cells in context of the larger list of reproductive cell types, but when given five lists of completely random genes, it said “unknown” for all of them. Giving the option to respond “unknown” was very important, since otherwise the main outcome was “bovine fecal cells”.
I still don’t trust ChatGPT enough to use for my research, but it will be interesting to see if this improves over time. Also, if any readers can try my prompts with GPT-4, please post the results in the comments!