LLMs: thoughts & recommendations

While working on the Botonomics grant, we’ve reached a number of conclusions and encountered some useful resources. We’d like to share them.

Which GPT version to use?

Based on our own experiences, we can say that it appears that GPT-4 might have become sophisticated enough to serve laypeople and researchers in some tasks, consistent with the emergence of some skills in the most advanced LLMs. Two examples illustrate this:

In Niszczota and Abbas (2023) we applied a 21-item financial literacy test to both GPT-3.5 and GPT-4. The former got a mediocre score of 65-66%, while the latter got a score of 99%. This provides some evidence that GPT-4 might be a useful source of preliminary financial advice for laypeople (although more work is needed to validate this claim).
In a study that tests whether GPT can replicate cross-cultural differences in personality using: (1) the Ten-Item Personality Inventory (Gosling, Rentfrow, and Swann Jr. 2003) as a way to capture personality according to the Five-Factor Model (McCrae and John 1992), and (2) people from the United States and South Korea as targets of the simulation, GPT-4 replicated all of the differences (i.e., for each of the Big Five), while GPT-3.5 didn’t.

However, the obvious limitation of GPT-4 - relative to GPT-3.5 - is its cost. It costs roughly 25 times as much to use GPT-4 through an API (although this depends on the length of the prompts and responses. Therefore, we recommend the use of GPT-4 in social sciences, but this is not a “blanket” recommendation: indeed, in many cases, it might suffice to use GPT-3.5 or perhaps open-source alternatives such as LLaMA (Touvron et al., n.d.), although the latter might still take some time to mature. We suggest that researchers compare the performance of GPT-3.5 and GPT-4 before conducting large-scale studies.

Simulating human behavior

A growing number of research papers (e.g., Horton, 2023) suggest that LLMs can be used to simulate human behavior. For example, Horton (2023) shows that GPT can replicate perceptions of fairness of raising prices, the status quo effect, as well as a finding concerning the effect of labor substitution.

Dillion et al. (2023) propose that LLMs can be used by scientists in a number of circumstances: in idea generation and refinement, in piloting studies, and in validating (robustness-testing) earlier conclusions, which were derived from human samples. This, however, is a controversial topic: for a critique of the use of LLM, even in such secondary roles, see Crockett and Messeri (2023).

Ethicality of LLM use

Many academics started to immediately integrate ChatGPT into their work, and some have acknowledged its role by listing it as a co-author. However, this was quickly deemed inappropriate in an editorial from Science (Thorp 2023). In the aftermath, many publishers have acknowledged that LLMs can only be used for a limited number of tasks and that researchers must report their use.

We (Niszczota and Conway, 2023) conducted an analysis in which laypeople judged delegating five parts of the research process to either a Ph.D. student (a human - the control condition) or a large language model. Results suggest that delegation to an LLM was deemed less morally acceptable than delegation to a human. Researchers who used LLMs were deemed less trustworthy, and the accuracy of findings obtained via an LLM was judged to be lower.

There appears to be a mismatch between the value that LLMs can bring to researchers - automating many tedious parts of the research process (e.g., Dillion et al., 2023) - and the social acceptance of doing so. A likely consequence will be the under-reporting of LLM use, which is still relatively difficult to detect and circumvent.

Other resources

OpenAI - the creators of GPT - provide two useful resources for less and more advanced users:

A GPT best practices page;
The OpenAI Cookbook, that gives tips on how to use GPT via an API (programmatically).

Andrej Karpathy has made a number of insightful list of recommendations concerning GPT-use (Microsoft Developer 2023).

Yang et al. (2023) conduct a comparison of various texts that could be used to lead LLMs to useful responses.

References

Crockett, Molly, and Lisa Messeri. 2023. “Should Large Language Models Replace Human Participants?” https://doi.org/10.31234/osf.io/4zdx9.

Dillion, Danica, Niket Tandon, Yuling Gu, and Kurt Gray. 2023. “Can AI language models replace human participants?” Trends in Cognitive Sciences 27 (7): 597–600. https://doi.org/10.1016/j.tics.2023.04.008.

Gosling, Samuel D, Peter J Rentfrow, and William B Swann Jr. 2003. “A Very Brief Measure of the Big-Five Personality Domains.” Journal of Research in Personality 37 (6): 504–28. https://doi.org/10.1016/S0092-6566(03)00046-1.

Horton, John J. 2023. “Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?” Working Paper. https://doi.org/10.3386/w31122.

McCrae, Robert R., and Oliver P. John. 1992. “An Introduction to the Five-Factor Model and Its Applications.” Journal of Personality 60 (2): 175–215. https://doi.org/10.1111/j.1467-6494.1992.tb00970.x.

Microsoft Developer. 2023. State of GPT | BRK216HFS. https://www.youtube.com/watch?v=bZQun8Y4L2A.

Niszczota, Paweł, and Sami Abbas. 2023. “GPT Has Become Financially Literate: Insights from Financial Literacy Tests of GPT and a Preliminary Test of How People Use It as a Source of Advice.” Finance Research Letters, August, 104333. https://doi.org/10.1016/j.frl.2023.104333.

Niszczota, Paweł, and Paul Conway. 2023. “Judgments of Research Co-Created by Generative AI: Experimental Evidence.” https://doi.org/10.48550/arXiv.2305.11873.

Thorp, H. Holden. 2023. “ChatGPT Is Fun, but Not an Author.” Science 379 (6630): 313–13. https://doi.org/10.1126/science.adg7879.

Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. “Llama 2: Open Foundation and Fine-Tuned Chat Models.” https://doi.org/10.48550/arXiv.2307.09288.

Yang, Chengrun, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2023. “Large Language Models as Optimizers.” https://doi.org/10.48550/arXiv.2309.03409.