Abstract
What do chatbots, voice assistants, and predictive text have in common? They all use computer programs called language models. Large language models are new kinds of models that can only be built using supercomputers. They work so well that it can be hard to tell if something was written by a person or by a computer!
We wanted to understand how a large language model called GPT-3 worked. But we wanted to know more than whether GPT-3 could answer questions correctly. We wanted to know how and why. We treated GPT-3 like a participant in a psychology experiment. Our results showed that GPT-3 gets a lot of questions right. But we also learned that GPT-3 gets confused very easily. And it doesn’t search for new information as well as people do. Knowing how and why large language models come up with wrong answers helps us figure out how to make even better versions in the future.