For over a decade, complexity scientist Peter Turchin and his collaborators have worked to compile an unparalleled database of human history – the Seshat Global History Databank. Recently, Turchin and computer scientist Maria del Rio-Chanona turned their attention to artificial intelligence (AI) chatbots, questioning whether these advanced models could aid historians and archaeologists in interpreting the past.
The study, which is the first of its kind, evaluates the historical knowledge of leading AI models such as ChatGPT-4, Llama, and Gemini.
The results, presented at the NeurIPS conference, reveal both potential and significant limitations in AI’s ability to grasp historical knowledge, especially at the nuanced, expert level.
“Large language models (LLMs), such as ChatGPT, have been enormously successful in some fields – for example, they have largely succeeded by replacing paralegals,” said Turchin, who leads the Complexity Science Hub’s (CSH) research group on social complexity and collapse.
“But when it comes to making judgments about the characteristics of past societies, especially those located outside North America and Western Europe, their ability to do so is much more limited.”
The study’s findings highlight that AI’s capabilities are domain-specific. While LLMs excel in some applications, they struggle with expert-level historical analysis. GPT-4 Turbo, the best-performing model, achieved a balanced accuracy of just 46% on a four-choice question test.
This result, though an improvement over random guessing (25%), underscores the substantial gaps in AI’s understanding of global history.
“I thought the AI chatbots would do a lot better,” said del Rio-Chanona, corresponding author of the study and assistant professor at University College London. “History is often viewed as facts, but sometimes interpretation is necessary to make sense of it.”
The study leveraged the Seshat Global History Databank, a comprehensive resource documenting historical data from 600 societies worldwide, encompassing over 36,000 data points and 2,700 scholarly references. Using Seshat as a benchmark, researchers tested LLMs on questions requiring graduate or expert-level knowledge.
“We wanted to set a benchmark for assessing the ability of these LLMs to handle expert-level history knowledge,” explained first author Jakob Hauser, a resident scientist at CSH.
“A key component of our benchmark is that we not only test whether these LLMs can identify correct facts, but also explicitly ask whether a fact can be proven or inferred from indirect evidence.”
The study revealed that AI performance varied significantly across time periods and geographic regions. LLMs demonstrated greater accuracy when addressing questions about ancient history, particularly between 8,000 BCE and 3,000 BCE, but struggled with more recent events, especially from 1,500 CE to the present.
Geographically, OpenAI models like GPT-4 performed better for regions such as Latin America and the Caribbean, while Llama models excelled in Northern America. However, both systems underperformed in Sub-Saharan Africa, with Llama also showing weaker results for Oceania.
These disparities highlight potential biases in training data, which may prioritize certain historical narratives while neglecting others.
LLMs performed best on questions related to legal systems and social complexity, but they struggled significantly with topics such as discrimination and social mobility.
“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history,” said del Rio-Chanona. “They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task.”
Among the tested models, GPT-4 Turbo ranked highest with 46% accuracy, while Llama-3.1-8B scored the lowest at 33.6%.
The researchers aim to address the shortcomings revealed in this study by expanding the dataset and refining the benchmark. Hauser emphasized plans to incorporate more data from underrepresented regions, especially in the Global South, and to include more complex historical questions.
“We also look forward to testing more recent LLM models, such as o3, to see if they can bridge the gaps identified in this study,” Hauser said.
The team believes the findings are valuable not only for AI developers but also for historians and archaeologists.
For academics, understanding AI’s limitations can guide its use in research, while for developers, the study highlights areas for improvement, such as addressing regional biases and enhancing models’ capacity for handling nuanced historical evidence.
The study highlights both the potential and the current limitations of AI in historical research. While LLMs have proven valuable for generating basic historical facts, they fall short when tasked with interpreting complex and nuanced historical contexts.
For researchers and developers alike, these findings offer a roadmap for improving AI’s understanding of global history, paving the way for tools that better support scholarly inquiry into the past.
—–
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.
—–