ChatGPT scores 60% accuracy on financial statement analysis
June 4, 2024
A recent academic paper says that ChatGPT accurately predicted whether a company's earnings would rise or fall about 60% of the time, versus human analysts, who got it right 52.71% of the time.
The paper, published by academics at the University of Chicago's Booth School of Business, used a ChatGPT model fine-tuned for financial analysis. The academics have made the GPT available here. The bot was fed anonymized and standardized corporate financial statements, with years replaced by labels such as t and t-1. Specifically, the researchers standardized the balance sheets and income statements in a way that follows Compustat's balancing model so the format of financial statements would be identical across all firm-years and the model would not know which company or even time period its analysis corresponded to. They then used chain-of-thought prompting to train it for financial analysis.
The financial statements sampled a period spanning from 1982 to 2021 with 39,533 observations from 3,152 distinct firms. The researchers' target variable across all models was a directional change in future earnings, up or down. They tested the bot to predict next year's earnings one month after earnings reports, three months and six months.
A key research design choice, according to the paper, was omitting all textual information, such as management discussion and analysis. This was because the researchers' primary interest was in understanding the LLMs' ability to analyze and synthesize purely financial numbers. This was done to set up research questions such as: "Can a large language model generate economic insights purely from the numbers reported in financial statements absent any narrative context?"
For the human predictions, the researchers used consensus analyst forecasts. If there were multiple forecasts issued by a single analyst, they used the closest one to the earnings release dates. They then took the median value of analysts' forecasts and compared it to the actual year's earnings per share. They required at least three analyst forecasts in a given firm-year to compute median values. If the median forecasted EPS value was larger than the year t EPS, they labeled the prediction as "increase" and vice versa. Like the bot, these too were measured one month after earnings reports, three months out and six months out.
What they found was that human analysts were accurate 52.71% of the time one month later, 55.95% of the time three months later, and 56.68% of the time six months later.
ChatGPT without chain-of-thought prompting (referred to in the paper as a "naive" model) scored only 49%. With chain-of-thought prompting, this rate increased to 60.31%, well above human analysts. The results were on par with the 60.45% accuracy rate of a specialized neural network trained on the same data.
Researchers said GPT is capable of predicting future earnings because it distills narrative insights about the financial health of the company from the numeric data. Specifically, the model analyzes trends, then switches to the ratio analysis, and concludes by providing a rationale behind its prediction.
"Although one must interpret our results with caution, we provide evidence consistent with large language models having human-like capabilities in the financial domain," the study concluded. "General purpose language models successfully perform a task that typically requires human expertise and judgment and do so based on data exclusively from the numeric domain. Therefore, our findings indicate the potential for LLMs to democratize financial information processing and should be of interest to investors and regulators."
The AI was not perfect, however. Researchers observed sharp drops in prediction accuracy in 1974, 2008-2009 and 2020. These periods overlap with international macroeconomic downturns: the oil shock in 1974, the financial crisis in 2008-2009 and the COVID-19 outbreak in 2020. It also became less accurate the further back in time it went, which the researchers say reflects the increasing difficulty of predicting earnings from statistics alone. Further, the researchers found that ChatGPT is more likely to generate inaccurate predictions when firms are smaller, have higher leverage, record a loss and have higher earnings volatility.
[Accounting Today]