Heavy lifting: Why using AI for data extraction is still no easy task

Using AI to extract data from documents and filings should be a no-brainer. But it takes a lot of brains and money to get those processes set up and running reliably and accurately.

Data extraction and AI experts are warning that attempting to use generative AI models to parse data from public company documents such as SEC filings could prove costly and—more importantly—may not deliver accurate results. Specifically, they say financial firms that need very high standards of accuracy should not try to develop solutions using generic AI tools and large language models (LLMs) but will need to invest time and money to build highly custom services.

Patronus AI, which provides a platform for evaluating AI and LLM performance, conducted testing last year that showed widely used LLMs frequently failed to accurately answer a sample set of questions from its FinanceBench benchmarking tool. FinanceBench comprises 10,000 question-and-answer pairs covering SEC filings such as 10Ks and 10Qs, financial reports, and earnings call transcripts.

According to Patronus AI, both GPT-4 Turbo and Llama 2 with retrieval systems failed to answer questions based on these documents 81% of the time. And even LLMs with long context windows—which are much slower and still aren’t long enough to support the length of documents typically used by analysts, the vendor said—still failed a significant amount of the time: GPT-4 Turbo with long context still failed 21% of the time, while Anthropic’s Claude-2 with long context fails 24% of the time.

One chief data officer at a fintech vendor, and who was previously the CDO of a tier-1 global bank, says these figures aren’t surprising, given the specialized nature of legal securities filings. In a previous role, he was consulting with a niche data firm attempting to do something similar—parsing structured product details from prospectus filings. The firm started out using ChatGPT, but while this cut out some manual document review, it only delivered marginal benefit overall. The CDO helped this firm create a new model using MS Machine Learning, which allowed it to train a more specific model. This did improve its data collection, but the CDO says the output still needed a lot of attention.

The CDO isn’t alone in doubting the abilities of GenAI to handle complex information extraction tasks.

Zero to 80, then hard on the brakes

“It’s so easy to go from zero to 80% with almost no effort using a generative model, but people don’t realize it’s not easy to overcome the 80% hurdle and get to a place where you can use it in an enterprise context,” says Kris Bennatti, CEO of Hudson Labs (formerly Bedrock AI), which focuses on helping firms get LLMs to function in capital markets environments.

“We’ll see teams get excited about AI, rely on GenAI, and make a RAG (retrieval automated generation, which is used to optimize the output of LLMs), then they find they can’t get to 100%, so they decide AI doesn’t work and shouldn’t be implemented—and that’s the wrong conclusion,” Bennatti says. “GenAI is a very exciting tool, but it’s being used for too many things and not the right things.”

She says there are many reasons why generic LLMs have higher failure rates in capital markets use cases. For example, there are nuances such as how well or badly LLMs deal with negatives, and reasoning errors because generic LLMs contain so much irrelevant noise that nevertheless contains linguistic overlaps with information that is relevant.

It’s so easy to go from zero to 80% with almost no effort using a generative model, but people don’t realize it’s not easy to overcome the 80% hurdle
Kris Bennatti, Hudson Labs

Hudson Labs works with research analysts and firms’ internal data engineering teams, using a proprietary noise suppression technique to improve their signal-to-noise ratio by removing irrelevant information, thus also removing the potential for confusion in future steps.

But other challenges still remain. One is the highly technical language often used in filings, and another is the limited prompt size in ChatGPT, which effectively precludes parsing entire SEC filings.

Lodas Markets, an Overland Park, Kansas-based operator of marketplaces for privately traded assets, including real-estate investment trusts and business development companies, encountered this prompt size issue when trying to use AI to extract data from private company information to increase transparency in the private assets it trades. Creating secondary markets for assets that are typically very opaque can be very difficult because the lack of transparency tends to deter investors.

“We could have used people to extract the data manually, but that’s not scalable,” says Rigo Neri, chief technology officer at Lodas. “So, with the launch of more accessible LLMs over the last year, we asked whether we could use AI to solve this problem.”

The company set up a proof-of-concept using OpenAI and ChatGPT-3.5 but then encountered the challenge of having the models pinpoint the right data points in the proverbial haystack of documentation. Neri says that some of these technical documents can be 50 to 100 pages long, but an LLM has a small window of context that can be provided. As a result, they needed to reduce the amount of information that the LLM had to look at.

“We did some file conversion, breaking down documents into individual pages, but even that’s not enough because information can overlap between pages—for example, a list of properties in a REIT may stretch over 4 pages,” he says.

Even though GPT-4 expands the context window to be able to handle larger documents, that’s not the ideal solution because of the performance and cost implications of processing larger volumes of data. “If you want to read one chapter of a book, you don’t want to have to read the whole book again every time,” Neri says.

Once Lodas’ AI has extracted and rendered data from documents, the company performs a final, manual check, comparing the original information and the extracted results side-by-side.

“We can rely on AI for maybe 80% of it, and that remaining 20% is where the human element comes in,” Neri says, adding that migrating to GPT-4 may bump the success rate up from 80% to 90%. “Hopefully, we will get to a point where the human element is reduced, and we can rely more on the AI.”

Bad language

One of the other biggest problems, says Hudon Labs’ Bennatti, is that the language used in an SEC filing versus the language used broadly to describe information across the internet or even used in financial news stories can be very different in terms of both the key terms used and the way the content is structured.

Technical terms can also prove challenging. For example, on one project, Hudson Labs was looking for indications of impairment. But whereas the term “accelerated decline” usually refers to conditions getting worse more quickly, in the mining sector, the term has a very different meaning, referring to a structural aspect of a mine, Bennatti says.

Barbara Matthews, founder and CEO of BCMstrategy, a company that turns public policy information into data and analysis of the impact of those policy decisions on financial markets, says these limitations are to be expected.

“It’s not surprising that generic LLMs have difficulty with the highly technical language in SEC filings,” says Matthews, who had to build her own proprietary AI tool to interrogate her datasets of public policy information, as well as the underlying language models for training the AI. “Despite their billions of parameters, the computing capacity cannot compensate for the lack of specific domain expertise.”

If you’re teaching them wrong too fast, your output is screwed
David Trainer, New Constructs

Hudson Labs, for example, uses highly specialized, proprietary LLMs, Bennatti says, and rarely uses only one. Instead, it uses a set of LLMs to perform specific tasks in a pipeline—with separate pipelines for different functions—where the firm controls the input and output, and each is never placed in a position where it can fail.

The key takeaway from this is that specialized functions require specialized LLMs to get the best results, fastest. And while generic models arguably learn from and benefit from exposure to that specific domain expertise, the short-term risk is errors and information leakage because you’re effectively sharing your questions—which could expose your strategy—with the rest of the world as those models incorporate your queries, Matthews warns.

So, if firms do choose to work with generic LLMs, they need to ensure that the input and outputs of those models are protected, or risk losing any advantage they hope to gain as proprietary information leaks into the wider pool of LLM data, she says.

“Whatever model is deployed must be a dedicated instance. Otherwise, the training and every subsequent query goes back into that global pool and there’s a high risk of data leakage,” she says. “Some firms—including ones who have shared their open-source models—make it possible to deploy those in a dedicated instance. That’s the model for users who need to protect proprietary information.”

Beside linguistic challenges, there are also inconsistencies in the formatting of filings and financial reports that a human would understand but a machine might have difficulty capturing consistently and may be tempted to fill in gaps with “hallucinations” and be more likely to introduce errors that—if proliferated through subsequent analysis processes—could lead to disastrous results.

Another example is figures in an income statement or cashflow document, which are typically presented in thousands for brevity—i.e., if a figure reads $1,000, it actually represents $1,000,000—so after extracting the raw figure, the AI would need to know when to multiply that figure by 1,000, says Lodas’ Neri. Or a negative amount may be expressed using a minus symbol or with the figure written within brackets, so an AI would need to be trained to understand that both mean the same thing.

AI vs actual intelligence

David Trainer has been extracting data from SEC filings, with a specific focus on items hidden in the footnotes of filings, for more than 20 years. He started as an analyst at Credit Suisse before founding New Constructs, which does just that as a service. He says before introducing artificial intelligence, you need human experience to set the ground rules and establish processes.

“You need human experts to break down balance sheets, income statements, and footnotes for thousands and thousands of statements … to tell the story about a company that you can then feed into the machines,” Trainer says. “You have to create order out of the chaos that is filings, and someone has to create that order and do it correctly enough times that it works.”

If you want to read one chapter of a book, you don’t want to have to read the whole book again every time
Rigo Neri, Lodas Markets

New Constructs’ process uses machine learning to automate repeatable processes by teaching machines how to perform them without interpretation or any generative aspect. Through this, both humans and machines have their own, distinct roles to play. It’s easier to teach something to a human, plus they have intuition to figure things out, he says, but adds that humans don’t have the same consistency as machines. “It can be hard to get people to agree on the meaning of an accounting data point,” he notes.

On the other hand, machines can be “impressionable,” he says. “If you’re teaching them wrong too fast, your output is screwed.”

And that’s easy to do—especially when the input data can often be unreliable.

“The variability in disclosures is huge,” Trainer says. “We’ve had to deal with millions of disclosures that are weird. For example, sometimes a company doesn’t disclose something, and if we just insert a zero value, the results could be FUBAR. So we have to build in estimates.”

Verity Data, which has also been extracting data from SEC filings for 20 years to create research and insights, can attest to the messy and inconsistent nature of some filings. Max Magee, principal of research operations for generative AI at the vendor, warns that—while AI can recognize some mistakes—it can’t be relied on to understand when a complex regulatory filing may contain an error or omission.

“It comes down to pre-processing and post-processing,” he says. “AI won’t get tripped up by a few typos, but the issues with 10-Ks and 10-Qs are that these are highly structured filings, and there are lots of technical issues that crop up that make them hard to standardize, such as identifying the same sections across different companies because they report those differently.”

Prior to introducing an AI component to assist with extraction, Verity’s software was built to recognize those differences, so the vendor was able to transfer that existing knowledge into the AI. But that wouldn’t have been possible without—just as with BCM’s Matthews and New Constructs’ Trainer—years of domain expertise.

It’s a similar story at UK-based Accelex Technology, which automates the extraction of data from documents about privately held companies. The vendor had been following the progress of GenAI and LLMs, and recently went live with a generative tool for cash flow notices—a call for capital to be distributed back to an investor.

Thomas Charman, senior data scientist at Accelex, says that during beta testing, the tool was able to extract 80% (a reoccurring percentage) of data points from a document and is improving as the app matures. Previously, this area was the vendor’s worst-performing model, so the tool enables Accelex to provide better service and more accurate data to clients. But key to that was the vendor’s existing knowledge.

“Because we already had a platform that extracts this data, we had good examples of what the inputs look like and what we want the output to look like. And we have a well-defined data model with definitions of terms across the industry,” Charman says. “The guardrails we put in place to protect against hallucinations are working, and we make sure that the responses conform to our data model.”

This allowed the vendor to deploy new technology quickly and use open-source resources instead of making major investments in proprietary LLMs.

“We were able to get from the first ‘hack-y’ things using open-source LLMs to an app in three or four months with just two of us working on it. Without having already done a lot of those tasks, we would have needed double or triple the number of people working on it, and it would have taken much longer,” Charman says. “We’re not an organization with the size and resources to train a model from scratch, so we use some third-party resources. But the most work isn’t about the GenAI and LLM, but rather about how do you make sure you’re showing it the right part of a document in the first place.”

“No easy task”

But for most parties exploring using AI for these tasks, “the bottom line is that a generic LLM is of limited value for these types of applications, especially for something as specialized as legal securities filings,” warns the aforementioned CDO. “A proprietary model is probably the way to go, but the fact is building and training one is no easy task,” and may be cost-prohibitive to most firms and vendors without large offshore development organizations.

“My sense is that outside of the hyperscalers, such a Microsoft, OpenAI, Amazon, and others, is that it’s quickly becoming cost-prohibitive to keep up and build your own models. It’s just too expensive—too difficult,” adds Verity Data’s Magee. So, I think we’re seeing people use these fundamental models, and use APIs and build on top of those.”

Ultimately, AI adoption for these specialist tasks comes down to a trade-off between accuracy and cost. How much checking are you prepared to tolerate in return for faster time-to-market and ease of deployment, versus how much extra are you willing to pay to achieve 100% accuracy—if that’s even possible? But as the volume of reports grows and demand for faster analysis accelerates, it’s not possible to have humans check each and every data point.

Building an AI mechanism that you can be confident will accurately extract and report data from filings and other documents must involve a range of techniques: specialized language models that aren’t trying to boil the ocean, the domain expertise to train the AI, and different tools for different parts of the job. And though artificial intelligence may eventually take over much of the heavy lifting, the role of actual human intelligence remains key in creating and testing these models—especially individuals with years of domain-specific experience who can spot the linguistic or formatting nuances that may confuse an AI.

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe

You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.

The IMD Wrap: Will banks spend more on AI than on market data?

As spend on generative AI tools exceeds previous expectations, Max showcases one new tool harnessing AI to help risk and portfolio managers better understand data about their investments—while leaving them always in control of any resulting decisions.

You need to sign in to use this feature. If you don’t have a WatersTechnology account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here