Banks Look to Virtual Data Warehouses to Unify Trading Desks

As emerging technologies and greater computing power have brought about new analytical tools and capabilities, adding a virtual layer could help take the burden off the traditional data warehouse.

Deutsche Bank has been looking for an efficient way to pull trading data across its fixed income, credit, and foreign exchange (FX) desks, which traditionally, and like in many other organizations, sit in different databases. In the words of two DB executives, bringing that data together proved to be “much more than a technical challenge.”

Trading in different asset classes requires a certain level of specialization, which brought about the challenge of heterogeneous data entities, databases, different business intelligence (BI) tools and different analytics, all built up in their own, separate divisions.

Specifically looking at trading data across its fixed-income, credit and FX trading desks, Jon Mah, managing director for strategic analytics in Deutsche Bank’s Corporate & Investment Bank (CIB) unit, said these product lines typically have their own technical solutions. This means they also get serviced by those specific data pipelines.

“What isn’t as easily done is figuring out a solution that spans across all of those different datasets. I want to know how much ‘XYZ’ is trading with me, overall, across FX, fixed income, equities, etc. How do you do that? That’s the problem we’re focused on here,” he said, speaking at the Waters USA conference in New York on December 3.

These divisions aren’t necessarily treated as silos, added Vijay Bhandari, innovation lead at the bank. Instead, they are well-structured, purpose-driven, and useful in the way they were intended to be.

“They are exact solutions and they are optimized for what they need to do. The problem we’re trying to solve is that there is a whole class of other users that also want to be able to make use of this information, and sometimes this has to go across the board. How do we facilitate this so they don’t have to get [through] the complexity that lies underneath all of these different units?” he said.

This challenge can be tackled in a few ways, Bhandari added. Some firms have looked at it as a software solution; others have looked at it in terms of a meta-database solution, and others go down the route of finding database-related solutions.

DB has embarked on a proof-of-concept (PoC) using a virtual data warehouse. For this, it is working with two undisclosed vendors to help bring all these different databases and data types under a virtual data warehouse, and then expose them via a common access layer.

“The most common access layers are Java Database Connectivity (JDBC), Open Database Connectivity (ODBC) and an open API service-oriented layer where you create messaging and servicing APIs that can then tie into other upstream applications or data analytics tools,” Mah said.

In DB’s case, it uses Python, Java, Jupyter notebooks, and Tableau for visualization needs. The idea, Bhandari said, is to have this underneath the virtual data warehouse and then expose a common logical schema across data sources that can be consumed by data analytics, BI reporting, and other in-house applications.  

It started work on the PoC about two months ago and hopes to have it up and running by early 2020.

“There are technical solutions to these things, but overwhelmingly, the biggest challenge is how we organize it to make it meaningful,” Mah said. “I just want to know what my profitability is for a particular time in this asset class versus another. And that involves pulling data that’s very specific to every different product, and having to do all that together to be able to get an answer. What you need to do is figure out what are the specific attributes I can extract from everything that exists at a very specific level in the aggregate that’s meaningful and in a broad way.”

With the increased amount of data flowing into trading platforms across different desks, firms are finding that they are not fast enough to process that information. The question these organizations need to answer is how they perform BI-like tasks quickly while ensuring their analytics are fresh.

“If we can have good etiquette for logic, framework, and schema that sits on top of all our data, then we can kind of create the single pane of glass that taps into all this information and makes it easy for us to consume,” Bhandari said.

Having that one single point of access that is already standardized is where firms will gain efficiencies and improve time-to-market to be able to produce analytics in a more standardized manner.

As firms are dealing with an increasing amount of data, pulling data for analytics from disparate databases within traditional data warehouses is proving inefficient. Data lakes, on the other hand, can lead to a mixed bag of products and technologies. This has led some firms to consider virtual data warehouses for a more agile way to access data.

Warehouse to Lake to Virtual

It isn’t that virtual data warehouses are new. However, it has not been “computationally cheap” to do it on the fly, as Bhandari put it. “I think what’s happened in the past four or five years is that there’s been a confluence of technologies that have been built fast enough—and affordable enough—to make this kind of technology worth looking at again to see if it’s performant enough for our needs. It may not be performant for anything critical in real-time, but for most types of analysis that analysts and BI folks are doing, it’s definitely fast enough,” he said.

The predominant data storage paradigm within financial firms has been to have multiple, disparate application databases and general-purpose file stores, each used locally by its core application. Naz Quadri, head of enterprise data science and quant development at Bloomberg, tells WatersTechnology that this meant that, up until the last decade, any cross-functional business analysis would need to be done on the firm-wide data warehouse.

These warehouses were custom, costly, and complex to set up. Quadri explains that any data source fed into the data warehouse could require months of engineering effort to connect in through the extract, transform and load (ETL) process. 

Then, once set up, these warehouses were fairly rigid in their data representation and allowed for limited types of reports it could generate.

As data science has proliferated across the financial industry, this meant that the rigidity of the data warehouse was not an appropriate fit for analytics using data science in many cases. “This gave birth to the data lake concept, where any and all application database data can be copied or even streamed in real-time to the lake,” Quadri says.

In recent years, data lakes have since served as a natural home for additional non-application datasets that are brought into a firm, such as alternative data. “With the appropriate choice of technology for the lake and sufficient descriptive metadata, this environment allows data scientists to explore and build models at a rapid rate,” Quadri says.

Adding to the complexity, though, is the fact that different people define data lakes differently. To Dessa Glasser, principal of the Financial Risk Group, independent board member at Oppenheimer Holdings, and former chief data officer for JP Morgan’s asset management unit, data lakes are more for analytical data, historical data, and so on. “It’s useful for analyzing and combining structured and unstructured data, as well as historical data. They are good to support things like data labs and business intelligence tools where people want to perform ‘what if’ analysis, often on historical data, and to access the data directly. You don’t want analysts querying operational data sources directly, since you can’t risk taking down these operational or trading systems. That’s why lakes have been used traditionally in the past. It really is more for analysis and historical data,” she explains.

Though there may be different definitions tied to it, data lakes help make more content available across the enterprise. Bill Gartland, vice president for fixed-income data and analytics at Broadridge, says this shifts the mindset from capturing the information needed to satisfy today’s needs, to saving everything and figuring out what to do with it later.

The idea of data virtualization came about by asking the question: Do we really need to copy all this data to a central store? The concern was that often there was duplication of data in the traditional data warehouses and data lakes. When the enormity of data that flows through institutions is taken into account, duplication of data can become a serious issue.

So the idea of the virtualized data warehouse is to make it look as though all the disparate data stores within an organization are in the same database through a layer of virtualization technology, Quadri says. This makes it easier for everyone in the organization to get access to the data they need using a common taxonomy.

For financial professionals who prefer BI tools like QlikSense and Tableau, data virtualization over application databases is an attractive solution. However, Quadri says teams of data scientists would still typically prefer centralized data lakes.

It’s Not For Everything

Aside from the fact that traditional data warehouses are structured, require programming, ETL, and typically take some time to build, there are other challenges to this model, including data regulation and scalability. They are also construction-heavy and require a strong commitment of resources to build out tools. Even with all this effort, it may not lead to a return on a costly investment.

Mark Alayev, director of service delivery for RFA, an IT, financial cloud and cybersecurity services provider to the investment management sector, says that in current environments with traditional data warehouse, firms must join the data in one location to perform queries on top of it.

“This means you may need to move UK data near US data or vice versa, in order to run queries across both platforms,” he says. “By separating the data warehouse from the database in a virtual data warehouse scenario, the data may reside in the appropriate location for each of them, but it will only run across both datasets when you run the query. It delivers the result to the analyst, and the data continues to live separately.”

The second challenge with the traditional data warehouse has to do with scalability. Just as there can be many databases queried by one data warehouse, there can be many data warehouses querying the same set of databases.

“This allows for you to scale concurrency horizontally as you can have many threads processing queries. This enables you to run ingestion, analytics across many different tools without conflicting with others. Of course, there are limitations based on your data storage, so an elastic clustered storage would be preferred to allow for graceful scaling,” Alayev explains.

Additionally, scaling up can be difficult, particularly when modern analytics require the consumption of extremely large datasets. Those spikes in demand can stress traditional warehouses. Virtual data warehouses, on the other hand, hosted in public or private cloud environments, are better equipped to scale with demand.

While virtual data warehouses pave the way through many challenges, it isn’t the silver bullet for all data queries and analytic work. As Alayev puts it, they are incredible structures, but not for all scenarios. He explains that where virtual data warehouses may hit a wall is when it comes to high-performance scenarios where latency is crucial, like in quantitative trading.

“The virtualization layer does take additional processing power and is in turn another level of abstraction away. For those performance intensive needs, a physical data warehouse may be preferred,” he adds.

Another area where virtual data warehouses could fall short is the requirement to use structured query language (SQL), according to Alayev. As more BI tools move to low-code or no-code interfaces, making way for natural language querying of data relies heavily on understanding the relationship of the data below it.

“Virtual data warehouses typically do not hold relationships between the data and require knowledge of SQL,” he says.

And an area that seems quite basic, but is “absolutely critical” is the taxonomy and authoritative data store. This is where the data transformation, mapping, and validation happens, Glasser adds. “For example, you can have the logic built into the glossary that IBM is also known as International Business Machines. We had a data glossary with all the business rules tying together terms, such as A equals B, so when a query came in, we knew what data to pull. We would display the results and say, ‘here’s the definition, a third-party called IBM,’ and people could actually look and see all the different values tied to it,” she says.

When using a virtualization layer, it is critical to get the taxonomy right. “So if you’re looking at all your transactions with IBM, you need to tag it with a unique ID. This would then allow the system to pull in all the transactions associated with all the legal entities tied to IBM. The taxonomy becomes  very important when using a virtualized layer,” Glasser stresses.

Isolation and Purpose

Despite some of those issues, virtual data warehouses are suitable for many isolated sets of data of all shapes and sizes, says RFA’s Alayev.

He explains that these include workloads and projects on quantitative research, advertising technology, and private equity deal and portfolio company analysis. For the latter, data can be pulled from various sources using software-as-a-service applications supplemented with consumer sentiment. Machine-learning would do well in this environment, too, as it can scale and handle running a large number of complex queries required to build, say, neural networks.

Moving forward, it also seems there is still a place for the traditional data warehouse, says DB’s Bhandari.

“This is not meant to be a performance-enhancing technology or approach; it’s obviously adding a layer of abstraction. Whenever you do that, there are going to be certain costs associated with it. I think the real benefit of doing that is simplicity to the consumers of the data, who don’t necessarily want to know how to use each individual technology. And it also has to do with being able to work independently of the timelines of different teams that have different technologies underneath and underlying,” he said.

Indeed, this is not the last evolution for the data warehouse, whether traditional, in lake format, or virtualized. And just as frustrations have arisen with data lakes—the previous savior of the traditional data warehouse—new challenges will emerge with virtualized data warehouses. But as more data becomes available, as the need to derive insights from that data grows even more paramount, and as firms need to rein in costs, for the foreseeable future, expect more firms to follow the same path that Deutsche Bank is taking.

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe

You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.

Removal of Chevron spells t-r-o-u-b-l-e for the C-A-T

Citadel Securities and the American Securities Association are suing the SEC to limit the Consolidated Audit Trail, and their case may be aided by the removal of a key piece of the agency’s legislative power earlier this year.

Most read articles loading...

You need to sign in to use this feature. If you don’t have a WatersTechnology account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here