Preserve Data History In The Cloud — Or Lose It – ITPro Today

In a discussion I had with a spokesperson for a cloud services vendor several months ago, the representative said something that stuck with me. One thing you dont get with the cloud is a history of how your data changes over time, the spokesperson said. Theres no way you can look at a record [at a point in time] and compare it to other time periods. What were doing is were preserving the historical data that is [otherwise] lost in the cloud.

The spokesperson was not affiliated with a cloud data lake, data warehouse, database or object storage vendor. It seemed that the company hadnt previously considered that cloud subscribers could use one of these services to collect and preserve the historical data produced by cloud applications. Or, if the company had previously considered this, it was rejected as an undesirable, or nonviable, option.

In my conversations specific to the data mesh architecture and data fabric markets, the terms history and historical tend to come up infrequently. Instead, the emphasis is on (a) enabling business domain experts to produce their own data and (b) making it easier for outsiders -- experts and non-experts alike -- to discover and use data. And yet, discussion of the technologies that underpin both data mesh architecture and the data fabric, viz., data virtualization, metadata cataloguing, knowledge discovery, focuses on connectivity to operational databases, applications and services; i.e., resources that do not preserve data history.

Historical data is of crucial importance to machine learning (ML) engineers, data scientists and other experts, of course. It is not an afterthought. And there are obvious schemes you can use to accommodate historical data in data mesh architecture -- a historical repository that is instantiated as its own domain, for example.

But these two data points got me wondering: Are we forgetting about data history? In the pell-mell rush to the cloud, are some organizations poised to reprise the mistakes of past decades?

Most cloud apps and services do not preserve historical data. That is, once a field, value or record changes, it gets overwritten with new data. Absent a routinized mechanism for preserving it, this data is lost forever. However, having said this, some cloud services do give customers a means to preserve data history.

This option to preserve data history might seem convenient, at least so far as the customer is concerned. But there is a myriad of reasons why organizations should consider taking on and owning the responsibility of preserving historical data themselves.

The following is a quick-and-dirty exploration of considerations germane to the problem of preserving, managing and enabling access to historical data produced by cloud applications and services.It is not in any sense an exhaustive tally, but it does aspire to be a solid overview.

Then you need a plan to preserve, manage and use the data produced by your cloud apps and services.

The good news is that it should be possible to recover historical data from extant sources. Back in the early days of decision support, for example, recreating data history for a new data warehouse project usually involved recovering data from backup archives, which, in most cases, were stored on magnetic tape.

In the cloud, this legacy dependency on tape may go away, but the process of recreating data history is still not always straightforward. For example, in the on-premises environment, it was not unusual for a backup archive to tie into a specific version of an application, database management system (DBMS) or operating system (OS). This meant that recovering data from an old backup would entail recreating the context in which that backup was created.

Given the software-defined nature of cloud services, virtual abstraction on its own does not address the problem of software dependencies. So, for example, in infrastructure as a service, you have the same dependencies (OS, DBMS, etc.) as you did in the on-premises data center. With respect to platform as a service (PaaS) and software as a service (SaaS), changes to newer versions of core cloud software (e.g., deprecated or discontinued APIs) could also complicate data recovery.

The lesson: Develop a plan to preserve and manage your data history sooner rather than later.

You should still have a plan. When you use your providers offerings to preserve data history, it creates an unnecessary dependency. That is, do you really own your data if it lives in the providers cloud services?

Moreover, your access to your own data is mediated by the tools and APIs -- and the terms of service -- that are specified by your cloud provider. But what if the provider changes its terms of service? What if you decide to discontinue use of the providers services? What if the provider is acquired by a competitor or discontinues its services? How much will it cost you to move your data out of the providers cloud environment? What formats can you export it in?

In sum: Are you comfortable with these constraints? This is why it is incumbent upon customers to own and take responsibility for the historical data produced by their cloud apps and services.

Even in the era of data scarcity -- first, scarcity with respect to data volumes; second, scarcity with respect to data storage capacity -- savvy data warehouse architects preferred to preserve as much raw historical data as possible, in some cases using change-data capture (CDC) technology to replicate all deltas to a staging area. Warehouse architects did this because having raw online transaction processing (OLTP) data on hand made it relatively easy to change or to maintain the data warehouse. For example, they could add new dimensions or rekey existing ones.

Today, this is more practicable than ever, thanks to the availability (and cost-effectiveness) of cloud object storage. It is likewise more necessary than ever, due to the popularity of disciplines such as data science and machine learning engineering. These disciplines, along with traditional practices such as data mining, typically require raw, unconditioned data.

A caveat, however: If you use CDC to capture tens of thousands of updates an hour, you will ingest tens of thousands of new, time-stamped records each hour. Ultimately, this adds up.

The lesson is that not all OLTP data is destined to become historical. If for some reason you need to capture all updates -- e.g., if you are using a data lake to centralize access to current cloud data for hundreds of concurrent consumers -- you do not need to persist all these updates as part of your data history. (Few customers could afford to persist updates at this volume.) What you should do is persist a sample of all useful OLTP data at a fixed interval.

On its own, it is possible to query against data produced by cloud applications or services to establish a history of how it has changed over time. Data scientists and ML engineers can trawl historical data to glean useful features, assuming they can access the data. But data is also useful when it is combined with (historical) data from other services to create different kinds of multidimensional views: you know, analytics.

For example, by combining data in Salesforce with data from finance, logistics, supply chain/procurement, and other sources, analysts, data scientists, ML engineers and others can produce more useful analytics, design better (more reliable) automation features and so on.

By linking sales and marketing, finance, HR, supply chain/procurement, logistics, and other business function areas, executive decision makers can obtain a complete, synoptic view of the business and its operations. They can make decisions, plan and forecast on that basis.

This is just to scratch the surface of its usefulness.

The purpose of this article was to introduce and explore the problem of capturing and preserving the data that is produced by cloud apps and services -- specifically, the historical operational data that typically gets overwritten when new data gets produced. There are several reasons organizations will want to preserve and manage this data, including the following:

There is another reason that organizations will want to capture and preserve all the data their cloud apps produce, however. Most SaaS apps (and even many PaaS apps) are not designed for accessing, querying, moving, and/or modifying data. Rather, they are designed to be used by different kinds of consumers who work in different types of roles. The apps likewise impose constraints, such as API rate limits and, alternately, per-API charges, that can complicate the process of accessing and using data in the cloud.

In a follow-up article, I will delve into this problem, focusing specifically on API rate limits.

Link:

Preserve Data History In The Cloud -- Or Lose It - ITPro Today

Related Posts

Comments are closed.