Article

Data Warehouse vs. Operational Database! What? How? Which One?

•

December 16, 2022

•

10 min read

Have you ever wondered: What's the fundamental difference between a data warehouse and an operational database? If so, this article is written for you!

On the surface, data warehouses and operational databases provide similar functionalities. However, they are optimized for different use cases. Data warehouses are generally optimized for OLAP (online analytical processing) workloads, while operational databases are generally optimized for OLTP (online transactional processing) workloads.

In this article you will learn about OLAP and OLTP workloads, and will see the fundamental architectural difference that makes data warehouses preferred for OLAP workloads and operational databases preferred for OLTP workloads.

This article covers the following topics:

A brief review of background information that provides a basis for understanding the architectural decisions made for operational databases and data warehouses. This includes an overview of latency numbers that every programmer should know, as well as a high-level overview of how pages of data are read from disk into memory.
An example of an OLTP workload, with an explanation of how row-oriented storage makes operational databases good for such workloads.
An example of an OLAP workload along with an explanation of how column-oriented storage makes data warehouses good for such workloads.
A review benchmark results that compare a popular OLTP database with an up-and-coming OLAP data warehouse.
A summary of when an operational database should be used and when a data warehouse may be a better choice.
The benefits of using both an operational database and a data warehouse.
How Airbyte may be used to synchronize data from an operational database to a data warehouse as part of your overall data integration strategy.

By the end of this article you will understand the fundamental difference between a data warehouse and an operational database, and you will be well positioned to determine if your requirements are best served by a data warehouse, an operational database, or a combination of both of these technologies.

Background information

In order to understand the fundamental difference between a data warehouse and an operational database, it is useful to understand design decisions and trade-offs that have been made for each one. Furthermore, in order to understand the underlying reasoning for these design decisions, it is helpful to have a general understanding of latency numbers and a few basic operating system concepts.

A review latencies

The latency numbers every programmer should know table, produced by Jonas Bonér in 2012 can be used as a rough guide to help understand the time required for accessing data depending on where it is located.

ℹ️ Even though current (2022) latency numbers may be different than those from 2012, the relative orders of magnitude should still hold true and help understand the trade-offs that will be discussed later in this article.

The latency numbers from the linked article have been copied into the image below:

The orders of magnitude differences between different kinds of data access are worth paying attention to. For example, loading data from an SSD is more than 3 orders of magnitude (more than 1000x) slower than referencing main memory, and a disk seek on a spinning disk is 5 orders of magnitude (100,000x) slower than referencing data that is in memory.

The above latencies should make it clear that there is a huge performance advantage to minimizing disk access.

One optimization to reduce the number of times data needs to be read from disk is to ensure that the working set of data is efficiently loaded into memory. This is covered in more detail in the next section.

How data is loaded into memory?

Operational databases and data warehouses generally store much more data on disk than can possibly fit into memory. Therefore, they rely on the underlying operating system to read different portions of data from disk into memory as required. The operating system makes use of memory paging to read and write data. When new data is requested, a page is retrieved from disk and loaded into memory. Pages range in size from 4KiB to 2MiB depending on how the system is configured.

Because data is read as pages, this means that even in the case that only a tiny piece of data is required from some portion of a disk, a minimum 4KiB of data would be read – and potentially much more than that, depending on the system configuration. Therefore, without careful consideration of how data is stored on disk, memory paging may result in a high percentage of useless data being loaded into memory alongside the requested information.

In the case that the pages that are retrieved from disk contain a low density of useful data, then many more pages need to be read from disk than would otherwise be the case. Furthermore, if a large number of pages are read into memory (as would ultimately be the case if most of the data in each page is irrelevant) then the server will quickly run out of memory capacity, at which point, some of the data in memory will be paged back to disk (written to disk) to make space for alternative pages in memory. Because of the relative slowness of disk operations, this will have a negative impact on performance, and in the worst case can lead to thrashing.

Both data warehouses and operational databases are designed to ensure that the data that is returned in each page has a high density of useful information by following the principle of locality based on the expected data access patterns. Because data warehouses and operational databases are designed for different kinds of workloads with different data access patterns, they store data on disk in distinct ways.

What is an operational database?

In this section I discuss what an operational database is, provide an example of the kind of workloads that operational databases are designed to be used for, and show how data is stored in an operational database to minimize disk interaction for the expected workload.

A brief definition of an operational database

An operational database is a storage technology in which data is (often) stored as a structured collection of tables that are organized as rows and columns. Each row in an operational database represents a record.

Operational databases are designed for OLTP (Online Transactional Processing) workloads, which often involve many users executing a high number of small atomic transactions. In other words, operational databases are optimized for high-throughput transactional processing of records.

Examples of popular operational databases are Oracle, MySQL, and PostgreSQL.

An example operational database workload

An example of a typical operational database workload could be the storage and tracking of inventory information for a retailer. Required columns are SKU, Name, Inventory count, Price, etc. This could look as follows:

Example data that may be stored in an operational database

In a system built on top of this operational database, if a client purchases Bread, then a cashier would scan the SKU (Id1), which would retrieve the corresponding record from the operational database to get the product name (Bread) and price ($1.50), and it would then decrement the inventory count (50 to 49).

In this example several fields within a single record have been accessed to process this purchase, but notably the values located in other records were not required. Operational databases are therefore optimized to ensure that the contents of a given record can be efficiently loaded from disk and efficiently processed. For a large retailer, millions of such transactions may need to be processed each day.

Experience Hassle-Free Data Ingestion into Your Data Platforms with Airbyte

Try FREE for 14 Days

Operational database storage

Operational databases store their data in a row-oriented manner which means that the values associated with a single record are stored together on disk. Storing data in this manner corresponds to the most common data access and usage patterns associated with OLTP workloads, such as the inventory/purchase example just given. This minimizes the number of pages that need to be read from disk when performing an OLTP operation. As should be clear from the latency numbers discussed earlier, keeping disk activity to a minimum is key to providing good performance.

Assuming that the database stores its data on a solid-state drive (SSD), a row-oriented/record-by-record format on disk may look as follows:

How an operational database may store row-oriented data on a solid-state (SSD) drive

Alternatively, if an operational database uses spinning disks, the data storage may look similar to the following:

How an operational database may store row-oriented data on a spinning hard disk drive (HDD)

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

What is a data warehouse

In this section I discuss what a data warehouse is, give an example of the kind of workload that data warehouses are designed to be used for, and show how data is stored in a column-oriented format on disk to efficiently satisfy the expected workload.

A brief definition of a data warehouse

As is the case for an operational database, a data warehouse is also a storage technology in which data is (often) stored as a structured collection of tables that are conceptually organized as rows and columns. Because of the similarities with databases, data warehouses are sometimes referred to as OLAP databases.

Each row in a data warehouse represents a record. However, as opposed to operational databases, data warehouses are generally designed for OLAP (Online Analytical Processing) workloads which involve running a small number of large analytical operations. These analytics operations often only require data from a subset of the available columns, but they may need most (or all) of the values within the relevant columns.

Examples of popular data warehouses are BigQuery, Redshift, and IBM Db2 Warehouse.

An example data warehouse workload

Using the same data presented in the database example above, a table stored in a data warehouse looks the same – but an important difference is how the data is stored on disk. The different colors represent the values in each column that will be stored side-by-side on disk. This will be discussed in more detail below, when the disk layout is covered.

Example data that may be stored in a data warehouse

A data warehouse is optimized for analytics operations, such as efficiently finding the average price of every product in inventory. Such an operation would need to load all of the prices from disk into memory and process all of these prices in order to compute the average. Data warehouses are designed to efficiently load data for one or more columns from disk into memory, and to efficiently process a large number of values from within each column.

Data warehouse storage

In a traditional row-oriented operational data store, entire rows of data must be read even if a particular query requests data from only a few columns.

On the other hand, data warehouses often store their data in a column-oriented manner which means that the values contained in a given column are stored together on disk. Storing the data in this manner corresponds to the most common data access and usage patterns associated with OLAP workloads, such as the calculation of the average price across all products that was just discussed.

Storing data in this manner minimizes the number of pages that need to be read from disk when performing OLAP operations. The column-oriented storage of data is one of the key architectural differences of data warehouses versus operational databases, and is a big reason why analytics operations, which often retrieve and process many values within a given column, are super efficient in data warehouses. As should be clear from the latency numbers discussed earlier, keeping disk activity to a minimum is a key factor for providing good performance.

Assuming that data is stored on a solid-state drive (SSD), data warehouses often logically store their data in a column-oriented format, such as the following:

How a data warehouse may store column-oriented data on a sold-state drive (SSD)

And if a data warehouse is using spinning disks, the data storage may look as follows:

How a data warehouse may store column-oriented data on a spinning hard disk drive (HDD)

More about column-oriented data

Column-oriented data storage is a large factor in the ability of data warehouses to efficiently process OLAP workloads. In addition to reducing the amount of disk operations, other benefits of column-oriented storage are improved data compression and the ability to leverage parallel processing. Because of these advantages, several open source column-oriented storage formats have appeared.

Data compression

Each column contiguously stores a single type of data, which means that data warehouses can efficiently leverage various data compression techniques. Compressing data not only reduces storage costs, it can improve performance by reducing the number of disk operations.

Columnar-vectorized processing

The column-oriented storage format used by data warehouses allows them to efficiently leverage modern SIMD (Single Instruction Multiple Data) computer architectures for columnar-vectorized processing. In this kind of processing, a batch of values (a “vector”) are processed in parallel in a single operation, which accelerates processing of OLAP workloads.

Open column-oriented data formats

Because of the benefits of column-oriented data storage for OLAP workloads, several open formats have appeared. These include:

OLTP database performance versus OLAP data warehouse performance

While it is difficult to provide a generic and universally accurate comparison of the performance of data warehouses versus operational databases, it is possible to find comparisons of specific systems. For example, a paper called SQLite: Past, Present, and Future has been published which compares SQLite (an in-process/embedded OLTP database) versus DuckDB (an in-process/embedded OLAP data warehouse). The conclusion of this paper is the following:

SQLite out-performs DuckDB on a write transactions benchmark by 10x-500x on a powerful cloud server and 2x-60x on a Raspberry Pi, for small to large databases.
For analytical benchmarks using the SSB (Star Schema Benchmark) DuckDB out-performs SQLite by up to 30-50x and as low as 3-8x.

These results confirm what you would expect to see: DuckDB (OLAP) does well when performing analytics operations, and SQLite (OLTP) performs well for transactional workloads.

Do you need an operational database or a data warehouse?

As you have just seen, OLTP databases are good at some workloads, and OLAP data warehouses are good for other workloads. Below I summarize the factors that will help you determine if you should use an operational database or a data warehouse. I also briefly touch on the possibility that neither of these technologies may be suitable for your data requirements.

When to use an operational database?

Operational databases are designed for OLTP (online transaction process) workloads including the following:

Thousands simultaneous users
Small, atomic transactions
Reliability – often designed as an operational systems that acts as the source of truth
Realtime fast and efficient CRUD operations (create, read, update, and delete)
Efficient data storage
Storing current data – may make historical analysis impossible

When to use a data warehouse?

Data warehouses are good for OLAP (online analytical processing) workloads such as the following:

A small number of users, each of which may execute heavy analytics workloads
Deep analysis with complex queries across large data sets
Downtime is permitted – generally not used as the one-and-only operational system
Denormalized (duplicate) data in order to improve the efficiency of analytics operations
Storing of historical data for deeper insights

To learn more, you can also check out our article comparing Data Warehouse vs Database to understand their scope and purpose in-depth!

Do you want the benefits of both an operational database and a data warehouse?

You may have read through the above lists and thought “I want to support both transactional and analytics workloads! ” – if so, you may wish to run a data warehouse and an operational database side-by-side. In this case you will need a way to keep the data synchronized between your operational database and your data warehouse, this is discussed later in this article.

Do you need something totally different?

If you need to store huge amounts of unstructured data, then you may be dealing with data that doesn’t fit well into either a row-oriented operational database or into a column-oriented data warehouse. In this case, you may consider storing your data in a data lake or a lakehouse. These alternatives will be discussed in more detail in a future article.

How can you sync data between an operational database and a data warehouse?

If you have decided that you require both a data warehouse and an operational database in order to efficiently handle both OLAP and OLTP workloads, you will want to synchronize data from your operational database into your data warehouse. This can be achieved with ELT tools such as Airbyte, which are designed for database replication into data warehouses! A few examples of replicating data from databases into data warehouses can be found in the tutorials section of Airbyte’s website, which includes tutorials such as the following:

Conclusion

In this article you have learned the fundamental difference between data warehouses and operational databases. This included an overview of the underlying architectural differences between these technologies, and how they are optimized for different workloads. You have also seen why it may be beneficial to run both an operational database and a data warehouse side-by-side. If this is your case, then you can make use of a data integration tool such as Airbyte to handle the synchronization of data between these systems.

I briefly touched on the possibility that neither an operational database nor a data warehouse is the right choice for your data storage – in some cases a data lake or a lakehouse is a more appropriate technology choice. Keep your eye on Airbyte’s blog for a discussion about these alternatives in a future article.

If you are looking to dive deeper into Airbyte, you may be interested in Airbyte tutorials or Airbyte’s blog. You may also consider joining the conversation on our community Slack Channel, participating in discussions on Airbyte’s discourse, or signing up for our newsletter.

The data movement infrastructure for the modern data teams.

Try a 14-day free trial

About the Author

Alex is a Data Engineer and Technical writer at Airbyte, and was formerly employed as a Product Manager and as a Principal Consulting Architect at Elastic. Alex has an MBA from the IESE Business School in Barcelona, and a Masters degree in Electrical Engineering from the University of Toronto. Alex has a personal blog at https://alexmarquardt.com.