Imagine you’re replicating a database containing millions of records for a fast-growing online store to a data warehouse for analysis. Your data replication solution has to be fast, efficient, and cost-effective.
In this tutorial, you’ll learn how to use Airbyte to transfer data from an OnLine Transactional Processing (OLTP) database, PostgreSQL, to an OnLine Analytical Processing (OLAP) data warehouse, Snowflake. You’ll also learn how to deal with edge cases, like partitioned tables and custom data types.
Before you start replicating data, it's important to understand what OLTP and OLAP mean.
OLTP is a form of data storage and processing that executes a number of concurrent transactions. OLTP workloads are optimized for database inserts. Example application areas of OLTP include online banking, online shopping, order entry, and sending text messages. OLTP follows ACID principles. In basic terms, this means a transaction to the database either completes or fails.
OLAP is a form of data storage and processing that focuses on storing denormalized data to make querying easier when carrying out data analysis. OLAP systems are optimized for data analysis. This form of storage is used in data warehouses and data lakess. Data analysts can make queries on historical data from an OLAP environment to make better analytical decisions.
Snowflake is an OLAP is a cloud-based data warehouse. There is no hardware to configure, no software to install, and no maintenance required. Snowflake ships with an integrated data warehouse, as well as analytics tools for running OLAP workloads. These features make Snowflake a great choice for OLAP workloads.
There are two main sync modes to replicate data from OLTP to OLAP systems.
Airbyte is a data integration tool that allows you to replicate data to data warehouses, lakes, and databases. Airbyte supports both full refresh and incremental syncs. In this tutorial, you will use full refresh mode to replicate data from PostgreSQL to Snowflake.
Sowflake supports a rich set of data types, including semi-structured data types such as JSON and XML. This means that most tables won’t require extra data type conversion. For tables with data types that Snowflake doesn’t support, Airbyte converts that data to an Airbyte-specific string type which ultimately ends up as a VARCHAR type on Snowflake.
Airbyte streams data from a source, builds it into temp files, and delivers the data to the destination. You can choose to use your local machine for storing temp files as part of the replication process, or use an external staging area like Amazon S3 or Google Cloud Storage (GCS). An external staging area offers faster replication speeds than your local machine.
Before you can start replicating data, you’ll need to ensure the following conditions are met:
Now that you understand the prerequisites, you’ll go over the steps of replicating a simple PostgreSQL database to Snowflake with Airbyte.
The first step is to create and populate a local PostgreSQL instance. You’ll need to clone the below GitHub repo. The repo contains all the code and files you’ll use throughout the tutorial.
After cloning the repo, navigate to the clone repo. Then create a new database.
After creating the database, run the command below to create a table, movies and populate it with data.
Now, enter the psql shell as user postgres, psql -U postgres, then connect to the airbyte_tut database, \c airbyte_tut. Next, create a dedicated read-only user that will have access to the relevant tables in the airbyte_tut database.
Then, give the Airbyte user access to the airbtye_tut database.
When creating a Snowflake account you’ll need to pick a Snowflake edition and a cloud provider as part of the account creation process. Next, you’ll receive an email containing your login URL similar to https://id.region.cloud-provider.snowflakecomputing.com/console/login. Make sure to bookmark this URL for future logins.
After you create your username and password you will be redirected to the Snowflake dashboard after your account has been successfully created. The worksheet area will be the primary place you’ll run scripts for creating and modifying resources.
You need to create the destination database, user, role, and schema on Snowflake where the sync will occur. Airbyte already provides a script for running this automation. Paste the snippet below into your worksheet area on Snowflake. Make sure to change the airbyte_password variable to your preferred password before running the script.
Select All queries and run the script.
Ensure you have Docker Compose on the environment where you want to run Airbyte. Now, run the commands below to initiate Airbyte setup:
You can then open your app on http://localhost:8000. After inputting your email, you’ll be presented with an onboarding experience to create a first Airbyte connection, starting with the source.
Select Postgres as your source and set up the connection parameters as explained in the Postgres source documentation.
After inputting the name and destination type, you’ll be presented with several required fields that Airbyte will use to connect to the Snowflake destination. Fill in the form with the default details.
When you get to the loading method section, select your preferred loading method and input the required details. The default loading method, uses your local machine to generate the temp files. An example using AWS S3 is shown below.
After verifying your connection, you’ll be redirected to set it up. Choose a manual sync frequency and set the namespace configuration to Destination Connector Settings.
Ensure the `movies` table shows up. If it doesn’t, check that the table is present in the `airbyte_tut` database and refresh the schema. Complete the process by clicking Set up connection.
Then you can On your Airbyte dashboard, navigate to the Connections page and click on the first connection.
This will redirect you to a page with more details about that connection. Click Sync now to start synchronizing data.
This action will trigger a sync to Snowflake. When the sync completes, 18 records will be pushed to Snowflake.
Next we will cover three special cases to replicate partitioned tables, custom data types and working with geospatial data.
Replicating tables with thousands of rows can be a slow process, so tables are often broken down into logical units called partitions. Using Airbyte, it’s possible to partition the tables by user-signup date, then push each of these partitions to Snowflake.
To begin, navigate to the tutorial folder and execute the table-partitioning.sql file. This loads data into a users table and creates two partitions. To replicate this partitioned table to Snowflake, you have two options:
1. Export the table as a single view.
2. Export each partition as separate files.
The first option involves selecting the users table only. The second involves selecting each partition.
It’s important to know that privileges applied to a user don’t affect partitioned tables after the privileges have been applied, you have to reassign privileges. Simply rerun the original command you used to assign privileges.
The advantage of exporting each partition is the speed of parallel execution is that each table will be replicated in its own stream. After the replication process is complete, the tables can then be combined into a single table on Snowflake.
To get each partitioned table into Snowflake, navigate to the Settings tab on the Airbyte-Snowflake connection. Click Update latest source schema to load the new tables.
Finally, save the changes to restart the migration.
Custom data types (also called user-defined types) in PostgreSQL are abstractions of the basic data types. A custom data type could be a basic data type like VARCHAR, with the addition of a check to ensure that imputed data has no spaces. An example is a countries table defined by the following query.
You could define a custom data type for the code field in the table using the CREATE DOMAIN statement.
You can then use this data type in the normal way.
Snowflake does not support user-defined types, so Airbyte converts all user-defined types to VARCHAR before replicating the data to Snowflake to handle this. To demonstrate the process of replicating custom data types to Snowflake, navigate to the tutorial folder and load a table with user-defined types.
After the table is created and populated with data, update your schema with the latest table.
Save the configuration to restart the sync process. When the sync succeeds, check Snowflake for the new data.
You’ll notice that the data type of the code column is VARCHAR. This is due to the transformation that occurred on the data.
PostgreSQL has support for unstructured data through: JSON, JSONB, HStore and XML. Fortunately, Snowflake supports JSON and XML data types. The other two data types, JSONB and HStore, are converted to the VARCHAR data type.
Most modern applications and systems utilize one form of spatial data or another. Spatial data is usually represented as using a coordinate system like the cartesian or spherical systems. For PostgreSQL databases, PostGIS is used to enable spatial data storage. It enables PostgreSQL databases to store and query location related data such as longitude and latitude through its custom data types.
PostGIS offers the following data types: Box2d, Box3d, Geometry, Geometry_dump, Geography.
The Geography data type is the most important data type offered by PostGIS, it provides a way to store longitude and latitude data on PostgreSQL. This is done using the Point type, which takes in longitude and latitude as inputs.
Snowflake supports the Geography data type, meaning no typecasting will occur during replication. The replication process is the same as before.
Create a table containing airports and their coordinates by running the airports.sql file from the tutorial folder:
Update your schema on Airbyte. Notice the geography_columns and geometry_columns tables that were added. These tables originated from the PostGIS extension.
Sync the data using Airbyte, then view the replicated data on Snowflake.
Geography data is converted to VARCHAR by default, but Snowflake has a SQL function, TO_GEOGRAPHY, for converting this VARCHAR into the Geography data type.
Data warehouses like Snowflake are optimized for analytical workloads. For this reason, it's common to replicate data from transactional databases like PostgreSQL.
Airbyte makes it easy to replicate data from PostgreSQL to Snowflake. However, it's important to know what to expect from some special cases, like replicating partitioned tables and how custom data types are mapped.
Osinachi Chukwujama is a web developer and technical writer. He enjoys building backend applications and utilizing cloud computing. He plays the organ and creates casual animations when he isn't coding.
Start breaking your data siloes with Airbyte.
Learn how to stream changes from a MySQL database to Kafka using Change Data Capture (CDC).
Learn how to modify the dbt code used by Airbyte to partition and cluster BigQuery tables.