“You’re hot then you’re cold” - Redshift data retention with Spectrum

Amazon Redshift – data warehouse solution from AWS – configured as a cluster can be set up in many ways using different types and counts of underlying nodes. While designing any database solution, storage is one of the most important factors. In this article, I would like to describe how to save on Redshift’s space requirements by leveraging the Redshift Spectrum feature.

What is Redshift Spectrum?

Redshift Spectrum is a feature that enables executing SQL queries on structured and semistructured Amazon S3 data (as if it was a regular table) without an actual need to load it into data warehouse’s storage.

AWS documentation describes it as capable of highly parallel execution, running on dedicated servers (so-called Redshift Spectrum layer) independent of the Redshift cluster. It’s so scalable that it can use thousands of instances to leverage massively parallel processing for large datasets. As a major part of processing happens in the Redshift Spectrum layer, only a small portion of S3 data is transferred – it is extremely important for efficiency, especially while querying the same dataset by multiple clusters at the same time.

What are the benefits? You can tailor the size of your Redshift cluster for frequently accessed data and make the remaining available via Redshift Spectrum at low cost.

Data formats

Redshift Spectrum supports following file formats:

CSV
TSV
Parquet
ORC
JSON
Amazon ION
Avro
RegExSerDe
Grok
RCFile
Sequence

Although, as you can see there are quite a lot of them, only 2 ensure good performance and low cost. They are columnar storage formats: Apache Parquet and Apache ORC. Spectrum pricing is dependent on data scanned and as it reads only columns that were included in the query – columnar formats seem the best option.

Partitioning

To get even better query performance and lower cost it is advised to use data partitioning. Here the decision is really data specific, usually, you should select frequently filtered columns to form partitions.

It is Important to note that in this case more does not necessarily mean faster, so be careful, while choosing them. How do partitions affect performance? As partitions are bound to S3 paths, they help to prune unnecessary files while scanning the S3.

External databases & tables

All CREATE queries that you execute for Redshift Spectrum result in AWS Data Catalog changes. All external schemas and tables are available there for you to view. Another news is that as long as they are there you can query them using AWS Athena. AWS Data Catalog is not the only option as for the external databases, if you are more comfortable with Hive there is also an option to use it.

Redshift Spectrum vs Athena

AWS Data Catalog tables can be queried using Athena and Redshift Spectrum. Which one should I use then? Actually it depends on the stack that you are currently using and the use case that you have.

If you already have a Redshift cluster, probably Redshift Spectrum is a good way to go as it requires one to set it up. While using Redshift together with Redshift Spectrum you can introduce a lifecycle for your data. Another advantage of using Redshift together with Redshift Spectrum are late binding views – they enable you to merge Redshift and Redshift Spectrum data into a single view.

If you have your data already on S3 and are looking for an easy and fast way to do some analytics then AWS Athena is what you are looking for. AWS Athena does really great also in the long term for data lake querying – just remember about the best practices!

Redshift data unload

Redshift Spectrum schema and tables’ setup has been thoroughly described in official AWS documentation, you can find it here – it’s really great, but it has one assumption: data is already in S3. This is why in this article I will concentrate on unloading the data from Redshift to S3 to make it available to Redshift Spectrum for querying.

Redshift data can be dumped to the S3 using UNLOAD query, here is an example:

UNLOAD ('select statement') TO 's3://object-path/name-prefix' IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>' PARTITION BY ( column_name [, ... ] ) INCLUDE ALLOWOVERWRITE [PARQUET | CSV];

Let’s decompose it:

TO 's3://object-path/name-prefix' – here you have to provide the bucket name and the prefix of the file that will be written to S3. The fun fact is that you cannot specify the exact file name. Number of files depends on the PARALLEL parameter value and unloaded data size. Maximum file size is 6.2 GB.

IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>' – IAM role has to have permission to save data to selected S3 bucket, this role has to be bound to Redshift cluster.

PARTITION BY ( column_name [, ... ] ) – here you can specify, which columns will build up partitions, order is meaningful, it will reflect the S3 path structure, which will look like follows:

s3://bucket_name/prefix/parition_name_1=value1/parition_name_2=value2/000.parquet



If you use INCLUDE in your statement partition columns will be included in the parquet file as well.


ALLOWOVERWRITE - if you won’t include it in your statement unload operation will fail if there are already any files in the directory


[PARQUET | CSV] - currently unload operation supports only these two data formats

Summary

To sum up, Redshift Spectrum introduces a great opportunity for your Redshift cluster’s data lifecycle.


The lifecycle process and Redshift Spectrum set up are really simple and do not require any extra maintenance overhead - just remember to keep your data partitioned and schema consistent. Want to cut on Redshift cluster expenses? Don’t wait and create your external tables! 



Sources: