Scalable Data Modeling with Redshift

One of the major challenges of building an advanced bidding and reporting platform is dealing with the large amounts of data we see come in and out of our system. Our database grows rapidly on a day to day basis. So we must ensure that we have enough storage to manage it as well as a powerful enough computing solution to handle the workload.

Background

amazon-aws-logo
We began this adventure on Amazon RDS MySQL and it has thus far served us well. However, as the database has grown over time, we’ve noticed our aggregations and analytics have begun to take longer and longer. To offset this, thus far, we’ve offloaded this work to background processes that run daily after we’ve received the day’s data. This way the aggregations are not live, but are pre-built in aggregation tables. Even with this solution in place, however, this becomes more and more cumbersome as time goes on, to the point that we feel we are nearing the limits of MySQL and have begun to look to other solutions. This has brought us to Amazon Redshift. Others have chronicled their efforts, so we thought we would provide some details of our journey as well.

To Redshift…

Amazon Redshift is a fork of PostgreSQL that is highly customized to handle large-scale data. In fact, they market to petabyte-scale data warehouses. While our data warehouse hasn’t quite reached this scale, we thought this would be a good move to make, however we have made a few discoveries in the process that have changed the way we approach it as we have dug deeper into this solution.

Our initial findings in query performance were significant:

mysql> SELECT COUNT(1) FROM sample_table;
+------------------+
|  COUNT(1)    |
+------------------+
| '224367613'  |
+------------------+
1 row in set (137.381 sec)

postgres=# SELECT COUNT(1) FROM sample_table;
+------------------
|  COUNT(1)    |
+------------------+
| '224367613'  |
+------------------+
1 row in set (1.49 sec)

Over two minutes for a simple count has been cut down to one and a half seconds. And we have seen even greater results in our heavier analytics queries, up to 50x-300x improvement in some areas.

One of the issues we ran into is how Redshift handles indexes much different than a traditional RDBMS. While it supports primary keys, foreign keys, and unique keys, it does not support regular non-unique indexes. And while it does take the supported indexes into account when creating the query plan, it does not enforce any of the constraints. So it does not ensure uniqueness for either unique or primary keys which means this has to be handled on the application level. It also does not support referential actions such as:

ON DELETE SET NULL
ON UPDATE CASCADE

So referential integrity also has to be ensured on a separate level. Aside from these index changes, Redshift also introduces two new options, DISTKEY and SORTKEY. A simple table definition may look like this:

CREATE TABLE sites

site_id INTEGER IDENTITY(1, 1),
name VARCHAR(30),
address VARCHAR(128),
PRIMARY KEY(site_id)

)
DISTKEY(site_id)
SORTKEY(site_id, name);

DISTKEY tells Redshift how to distribute the data in the table when using multiple nodes and can only be set on one column of each table. If several tables reference a column, say, site_id, and it is commonly joined on, then this column should be the DISTKEY of these tables. The rows that have the same site_id in each table will be stored on the same node which limits the need to transfer data between nodes when executing the query.

SORTKEY on the other hand can be set on multiple columns on each table and tells Redshift to sort by these columns on each node. For both of these options, it’s important to choose the best option to ensure optimal performance for the most common, or intensive queries.

…or not to Redshift

Another issue we’ve run into is that while Redshift has greatly improved the speed of our larger aggregations and analytics queries, the individual select statements and transactional items are much slower. We find this is expected as Redshift is intended as a reporting and analytics backend and has been highly optimized for those kinds of jobs. To this end, the Redshift team suggests using a caching layer on top of Redshift that will interact directly with the front-end, user-facing layer and send data in bulk inserts to Redshift itself rather than row-by-row.

amazon-redshiftLastly, since Redshift differs so much from the main branch of PostgreSQL, in both schema definitions and query engine level implementations, many existing PHP schema managers, such as Doctrine’s, do not support managing the Redshift schema (yet). Common tasks such as getting the last insert id for an auto_increment column or setting a function as the default value of a column (current_timestamp for instance) are not implemented or supported. This is due to the organization of what they call Leader nodes and Compute nodes. Also, there are many maintenance tasks that cannot be performed without dropping and recreating the table.

Wrapping it up

For all of these reasons and more, we’ve concluded that it is still necessary to maintain a traditional RDBMS for the front-end. From here, we replicate the data to Redshift once we have loaded it into Koddi for use in resource intensive reporting and analytics tasks. So Redshift is not a catch-all move for those who may be looking for a more powerful database management solution, but is highly effective at its targeted function.

Additional Links:

A Tour of Amazon Redshift

AWS Redshift: How Amazon Changed The Game

 

Categories
Technology