seanswezey

Storing Historical Data with PostgreSQL and Automatic Partitions

February 23, 2017

It is considered good practice to store and review logs from your servers when you’re working with web services. Most often, text logs are used and information is gleaned with grep and other utilities. However, sometimes you may want to keep information that can be logged in a more structured and queryable manner. So let’s talk about storing historical data in PostgreSQL.

Often, storing historical data with PostgreSQL is not the best solution – you don’t get to take advantage of the relational benefits of Postgres, and it’s a bit complicated to set up. However, when working with clients (or even for yourself), it’s a huge benefit to have all of a service’s data available through one interface in one query. And it doesn’t hurt that it’s in SQL, a language that has been around for more than 30 years with tons of documentation available on the Internet.

To store historical data in Postgres, we are going to use two slightly more advanced tools: partitioning and triggers. Partitioning will allow us to separate our table into logical partitions, i.e. by month or week, while still allowing for queries among all rows. They will also allow us to quickly delete older data once we no longer need it. Triggers allow us to execute code on specific events in a database. While using both, we will create monthly partitions as needed whenever we add a new row to our table storing the historical data.

The first step is to create a table that will represent our historical data. This table will be the format for all the child tables, but will not store any actual information. For our example, let’s pretend we’re an eCommerce company and want to store historical data on our products as they move through the fulfillment chain. We would create a table à la:

CREATE TABLE order_log (
  "id" SERIAL,
  "customer_id" INTEGER,
  "created_at" TIMESTAMP WITH TIME ZONE,
  "updated_at" TIMESTAMP WITH TIME ZONE,
  "status" TEXT,
  "notes" TEXT,
  PRIMARY KEY ("id")
);

Now comes the tricky part. PostgreSQL does support several procedural languages, but by default only PL/pgSQL is included and is the most universal. So for this example, I’ll be using it to define our functions. The first function that we need will create a child table from a parent table. We will pass in the parent table which column to check for partitioning and the month and year, and it will create the table with the proper naming and checks in place:

CREATE OR REPLACE FUNCTION create_log_month_table(parent_table TEXT, timestamp_col TEXT, year INT, month INT) RETURNS VOID AS $$
  DECLARE
    statement TEXT;
    next_date_year INT;
    next_date_month INT;
  BEGIN
    SET CLIENT_MIN_MESSAGES = ERROR;

    IF month = 12 THEN
      next_date_month := 1;
      next_date_year := year + 1;
    ELSE
      next_date_month := month + 1;
      next_date_year := year;
    END IF;

    statement := format('CREATE TABLE IF NOT EXISTS %1$I_%2$s_%3$s ('
                        '    CHECK ( %4$I >= DATE ''%2$s-%3$s-01'' AND %4$I < DATE ''%5$s-%6$s-01'' )'
                        ') INHERITS (%1$I)', parent_table, year, lpad(month::TEXT, 2, '0'), timestamp_col, next_date_year, lpad(next_date_month::TEXT, 2, '0'));
    EXECUTE statement;
  END;
$$ LANGUAGE 'plpgsql';

The next step is to create a trigger. The trigger will be executed before each row is inserted into the table. It will be responsible for making sure the correct child table exists and inserting the row into the child table instead (of the parent table). It takes in the column to check as an argument, extracts the information needed from the new row, calls our create_log_month_table function, inserts the row into the child table instead and returns NULL indicating that the row should not be inserted into the (parent) table:

CREATE OR REPLACE FUNCTION order_log_child_insert() RETURNS TRIGGER AS $$
  DECLARE
    timestamp_col TEXT;
    log_timestamp TIMESTAMP WITH TIME ZONE;
    log_year INT;
    log_month INT;
  BEGIN
    timestamp_col := TG_ARGV[0];
    EXECUTE format('SELECT ($1).%I::TIMESTAMP WITH TIME ZONE', timestamp_col) USING NEW INTO log_timestamp;
    log_year := extract(YEAR FROM log_timestamp);
    log_month := extract(MONTH FROM log_timestamp);

    PERFORM create_log_month_table(TG_TABLE_NAME, timestamp_col, log_year, log_month);
    EXECUTE format('INSERT INTO %1$I_%2$s_%3$s SELECT $1.*', TG_TABLE_NAME, log_year, lpad(log_month::TEXT, 2, '0')) USING NEW;

    RETURN NULL;
  END;
$$ LANGUAGE  'plpgsql';

After creating the trigger, the only step left is to set it for our table! This is really simple and should be self-explanatory:

CREATE TRIGGER order_log_insert_trigger
  BEFORE INSERT ON order_log
  FOR EACH ROW EXECUTE PROCEDURE order_log_child_insert('updated_at');

There you have it! Now whenever an order is updated in our fictional service, we can insert a row into order_log , and it will automatically be inserted into the correct child table. We can query the order_log like a regular table and as old data is no longer valuable, we can quickly drop the child tables. And this isn’t just limited to partitioning for historical data – you can use it to partition tables on any column!

Published February 23, 2017

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *