Impact of proper column precision for analytical queries

Does it matter if your data warehouse tables have columns with much higher precision than needed? Probably: Yes.

But how do you know the precision of your columns is larger than required by the values stored in these columns? In Exasol, we have introduced the function MIN_SCALE to find out. I’m working on an Exasol 7 New Features course at the moment, and this article is kind of a sneak preview.

If there’s an impact, it will show only with huge amounts of rows of course. Would be nice to have a table generator to give us large testing tables. Another Exasol 7 feature helps with that: The new clause VALUES BETWEEN.

SELECT CAST(1.123456789 AS DECIMAL(18,17)) AS testcol

This generates a table with 1000 million rows and takes only 30 seconds runtime on my modest VirtualBox VM. Obviously, the scale of the column is too large for the values stored there. But if it wouldn’t be that obvious, here’s how I can find out:

SELECT MAX(a) FROM (SELECT MIN_SCALE(testcol) As a FROM scaletest.t1);

This comes back with the output 9 after 20 seconds runtime, telling me that the precision actually required by the values is 9 at max. I’ll create a second table for comparison with only the required scale:

SELECT CAST(1.123456789 AS DECIMAL(10,9)) AS testcol

So does it really matter? Is there a runtime difference for analytical queries?

SELECT COUNT(*),MAX(testcol) FROM t1; -- 16 secs runtime
SELECT COUNT(*),MAX(testcol) FROM t2; -- 7 secs runtime

My little experiment shows, the query running on the column with appropriate scale is twice as fast than the one running on the too large scaled column!

It would be beneficial to adjust the column precision according to the scale the stored values actually need, in other words. With statements like this:

ALTER TABLE t1 MODIFY (testcol DECIMAL(10,9));

After that change, the runtime goes down to 7 seconds as well for the first statement.

I was curious if that effect shows also on other databases, so I prepared a similar test case for an Oracle database. Same tables but only 100 million rows. It takes just too long to export tables with 1000 million rows to Oracle, using VMs on my notebook. And don’t even think about trying to generate 1000 million row tables on Oracle with the CONNECT BY LEVEL method, that will just take forever – or more likely break with an out-of-memory error.

The effect shows also with 100 million row tables on Oracle: 5 seconds runtime with too large precision and about 3 seconds with the appropriately scaled column.

Conclusion: Yes, looks like it’s indeed sensible to format table columns according to the actual requirements of the values stored in them and it makes a difference, performancewise.

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: