Impact of proper column precision for analytical queries

Does it matter if your data warehouse tables have columns with much higher precision than needed? Probably: Yes.

But how do you know the precision of your columns is larger than required by the values stored in these columns? In Exasol, we have introduced the function MIN_SCALE to find out. I’m working on an Exasol 7 New Features course at the moment, and this article is kind of a sneak preview.

If there’s an impact, it will show only with huge amounts of rows of course. Would be nice to have a table generator to give us large testing tables. Another Exasol 7 feature helps with that: The new clause VALUES BETWEEN.

CREATE OR REPLACE TABLE scaletest.t1 AS SELECT CAST(1.123456789 AS DECIMAL(18,17)) AS testcol FROM VALUES BETWEEN 1 AND 1e9;

This generates a table with 1000 million rows and takes only 30 seconds runtime on my modest VirtualBox VM. Obviously, the scale of the column is too large for the values stored there. But if it wouldn’t be that obvious, here’s how I can find out:

SELECT MAX(a) FROM (SELECT MIN_SCALE(testcol) As a FROM scaletest.t1);

This comes back with the output 9 after 20 seconds runtime, telling me that the precision actually required by the values is 9 at max. I’ll create a second table for comparison with only the required scale:

CREATE OR REPLACE TABLE scaletest.t2 AS SELECT CAST(1.123456789 AS DECIMAL(10,9)) AS testcol FROM VALUES BETWEEN 1 AND 1e9;

So does it really matter? Is there a runtime difference for analytical queries?

SELECT COUNT(*),MAX(testcol) FROM t1; -- 16 secs runtime SELECT COUNT(*),MAX(testcol) FROM t2; -- 7 secs runtime

My little experiment shows, the query running on the column with appropriate scale is twice as fast than the one running on the too large scaled column!

It would be beneficial to adjust the column precision according to the scale the stored values actually need, in other words. With statements like this:

ALTER TABLE t1 MODIFY (testcol DECIMAL(10,9));

After that change, the runtime goes down to 7 seconds as well for the first statement.

I was curious if that effect shows also on other databases, so I prepared a similar test case for an Oracle database. Same tables but only 100 million rows. It takes just too long to export tables with 1000 million rows to Oracle, using VMs on my notebook. And don’t even think about trying to generate 1000 million row tables on Oracle with the CONNECT BY LEVEL method, that will just take forever – or more likely break with an out-of-memory error.

The effect shows also with 100 million row tables on Oracle: 5 seconds runtime with too large precision and about 3 seconds with the appropriately scaled column.

Conclusion: Yes, looks like it’s indeed sensible to format table columns according to the actual requirements of the values stored in them and it makes a difference, performancewise.

Dieser Eintrag wurde erstellt am Juni 11, 2021, 11:33 und wurde abgelegt unter TOI. Du kannst die Antworten auf diesen Beitrag über RSS 2.0 verfolgen. Du kannst eine Antwort schreiben oder einen Trackback von deiner eigenen Seite schicken.

Pickleball spielen