Impact of proper column precision for analytical queries
Does it matter if your data warehouse tables have columns with much higher precision than needed? Probably: Yes.
But how do you know the precision of your columns is larger than required by the values stored in these columns? In Exasol, we have introduced the function MIN_SCALE to find out. I’m working on an Exasol 7 New Features course at the moment, and this article is kind of a sneak preview.
If there’s an impact, it will show only with huge amounts of rows of course. Would be nice to have a table generator to give us large testing tables. Another Exasol 7 feature helps with that: The new clause VALUES BETWEEN.
CREATE OR REPLACE TABLE scaletest.t1 AS
SELECT CAST(1.123456789 AS DECIMAL(18,17)) AS testcol
FROM VALUES BETWEEN 1 AND 1e9;
This generates a table with 1000 million rows and takes only 30 seconds runtime on my modest VirtualBox VM. Obviously, the scale of the column is too large for the values stored there. But if it wouldn’t be that obvious, here’s how I can find out:
SELECT MAX(a) FROM (SELECT MIN_SCALE(testcol) As a FROM scaletest.t1);
This comes back with the output 9 after 20 seconds runtime, telling me that the precision actually required by the values is 9 at max. I’ll create a second table for comparison with only the required scale:
CREATE OR REPLACE TABLE scaletest.t2 AS
SELECT CAST(1.123456789 AS DECIMAL(10,9)) AS testcol
FROM VALUES BETWEEN 1 AND 1e9;
So does it really matter? Is there a runtime difference for analytical queries?
SELECT COUNT(*),MAX(testcol) FROM t1; -- 16 secs runtime
SELECT COUNT(*),MAX(testcol) FROM t2; -- 7 secs runtime
My little experiment shows, the query running on the column with appropriate scale is twice as fast than the one running on the too large scaled column!
It would be beneficial to adjust the column precision according to the scale the stored values actually need, in other words. With statements like this:
ALTER TABLE t1 MODIFY (testcol DECIMAL(10,9));
After that change, the runtime goes down to 7 seconds as well for the first statement.
I was curious if that effect shows also on other databases, so I prepared a similar test case for an Oracle database. Same tables but only 100 million rows. It takes just too long to export tables with 1000 million rows to Oracle, using VMs on my notebook. And don’t even think about trying to generate 1000 million row tables on Oracle with the CONNECT BY LEVEL method, that will just take forever – or more likely break with an out-of-memory error.
The effect shows also with 100 million row tables on Oracle: 5 seconds runtime with too large precision and about 3 seconds with the appropriately scaled column.
Conclusion: Yes, looks like it’s indeed sensible to format table columns according to the actual requirements of the values stored in them and it makes a difference, performancewise.
#Exasol dominates #Datawarehouse benchmarks – again!
The TPC-H benchmark is an independent measurement tool for data warehouse performance. Exasol dominates this benchmark since 2008 and has delivered record breaking results in 2021, justifying once again our claim to provide the world’s fastest in-memory analytic database.
We took part in the data volume categories 10 TB, 30 TB and 100 TB this time and set the mark for both performance and price per performance.
That means we’re not only the fastest but also the least expensive analytics database you can get that delivers a similar performance.
We were running with a 20 nodes cluster, nodes being Dell PowerEdge R6525 servers with AMD EPYC 7543 processors.
Truth be told, you don’t see the big names of the industry participating in that TPC-H benchmark these days. No Teradata, Oracle, Snowflake and the likes took part in recent years. That’s a pity, because an independent fair comparison would help customers immensely to make an informed choice for their analytics platform.
My guess is: They just hate to be blown out of the water by us each time they compete 🙂
New free online course: #Exasol Virtual Schemas
Virtual Schemas integrate foreign data sources into the Exasol database. They enable Exasol to become the central source of truth in your data warehouse landscape.
We added another free online learning course to our curriculum that explains how to deal with Virtual Schemas. Like the others, it comes with many hands-on practices to support a good learning experience. It also contains many demo videos like this one:
I recorded this clip (like most of the others) myself, but this time we decided to do a voice-over by native professional speakers.
Certification exams are free, also for this new course. When you complete the majority of hands-on labs in the course, you get one free certification exam granted per person and per course.
What are you waiting for? Come and get it!