Posts Tagged Exasol

Understanding Distribution in #Exasol

Exasol doesn’t need much administration but getting distribution right matters

Exasol uses a clustered shared-nothing architecture with many sophisticated internal mechanisms to deliver outstanding performance without requiring much administration. Getting the distribution of rows between cluster nodes right is one of the few critical tasks left, though. To explain this, let’s say we have two tables t1 and t2:

The two tables are joined on the column JoinCol, while WHERE conditions for filtering are done with the column WhereCol. Other columns are not shown to keep the sketches small and simple. Now say these two tables are stored on a three-node cluster. Again, for simplicity only active nodes are on the sketch – no reserve nodes or license nodes. We also ignore the fact that small tables will be replicated across all active nodes.

Distribution will be random if no distribution key is specified

Without specifying a distribution key, the rows of the tables are distributed randomly across the nodes like this:

Absence of proper distribution keys: global joins

The two tables are then joined:

SELECT <something> FROM t1 JOIN t2 ON t1.JoinCol = t2.JoinCol;

Internally, this is processed as a global join which means network communication between the nodes on behalf of the join is required. This is the case because some rows do not find local join partners on the same node:

Distribution on join columns: local joins

If the two tables were distributed on their join columns with statements like these

ALTER TABLE t1 DISTRIBUTE BY JoinCol;

ALTER TABLE t2 DISTRIBUTE BY JoinCol;

then the same query can be processed internally as a local join:

Here every row finds a local join partner on the same node so no network communication between the nodes on behalf of the join is required. The performance with this local join is much better than with the global join although it’s the same statement as before.

Why you shouldn’t distribute on WHERE-columns

While it’s generally a good idea to distribute on JOIN-columns, it’s by contrast a bad idea to distribute on columns that are used for filtering with WHERE conditions. If both tables would have been distributed on the WhereCol columns, it would look like this:

This distribution is actually worse than the initial random distribution! Not only does this cause global joins between the two tables as already explained, statements like e.g.

<Any DQL or DML> WHERE t2.WhereCol='A';

will utilize only one node (the first with this WHERE condition) and that effectively disables one of Exasol’s best strengths, the Massive Parallel Processing (MPP) functionality. This distribution leads to poor performance because all other nodes in the cluster have to stand by being idle while one node has to do all the work alone.

Examine existing distribution with iproc()

The function iproc() helps investigating the existing distribution of rows across cluster nodes. This statement shows the distribution of the table t1:

SELECT iproc(),COUNT(*) FROM t1 GROUP BY 1 ORDER BY 1;

Evaluate the effect of distribution keys with value2proc()

The function value2proc() can be used to display the effect that a (new) distribution key would have:

SELECT home_node,COUNT(*) FROM (SELECT value2proc(JoinCol) AS home_node FROM t1) GROUP BY 1 ORDER BY 1;

Conclusion

Distribution on JOIN-columns leads to local joins which perform better than global joins: Do that!

Distribution on WHERE-columns leads to global joins and disables the MPP functionality, both causing poor performance: Don’t do that!

1 Comment

Using the Query Cache for good performance in #Exasol

The result of a query can be cached in Exasol to the effect that repeated identical queries complete in no time. This feature has been introduced in version 5 and is enabled by default.

SQL_EXA> select session_value,system_value  from exa_parameters where parameter_name='QUERY_CACHE';
EXA: select session_value,system_value  from exa_parameters where parameter...

SESSION_VALUE        SYSTEM_VALUE
-------------------- --------------------
ON                   ON

1 row in resultset.

The Query Cache can be (de-)activated on the session level as well as on the system level.

SQL_EXA> alter session set query_cache='off';
EXA: alter session set query_cache='off';

Rows affected: 0
SQL_EXA> select object_name,mem_object_size/1024/1024 as mb from exa_user_object_sizes where object_name='T';
EXA: select object_name,mem_object_size/1024/1024 as mb from exa_user_objec...

OBJECT_NAME          MB
-------------------- ----------------------------------
T                             1537.49641990661621093750

1 row in resultset.

SQL_EXA> select count(*) from t;
EXA: select count(*) from t;

COUNT(*)
---------------------
            320000000

1 row in resultset.
SQL_EXA> set timing on;
SQL_EXA> select count(*) from t where numcol2=42;
EXA: select count(*) from t where numcol2=42;

COUNT(*)
---------------------
               888896

1 row in resultset.


Timing element: 4
Elapsed: 00:00:03.022

SQL_EXA> select count(*) from t where numcol2=42;
EXA: select count(*) from t where numcol2=42;

COUNT(*)
---------------------
               888896

1 row in resultset.


Timing element: 5
Elapsed: 00:00:02.620

SQL_EXA> select count(*) from t where numcol2=42;
EXA: select count(*) from t where numcol2=42;

COUNT(*)
---------------------
               888896

1 row in resultset.


Timing element: 6
Elapsed: 00:00:02.724

Without using the Query Cache the repeated query takes roughly 3 seconds.

SQL_EXA> alter session set query_cache='on';
EXA: alter session set query_cache='on';

Rows affected: 0


Timing element: 7
Elapsed: 00:00:00.008

SQL_EXA> select count(*) from t where numcol2=42;
EXA: select count(*) from t where numcol2=42;

COUNT(*)
---------------------
               888896

1 row in resultset.


Timing element: 8
Elapsed: 00:00:00.009

Lightning fast! If statement profiling is enabled, QUERY CACHE RESULT shows as PART_NAME in tables like EXA_USER_PROFILE_LAST_DAY.
Also EXECUTION_MODE from EXA_SQL_LAST_DAY shows the usage of the Query Cache:

SQL_EXA> set timing off;
SQL_EXA> select current_statement;
EXA: select current_statement;

CURRENT_STATEMENT
---------------------
                   35

1 row in resultset.

SQL_EXA> show autocommit;
AUTOCOMMIT = "ON"
SQL_EXA> select command_name,duration,row_count,execution_mode from exa_sql_last_day where stmt_id=33 and session_id=current_session;
EXA: select command_name,duration,row_count,execution_mode from exa_sql_las...

COMMAND_NAME                             DURATION    ROW_COUNT             EXECUTION_MODE
---------------------------------------- ----------- --------------------- --------------------
SELECT                                         0.005                     1 CACHED

1 row in resultset.

If DML changes the table, the result in the Query Cache is invalidated automatically:

SQL_EXA> update t set numcol2=1 where rowid in (select rowid from t limit 1);
EXA: update t set numcol2=1 where rowid in (select rowid from t limit 1);
Rows affected: 1

SQL_EXA> set timing on;
SQL_EXA> select count(*) from t where numcol2=42;
EXA: select count(*) from t where numcol2=42;

COUNT(*)
---------------------
888896

1 row in resultset.

Timing element: 10
Elapsed: 00:00:02.870

SQL_EXA> set timing off;
SQL_EXA> select current_statement;
EXA: select current_statement;

CURRENT_STATEMENT
---------------------
51

1 row in resultset.

SQL_EXA> select command_name,duration,row_count,execution_mode from exa_sql_last_day where stmt_id=49 and session_id=current_session;
EXA: select command_name,duration,row_count,execution_mode from exa_sql_las...

COMMAND_NAME DURATION ROW_COUNT EXECUTION_MODE
---------------------------------------- ----------- --------------------- --------------------

0 rows in resultset.

There’s a 1 Minute interval for syncing the statistic tables. It can be triggered manually, though:

SQL_EXA> flush statistics;
EXA: flush statistics;

Rows affected: 0

SQL_EXA> commit;
EXA: commit;

Rows affected: 0

SQL_EXA> select command_name,duration,row_count,execution_mode from exa_sql_last_day where stmt_id=49 and session_id=current_session;
EXA: select command_name,duration,row_count,execution_mode from exa_sql_las...

COMMAND_NAME                             DURATION    ROW_COUNT             EXECUTION_MODE
---------------------------------------- ----------- --------------------- --------------------
SELECT                                         2.862                     1 EXECUTE

1 row in resultset.

Runtime and EXECUTION_MODE EXECUTE confirms that the Query Cache was invalidated by the UPDATE above. Now it’s automatically refreshed:

SQL_EXA> set timing on;
SQL_EXA> select count(*) from t where numcol2=42;
EXA: select count(*) from t where numcol2=42;

COUNT(*)
---------------------
               888896

1 row in resultset.


Timing element: 11
Elapsed: 00:00:00.010

Leave a comment

#Exasol Database whoami

This little script displays some useful meta-information:

SQL_EXA> create schema myschema;
EXA: create schema myschema;

Rows affected: 0

SQL_EXA> create or replace script whoami as
 output('Current User: '.. tostring(exa.meta.current_user))
 output('Current Schema: '.. tostring(exa.meta.current_schema))
 output('Session ID: '.. tostring(exa.meta.session_id))
 output('Database Version: '.. tostring(exa.meta.database_version))
 output('Number of Nodes: '.. tostring(exa.meta.node_count))
 /
EXA:create or replace script whoami as...

Rows affected: 0
SQL_EXA> col output for a40;
COLUMN   output ON
FORMAT   a40
SQL_EXA> execute script whoami with output;
EXA: execute script whoami with output;

OUTPUT
----------------------------------------
Current User: SYS
Current Schema: MYSCHEMA
Session ID: 1612024483893367379
Database Version: 6.1.0-alpha1
Number of Nodes: 1

5 rows in resultset.

All available metadata is documented here (Chapter 3. Concepts -> 3.6 UDF scripts).

,

Leave a comment

%d bloggers like this: