Automatic Indexes in #Exasol

An Exasol database will automatically create, maintain and drop indexes, following the core idea to deliver great performance without requiring much administrative efforts. Like our tables, our indexes are always compressed and you don’t need to configure anything for that.

Joins between two or more tables are processed like this in Exasol: One table is full scanned (this is called the root table) and the other tables are joined using an index on their join columns.

If these indexes on the join columns are not already existing, they are automatically created during the join operation. Taking two tables t1 and t2 as an example, and a statement like

select count(*) from t1 join t2 on t1.joincol = t2.joincol;

The Exasol optimizer will compute an execution plan based on automatically gathered statistics that inform it amongst others about the table sizes. Often, the smaller table will become the root table and the other table will be joined using an index on the join column.
Let’s see an example:

create or replace table t1 as
select 
rownum as joincol, 
'Just some text' as textcol,
mod(rownum,5) as numcol1, 
mod(rownum,1000) as numcol2,
5000 as numcol3,
to_date ('01.' || lpad(to_char(mod(rownum,12)+1),2,'0') || '.2018' ,'dd.mm.yyyy') as time_id 
from (select 1 from dual connect by level<=1e6);

SUCCESS	CREATE	36.43		1000000	OK

create or replace table t2 as select * from t1 limit 5e5;

SUCCESS	CREATE	0.735		500000	OK

select count(*) from t1 join t2 on t1.joincol=t2.joincol;

COUNT(*)
500000

select index_table,remarks from exa_dba_indices where index_table like 'T_';

INDEX_TABLE	REMARKS
T1	        GLOBAL INDEX (JOINCOL)

That index got created during query execution (on the first join between t1 and t2) and subsequently supports further joins with t1 on joincol.

If DML is done on t1, this index is maintained by the system:
INSERT into t1 will add new index entries accordingly.
DELETE from t1 will mark rows as deleted until more than 25% of rows have been deleted, Then t1 is reorganized automatically and the index is automatically rebuilt.
UPDATE statements that affect less than 15% of rows will update index key entries accordingly. If more than 15% of rows are updated, the index is automatically rebuilt.

If  an index is not used to support queries for more than 5 weeks, it will be automatically dropped. That way, a useless index will not slow down DML and consume space for no good reason.

Operationally, nothing needs to be done about indexes in Exasol and that’s what most of our customers do: They just let the system take care of indexes. In earlier versions, EXA_DBA_INDICES didn’t exist even to avoid providing superfluous information.

What is a global index and why was it created that way on t1, you may ask. Like tables, indexes are also distributed across the Exasol cluster nodes. If the index part on a node points only to table rows residing on the same node, that’s a local index. A global index means that the index part on a node points to (at least some) table rows residing on another node. Means a global join leads to global indexes while local joins lead to local indexes. Profiling a join between t1 and t2 confirms that:

select part_name,part_info,object_name,remarks from exa_user_profile_last_day where stmt_id =21 and session_id=current_session;

PART_NAME	    PART_INFO	              OBJECT_NAME	   REMARKS
COMPILE / EXECUTE   (null)	              (null)	           (null)
SCAN	            (null)	               T2	           (null)
JOIN	            GLOBAL	               T1	           T2(JOINCOL) => GLOBAL INDEX (JOINCOL)
GROUP BY	    GLOBAL on TEMPORARY table  tmp_subselect0	   (null)

So that was a global join using a global index. If the two tables were distributed on joincol, this leads to local joins with local indexes. Putting the distribution key on joincol for t1 will automatically convert the existing index into a local index:

alter table t1 distribute by joincol;

SUCCESS	ALTER	0.90

alter table t2 distribute by joincol;

SUCCESS	ALTER	0.487

select index_table,remarks from exa_dba_indices where index_table like 'T_';

INDEX_TABLE	REMARKS
T1	        LOCAL INDEX (JOINCOL)

When using prepared statements to do UPDATE or DELETE, this may automatically create an index on the filter column:

update t2 set textcol='Modified' where numcol1=:parameter1; -- Enter 1

SUCCESS	UPDATE	0.44		100004	OK

select index_table,remarks from exa_dba_indices where index_table like 'T_';

INDEX_TABLE	REMARKS
T1	        LOCAL INDEX (JOINCOL)
T2	        LOCAL INDEX (NUMCOL1)

delete from t1 where numcol2=:parameter2; -- Enter 42

SUCCESS	DELETE	0.511		1000	OK

select index_table,remarks from exa_dba_indices where index_table like 'T_';

INDEX_TABLE	REMARKS
T1	        LOCAL INDEX (NUMCOL2)
T1	        LOCAL INDEX (JOINCOL)
T2	        LOCAL INDEX (NUMCOL1)

This results in local indexes because there is no need to refer to rows on other nodes while each node updates or deletes on his locally stored part of the tables.
I’m using DbVisualizer as a SQL Client for my demos here and it prompts for inputs of :parameter1 and :parameter2 when the statements are executed.

Another reason why indexes are automatically created is when a primary constraint is added to a table:

alter table t2 add constraint t2pk primary key (joincol);

SUCCESS	ALTER	0.39

select index_table,remarks from exa_dba_indices where index_table like 'T_';

INDEX_TABLE	REMARKS
T1	        LOCAL INDEX (NUMCOL2)
T1	        LOCAL INDEX (JOINCOL)
T2	        LOCAL INDEX (JOINCOL)
T2	        LOCAL INDEX (NUMCOL1)

This created a local index because t2 is distributed on joincol.
Please do not take this as a recommendation to add primary constraints to tables in Exasol, I just mentioned it because it also leads to the creation of indexes. MERGE statements also lead to index creation because they are processed as joins, by the way.

Although customers are not required to do anything about indexes, there’s a couple of good practices related with them that make your Exasol database perform even better:

Small numeric formats are optimal for join columns and lead to small efficient indexes. It’s often beneficial to replace multi-column joins respectively joins on large formatted columns by joins on numeric surrogate keys. IDENTITY COLUMNS help to generate these surrogate keys.

Avoid using expressions on the join columns because the resulting indexes are not persisted and have to be built again during every join:

select count(*) from t1 join t2 on t1.joincol+1=t2.joincol+1; -- don't do that

Avoid having mixed data types on join columns because that can also lead to expression indexes:

create or replace table t3 as
select 
to_char(rownum) as joincol, -- don't do that
'Just some text' as textcol,
mod(rownum,5) as numcol1, 
mod(rownum,1000) as numcol2,
5000 as numcol3,
to_date ('01.' || lpad(to_char(mod(rownum,12)+1),2,'0') || '.2018' ,'dd.mm.yyyy') as time_id 
from (select 1 from dual connect by level<=1e6);

SUCCESS	CREATE	34.925		1000000	OK

create or replace table t4 as select * from t1 limit 100;

SUCCESS	CREATE	0.374		100	OK

Above created a large table t3 using a string type for joincol and a small table t4 using a numeric type for joincol. Upon joining the two, likely t4 becomes the root table and t3 is expression indexed:

select count(*) from t3 join t4 on t3.joincol=t4.joincol;

COUNT(*)
100

select index_table,remarks from exa_dba_indices where index_table like 'T_';

INDEX_TABLE	REMARKS
T1	        LOCAL INDEX (NUMCOL2)
T1	        LOCAL INDEX (JOINCOL)
T2	        LOCAL INDEX (JOINCOL)
T2	        LOCAL INDEX (NUMCOL1)

No index listed for t3 as you see. The profile of the join between t3 and t4 shows:

select part_name,part_info,object_name,remarks from exa_user_profile_last_day where stmt_id =95 and session_id=current_session;

PART_NAME	    PART_INFO	               OBJECT_NAME	REMARKS
COMPILE / EXECUTE   (null)	               (null)	        (null)
INDEX CREATE	    EXPRESSION INDEX	        T3	        ExpressionIndex
SCAN	            on REPLICATED table	        T4	        (null)
JOIN	            (null)	                T3	        T4(JOINCOL) => ExpressionIndex
GROUP BY	    GLOBAL on TEMPORARY table	tmp_subselect0	(null)

There was an index created for the join, but expression indexes are not persisted. This was a local join (no GLOBAL indicator in the PART_INFO column for the JOIN step) because t4 was replicated across all nodes due to its small size.

Apart from following the mentioned good practices, there’s simply not much for customers to take care of related to indexes in Exasol – it just works 🙂

Executive Summary:

Exasol creates all indexes it needs automatically.

Tables and Indexes are automatically compressed.

Tables and Indexes are automatically reorganized if required.

Useless indexes are automatically dropped.

The administrative effort related to indexes is close to zero in Exasol.

Investing some time in proper table design initially might still be a good idea.

 

  1. Hinterlasse einen Kommentar

Hinterlasse einen Kommentar

Diese Seite verwendet Akismet, um Spam zu reduzieren. Erfahre, wie deine Kommentardaten verarbeitet werden..