After the previous posts about Exasol, which have been more around installation, administration, backup & restore, loading data and general notes around transactions and sessions we’ll now go into more details how Exasol manages data. Here is the list of the previous posts:
- Getting started with Exasol – Setting up an environment
- Getting started with Exasol – Loading data from PostgreSQL
- Getting started with Exasol – Some words about indexes and transactions
- Getting started with Exasol – Sessions and auditing
- Getting started with Exasol – Installation from ISO
- Getting started with Exasol – Backup and restore (1)
- Getting started with Exasol – Backup and restore (2)
- Getting started with Exasol – A multi node cluster in AWS
Having a two node cluster running in AWS is a good starting point to explore an important concept in Exasol: Distribution keys. As soon as you have more than one data node in the cluster, data gets distributed across the cluster nodes, if a certain limit in size is reached, automatically. To start with, lets generate a CSV file we can import into Exasol. I’ll use this simple bash script to generate a file with 100’000’000 rows:
dwe@dwe:~/Downloads$ cat gen_data.sh #!/bin/bash FILE="/home/dwe/Downloads/sample.csv" rm -rf ${FILE} for i in {1..10000000}; do echo "${i},firstname${i},lastname${i},xxx${i}@xxx.com,street${i},country${i},description${i}" >> ${FILE} done
This results in a file of around 1GB:
dwe@dwe:~/Downloads$ chmod +x gen_data.sh dwe@dwe:~/Downloads$ ./gen_data.sh dwe@dwe:~/Downloads$ head -5 sample.csv 1,firstname1,lastname1,[email protected],street1,country1,description1 2,firstname2,lastname2,[email protected],street2,country2,description2 3,firstname3,lastname3,[email protected],street3,country3,description3 4,firstname4,lastname4,[email protected],street4,country4,description4 5,firstname5,lastname5,[email protected],street5,country5,description5 dwe@dwe:~/Downloads$ ls -lha sample.csv -rw-rw-r-- 1 dwe dwe 1023M Oct 17 19:27 sample.csv
We’ll import that using the import command as we already did it in the previous posts. If you have multiple nodes, you can list of all of them in the connection string:
/home/dwe/EXAplus-7.0.0/exaplus -c 18.193.84.41,54.93.49.6:8563 -u sys -p xxxxx
Creating the schema and the table that will hold the data we want to import:
SQL_EXA> create schema demo; EXA: create schema demo; Rows affected: 0 SQL_EXA> open schema demo; EXA: open schema demo; Rows affected: 0 SQL_EXA> create table sample ( id int primary key , firstname varchar(20) , lastname varchar(20) , email varchar(20) , street varchar(20) , country varchar(20) , description varchar(20) ); EXA: create table sample ( id int primary key... Rows affected: 0
Importing the sample csv created above:
SQL_EXA> IMPORT INTO sample FROM LOCAL CSV FILE '/home/dwe/Downloads/sample.csv' COLUMN SEPARATOR = ','; EXA: IMPORT INTO sample ... Rows affected: 10000000
If we check if this table already has a distribution key assigned, we’ll notice that there is none:
SQL_EXA> select TABLE_NAME,TABLE_HAS_DISTRIBUTION_KEY from exa_dba_tables where TABLE_NAME = 'SAMPLE'; EXA: select TABLE_NAME,TABLE_HAS_DISTRIBUTION_KEY from exa_dba_tables where... TABLE_NAME TABLE -------------------- ----- SAMPLE FALSE 1 row in resultset.
If no distribution key is specified, the distribution of the rows is random. Using the IPROC function, you can you ask, on which node a given row is stored. If we do that for our sample table we’ll see this distribution of data:
SQL_EXA> select count(*) from (select id, iproc() from sample) where iproc=0; EXA: select count(*) from (select id, iproc() from sample) where iproc=0; COUNT(*) --------------------- 5002896 1 row in resultset. SQL_EXA> select count(*) from (select id, iproc() from sample) where iproc=1; EXA: select count(*) from (select id, iproc() from sample) where iproc=1; COUNT(*) --------------------- 4997104 1 row in resultset.
As we currently have two data nodes, we see an approximate even distribution of the rows across the two nodes. To know if that distribution is a good choice, you need to know what columns will be used in joins. To demonstrate this let’s create a copy of the table, again without specifying a distribution key:
SQL_EXA> create table sample2 as select * from sample; EXA: create table sample2 as select * from sample; Rows affected: 10000000 SQL_EXA> alter table sample2 add constraint sample_pk primary key (id); EXA: alter table sample2 add constraint sample_pk primary key (id); Rows affected: 0
Suppose we want to join these two tables on the primary key like this:
SQL_EXA> select sample.id, sample.firstname from sample join sample2 on sample.id = sample2.id where sample.id < 20; EXA: select sample.id, sample.firstname from sample join sample2 on sample.... ID FIRSTNAME --------------------- -------------------- 1 firstname1 2 firstname2 3 firstname3 4 firstname4 5 firstname5 6 firstname6 7 firstname7 8 firstname8 9 firstname9 10 firstname10 11 firstname11 12 firstname12 13 firstname13 14 firstname14 15 firstname15 16 firstname16 17 firstname17 18 firstname18 19 firstname19 19 rows in resultset.
How can we know if the distribution of the rows is good for this query? In PostgreSQL we would have a look at the execution plan but there is no command to generate an execution plan in Exasol (at least not in the same way you know it from other databases). What you need to do for getting the details about what the cost based optimizer is doing, is to turn on profiling. Let’s do a simple example: We’ll create a simple table, select from that table and then check what happened in the background. As the first step, lets create the table and add one row:
SQL_EXA> create table dummy ( a int, b varchar(20), c date ); EXA: create table dummy ( a int, b varchar(20), c date ); Rows affected: 0 SQL_EXA> insert into dummy values (1,'aaaa',sysdate); EXA: insert into dummy values (1,'aaaa',sysdate); Rows affected: 1
Now we want to profile a simple statement and that requires some steps to complete: Turn on profiling (which is turned off by default), disable the query cache (otherwise we might get the results for subsequent executions of the same SQL statement from the cache), flush the statistics, and finally the steps to get out the information so we see what happened:
SQL_EXA> alter session set profile='on'; EXA: alter session set profile='on'; Rows affected: 0 SQL_EXA> alter session set query_cache='off'; EXA: alter session set query_cache='off'; Rows affected: 0 SQL_EXA> set autocommit off; SQL_EXA> select count(*) from dummy; EXA: select count(*) from dummy; COUNT(*) --------------------- 1 1 row in resultset. SQL_EXA> flush statistics; EXA: flush statistics; Rows affected: 0 SQL_EXA> commit; EXA: commit; Rows affected: 0 SQL_EXA> select STMT_ID,COMMAND_NAME,PART_ID,PART_NAME,PART_INFO,OBJECT_SCHEMA,OBJECT_NAME,OBJECT_ROWS,OUT_ROWS from EXA_DBA_PROFILE_LAST_DAY where session_id=current_session and command_name = 'SELECT' preferring high stmt_id order by part_id; EXA: select STMT_ID,COMMAND_NAME,PART_ID,PART_NAME,PART_INFO,OBJECT_SCHEMA,... STMT_ID COMMAND_NAME PART_ID PART_NAME PART_INFO OBJECT_SCHEMA OBJECT_NAME OBJECT_ROWS OUT_ROWS ------------ ---------------------------------------- --------- -------------------- -------------------- --------------- ------------------------------ --------------------- --------------------- 8 SELECT 1 COMPILE / EXECUTE 8 SELECT 2 SCAN DEMO DUMMY 1 1 8 SELECT 3 GROUP BY GLOBAL on TEMPORARY tmp_subselect0 0 1 table 3 rows in resultset. SQL_EXA>
What can we see here: Three steps have been executed:
- COMPILE / EXECUTE: This is always the first step, the time it takes to compile the query
- SCAN: This step does the scan over the dummy table
- GROUP BY: Finally a group by to get the aggregation
There is much more information in EXA_DBA_PROFILE_LAST_DAY, but we’ll ignore that for now. Now, that we know how to get that information we can go back to our initial query, what do we see here?
SQL_EXA> select sample.id, sample.firstname from sample join sample2 on sample.id = sample2.id where sample.id < 20; EXA: select sample.id, sample.firstname from sample join sample2 on sample.... ID FIRSTNAME --------------------- -------------------- 1 firstname1 2 firstname2 3 firstname3 4 firstname4 5 firstname5 6 firstname6 7 firstname7 8 firstname8 9 firstname9 10 firstname10 11 firstname11 12 firstname12 13 firstname13 14 firstname14 15 firstname15 16 firstname16 17 firstname17 18 firstname18 19 firstname19 19 rows in resultset. SQL_EXA> flush statistics; EXA: flush statistics; Rows affected: 0 SQL_EXA> SQL_EXA> commit; EXA: commit; Rows affected: 0 SQL_EXA> select STMT_ID,COMMAND_NAME,PART_ID,PART_NAME,PART_INFO,OBJECT_SCHEMA,OBJECT_NAME,OBJECT_ROWS,OUT_ROWS,DURATION from EXA_DBA_PROFILE_LAST_DAY where session_id=current_session and command_name = 'SELECT' preferring high stmt_id order by part_id; EXA: select STMT_ID,COMMAND_NAME,PART_ID,PART_NAME,PART_INFO,OBJECT_SCHEMA,... STMT_ID COMMAND_NAME PART_ID PART_NAME PART_INFO OBJECT_SCHEMA OBJECT_NAME OBJECT_ROWS OUT_ROWS DURATION ------------ ---------------------------------------- --------- -------------------- -------------------- --------------- ------------------------------ --------------------- --------------------- ----------- 43 SELECT 1 COMPILE / EXECUTE 0.199 43 SELECT 2 SCAN DEMO SAMPLE 10000000 19 0.028 43 SELECT 3 JOIN GLOBAL DEMO SAMPLE2 10000000 19 0.000 43 SELECT 4 INSERT on TEMPORARY table tmp_subselect0 0 19 0.000 4 rows in resultset.
What you usually don’t want to see is a global join but exactly this is happening (I’ve also added the duration so we can compare timings later). If a join partner can not be found on the local node a global joins happens (communication with the node that holds the join partner) and this introduces network traffic on the private network, and that takes time.
Lets change the distribution key of the two tables (this will re-write the table, of course), so that IDs with the same hash will located on the same node:
SQL_EXA> alter table sample distribute by id; EXA: alter table sample distribute by id; Rows affected: 0 SQL_EXA> alter table sample2 distribute by id; EXA: alter table sample2 distribute by id; Rows affected: 0 SQL_EXA> commit; EXA: commit; Rows affected: 0
Does that change the execution steps?
SQL_EXA> select sample.id, sample.firstname from sample join sample2 on sample.id = sample2.id where sample.id < 20; EXA: select sample.id, sample.firstname from sample join sample2 on sample.... ID FIRSTNAME --------------------- -------------------- 1 firstname1 2 firstname2 3 firstname3 4 firstname4 5 firstname5 6 firstname6 7 firstname7 8 firstname8 9 firstname9 10 firstname10 11 firstname11 12 firstname12 13 firstname13 14 firstname14 15 firstname15 16 firstname16 17 firstname17 18 firstname18 19 firstname19 19 rows in resultset. SQL_EXA> flush statistics; EXA: flush statistics; Rows affected: 0 SQL_EXA> commit; EXA: commit; Rows affected: 0 SQL_EXA> select STMT_ID,COMMAND_NAME,PART_ID,PART_NAME,PART_INFO,OBJECT_SCHEMA,OBJECT_NAME,OBJECT_ROWS,OUT_ROWS,DURATION from EXA_DBA_PROFILE_LAST_DAY where session_id=current_session and command_name = 'SELECT' preferring high stmt_id order by part_id; EXA: select STMT_ID,COMMAND_NAME,PART_ID,PART_NAME,PART_INFO,OBJECT_SCHEMA,... STMT_ID COMMAND_NAME PART_ID PART_NAME PART_INFO OBJECT_SCHEMA OBJECT_NAME OBJECT_ROWS OUT_ROWS DURATION ------------ ---------------------------------------- --------- -------------------- -------------------- --------------- ------------------------------ --------------------- --------------------- ----------- 51 SELECT 1 COMPILE / EXECUTE 0.011 51 SELECT 2 SCAN DEMO SAMPLE 10000000 19 0.028 51 SELECT 3 JOIN DEMO SAMPLE2 10000000 19 0.000 51 SELECT 4 INSERT on TEMPORARY table tmp_subselect0 0 19 0.000 4 rows in resultset.
Yes, definitely. We do not see a global join anymore, but a local join and this improves performance. So, with Exasol, you should avoid global joins, because that increases traffic on the private network. Try to distribute the data on the join columns for getting local joins.
Btw: You can see the distribution keys if you do a full describe against a table:
SQL_EXA> desc full sample; EXA: desc full sample; COLUMN_NAME SQL_TYPE NULLABLE DISTRIBUTION_KEY PARTITION_KEY COLUMN_COMMENT ------------------------------ ---------------------------------------- -------- ---------------- ---------------- -------------------------------------------------- ID DECIMAL(18,0) FALSE TRUE FALSE FIRSTNAME VARCHAR(20) UTF8 TRUE FALSE FALSE LASTNAME VARCHAR(20) UTF8 TRUE FALSE FALSE EMAIL VARCHAR(20) UTF8 TRUE FALSE FALSE STREET VARCHAR(20) UTF8 TRUE FALSE FALSE COUNTRY VARCHAR(20) UTF8 TRUE FALSE FALSE DESCRIPTION VARCHAR(20) UTF8 TRUE FALSE FALSE
To show the effect on global joins on the network, lets redo the example, but this time also add the NET column from EXA_DBA_PROFILE_LAST_DAY, which shows the network traffic in MiB per second. An easy way to get to our initial state is to drop the distribution keys and then manually re-organize the tables:
SQL_EXA> alter table sample drop distribution keys; EXA: alter table sample drop distribution keys; Rows affected: 0 SQL_EXA> alter table sample2 drop distribution keys; EXA: alter table sample2 drop distribution keys; Rows affected: 0 SQL_EXA> reorganize table sample; EXA: reorganize table sample; Rows affected: 0 SQL_EXA> reorganize table sample2; EXA: reorganize table sample2; Rows affected: 0 SQL_EXA> commit; EXA: commit; Rows affected: 0 SQL_EXA> desc full sample; EXA: desc full sample; COLUMN_NAME SQL_TYPE NULLABLE DISTRIBUTION_KEY PARTITION_KEY COLUMN_COMMENT ------------------------------ ---------------------------------------- -------- ---------------- ---------------- -------------------------------------------------- ID DECIMAL(18,0) FALSE FALSE FALSE FIRSTNAME VARCHAR(20) UTF8 TRUE FALSE FALSE LASTNAME VARCHAR(20) UTF8 TRUE FALSE FALSE EMAIL VARCHAR(20) UTF8 TRUE FALSE FALSE STREET VARCHAR(20) UTF8 TRUE FALSE FALSE COUNTRY VARCHAR(20) UTF8 TRUE FALSE FALSE DESCRIPTION VARCHAR(20) UTF8 TRUE FALSE FALSE 7 rows in resultset. SQL_EXA> desc full sample2; EXA: desc full sample2; COLUMN_NAME SQL_TYPE NULLABLE DISTRIBUTION_KEY PARTITION_KEY COLUMN_COMMENT ------------------------------ ---------------------------------------- -------- ---------------- ---------------- -------------------------------------------------- ID DECIMAL(18,0) FALSE FALSE FALSE FIRSTNAME VARCHAR(20) UTF8 TRUE FALSE FALSE LASTNAME VARCHAR(20) UTF8 TRUE FALSE FALSE EMAIL VARCHAR(20) UTF8 TRUE FALSE FALSE STREET VARCHAR(20) UTF8 TRUE FALSE FALSE COUNTRY VARCHAR(20) UTF8 TRUE FALSE FALSE DESCRIPTION VARCHAR(20) UTF8 TRUE FALSE FALSE 7 rows in resultset.
This is how we started. Let’s have a look at the first case once more (please note that I’ll execute the statement twice, as the first execution will create the index), but we also change the statement to go through the whole table:
SQL_EXA> select count(*) from sample join sample2 on sample.id = sample2.id; EXA: select count(*)... COUNT(*) --------------------- 10000000 1 row in resultset. SQL_EXA> select count(*) from sample join sample2 on sample.id = sample2.id; EXA: select count(*)... COUNT(*) --------------------- 10000000 1 row in resultset. SQL_EXA> flush statistics; EXA: flush statistics; Rows affected: 0 SQL_EXA> commit; EXA: commit; Rows affected: 0 SQL_EXA> select STMT_ID,COMMAND_NAME,PART_ID,PART_NAME,PART_INFO,OBJECT_SCHEMA,OBJECT_NAME,OBJECT_ROWS,OUT_ROWS,DURATION,NET from EXA_DBA_PROFILE_LAST_DAY where session_id=current_session and command_name = 'SELECT' preferring high stmt_id order by part_id; EXA: select STMT_ID,COMMAND_NAME,PART_ID,PART_NAME,PART_INFO,OBJECT_SCHEMA,... STMT_ID COMMAND_NAME PART_ID PART_NAME PART_INFO OBJECT_SCHEMA OBJECT_NAME OBJECT_ROWS OUT_ROWS DURATION NET ------------ ---------------------------------------- --------- -------------------- -------------------- --------------- ------------------------------ --------------------- --------------------- ----------- --------- 123 SELECT 1 COMPILE / EXECUTE 0.031 129.9 123 SELECT 2 SCAN DEMO SAMPLE 10000000 10000000 0.007 166.4 123 SELECT 3 JOIN GLOBAL DEMO SAMPLE2 10000000 10000000 1.128 123 SELECT 4 GROUP BY GLOBAL on TEMPORARY tmp_subselect0 0 1 0.009 155.6 table 4 rows in resultset. SQL_EXA>
Changing the distribution keys again, and re-do the test:
SQL_EXA> alter table sample distribute by id; EXA: alter table sample distribute by id; Rows affected: 0 SQL_EXA> alter table sample2 distribute by id; EXA: alter table sample2 distribute by id; Rows affected: 0 SQL_EXA> commit; EXA: commit; Rows affected: 0 SQL_EXA> select count(*) from sample join sample2 on sample.id = sample2.id; EXA: select count(*)... COUNT(*) --------------------- 10000000 1 row in resultset. SQL_EXA> select count(*) from sample join sample2 on sample.id = sample2.id; EXA: select count(*)... COUNT(*) --------------------- 10000000 1 row in resultset. SQL_EXA> flush statistics; EXA: flush statistics; Rows affected: 0 SQL_EXA> commit; EXA: commit; Rows affected: 0 SQL_EXA> select STMT_ID,COMMAND_NAME,PART_ID,PART_NAME,PART_INFO,OBJECT_SCHEMA,OBJECT_NAME,OBJECT_ROWS,OUT_ROWS,DURATION,NET from EXA_DBA_PROFILE_LAST_DAY where session_id=current_session and command_name = 'SELECT' preferring high stmt_id order by part_id; EXA: select STMT_ID,COMMAND_NAME,PART_ID,PART_NAME,PART_INFO,OBJECT_SCHEMA,... STMT_ID COMMAND_NAME PART_ID PART_NAME PART_INFO OBJECT_SCHEMA OBJECT_NAME OBJECT_ROWS OUT_ROWS DURATION NET ------------ ---------------------------------------- --------- -------------------- -------------------- --------------- ------------------------------ --------------------- --------------------- ----------- --------- 131 SELECT 1 COMPILE / EXECUTE 0.006 0.2 131 SELECT 2 SCAN DEMO SAMPLE 10000000 10000000 0.005 0.0 131 SELECT 3 JOIN DEMO SAMPLE2 10000000 10000000 0.330 131 SELECT 4 GROUP BY GLOBAL on TEMPORARY tmp_subselect0 0 1 0.004 0.1 table 4 rows in resultset. SQL_EXA>
We come down to almost no network traffic and this is what you want to see.
In the next post we’ll look into partitioning.