This user hasn't shared any biographical information
Posted in TOI on July 18, 2018
If you need a large set of data to test or demonstrate something, this does the trick:
Create demo table in Oracle
create table t as select rownum as id, -- growing unique number 'Just some text' as textcol, mod(rownum,5) as numcol1, -- numbers 0-4 round-robin mod(rownum,1000) as numcol2 , -- numbers 0-999 round robin 5000 as numcol3, -- constant to_date ('01.' || lpad(to_char(mod(rownum,12)+1),2,'0') || '.2018' ,'dd.mm.yyyy') as time_id -- dates for every month from dual connect by level<=1e4 -- 10000 rows ;
Create demo table in Exasol
create table t as select rownum as id, -- growing unique number 'Just some text' as textcol, mod(rownum,5) as numcol1, -- numbers 0-4 round-robin mod(rownum,1000) as numcol2 , -- numbers 0-999 round robin 5000 as numcol3, -- constant to_date ('01.' || lpad(to_char(mod(rownum,12)+1),2,'0') || '.2018' ,'dd.mm.yyyy') as time_id -- dates for every month from (select 1 from dual connect by level<=1e4) -- 10000 rows ;
In Oracle you may get this error message for a high number of rows (like 10 Mio): ORA-30009: Not enough memory for CONNECT BY operation.
This way it works in spite of it:
SQL> alter session set workarea_size_policy=manual; Session altered. SQL> alter session set sort_area_size=1000000000; Session altered. SQL> create table t as select rownum as id, 'Just some text' as textcol, mod(rownum,5) as numcol1, mod(rownum,1000) as numcol2 , 5000 as numcol3, to_date ('01.' || lpad(to_char(mod(rownum,12)+1),2,'0') || '.2018' ,'dd.mm.yyyy') as time_id from dual connect by level<=1e7 ; Table created. SQL> select count(*) from t; COUNT(*) ---------- 10000000 SQL> select bytes/1024/1024 as mb from user_segments where segment_name='T'; MB ---------- 480
In Exasol, this problem doesn’t surface:
SQL_EXA> create table t as select rownum as id, 'Just some text' as textcol, mod(rownum,5) as numcol1, mod(rownum,1000) as numcol2 , 5000 as numcol3, to_date ('01.' || lpad(to_char(mod(rownum,12)+1),2,'0') || '.2018' ,'dd.mm.yyyy') as time_id from (select 1 from dual connect by level<=1e7) ; EXA: create table t as... Rows affected: 10000000 SQL_EXA> select mem_object_size/1024/1024 as mb from exa_all_object_sizes where object_name='T'; EXA: select mem_object_size/1024/1024 as mb from exa_all_object_sizes where... MB ---------------------------------- 50.03943824768066406250 1 row in resultset.
And the resulting table is smaller because Exasol stores it in columnar compressed format 🙂
Posted in TOI on July 16, 2018
This article gives a more detailed view on the Exasol Cluster Architecture. A high level view is provided here.
Exasol Cluster Nodes: Hardware
An Exasol Cluster is built with commodity Intel servers without any particular expensive components. SAS hard drives and Ethernet Cards are sufficient. Especially there is no need for an additional storage layer like a SAN.
See here for a list of Exasol Certified Servers.
As a best practice the hard drives of Exasol Cluster nodes are configured as RAID 1 pairs. Each cluster node holds four different areas on disk:
1.OS with 50 GB size containing CentOS Linux, EXAClusterOS and the Exasol database executables
2.Swap with 4 GB size
3.Data with 50 GB size containing Logfiles, Coredumps and BucketFS
4.Storage consuming the remaining capacity for the hard drives for the Data Volumes and Archive Volumes
The first three areas can be stored on dedicated disks in which case these disks are also configured in RAID 1 pairs, usually with a smaller size than those that contain the volumes. More common than having dedicated disks is having servers with only one type of disk. These are configured as hardware RAID 1 pairs. On top of that software RAID 0 partitions are being striped across all disks to contain OS, Swap and Data partition.
Exasol 4+1 Cluster: Software Layers
This popular multi-node cluster serves as example to illustrate the concepts explained. It is called 4+1 cluster because it has 4 Active nodes and 1 Reserve node. Active and Reserve nodes have the same layers of software available. The purpose of the Reserve node is explained here. Upon cluster installation, the License Server copies these layers as tar-balls across the private network to the other nodes. The License Server is the only node in the cluster that boots from disk. Upon cluster startup, it provides the required SW layers to the other cluster nodes.
Exasol License Essentials
There are three types of licenses available:
Database RAM License: This most commonly used model specifies the total amount of RAM that can be assigned to databases in the cluster.
Raw Data License: Specifies the maximum size of the raw data you can store across databases in the cluster.
Memory Data License: Specifies the maximum size of the compressed data you can store across all databases.
For licenses based on RAM, Exasol checks the RAM assignment at the start of the database. If the RAM in use exceeds the maximum RAM specified by the license, the database will not start.
For licenses based on data size (raw data license and memory data license), a periodic check is done by Exasol on the size of the data. If the size limit exceeds the value specified in the license, the database does not permit any further data insertion until the usage drops below the specified value.
Customers receive their license as a separate file. To activate the license, these license files are uploaded to the License Server using EXAoperation.
Storage Volumes are created with EXAoperation on specified nodes.
EXAStorage provides two kinds of volumes:
Each database needs one volume for persistent data and one temporary volume for temporary data.
While the temporary volume is automatically created by a database process, the persistent data volume has to be created by an Exasol Administrator upon database creation.
Archive volumes are used to store backup files of an Exasol database.
Exasol 4+1 Cluster: Data & Archive Volume distribution
Data Volumes and Archive Volumes are hosted on the hard drives of the active nodes of a cluster.
They consume the major capacity of these drives. The license server usually hosts EXAoperation.
EXAoperation is the major management GUI for Exasol Clusters, consisting of an Application Server and a small Configuration Database, both located on the License Server under normal circumstances. EXAoperation can be accessed from all Cluster Nodes via HTTPS. Should the License Server go down, EXAoperation will failover to another node while the availability of the Exasol database is not affected at all.
Shared-nothing architecture (MPP processing)
Exasol was developed as a parallel system and is constructed according to the shared-nothing principle. Data is distributed across all nodes in a cluster. When responding to queries, all nodes co-operate and special parallel algorithms ensure that most data is processed locally in each individual node’s main memory.
When a query is sent to the system, it is first accepted by the node the client is connected to. The query is then distributed to all nodes. Intelligent algorithms optimize the query, determine the best plan of action and generate needed indexes on the fly. The system then processes the partial results based the local datasets. This processing paradigm is also known as SPMD (single program multiple data). All cluster nodes operate on an equal basis, there is no Master Node. The global query result is delivered back to the user through the original connection.
Above picture shows a Cluster with 4 data nodes and one reserve node. The license server is the only server that boots from disk. It provides the OS used by the other nodes over the network.
Exasol uses a shared nothing architecture. The data stored in this database is symbolized with A,B,C,D to indicate that each node contains a different part of the database data. The active nodes n11-n14 each host database instances that operate on their part of the database locally in an MPP way. These instances communicate and coordinate over the private network.
Exasol Network Essentials
Each Cluster node needs at least two network connections: One for the Public Network and one for the Private Network. The Public Network is used for client connections. 1 Gb Ethernet is sufficient usually. The Private Network is used for the Cluster Interconnect of the nodes. 10 GB Ethernet or higher is recommended for the Private Network. Optionally, the Private Network can be separated into one Database Network (Database Instances communicate over it) and one Storage Network (Mirrored Segments are synchronized over this network).
Exasol Redundancy Essentials
Redundancy is an attribute that can be set upon EXAStorage Volume creation. It specifies the number of copies of the data that is hosted on Active Cluster nodes. In practice this is either Redundancy 1 or Redundancy 2. Redundancy 1 means there is no redundancy, so if one node fails, the volume with that redundancy is no longer available. Typically that is only seen with one-node Clusters. Redundancy 2 means that each node holds a copy of data that is operated on by a neighbor node, so the volume remains available if one node fails.
Exasol 4+1 Cluster: Redundancy 2
If volumes are configured with redundancy 2 – which is a best practice – then each node holds a mirror of data that is operated on by a neighbor node. If e.g. n11 modifies A the mirror A‘ on n12 is synchronized over the private network. Should an active node fail, the reserve node will step in starting an instance.
Posted in TOI on July 5, 2018
The building blocks of an Exasol cluster are commodity Intel servers like e.g. Dell PowerEdge R740 with 96 GB RAM,12 x 1.2 TB SAS Hot-plug hard-drives and 2 x 10Gb Ethernet Cards for the private network. That’s sufficient to deliver outstanding performance combined with high availability. The picture below shows a 4+1 cluster, one of our most popular configurations:
Each active node hosts one database instance that works on its part of the database (A,B,C,D) in an MPP way. The instances communicate over the private network. Optionally, the private network can be separated into one database network and one storage network. In this case, the instances communicate over the database network. Notice that the instances access their part of the database directly on their local hard drives, they do not need the private network respectively the storage network for that. The reserve node becomes relevant only if one of the active nodes fails. The local hard drives are being setup in RAID 1 pairs, so single disk failures can be tolerated without losing database availability. Not listed is the license node that is required to boot the cluster initially. After that, the license node is no longer required to keep the cluster running.
If data volumes with redundancy 2 are in use – which is the most common case – then each node holds a copy of the data operated on by a neighbor node:
If a Master-Segment like A is modified, the Slave-Segment (A’) is synchronized accordingly over the private network respectively the storage network.
Availability comes with a price: The raw disk capacity is reduced by half because of the RAID 1 mirroring and again by half because of the redundancy 2, so you remain with approximately (Linux OS and database software also require a small amount of disk space) 1/4 of your raw disk capacity. But since we are running on commodity hardware – no storage servers, no SAN, no SSDs required etc. – this is actually a very competitive price.
Now what if one node fails?
ExaClusterOS – Exasols Clusterware – will detect the node failure within seconds and shutdown all remaining database instances in order to preserve a consistent state of the database. Then it restarts them again on the still available 3 nodes and also on the Reserve node that now becomes an Active node too. The database itself becomes available again with the node n15 now immediately working with segment B’.
The downtime of the system caused by the node failure is below 30 seconds typically. The restart of the database triggers a threshold called Restore Delay which defaults to 10 Minutes. If within that time the failed node becomes available again, we will just re-synchronize the segments (A’ and B in the example) which can be done fast. The instance on n15 will then work with the segment B as a Master-Segment until the database is manually restarted. Then n15 becomes a reserve node again and n12 is active with an instance running there.
If the failed node doesn’t come back within Restore Delay:
We will then create new segments on node n15: A’ is copied from n11 and B is copied from n13. This activity is time-consuming and puts a significant load on the private network, which is why configuring a dedicated storage network may be beneficial to avoid a drop in performance during that period. A new reserve node should now be added to the cluster, replacing the crashed n12.