This article gives a more detailed view on the Exasol Cluster Architecture. A high level view is provided here.
Exasol Cluster Nodes: Hardware
An Exasol Cluster is built with commodity Intel servers without any particular expensive components. SAS hard drives and Ethernet Cards are sufficient. Especially there is no need for an additional storage layer like a SAN.
See here for a list of Exasol Certified Servers.
As a best practice the hard drives of Exasol Cluster nodes are configured as RAID 1 pairs. Each cluster node holds four different areas on disk:
1.OS with 50 GB size containing CentOS Linux, EXAClusterOS and the Exasol database executables
2.Swap with 4 GB size
3.Data with 50 GB size containing Logfiles, Coredumps and BucketFS
4.Storage consuming the remaining capacity for the hard drives for the Data Volumes and Archive Volumes
The first three areas can be stored on dedicated disks in which case these disks are also configured in RAID 1 pairs, usually with a smaller size than those that contain the volumes. More common than having dedicated disks is having servers with only one type of disk. These are configured as hardware RAID 1 pairs. On top of that software RAID 0 partitions are being striped across all disks to contain OS, Swap and Data partition.
Exasol 4+1 Cluster: Software Layers
This popular multi-node cluster serves as example to illustrate the concepts explained. It is called 4+1 cluster because it has 4 Active nodes and 1 Reserve node. Active and Reserve nodes have the same layers of software available. The purpose of the Reserve node is explained here. Upon cluster installation, the License Server copies these layers as tar-balls across the private network to the other nodes. The License Server is the only node in the cluster that boots from disk. Upon cluster startup, it provides the required SW layers to the other cluster nodes.
Exasol License Essentials
There are three types of licenses available:
Database RAM License: This most commonly used model specifies the total amount of RAM that can be assigned to databases in the cluster.
Raw Data License: Specifies the maximum size of the raw data you can store across databases in the cluster.
Memory Data License: Specifies the maximum size of the compressed data you can store across all databases.
For licenses based on RAM, Exasol checks the RAM assignment at the start of the database. If the RAM in use exceeds the maximum RAM specified by the license, the database will not start.
For licenses based on data size (raw data license and memory data license), a periodic check is done by Exasol on the size of the data. If the size limit exceeds the value specified in the license, the database does not permit any further data insertion until the usage drops below the specified value.
Customers receive their license as a separate file. To activate the license, these license files are uploaded to the License Server using EXAoperation.
Storage Volumes are created with EXAoperation on specified nodes.
EXAStorage provides two kinds of volumes:
Each database needs one volume for persistent data and one temporary volume for temporary data.
While the temporary volume is automatically created by a database process, the persistent data volume has to be created by an Exasol Administrator upon database creation.
Archive volumes are used to store backup files of an Exasol database.
Exasol 4+1 Cluster: Data & Archive Volume distribution
Data Volumes and Archive Volumes are hosted on the hard drives of the active nodes of a cluster.
They consume the major capacity of these drives. The license server usually hosts EXAoperation.
EXAoperation is the major management GUI for Exasol Clusters, consisting of an Application Server and a small Configuration Database, both located on the License Server under normal circumstances. EXAoperation can be accessed from all Cluster Nodes via HTTPS. Should the License Server go down, EXAoperation will failover to another node while the availability of the Exasol database is not affected at all.
Shared-nothing architecture (MPP processing)
Exasol was developed as a parallel system and is constructed according to the shared-nothing principle. Data is distributed across all nodes in a cluster. When responding to queries, all nodes co-operate and special parallel algorithms ensure that most data is processed locally in each individual node’s main memory.
When a query is sent to the system, it is first accepted by the node the client is connected to. The query is then distributed to all nodes. Intelligent algorithms optimize the query, determine the best plan of action and generate needed indexes on the fly. The system then processes the partial results based the local datasets. This processing paradigm is also known as SPMD (single program multiple data). All cluster nodes operate on an equal basis, there is no Master Node. The global query result is delivered back to the user through the original connection.
Above picture shows a Cluster with 4 data nodes and one reserve node. The license server is the only server that boots from disk. It provides the OS used by the other nodes over the network.
Exasol uses a shared nothing architecture. The data stored in this database is symbolized with A,B,C,D to indicate that each node contains a different part of the database data. The active nodes n11-n14 each host database instances that operate on their part of the database locally in an MPP way. These instances communicate and coordinate over the private network.
Exasol Network Essentials
Each Cluster node needs at least two network connections: One for the Public Network and one for the Private Network. The Public Network is used for client connections. 1 Gb Ethernet is sufficient usually. The Private Network is used for the Cluster Interconnect of the nodes. 10 GB Ethernet or higher is recommended for the Private Network. Optionally, the Private Network can be separated into one Database Network (Database Instances communicate over it) and one Storage Network (Mirrored Segments are synchronized over this network).
Exasol Redundancy Essentials
Redundancy is an attribute that can be set upon EXAStorage Volume creation. It specifies the number of copies of the data that is hosted on Active Cluster nodes. In practice this is either Redundancy 1 or Redundancy 2. Redundancy 1 means there is no redundancy, so if one node fails, the volume with that redundancy is no longer available. Typically that is only seen with one-node Clusters. Redundancy 2 means that each node holds a copy of data that is operated on by a neighbor node, so the volume remains available if one node fails.
Exasol 4+1 Cluster: Redundancy 2
If volumes are configured with redundancy 2 – which is a best practice – then each node holds a mirror of data that is operated on by a neighbor node. If e.g. n11 modifies A the mirror A‘ on n12 is synchronized over the private network. Should an active node fail, the reserve node will step in starting an instance.