Exadata Part VII: Meaning of the various Disk Layers

Whenever I have tought an Exadata Database Machine course, there was some confusion amongst the attendees about the many different kinds of Disks encountered on the Storage Servers (Cells): We have not less than 4 different layers there, and the intention of this posting is to clarify the meaning of the different layers and to explain what an Administrator needs to do there eventually.

Each Cell comes with 12 SAS Harddisks (600 GB each with High Performance resp. 2 TB each with High Capacity). The picture below shows a Cell with the 12 Harddisks on the front:

Also each Cell has 4 Flashcards built in that are divided into 4 Flashdisks each, summarizing to 16 Flashdisks in each Cell that deliver by default 384 GB Flash Cache. At this stage, the first layer of abstraction comes:

1) Physical Disks

Physical Disks can be of the type Harddisk or of the type Flashdisk. You cannot create or drop them. The only administrative task on that layer can be to turn the LED at the front of the Cell on before you replace a damaged Harddisk to be sure you pull out the right one, with a command like

CellCLI> alter physicaldisk  serviceled on

2) Luns

Luns are the second layer of abstraction. They have been introduced, because the first two Harddisks in every Cell are different than the other 10 in so far as they contain the Operating System (Oracle Enterprise Linux). About 30 GB have been carved out of the first 2 Harddisks for that purpose. We have 2 of them for redundancy – the Cell can still operate if only one of the first 2 Harddisks fails. If we investigate the first 2 LUNs, we see the mirrored OS Partitions. Joel Goodman has done that in a very instructive posting. I need to correct my original statement in this post that said „The first 2 Luns are 30 GB smaller than the other 10 therefore.“ The LUNs are equally sized on each Harddisk, but the usable space (for Celldisks resp. Griddisks) is about 30 GB less on the first two.

As an Administrator, you do not need to do anything on the Lun Layer except looking at it with commands like

CellCLI> list lun

3) Celldisks

Celldisks are the third layer of abstraction. It was introduced to enable interleaving in the first place. There has been some misconception about that in the Exadata Community, which is why I will spent some more lines on that topic. Typically, our ACS creates all the Celldisks for you without interleaving with a command like

CellCLI> create celldisk all harddisk

When you investigate your Celldisks you would see something like that:

CellCLI> list celldisk attributes name,interleaving where disktype=harddisk
         CD_disk01_cell1         none
         CD_disk02_cell1         none
         CD_disk03_cell1         none
         CD_disk04_cell1         none
         CD_disk05_cell1         none
         CD_disk06_cell1         none
         CD_disk07_cell1         none
         CD_disk08_cell1         none
         CD_disk09_cell1         none
         CD_disk10_cell1         none
         CD_disk11_cell1         none

My Celldisk #12 is not showing because I have dropped it to show the alternative creation with interleaving:

CellCLI> create celldisk all harddisk interleaving='normal_redundancy'
CellDisk CD_disk12_cell1 successfully created

In a real world configuration, every Celldisk (on every Cell) would have the same interleaving (none, normal_redundancy or high_redundancy). The interleaving attribute of the Celldisk determines the placement of the later created Griddisks on that Celldisk.

So as and Administrator, you could create and drop Celldisks – although you will rarely if at all do that. Most customers are best suited with the default configuration that comes without interleaving: First created Griddisks are the fastest -> DATA is faster than RECO

4) Griddisks

Griddisks are the fourth layer of abstraction, and they will be the Candidate Disks to build your ASM diskgroups from. By default (interleaving=none on the Celldisk layer), the first Griddisk that is created upon a Celldisk is placed on the outer sectors of the underlying Harddisk. It will have the best performance therefore. If we follow the recommendations, we will create 3 Diskgroups upon our Griddisks: DATA, RECO and SYSTEMDG.

DATA is supposed to be used as the Database Area (DB_CREATE_FILE_DEST=’+DATA‘ on the Database Layer), RECO will be the Recovery Area (DB_RECOVERY_FILE_DEST=’+RECO‘) and SYSTEMDG will be used to hold Voting Files and OCR files. It makes sense that DATA has a better performance than RECO, and SYSTEMDG can be placed on the slowest (inner) part of the Harddisks.

With interleaving specified at the Celldisk layer, this is different: The Griddisks are then being created from outer and inner parts of the Harddisk, leading to equal performance of the Griddisks and also then of the later created Diskgroups. This option was introduced for customers who want to provide different Diskgroups for different Databases without preferring one Database over the other.

We will take Griddisks out of the 10 non System Harddrives of each Cell in the size of about 30 GB to build the Diskgroup SYSTEMDG upon. That leaves us with the same amount of space left on each of the 12 Harddrives for the DATA and RECO diskgroup. You may wonder why the SYSTEMDG Diskgroup gets relatively large with that approach – much larger than the space that is required by Voting Files and OCR. That space gets used if you establish a DBFS filesystem with a dedicated DBFS database that uses that SYSTEMDG diskgroup as the Database Area. In this DBFS filesystem, you may store flat files in to process them with External Tables (or SQL*Loader) from your productive Databases.

So as an Administrator, you can (and will, most likely) create and drop Griddisks; typically 3 Griddisks are carved out of each Celldisk, resp. 2 out of the first 2 that contain already the OS. Assuming we have High Performance Disks:

CellCLI> create griddisk all harddisk prefix=temp_dg, size=570G

This command will create 12 Griddisks, each of 570G in size from the outer (fastest) sectors of the underlying Harddisks. It fills up the first 2 Celldisks entirely, because they have just 570G space free – the rest is already consumed by the OS partition.

CellCLI> create griddisk all harddisk prefix=systemdg

This command creates 10 Griddisks for the systemdg diskgroup, consuming all the available (30G) space remaining on the 10 non system Harddisks. They will be on the slowest part without interleaving.

CellCLI> drop griddisk all prefix=temp_dg

Now we have dropped that griddisks, leaving the faster parts empty for the next 2 Diskgroups:

CellCLI> create griddisk all harddisk prefix=data, size=270G

It is best practice to give the name of the future diskgroup as a prefix for the Griddisks. We have now 12 Griddisks for the future DATA diskgroup on the outer sectors created. The remaining space (300G) will be consumed by the reco Griddisks:

CellCLI> create griddisk all harddisk prefix=reco

We are now ready to continue on the Database Layer and create ASM Diskgroups there. I have given an example (incidentally with Flashdisks, but it looks the same with Harddisks) for that already in this posting. From that Layer, Griddisks just look like ASM (Candidate) Disks.

Conclusion: All the various Disk Layers in Exadata are there for a good reason. As an Administrator, you will probably only deal with Griddisks, though. There are multiple Griddisks carved out of each Celldisk->Lun->Physical Disk. On the Database Layer, Griddisks look and feel like ASM Disks that you use for your ASM Diskgroups.

exadata, griddisk

Dieser Eintrag wurde erstellt am Mai 18, 2011, 16:18 und wurde abgelegt unter TOI. Du kannst die Antworten auf diesen Beitrag über RSS 2.0 verfolgen. Du kannst eine Antwort schreiben oder einen Trackback von deiner eigenen Seite schicken.

#1 von Dan Norris am Mai 18, 2011 - 16:46

„First created Griddisks are the fastest -> DATA is faster than RECO“

That’s not always true. The more accurate statement is „Griddisks with the lowest offset are the fastest“. They may not be created first since the offset attribute may be specified.
#2 von Uwe Hesse am Mai 18, 2011 - 16:59

Thank you, Dan, for this additional information! Can you give an example for that CellCli command? I haven’t seen an offset attribute specified during Griddisk creation yet and it is also not documented.
#3 von Surachart Opun am Mai 20, 2011 - 05:36

thank You Uwe for Great post.

Thank You Dan… for “Griddisks with the lowest offset are the fastest”.
lowest offset … I see:
offset: 32M
#4 von Arup Nanda am Mai 22, 2011 - 08:39

Great post, Uwe and fabulous series on Exadata. Thanks for all your hard work in putting it together.
#5 von Frits Hoogland am Mai 22, 2011 - 11:17

Great post Uwe! Be aware DATA and RECO are created as DATA_ and RECO_ with newer onecommand versions.
#6 von Frits Hoogland am Mai 22, 2011 - 11:19

^^ a four letter client specified prefix should follow DATA_ and RECO_ in the above command.
#7 von Uwe Hesse am Mai 23, 2011 - 10:50

Thank you guys for the nice feedback 🙂 Very much appreciated! And thank you Frits for mentioning the new naming convention.
#8 von Waseem am November 28, 2011 - 08:13

Hi Uwe,

There is some confusion and I hope you may guide be through this.

The Oracle documentation says:
CREATE CELLDISK ALL—This CELLCLI command automatically creates celldisks on all available logical unit numbers (LUNs).
Now the example above you have used:
create celldisk all harddisk
Suppose you do not use the haddisk option, would the command ‚create celldisk all‘ create celldisks on both flash as well as had disks ?
Would it be that 16 flashcell disks and 12 celldisks (hardsisks) be created byt his command ?

Suppose we do use the ‚create celldisk all‘ would it allow us to use the flash cell disks for Flash cache or can it only be used for Grid Flashdisks ?

Regards,
Waseem.
#9 von Waseem am November 28, 2011 - 08:51

Hi Uwe,

The offset example is mentioned in the below document:

Klicke, um auf maa-exadata-consolidated-roles-459605.pdf zuzugreifen

„This order of creation is important to note since the griddisks created first (with a lower offset) will be on the outer tracks of the physical disks and consequently get slightly better performance. You can also specify the OFFSET as part of the griddisk create commands and then the order is not pertinent as you can place the griddisks at a specific offset.“

Regards,
Waseem.
#10 von Uwe Hesse am November 28, 2011 - 16:29

Hi Wasseem,

thanks for your question and the information about the offset!
When you say
create celldisk all
this would create celldisks upon all spinning disks and all flash disks. If celldisks have been created already, it will skip them and return messages accordingly.

You may later on use the flash based celldisks to create the flashcache upon them (the standard approach), or to use some space from them to create griddisks upon them and leave the rest for the flash cache.
#11 von raj am Juli 26, 2012 - 07:27

Hi,

Can you please let me know the physical files present in database layer and storage yaer respectively.

Many Thanks
Raj
#12 von Uwe Hesse am Juli 30, 2012 - 17:17

Raj, the physical files that make up your Oracle Database are the very same as on non Exadata platforms: Datafiles, Controlfiles, Redo Logfiles. They are striped and mirrored on ASM diskgroups here; in a standard configuration distributed across all Storage Servers.
#13 von sunilbhola am Oktober 12, 2012 - 16:43

As you mentioned:-

600 GB each with High Performance resp. 2 TB each with High Capacity

Query:-
Which version of Exadata you are using. Somewhere I saw 3TB instead of 2TB.

Which one is correct.. it is 2TB or 3TB?

As per oracle:-
Database Machine Capacity (Uncompressed)
X2-8 or X2-2 Full Rack X2-2 Half Rack X2-2 Quarter Rack
Raw Disk Capacity
High Perf Disk 100 TB 50 TB 21.6 TB
High Cap Disk 504 TB 252 TB 108 TB

Calculation:-

Cell Disk High Perf in TB High Capacity in TB High Capacity in TB
14 12 600 100.8 2 336 3 504
7 12 600 50.4 2 168 3 252
3 12 600 21.6 2 72 3 108
#14 von oraclebhola am Oktober 13, 2012 - 07:30

I got the confusion cleared…you mentioned 2 TB for high capacity ….2 TB is used in non-M2 models…non m2 is old…in m2 we have data drives of 3tb..
#15 von oraclebhola am Oktober 13, 2012 - 07:33

Sunilbhola and oraclebhola both ids are mine..just fyi
#16 von Uwe Hesse am Oktober 13, 2012 - 08:52

So you sorted that out yourself 🙂 Yes, Exadata evolves (rapidly) over time and my postings will usually reflect only the state of the time they were written – although I may put an addendum to them that mentions the change occasionally.