With this posting we will look at the Flash Cache, built into the Exadata Database Machine. This feature is one major reason why Exadata is not only beneficial for Data Warehouse but also for OLTP. In a Full Rack, we have 8 Database Servers and 14 Storage Servers. Each Storage Server (brief: Cell) contains 4 PCIe Flash Cards:
This Flash Card * (s. Addendum at the bottom) delivers 96 GB Flash Storage, devided into 4 Flash Drives. This summarizes to 5 TB Flash Storage for a Full Rack – many Databases will probably fit completely into it. In an OLTP Database, the number of IOs per second deliverable is one of the most critical factors. With the above configuration, we can deliver up to 1 Million IOs per second.
The default way to deal with the Flash Storage is to use it completely as Flash Cache. You may think of Flash Cache as a prolongation of the Database Buffer Cache. It is populated automatically by the system with objects deemed useful to cache them. Without any intervention it is used that way:
CellCLI> list flashcache detail
name: exa5cel02_FLASHCACHE
cellDisk: [... list of all 16 flashdisks]
creationTime: 2011-02-02T01:14:04-08:00
degradedCelldisks:
effectiveCacheSize: 365.25G
id: 880d826b-47cc-4cf5-95fc-c36d6d315ba8
size: 365.25G
status: normal
That was from one of the cells as the celladmin user connected using the command line interface. The whole Flash Storage is in use for Flash Cache ** on that cell. On the Database Layer, we may see the effect like this:
SQL> select name,value from v$sysstat where name in
('physical read total IO requests','cell flash cache read hits');
NAME VALUE
---------------------------------------------------------------- ----------
physical read total IO requests 51572139
cell flash cache read hits 32344851
That is already a good ratio: The majority of IO requests is resolved from the Flash Cache. We can specify storage attributes on the segment layer to influence the caching behavior of that segment:
SQL> select segment_name,cell_flash_cache from user_segments;
SEGMENT_NA CELL_FL
---------- -------
SALES KEEP
We have 3 possible values here: DEFAULT (the default), KEEP and NONE. Keep means that the sales table will be stored in the Flash Cache „more aggressively“ than the default. It will in other words increase the chance to read it from there. The table is the same as in the previous posting. Because of this setting and previous selects on the table that populated the Flash Cache with it, I am now able to read it from there. I reconnect to initialize v$mystat:
SQL> connect adam/adam Connected. SQL> set timing on SQL> select count(*) from sales; COUNT(*) ---------- 20000000 Elapsed: 00:00:00.50 SQL> select name,value from v$mystat natural join v$statname where name in ('physical read total IO requests','cell flash cache read hits'); NAME VALUE ---------------------------------------------------------------- ---------- physical read total IO requests 10265 cell flash cache read hits 10265
The second possibility to deal with the Flash Storage is to take a part of it for building ASM diskgroups upon. All files on these ASM diskgroups will then reside permanently on Flash Storage:
CellCLI> drop flashcache Flash cache exa5cel01_FLASHCACHE successfully dropped CellCLI> create flashcache all size=100g Flash cache exa5cel01_FLASHCACHE successfully created CellCLI> create griddisk all flashdisk prefix=flashdrive GridDisk flashdrive_FD_00_exa5cel01 successfully created GridDisk flashdrive_FD_01_exa5cel01 successfully created GridDisk flashdrive_FD_02_exa5cel01 successfully created GridDisk flashdrive_FD_03_exa5cel01 successfully created GridDisk flashdrive_FD_04_exa5cel01 successfully created GridDisk flashdrive_FD_05_exa5cel01 successfully created GridDisk flashdrive_FD_06_exa5cel01 successfully created GridDisk flashdrive_FD_07_exa5cel01 successfully created GridDisk flashdrive_FD_08_exa5cel01 successfully created GridDisk flashdrive_FD_09_exa5cel01 successfully created GridDisk flashdrive_FD_10_exa5cel01 successfully created GridDisk flashdrive_FD_11_exa5cel01 successfully created GridDisk flashdrive_FD_12_exa5cel01 successfully created GridDisk flashdrive_FD_13_exa5cel01 successfully created GridDisk flashdrive_FD_14_exa5cel01 successfully created GridDisk flashdrive_FD_15_exa5cel01 successfully created
The Flash Cache for this cell is now reduced to 100 GB; all means „upon all 16 Flash Drives“ here. I am doing the same on the second cell – my Database Machine is limited to only one Server Node and two Cells. That gives me 32 Grid Disks based on Flash Drives to create ASM diskgroups upon:
CellCLI> drop flashcache Flash cache exa5cel02_FLASHCACHE successfully dropped CellCLI> create flashcache all size=100g Flash cache exa5cel02_FLASHCACHE successfully created CellCLI> create griddisk all flashdisk prefix=flashdrive GridDisk flashdrive_FD_00_exa5cel02 successfully created GridDisk flashdrive_FD_01_exa5cel02 successfully created GridDisk flashdrive_FD_02_exa5cel02 successfully created GridDisk flashdrive_FD_03_exa5cel02 successfully created GridDisk flashdrive_FD_04_exa5cel02 successfully created GridDisk flashdrive_FD_05_exa5cel02 successfully created GridDisk flashdrive_FD_06_exa5cel02 successfully created GridDisk flashdrive_FD_07_exa5cel02 successfully created GridDisk flashdrive_FD_08_exa5cel02 successfully created GridDisk flashdrive_FD_09_exa5cel02 successfully created GridDisk flashdrive_FD_10_exa5cel02 successfully created GridDisk flashdrive_FD_11_exa5cel02 successfully created GridDisk flashdrive_FD_12_exa5cel02 successfully created GridDisk flashdrive_FD_13_exa5cel02 successfully created GridDisk flashdrive_FD_14_exa5cel02 successfully created GridDisk flashdrive_FD_15_exa5cel02 successfully created CellCLI> list griddisk flashdrive_FD_10_exa5cel02 detail name: flashdrive_FD_10_exa5cel02 availableTo: cellDisk: FD_10_exa5cel02 comment: creationTime: 2011-02-02T02:56:52-08:00 diskType: FlashDisk errorCount: 0 id: 0000012d-e604-87f2-0000-000000000000 offset: 6.28125G size: 16.578125G status: active
Changing to the Database Server Node to create the ASM diskgroup as sysasm:
SQL> select path,header_status, os_mb,free_mb from v$asm_disk where path like '%flash%' PATH HEADER_STATU OS_MB FREE_MB -------------------------------------------------- ------------ ---------- ---------- o/192.168.14.10/flashdrive_FD_14_exa5cel02 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_12_exa5cel01 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_05_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_11_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_08_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_15_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_00_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_03_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_06_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_12_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_09_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_01_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_04_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_13_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_07_exa5cel02 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_10_exa5cel02 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_07_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_04_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_10_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_01_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_13_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_08_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_05_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_02_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_14_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_11_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_00_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_09_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_03_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_15_exa5cel01 CANDIDATE 16976 0 o/192.168.14.9/flashdrive_FD_06_exa5cel01 CANDIDATE 16976 0 o/192.168.14.10/flashdrive_FD_02_exa5cel02 CANDIDATE 16976 0 32 rows selected. SQL> create diskgroup flashdrive normal redundancy disk 'o/*/flashdrive*' attribute 'compatible.rdbms'='11.2.0.0.0', 'compatible.asm'='11.2.0.0.0', 'cell.smart_scan_capable'='TRUE', 'au_size'='4M'; Diskgroup created.
Please notice the Allocation Unit size of 4MB, necessary for Exadata. Normal Redundancy is strongly recommended – automatically each cell is also a Failure Group. In other words: Should one Storage Server crash, no loss of Data will happen, regardless whether we use Flash Drives or spinning drives to build the diskgroups upon. After the creation, we use the diskgroup like any other:
SQL> create tablespace veryfast datafile '+flashdrive' size 10g; Tablespace created.
Any segment created in this tablespace will reside on Flash Drives permanently.
Let’s take the opportunity to give an example for Information Lifecycle Management (ILM):
SQL> create tablespace compahigh datafile size 1g; Tablespace created. SQL> alter tablespace compahigh default compress for archive high; Tablespace altered.
The tablespace got created on spinning drives, because my DB_CREATE_FILE_DEST parameter points to such a diskgroup. Same for the next two:
SQL> create tablespace querylow datafile size 1g; Tablespace created. SQL> alter tablespace querylow default compress for query low; Tablespace altered. SQL> create tablespace ordinary datafile size 1g; Tablespace created.
My plan is to store one large partitioned table partly compressed, partly uncompressed on spinning drives and partly on Flash Drives – the newest and most volatile part.
SQL> create table sales_part (id number, flag number, product char(25),channel_id number,cust_id number, amount_sold number, order_date date, ship_date date) partition by range (order_date) interval (numtoyminterval(1,'year')) store in (veryfast) ( partition archhigh values less than (to_date('01.01.1991','dd.mm.yyyy')) tablespace compahigh, partition querylow values less than (to_date('01.01.1998','dd.mm.yyyy')) tablespace querylow, partition ordi1998 values less than (to_date('01.01.1999','dd.mm.yyyy')) tablespace ordinary, partition ordi1999 values less than (to_date('01.01.2000','dd.mm.yyyy')) tablespace ordinary, partition ordi2000 values less than (to_date('01.01.2001','dd.mm.yyyy')) tablespace ordinary, partition ordi2001 values less than (to_date('01.01.2002','dd.mm.yyyy')) tablespace ordinary, partition ordi2002 values less than (to_date('01.01.2003','dd.mm.yyyy')) tablespace ordinary, partition ordi2003 values less than (to_date('01.01.2004','dd.mm.yyyy')) tablespace ordinary, partition ordi2004 values less than (to_date('01.01.2005','dd.mm.yyyy')) tablespace ordinary, partition ordi2005 values less than (to_date('01.01.2006','dd.mm.yyyy')) tablespace ordinary ) ; Table created.
This uses the 11g New Feature Interval Partitioning to create new partitions automatically on Flash Drives. Now loading the table from the old sales table:
SQL> alter table sales_part nologging;
Table altered.
SQL> insert /*+ append */ into sales_part select * from sales order by order_date;
20000000 rows created.
Notice the order by above. It makes it later on possible not only to do partition pruning but also to use Storage Indexes if we query after ORDER_DATE or even SHIP_DATE. The single partitions now look like this:
SQL> select partition_name,tablespace_name,bytes/1024/1024 as mb from user_segments where segment_name='SALES_PART'; PARTITION_NAME TABLESPACE_NAME MB ------------------------------ ------------------------------ ---------- ARCHHIGH COMPAHIGH 8 ORDI1998 ORDINARY 56 ORDI1999 ORDINARY 56 ORDI2000 ORDINARY 56 ORDI2001 ORDINARY 56 ORDI2002 ORDINARY 56 ORDI2003 ORDINARY 56 ORDI2004 ORDINARY 56 ORDI2005 ORDINARY 56 QUERYLOW QUERYLOW 40 SYS_P101 VERYFAST 56 SYS_P102 VERYFAST 56 SYS_P103 VERYFAST 56 SYS_P104 VERYFAST 56 SYS_P105 VERYFAST 56 SYS_P106 VERYFAST 8 16 rows selected. SQL> select count(*) from sales_part partition (archhigh); COUNT(*) ---------- 5330000 SQL> select count(*) from sales_part partition (querylow); COUNT(*) ---------- 5114000 SQL> select count(*) from sales_part partition (ordi1998); COUNT(*) ---------- 730000 SQL> select count(*) from sales_part partition (sys_p101); COUNT(*) ---------- 730000
During the life cycle of the data, partitions may no longer be highly volatile and can be moved to spinning drives or even get compressed:
SQL> alter table sales_part move partition sys_p101 tablespace ordinary; Table altered. SQL> alter table sales_part move partition ordi1998 compress for query low tablespace querylow; Table altered. SQL> select partition_name,tablespace_name,bytes/1024/1024 as mb from user_segments where segment_name='SALES_PART'; PARTITION_NAME TABLESPACE_NAME MB ------------------------------ ------------------------------ ---------- ARCHHIGH COMPAHIGH 8 ORDI1998 QUERYLOW 8 ORDI1999 ORDINARY 56 ORDI2000 ORDINARY 56 ORDI2001 ORDINARY 56 ORDI2002 ORDINARY 56 ORDI2003 ORDINARY 56 ORDI2004 ORDINARY 56 ORDI2005 ORDINARY 56 QUERYLOW QUERYLOW 40 SYS_P101 ORDINARY 56 SYS_P102 VERYFAST 56 SYS_P103 VERYFAST 56 SYS_P104 VERYFAST 56 SYS_P105 VERYFAST 56 SYS_P106 VERYFAST 8 16 rows selected.
We have no indexes in place – so there is no rebuild needed 🙂
Summary: Flash Storage inside the Oracle Exadata Database Machine is used completely as Flash Cache by default, effectively working as an extension of the Database Buffer Cache and delivering faster Access together with a very high IO per Second rate which is especially important for OLTP. Additionally, we may take a part of the Flash Storage to build ASM diskgroups upon it. Files placed on these diskgroups will reside permanently on Flash Storage – no Caching needed.
* Addendum: The posting reflects the state of X2. With X3, Cells come with 4 x F40 Flashcards that deliver each 400 GB Flash capacity, now to a total of 1600 GB Flash capacity for each Storage Server. Given that the F40 is also faster than the old F20 and together with the new Write Back Flash Cache technology, there will be even less likely a need to build ASM diskgroups upon Flash storage than with X2.
** Second Addendum: Already with newer versions of X2, we introduced Flash Logging, which takes 512 MB Flash Storage from each cell (regardless whether it is X2 or X3). This relatively small amount reduces the capacity of the Flash Cache accordingly. See here for a more detailed explanation (Page 6).
#1 von Arup Nanda am Februar 9, 2011 - 23:39
Excellent Information, Uwe. Short and to the point.
#2 von Uwe Hesse am Februar 10, 2011 - 08:54
Thank you, Arup! Much appreciated 🙂
#3 von Bhavik Desai am April 13, 2011 - 13:16
Hi Uwe,
Indeed nice thread…
I have a question on flash cache.
Oracle says that we can use upto 80% of flash cache to hold KEEP objects.
If i put more objects in order to increase overall flash cache utilization to more than 80%, what would be performance of flash cache like ?
No effect
Or it will not allow ALL KEEP objects to be in flash cache and thus loaded objects will also be maintained with aging policy ?
Or 80% utilization can get expanded to accommodate additional objects
Regards,
Bhavik Desai
#4 von Uwe Hesse am April 13, 2011 - 21:07
Hi Bhavik,
in the Flash Cache, KEEP assigned objects are also subject to an age out policy – if those objects ain’t accessed for a longer time, they will vanish from the flash cache to make room for objects with the default attribute. And yes, they cannot consume more than 80 % of the Flash Cache. If you assign objects with in summary more than 80 % of the Flash Cache with the CELL_FLASH_CACHE KEEP attribute – they will simply not all be flash cached – at the same time.
#5 von Bhavik Desai am April 14, 2011 - 13:50
Many thanks Uwe for your prompt and detailed response.
#6 von Kishore am August 11, 2011 - 04:00
What happens when we reboot cell servers (all), is data on flash cache disks going to permanent like regular sas/sata drives?
#7 von Cristiano am August 21, 2011 - 02:07
Hi Bhavik Desai
Kindly, can you say for us where you find that Oracle says that we can use upto 80% of flash cache to hold KEEP objects?
I was researching to confirm this information for at least week and didn’t found it.
I will be thankful if you can put the source for this information.
Thank you
Best regards
Cristiano
#8 von Waseem am November 23, 2011 - 13:46
Hi Cristiano,
Same here I too was unable to find the 80% threshold about the flash Cache.
However it is mentioned in the Oracle Documents that come from the University.
this I am able to confirm.
thanks.
#9 von Ashok am Dezember 8, 2011 - 05:06
Following presentation from Oracle confirm 80% can be used for KEEP:
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCcQFjAB&url=http%3A%2F%2Fioug.itconvergence.com%2Fpls%2Fapex%2FDWBISIG.download_my_file%3Fp_file%3D2617.&ei=UyngTvnoFLLciAKQ25iNDw&usg=AFQjCNE-zX-tmwuuqcz311WuHbBqq4YPpA
#10 von Uwe Hesse am Dezember 8, 2011 - 11:36
Thank you, Ashok, for sharing this 🙂
#11 von peton am Januar 12, 2012 - 11:04
Hi Uwe,
Excellent Post!!! Thanks for your sharing.
Could you please kindly tell us what the relationship about the demo and Flash cache is? I can’t understand this.
Thanks,
Peton
#12 von peton am Januar 12, 2012 - 11:06
Can I think in this way about this demo? does it mean we can put the datafile in flash cache diskgroup?
Thanks,
Peton
#13 von Uwe Hesse am Januar 12, 2012 - 12:30
Peton,
I showed that we can take a part of the flash storage to build griddisks upon. That part is then not used for Flash Cache. Instead, tables that reside in a tablespace that is created in a diskgroup that is built by those griddisks will be permanently on flash storage.
#14 von peton am Januar 12, 2012 - 14:54
Hi Uwe,
Thanks for your response! I got it. -Peton
#15 von Uwe Hesse am Januar 12, 2012 - 17:03
You’re welcome 🙂
#16 von mwidlake am Juni 27, 2012 - 22:20
Hi Uwe,
Thank you for the very clear instructions on creating a disk group based on the flash storage in Exadata – it saved me valuable time when I was under pressure on a project recently.
We had a single table that was undergoing very high update activity and thus making high IOPS demands on our system. So much so that any other serious IO slowed down the activity on this key table. By moving it to an ASM diskgroup based on the flash storage, we resolved those issues.
Martin Widlake
#17 von Uwe Hesse am Juni 27, 2012 - 22:41
Hi Martin,
glad that you found it helpful! Thanks that you took the time to share that information with that nice feedback 🙂
#18 von harshadmark am Juli 31, 2012 - 21:12
What are the guidelines for setting cell_flash_cache values?
I mean should „KEEP“ be more appropriate for OLTP vs Datawarehousing ? or this consideration be made on more granular level of tables ?
#19 von Uwe Hesse am August 2, 2012 - 09:18
Generally, OLTP DBs will benefit more likely from the Flash Cache because IOPS (that will perform better from Flash) is here more relevant than in Data Warehouse where we expect to see not many concurring sessions. But then yes, it’s a decision to be taken on the segment level.
#20 von Dan am Januar 17, 2013 - 10:27
Uwe, just a small correction: in newer images there’s the flashlog too, so the size of the flashcache drops to 364.75GB and the flashlog’s supposed to have 512MB.
#21 von Uwe Hesse am Januar 20, 2013 - 11:16
Dan, thank you for pointing that out! I will add that to the posting.
#22 von frank.oracle am Februar 10, 2014 - 03:09
very nice peresente.Thanks!
#23 von sshdba am März 20, 2014 - 19:18
Uwe i love your articles. I have been followıng your blog for a long tıme. What happens when you ıssue a drop flashcache command ın a cell whıch had an ASM dısk group confıgured over the ıt, does ıt destroy the normally persıstent flash dısk aswell ?
#24 von Uwe Hesse am März 21, 2014 - 09:38
The flashcache is maintained separately from the griddisks based on flash storage. So you can drop the flashcache and the flash based diskgroup is not affected. Keep in mind that you will not need a flash based diskgroup most likely and better turn on the write-back flashcache instead.
#25 von Latif am April 8, 2014 - 06:29
Hi Uwe,
Really nice blog to read.I have a large table of size (260 GB) and we usually face performance issue while accessing this table.Currently we have pinned this table in to flash storage but still the issue persists.What you suggest , would this be resolved if we create new disk group in flash storage and move this table into new disk group?
#26 von Uwe Hesse am April 8, 2014 - 07:54
Latif, I think it is unlikely that you benefit from a diskgroup on flash storage with that table. You should investigate before you implement that with v$segment_statistics and I bet you see the major part of I/O requests satisfied by „optimized physical read“ for that table. You may have other performance problems with that table that cannot be resolved by hardware like locking contention or buffer busy waits.
#27 von Yury am Mai 21, 2014 - 18:11
Hello, Uwe,
Can Exadata Smart FC cache the temporary tables, hash and sort intermediate results ?
Thank you.
#28 von Uwe Hesse am Mai 21, 2014 - 20:57
Yuri, the Flashcache cannot get PGA content directly – that lives at first in the memory of the database nodes. The temporary tablespace can be flashcached, though.
So if a global temporary table doesn’t fit into PGA memory completely, it may be stored partly in a temporary tablespace which is flashcached. Same for sorts etc.
#29 von agence immobiliere brest am September 6, 2014 - 03:21
Enfants gagner énormément agence immobiliere a kaysersberg assez bon marchés forte que l’homme arrière de
façon commence à monter la règle taxation de bonne affaire un fort désir Ceux qui ont des vacances inoubliables
agence immobiliere a oran
agence immobiliere a foncia
agence immobiliere a lyon
agences immobilieres roanne
agence immobiliere strasbourg
#30 von Yury Pudovchenko am August 10, 2015 - 12:06
Hello, Uwe,
my question about FC and LOBs.
The documentation say we can use CELL_FLASH_CACHE clause for LOB segments. (Oracle Exadata Storage Server Software User’s Guide, page 7-60, 7-61).
But Oracle nothing say about default policy.
Is Flash Cacle caching policy cache LOB segments by default ? I mean LOBs with in its separate segment, not inline LOBs.
What about Old version of LOBs (Basic), and for SecureFiles ? Is there difference in caching it in FC or they are the same.
Thank you !