Purpose of the Voting Disk for #Oracle RAC

The Voting Disk provides an additional communication path for the cluster nodes in case of problems with the Interconnect. It prevents Split-Brain scenarios. That is another topic from my recent course Oracle Grid Infrastructure 11g: Manage Clusterware and ASM that I’d like to share with the Oracle Community.

Under normal circumstances, the cluster nodes are able to communicate through the Interconnect. Not only do the cssd background processes interchange a network heartbeat that way, but also things like Cache Fusion are done on that path. The red lines in the picture symbolize the cssd network heartbeat. Additionally, the cssd processes write also into the Voting Disk (respectively Voting File) regularly and interchange a disk heartbeat over that path. The blue lines in the picture stand for that path. Each cssd makes an entry for its node and for the other nodes it can reach over the network:

Voting File in an Oracle Cluster under normal circumstancesNow in case of a network error, a Split-Brain problem would occur – without a Voting Disk. Suppose node1 has lost the network connection to the Interconnect. In order to prevent that, redundant network cards are recommended since a long time. We introduced HAIP in 11.2.0.2 to make that easier to implement, without the need of bonding, by the way. But here, node1 cannot use the Interconnect anymore. It can still access the Voting Disk, though. Nodes 2 and 3 see their heartbeats still but no longer node1, which is indicated by the green Vs and red fs in the picture. The node with the network problem gets evicted by placing the Poison Pill into the Voting File for node1. cssd of node1 will commit suicide now and leave the cluster:

Split-Brain is prevented with the Voting File that got a Poison Pill placedThe pictures in this posting are almost identical with what I paint on the whiteboard during the course. Hope you find it useful :-)

Related posting: Voting Disk and OCR in 11gR2: Some changes

About these ads

,

  1. #1 by Anju Garg on September 20, 2013 - 16:34

  2. #2 by Ramesh on September 25, 2013 - 13:11

    Thanks for voting disk concept.I always prefer your blog .Thanks you very much for all the concepts.

    I have one question here.Who will put poison pill in the voting file.and also please post article on split brain concept.It would be helpful for me.

    Thank you sir once again

    Thanks
    Ramesh D

  3. #3 by John on November 4, 2013 - 14:11

    Hi,

    What would happen in the scenario of a 3 node RAC with 3 voting disks and the interconnect for each node goes down?
    If each node can still see the voting disks but not each other how is it decided which node(s) to evict?
    What are the rules for the number of voting disks for RAC’s with more than 2 nodes?

    Many thanks,
    John

  4. #4 by Uwe Hesse on November 8, 2013 - 12:33

    Ramesh D, thank you for the nice feedback! One of the nodes of the majority places the poison pill. We have no split brain concept because the voting disk technique avoids it – not much to post about :-)

  5. #5 by Uwe Hesse on November 8, 2013 - 12:40

    John, there is no relation between the number of RAC nodes and the number of voting disks (except for Extended RAC, where you will have 1 on each side and 1 separately in the middle).

    If the interconnect goes down completely, the Master Node will survive and the others get evicted. It is important to keep in mind that the voting disk (regardless of the number of files) as a logical entity is still accessible while the Interconnect has issues.

    It provides a second communication path for the nodes if the network fails, that’s the point of it.

  6. #6 by Apostolos on February 10, 2014 - 17:34

    Hello Uwe,

    Thank you for sharing this with us.

    Would it be possible to clarify your sentence: “It can still access the Voting Disk, though.”. What do you mean by “access” and *how* it can access the Voting Disk? Please correct me if I am wrong, but to my understanding the Interconnect network and the network that accesses the shared storage (including the voting disk, should be 2 different networks (2 distinct network infrastructures; cabling, switches, IP network etc). If the shared storage (that stores the voting disk) is on the same network as the private interconnect then what is the point of having a voting disk in the first place. Am I correct? Would you consider it to be a “best practice” to have 2 distinct networks; one for the interconnect and one for the storage? Finally, does the Interconnect network requires direct access to the shared storage (where the datafiles are stored). To my understanding, it would be a valid configuration to have the Interconnect only for the network hearbeat and Cache Fusion and the storage network (different cabling/IP network) for the disk heartbeat and for accessing the datafiles.

    I hope my questions were clear to you.
    Thank you very much,
    Apostolos

  7. #7 by Uwe Hesse on February 10, 2014 - 18:01

    Apostolos, yes, the connection to the shared storage will usually be over a different network than the one the interconnect uses. That is why it is often called ‘private interconnect’. It would be a serious design mistake to have both the shared storage and the interconnect running over the same network if that network has no built-in redundancy to safeguard against single-point of failures.

  8. #8 by Apostolos on February 10, 2014 - 18:15

    Hello Uew,
    Thank you for your lighting reply.

    But, even with build-in redundancy in place (for example team bonding, redundant switches, etc) having the disk heartbeat over the same network that the network heartbeat uses, it defeats the purpose of the voting disk. Am I correct? The V.D. is there to inform the surving nodes that the network connectivity failed for one (or more) nodes. If they share the same network and if that network fails for a node, both the network hearbeat and the diskbeat stop transmitting at the same time. So, the failed node will never receive the “poison pill”. Therefore, even with redundancy in place, to my understanding the private interconnect and the storage network should always be on different networks (i.e. network infrastructure)

    Also, since such configuration is feasible and supported by Oracle, how does G.I./RAC knows which network to use for accessing the data files (and for the disk heartbeat) and which network to use for Cache Fusion and network heartbeat? I mean that during the G.I. installation you can only differentiate between public, private (interconnect) and “do not use”. There is no distinction between Interconnect and Storage network.

    Final question, can the private Interconnect and the storage network be on different IP networks (or subnets)?

    Your answers are highly appreciated.

    Many thanks,
    Apostolos

  9. #9 by Uwe Hesse on February 11, 2014 - 15:37

    Apostolos, the connection to the shared storage is an OS layer thing that is transparent to the Grid Infrastructure. Therefore no choice with the OUI. In general, you are right with the concept of having a separate network for both – only that we don’t do that ourselves with Exadata, where we use the same (highly redundant Infiniband) network for both :-)

  10. #10 by Apostolos on February 11, 2014 - 15:55

    Uwe, thank you very much for your answers and valuable clarifications.

    Looking forward to your presentations at the OUG Ireland, which I will also attend.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,682 other followers

%d bloggers like this: