Skip to content

adibpat/gossip-based-heartbeat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

/* Single Author Info:
 *  Equal Contribution
 * Group Info:
 *  abpatwar Aditya B Patwardhan
 *  savidhal Shiaji A Vidhale
 *  smnaik Sanskruti M Naik
*/

NODE_FAILURE_DETECTION_USING_GOSSIP_BASED_HEARTBEAT_ALGORITHM
=============================================================

C++ program p4 with parameters:

1. number of peer nodes N (Total Number of Nodes in the system),
2. gossip parameter b (Pre-detremined Nodes to send heartbeats to),
3. gossip parameter c (iterations each node sends randomly chosen b neighbor list entries),
4. number of seconds after which a node is considered dead F, 
5. number of bad nodes that should fail B,
6. number of seconds to wait between failures P, 
7. the seed of the random number generator S, 
8. the total number of seconds to run T.

This node failure detection algorithm detects failures of nodes in a distributed system. The program creates a file called "endpoints" in the distributed file system and puts IP address and port number in a line for each node. 

The algorithm checks for changes in the heartbeat values of the nodes.No change in the value of heartbeat sent by the node for F seconds marks node failure.

The following section illustrates the performance of the gossip-based failure detection scheme for various input cases:

Here, we are choosing b different neighbors for every iteration. c is the total number of iterations per node.

CASE 1
======
N = 4, b = 1, c = 12, F = 3, B = 2, P = 4, S = 5, T = 18

For this set of inputs, the detection algorithm did not perform reliably.
Initially, each node had enough neighbors to send gossips to. But, it had to send only to b neighbors. After B nodes failed, each of the remaining two nodes had only one neighbor to gossip.

[abpatwar@eb2-2236-aw09 p4]$ cat list0
FAIL
0 7
1 2
2 2
3 2

Only two nodes were suposed to fail. But, the algorithm detected 3 failed nodes. 
This is because we have only one neighbor i.e b=1. Thus, the gossip does not spread quickly.
Adding to the trouble, the parameter F is small. This leads to inaccurate prediction of failed nodes.

CASE 2
======
N = 4, b = 2, c = 12, F = 3, B = 2, P = 4, S = 5, T = 18

For this set of inputs, the detection algorithm performed well.
Initially, each node had enough neighbors to send gossips to. After B nodes failed, each of the remaining two nodes had only one neighbor to gossip.

[abpatwar@eb2-2236-aw08 p4]$ cat list3
OK
0 6
1 3
2 12
3 12

For this set of inputs, the scheme detected failures with good accuracy. Though the time of failures are not perfect, they are in the range of 0-1 sec from actual failure time.

CASE 3
======
N = 4, b = 2, c = 12, F = 3, B = 2, P = 2, S = 5, T = 18


[abpatwar@eb2-2236-aw08 p4]$ cat list3
OK
0 3
1 2
2 12
3 12

The parameter P is reduced. This parameter is significant w.r.t the spread of gossip. 
With reduced P, the gossip does not spread well and the detection scheme is activated before every node receives gossips about all other nodes.

Correctness: 
Correctness is ensured when b is >= N/2.
This condition keeps sufficient number of neighbors per node and thus a reliable sprea

Fault Tolerance:
For detecting faults with accuracy, F should be large. Large F compensates the delays in the network.
Large P ensures that the topology is well discovered and gossips are spread throughout.
   

About

Gossip based Heartbeat Algorithm for Node Failure Detection in a Distributed System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors