- Nov 30, 2009
- Views: 13
- Page(s): 8
- Size: 225.38 kB
- Report
Share
Transcript
1 IJCSI International Journal of Computer Science Issues, Vol. 6, No. 2, 2009 25 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 High Availability Cluster System for Local Disaster Recovery with Markov Modeling Approach T.T.Lwin and T.Thein University of Computer Studies Yangon, Myanmar Abstract 2. Related Work The need for high availability (HA) and disaster recovery (DR) in IT environment is more stringent than most of the other sectors Hunter [5] described some system characteristics that of enterprises. Many businesses require the availability of business-critical applications 24 hours a day, seven days a week, benefit from clustering and presented a two-node and can afford no data loss in the event of a disaster. It is vital Microsoft Cluster Service (MSCS) cluster configuration that the IT infrastructure is resilient with regard to disruption, and also presented an availability model of that system even site failures, and that business operations can continue using Markov modeling techniques. without significant impact. As a result, DR has gained great In [1] they discussed high availability and disaster importance in IT. Clustering of multiple industries standard recovery solutions, and described how HA and DR servers together to allow workload sharing and fail-over solutions differ from one another and how they can be capabilities is a low cost approach. In this paper, we present the combined to provide the highest levels of resiliency for IT availability model through Semi-Markov Process (SMP) and also infrastructures. analyze the difference in downtime of the SMP model and the approximate Continuous Time Markov Chain (CTMC) model. Trivedi et. al [10] described an availability model for a To acquire system availability, we perform numerical analysis high availability platform using a multi-level hierarchical and SHARPE tool evaluation. composition approach that mixes reliability block diagrams and Markov chains, so as to allow detailed Keywords: availability, cluster system, local disaster recovery, behavior to be captured while avoiding state space markov modeling explosion. Song et al [9] provided novel solutions with three key components, availability modeling, model evaluation and 1. Introduction data analysis and examined numerical solutions for Markov models on the uniformization method. This paper High availability clusters (also known as HA Clusters or also presents a monitoring and data analysis framework, failover Clusters) are computer clusters implemented to which is responsible for failure analysis and availability provide high availability of services. They operate by reconfiguration. having redundant computers or nodes which are used to The semi-Markov decision model is a powerful tool in provide service when a system component fails. analyzing sequential decision process with random A cluster is a collection of computer nodes -- decision epochs [2]. They presented the application of independent, self-contained computer systems working Markov decision process algorithm, a joint optimization of together to provide a more reliable and powerful system inspection rate and its corresponding maintenance policy than a single node alone [8]. Clustering has proven to be a are also presented. very effective method for scaling to larger systems for added performance, as well as providing higher levels of 3. System Architecture availability and lower management costs. For this reason, software packages such as IBMs RS/6000 Cluster The architecture is based on an active-passive high Technology [8] (i.e., Phoenix) and Microsofts Cluster availability solution. Each service under high availability Services [5] (i.e., Wolf pack) are being used to build high needs at least two identical servers: a primary host, on availability systems. which the service run, one or more secondary hosts, able Disaster recovery solutions have gained popularity in to recover the application. As a result of failure detection, the past few years because of their ability to tolerate the active-passive roles are switched. A heartbeat keep- disasters and to achieve the reliability and availability.
2 26 IJCSI International Journal of Computer Science Issues, Vol. 6, No. 2, 2009 alive system is used to monitor the health of the nodes in With the widespread use of computers, data is becoming the cluster. A disaster recovery solution is typically more and more important in human life. But all kinds of composed of two nodes, one active and one passive. The accidents and disasters occur frequently. Data corruption active node is usually called master or production node, and data loss by various disasters have become more and the passive node is called secondary or standby node. dominant, accounting for over 60% [1] of data loss .Recent During normal operation, the only working node is the high-profile data loss has raised awareness of the need to master node; in the event of a node failover or switchover, plan for recovery of continuity. Many data disaster the standby node takes over the production role, by taking tolerance technologies have been employed to increase the its IP number, and completely replacing the master one. availability of data and to reduce the data damage caused To maintain the standby node for failover, the standby by disasters [2]. node contains homogenous installations and applications: A true disaster recovery solution is the ability to restore data and configurations must also be constantly full systems quickly on available computing resources synchronized with the master node. which may be local but may also be remote if the situation dictates and must allow recovery from site-wide disasters. The primary site may be completely down, a secondary Boot Application site located in a non-affected area would be used to restore Drive Server A services until the primary site comes back online. 4. Modeling and Analysis We propose the two-component system, one Heartbeat component is considered as active and the other as a Data A standby (spare) unit. The failure rates of the active unit and LAN/WAN the standby unit are different, and also the effect of failure Data B of the standby unit is different from that of the active unit. Private Assuming that, the time to restoration and reboot are exponentially distributed with rate and respectively. We consider a routine diagnostic that is run every T time units, intended to detect the latent fault of the standby unit. While units failure and restoration times are exponentially distributed, the routine diagnostic time Application interval is not a continuous time Markov chain. The model Server B for the system with the diagnostic routine is called a semi- Boot Markov chain. To solve this model, we could crudely Drive approximate the time to the next diagnostic to be T exponentially distributed with mean .Descriptions of the Figure1: System Framework 2 state are shown in table (1). If a crash occurs and if the data is not restored, it can have devastating consequences for a business. So it is Table (1): State Description for Transitions model imperative for companies to effectively backup and State Descriptions recover data and protect them from huge losses in 1 Both active and spare units are productivity and downtime. working In this way, hardware exposure is mitigated through 2 Protection switch fails to cover the physical hardware redundancy. Clustering provides high failure of the active unit availability by protecting against a node failure. However, it does not prevent against storage failures. Given the size 3 When active unit fails, protection of typical cluster environments, multiple hard disks are switch successfully restores service used to build large storage arrays. In Network and System by the standby unit Administration, when large numbers of any one device are 4 The failure of the standby unit used, failure is expected. When a hard disk fails, while the active unit is still working application disruption is unavoidable, as all the nodes in is detected immediately the cluster could be using that one particular disk as shared 5 The failure of the standby unit is not storage which contains all files. detected 6 The system is in failure state
3 IJCSI International Journal of Computer Science Issues, Vol. 6, No. 2, 2009 27 equations, we acquire the closed-form solution for the system. 1 Figure2 : State Transition Model =failure rate of an active unit (1 c ) 1 s=failure rate of a standby unit 1 + + + s s =restoration rate of a failed unit 2 c =coverage probability of an active unit s (1 c s ) cs=coverage probability of a standby unit s c s + T c (1 c ) + T =time units to detect the latent fault of the standby unit 2 s + + T We may compute the steady-state probabilities by first 2 s (1 c s ) writing down the steady-state balance equations of figure 2 s (1 c s ) s c s T are as follows: P1 = + 2 2 + + T (1 c )P1 + cP1 + s c s P1 T P3 + P4 = + s (1 c s )P1 (1) + + 1 + + + s (1 c )P1 = ( + s )P2 (2) (1 c ) + c + s c s + s (1 c s ) + cP1 + P2 + P6 = P3 + s P3 (3) + s 2 2 s (1 c s ) s cs P1 + P5 + P6 = ( + )P4 (4) s c s + T c (1 c ) T 2 s + + T 2 (5) s (1 c s )P1 = + P5 ( + ) T + + s s P2 + s P3 + P4 + P5 = 2P6 (6) The conservation equation of figure 2 is obtained by summing the probabilities of all states in the system and the sum of the equation is 1. (8) n Pi = 1 (7) (1 c ) P2 = P1 (9) i =1 s + Combining the above-mentioned balance equations with the conservation equations, and solving these simultaneous
4 28 IJCSI International Journal of Computer Science Issues, Vol. 6, No. 2, 2009 + + + s (1 c ) + c + c + (1 c ) + (1 c ) + c + c + (1 c ) + s s s s s s s s + + s s 2 2 s (1 c s ) s (1 c s ) T c (1 c ) (1 c ) s cs + s cs + T c 2 s + 2 s + + + P3 = T P 1 T P6 = P1 ( + ) ( + ) + + + s + s 1 2 s (1 c s ) + s cs T s 2 2 s (1 c s ) + T c (1 c ) T s cs + 2 s + + T (10) (13) (1 c ) + c + s c s + s (1 c s ) + 4.1Semi-Markov Model Analysis + s A better approach would be to take the time the next 2 s (1 c s ) diagnostic to be uniformly distributed over [0, T], resulting s cs + T c (1 c ) in a semi-Markov chain. This is indicated in fig: 2 the 2 s + transition labeled U (0, T). As occurring in two stages of + T transitions, the SMP is described by a transition probability P4 = P1 ( + ) matrix P and the vector of sojourn time distributions, H (t). + + s H 1 = 1 e ( + s )t (14) H 2 = 1 e ( + s )t (15) (11) H 3 = 1 e (s + )t (16) (1 c ) H 4 = 1 e ( + )t (17) P5 = s s P1 (12) 2 + T t t 1 1 e , t < T , H5 = T 1, t T , (18) H 6 = 1 e 2 t (19) Let X~EXP () and Y~U (0, T) random variables
5 IJCSI International Journal of Computer Science Issues, Vol. 6, No. 2, 2009 29 T P(X>Y)= P( X > t ) f Y (t )dt 0 ( ) T 1 1 = e t dt = 1 e T (20) 0 T T The one-step transition probability matrix P of the DTMC embedded at the time of transitions and the state probabilities of the embedded DTMC are given by the following equations respectively. 1 2 3 4 5 6 (1 c ) c s cs s (1 cs ) 1 0 + s + s + s + s 0 0 s 2 0 0 0 + s + s s 3 0 0 0 0 P= s + s + (21) 4 0 0 0 0 + + 5 0 0 0 ( 1 T 1 e T ) 0 1 1 T ( 1 e T ) 6 0 1 1 0 0 0 2 2 [ v = v1,1 , v1C , v0,1 , v1D , v0,0 ] (22) 1 1 + To obtain the steady state probabilities, solve the + s equation c + s s ( + )( + ) 1 v v=vP (23) v 4 = s + s + 2 2 T ( ) 1 e T 1 (26) This yield s (1 c s ) c (1 c ) (1 c ) v2 = v1 + s + s (24) s (1 c s ) v5 = v1 (27) + s s + (s + )2 + s + 2 2 1 1 + (s + ) v1 1 + v3 = s (25) + s + s + 2 2 1 + 1 v6 = 2 s cs + 1 ( ) 1 e T s (1 cs ) v1 (28) T + s (1 c ) c 1 ( ) + s s cs + T 1 e s (1 cs ) T c (1 c ) (1 c ) + ( + )( + ) c + s s s s
6 30 IJCSI International Journal of Computer Science Issues, Vol. 6, No. 2, 2009 The mean sojourn time hi in state i is 1 = h1 (37) (1 c ) h1 + h2 + +s s + (s + )2 hi = (1 H i (t ))dt (29) + + 2 2 s 0 1 h3 + 1 s cs + T 1 ( ) 1 e T s (1 cs ) h1 = 1 + s (30) + s (1 c ) c + s 1 1 h2 = 1 + + s (31) + s . s ( + )( + ) s cs + h + 1 + + 2 2 1 4 h3 = s + (32) s T 1 e ( T ) s (1 cs ) (1 c ) c + s 1 h4 = (33) s (1 cs ) + . h5 + + s 2(s + ) h5 = 1 1 (1 e ) T (34) T 2 1 1 + + s 1 (35) 2( + ) 2 h6 = . s c + 2 s s 1 1 e T ( ) s (1 cs ) h6 2 + s + 2 T (1 c ) c + s The state probabilities of the semi-Markov chain 2 c 2 (1 c ) are + s ( + s )( + s ) (1 c ) v i hi i = (36) P j v jhj 2 = h2 1 (38) + s h1 , where i, j {(1,1),1C , (0,1), (1,0 ),1D, (0,0) } s + (s + )2 + s + 2 2 1 1 + s 3 = h 1 (39) ( c + 1 1 e T (1 c ) 3 1 ) h s s T s s (1 c ) c + s
7 IJCSI International Journal of Computer Science Issues, Vol. 6, No. 2, 2009 31 1 DT(ctmc) 1 + + s DT(smp) s c s + 1 ( + )( + ) T 4 = s 40.00000 ( ) 1 e T Downtime (40) + s + 2 2 30.00000 s (1 c s ) 20.00000 c (1 c ) 10.00000 + s 0.00000 1 2 3 4 s (1 c ) P 5 = h5 1 (41) + s h1 Figure3:Difference in downtime of the SMP model and the approximate CTMC model s + (s + )2 5.1 Validation of Closed-form Results + s + 2 2 1 1 To verify the validity of our formula derivations, we + s compare the results obtained from the closed-form solution and the results obtained from the numerical solution by 6 = s cs + 1 T ( ) P 1 e T s (1 cs ) h6 1 (42) SHARPE. We found that our results are same. h1 (1 c ) c 40.00000 + s 2c 2 (1 c ) 30.00000 + ( + )( + ) DT 20.00000 s s s 10.00000 5. Experimental Results 0.00000 1 2 3 4 The exact model parameter values for the model are not known, however, a good estimate value for a range of DT(derivation) DT(SHARPE) model parameter is assumed. Fig: 3 plots the difference between downtime (minutes per year) estimates obtained Figure4: Downtime of the CTMC model using the SMP model and that obtained by approximating the U (0, T) distribution by an exponential distribution 35.00000 with mean T/2. We take the values c=0.9, cs=0.9, =1per 30.00000 hour, =12 per hour, and s= /4. We see that the higher 25.00000 the / ratio, the lower the downtime computed by the two 20.00000 models. Availability models capture failure and repair behavior 15.00000 of systems and their components. States of the underlying 10.00000 Markov chain will be classified as up states or down states. 5.00000 The system is not available in the state 2 and state 6. The 0.00000 1 2 3 4 system availability in the steady-state is defined as DT(SHARPE) DT(derivation) follows: Figure5: Downtime of the SMP model Availability=1-Unavailability =1-(2+6) (43)
8 32 IJCSI International Journal of Computer Science Issues, Vol. 6, No. 2, 2009 6. Conclusion and Engineering Management, Vol.4 (2008) No.2, pp.143-152. Organizations today face a tough challenge in choosing an appropriate high availability solution that meets their business requirements and IT budgets. To implement this requirement, organizations must give high availability and disaster recovery. High availability systems require fewer failures and faster repair. In this paper we presented high availability cluster and failover availability for disaster events. . We present a Markov model and express availability and downtime in terms of the parameters in the model. We evaluate the feasibility of our clustering model using SHARPE tools. References [1] D. Clitherow, M. Brookbanks, N. Clayton, and G. Spear, Combining High Availability and Disaster Recovery Solutions for Critical IT Environments, IBM Systems Journal 47, No. 4, 563575 (2008) [2] D.Chen, K.S.Trivedi,Optimization for condition-based maintenance with semi-Markov decision process Available online at www.sciencedirect.com [3] R. Gamache, R Short, and Mike Massa, "Windows NT Clustering Service," IEEE Computer, October 1998, pp.55-61. [4] C.Hirel, A. Robin, Sahner, X.Zang, K.S.Trivedi: Reliability and performing modeling using SHARPE 2000. Computer Performance Evaluation/TOOLS 2000. In Lecture Notes in Computer Science; Vol.1786, Springer-Verlag, 2000, pp.345-349. [5] S. W. Hunter and W. E. Smith, Availability Modeling and Analysis of a Two Node Cluster, Proceedings of the 5th International Conference on Information Systems, Analysis and Synthesis, Orlando, FL, October 1999. [6] Th. Lumpp, J. Schneider, J. Holtz, M. Mueller, N. Lenz, A. Biazetti, and D. Petersen, From High Availability and Disaster Recovery to Business Continuity Solutions, IBM Systems Journal 47, No. 4, 605619 [7] M.Malhotra, A.Reibman:Selecting and Implementing Phase Approximations for Semi-Markov Models, Volume 9, Issue 4, 1993, Pages 473-506. [8] G.F. Pfister, In Search of Clusters: The Coming Battle in Lowly Parallel Computing, Prentice Hall, Englewood Cliffs, NJ, 1998. [9] H.Song, C.Leangsuksun, R.Nassar, Availability Modeling and Evaluation on High Performance Cluster Computing Systems, Journal of Research and Practice in Information Technology, Vol.38, No.4, November 2006. [10] K. S. Trivedi, R. Vasireddy, D. Trindade, S. Nathan, and R. Castro. Modeling high availability systems. In Proc. Pacific Rim Dependability Conference, 2006. [11] K.S.Trivedi: Probability and Statistics with Reliability, Queuing, and Computer Science Applications, John Wiley and Sons, 2002. [12] M.Wiboonrat, Transformation of System Failure Life Cycle, International Journal of Management Science
Load More