An Algorithm Providing Fault-Tolerance for

0 downloads 0 Views 151KB Size Report
In the first part, we describe how tobtain a communication chain from the OSI model. In the second part, we outline the different methods providing fault-tolerance.
AnAlgorithmProviding Fault-Tolerancefor Layered DistributedSystems M.Taghelit,S.HaddadandPSens . Université PierreeMarie t Curie C.N.R.S. MASI 4, Place Jussieu 75252ParisCedex05,France e-mail: [email protected]

Abstract Thispaperpresentsnew a approachforthefault-tolerance inlayereddistributedsystems.Wedevelop thedynamicregenerationmethodinwhichfaultysoftwarecompone ntsaredynamicallyregenerated. In contrast to othertechniqueswhichduplicated critical components,thismethoddoesnotincreasethe complexityotfhesystemandtoleratesanunlimitednumbe orfailures.Toapplythistechnique,we developmethod a forthedesignoreliable f systems.Wetran sform aninitialsoftwarearchitectureina modelwhichencloseshomogeneouselements.Ourstudyreliesont heOSIstandardmodeldefined withintheISOorganization.So,fromaninitiallayered distributedarchitecture,weexhibita homogeneouscommunicationchain.Thebuildingandthemaintenance ofthischaininanunreliable environmentisachieved bayanlgorithm usingdynamicregeneration . Keywords f:ault-tolerance,dynamicregeneration,distributedsystems communication.

l,ayeredsystems,end-to-end

1.Introduction Distributedsystemsprovidenewopportunitiesfordevelopinghighperformanceapplications;athe sametime,becauseofdependencyofthecomponents,suchsystems areparticularlyfragile:one componentfailuremayimplyallthesystemfailure.Soit isessentialtohaveafault-tolerant management.Severalmethodsbasedontheredundancetechniqu esexist[7][1].Thesemethods preventthe failure ocritical f elementsbdyuplicating the m in several copies.There aretwobasickinds ofredundancy[6]:thepassiveonewherecopiesrunonlyithe f elementfailsandtheactiveonewhere allcopiesruninparallelwiththeelement.Thesetechni quesimplyagreatoverheadothe f systemand tolerates limited a numberoffailures(thisnumber isproportional to thenumberofcopies). Weproposeanewsolutionthatoffersanoptimumdegreeof regeneration.Insteadoduplicating f theelements,they ar

toleranceforalowcost:thedynamic reegenerated icnase ofailure. f

FromtheOSIstandardmodeldefinedwithintheISOorgani architecture taony communication chain.Acommunicationc and terminal a entitiesthrough inneroneswhich relay me

zation,wegeneralizethiskindof hainallowsdialogue a betweenaninitial ssages.

Theaimoftheproposedalgorithmistobuildandpreservet environment.Theentitiesotfhechainaredynamicallygenera elementshaveastaticexistence.Whenafailureoaf nent and integrate iinto t the chain.

hecommunicationchaininanunreliable tedunliketheothertechniqueswhere ityids etected,weregenerateanewentity

Inthefirstpart,wedescribehowtoobtainacommunication part, we outline thedifferentmethodsproviding fault-tolerance

chainfromtheOSImodel.Inthesecond and wientroduce the

principleothe f dynamicregeneration.Thethirdpartinforma onedescribesit.Finally,the lastpart showsthe correct

llypresentsthealgorithmandthefourth nessothe f algorithm.

2.Functionalschemeforlayered distributed systems TheISOorganizationdefinesanarchitectureforcommunica [3].Thishierarchicalmodeliscomposedosfevenlayers. thecommunicationbetweentwolayersotfhesamelevelis theOSIstandarddoesnotimposeaspecificlocalizatio implementallofthem.

tingopenedsystems(theOSIstandard) Eachlayerhasa specificfunctionality donethroughthelowerlayers.Although nofthelayers,currentsystemslocally

figure The 1: OSImodel Wewouldliketomakethiskindofarchitecturereliable scheme.Thefirststepconsistsinunlayeringthearchitecture newarchitecture,eventhoughidenticalastheoldone,makes entities.Thisunlayeringshowsthreekindsoefntities:the theendingentityatheendofthechainandtheinnerent successor.Allfunctionalitiesoof nelayerareincludedin bymaking noassumptionabout thelocalization.

F. orthat,wemodifythebasicfunctional The . notionolfayerdisappearsandthis achaincomposedofcommunicating initialentityathe t beginningotfhechain, itieswhicheachhasapredecessoranda anentitybutwegeneralizetheinitialmodel

figure Unlayering 2: othe f model Theglobalarchitecturebeingdefined,wewillproceedto thefunctionaldivisionofentities.Allthe entitieshavethreefunctionalities:thereceiptandse ndingom f essages beinghomogeneous,forallthe entities andtheprocessingwhichis heterogeneous. Webringtogetherthesefunctionsintwo specializedmodules:onemoduleofcommunicationthatcarri esoutthecommunicationtasksand anotherone,processingmodule,thatcarriesouttheprocess ing.Thecommunicationmodulebecomes the communication interface between anentity and the othe rs.

figure Functional 3: division Communicationmodules(CM)composean homogeneouscommunicationchain, hasthe same functionality (sending/receipt).The process ingmodulesremain

sinceeachmodule

and

heterogeneousandeachotfhemlocallycommunicateswiththec entity.Thus,afterwardswewillpayattentiontoensur done bpyreserving the communication chain icnaseofailure f

ommunicationmoduleotfhesame ecommunicationbetweenentitiesandthisis s.

Notethatthemodularityoafnentityimpliesalimite dpropagationotfhefailures.Thefailureotfhe processingmoduledoesnotmeanthebreakingofcommunication. Moreover,theseparationof functionalitiesallowsdiagnosismore preciseofailures f and thus more a efficientrecovery.

figure 5Communication : chain

3.Tolerancebydynamicregeneration Generally,techniquesprovidingfault-toleranceuseredundanc ymethods[7].Thesemethodsduplicate abasicelementinseveralcopies.Inthisway,theypr eventthefailureotfhebasicelementcalledthe primary. Therearetwokindsofredundancytechniques:theactive redundancywherecopiesareactivesin parallelwiththeprimaryelement,andthepassiveredundanc ywherecopiesareactiveonlyifthe primaly elementfails. 3.1. Activeredundancy Eachcopyreceivesthesameinputanddothesameteratm [8]. To achieve this,two differentmethodsare currently

ent.Onlyoneresultistakenintoaccount used.

Inthefirstone,avotemechanismchoosesoneresultfroma redundantelements.Adisagreementbetweenelementsprovide majority itshemostused vote technique [6].

mongallofthoseproducedbythe safailuredetection.Thesimple

figure 6The : votemechanism Inthesecondmethod,onlytheresultotfheprimaryelement fails,itisreplacedbyoneoiftscopies.Therecovery resultsotnheoutputofthe chosen copy.

isconsidered.Itfheprimarymodule isverysimple,itonlyconsistsintakingnew

Thevotetechniqueallowsalimiteddegreeotfolerance(i numberofnredundantelements,amajorityvotetolerat componentcanitselffail.Thiskindofailure f icsatast isrecommended.

.e.thenumberoef lementfailures):fora esonlyn/2failures.Moreover,thevote rophic,andareplicationothe f votecomponent

Thesecondtechniqueallowsagreaterdegreeotolerance f toleraten-lfailures.Butthismethoddoesnotpreventafa Contrarytothevotetechnique,erroneousresultsproduce account. Thesemethodsleadtoagreatoverheadotreatment. f I redundantelements)theremustbentreatmentsfortheproduc redundantelementhasafailureprobabilityandsotheglobal redundancyprovidesafastrecoveryfailure,becauseiasks t dothesameprocessandareinthesamestate).Thisme systemswhere the recovery timeibsounded [10].

thanthevote.Fornredundantelements,it ultybehaviouroftheprimaryelement. dbytheprimaryelementwillbetakeninto the f degreeoredundancy f in(i.e. s therearen tionofoneissue.Moreover,each failureprobabilityigs reater.But,active nostate'srestoration(allthetime,copies thodiscurrentlyusedforreliablerealtime

3.2. Passiveredundancy Aredundantelementcanonlystartrunningwhentheprimary oneelementisactive.

elementisfailing.Allthetime,only

figure 8passive : redundancy Whenacopyreplacesthefaultyprimaryelement,aprevious restored.Passiveredundancyneedsalsoacheckpointman expensiveinprocessingtimeandspace.Moreover,liketheac ofredundantmodulesins on-null.Generally,thistechniqu and iisot ften preferred ttohisone [9]. Themaindrawbacksofredundancytechniquesistheincrea limitedfault-tolerancedegree.Wepropose new a technique ofthe characteristicsoour f architecture and othe f tu

stateotfheprimaryelementmustbe agement.Thiskindoftechniqueis tiveredundancy,thefailureprobability ehasalowercostthanactiveredundancy

seofthesystem'scomplexityandthe thatsolvestheseproblemswhileprofiting nctionaldivision: thedynamicredundancy.

3.3. Principleothe f dynamicregeneration Theentitiesarenolongerduplicatedbutgenerateddynamical elementhasnoexistencea ndiisntegratedinthesyste same tunctionalityothe f faulty element.Thisapproach i •

.The system hasamuch s elementasfault-free a system. redundancy techniques.



.The overhead iosnly related ttohe faultdetection.



The faulttolerantdegreeinsotlimited.



Thegeneratedelementhavenloocalization constraintunlike fixed localization.Anewmodulemay bdeynamically genera

More precisely,thetreatmentoffaailureicsomposed naewelementand itsinsertion itnhe chain.

lyaftertheirfailure.Thegenerated monlyafteronefailure.Thiselementhasthe mpliesthe following points: So,itislesscomplexthansystem a using

theredundantelementswhichhavea ted oanunnfaulty host. ofthreesteps:thedetection,theregenerationof

figure The 9: dynamicregeneration Eachmodulehastosupervisetheoneihas t generated(itssuc failure.Thisensuresadecentralizedcontrolwhichansw requirements.

cessor)andtoregenerateiitncaseof ersourtoleranceandperformance

4.Implementationothe f dynamicregeneration 4.1. Targetsothe f algorithm •

The buildingothe f communication chain

Thebuildingothe f chainconsistsingeneratingafinitese innermodulehasapredecessorandasuccessor.Unlikethe static,themodulesaregenerateddynamically.Ontheother Thisstepconsiststgoenerate the firstmodulewhich iosn •

Maintenanceothe f communication chain ianunnreliable

Ifoneoseveral r modulesfail,thechainbreaksandnocomm entitiesips ossible.Tokeepend-to-endcommunication,it Thisicsarriedoutby the dynamicregeneration othe f faulty

of tcommunicationmodulessuchthateach othersystemswheretheelementsare hand,aninitializationstepisrequired. user'sinitiative. environment unicationbetweentheinitialandending isnecessarytoreplacethefaultymodules. modules.

4.2. Principlesothe f algorithm WewanttoconstructandkeepachainoN fmodules,each modules.There are three principles: •

Activation principle: Each non-endingmodule carriesotnhe bui

EachmodulewithnosuccessorandwhichisnottheNthonege carry otnhe building othe f chain iitis fnotcomplete a •

modulecommunicatingwiththeadjacent ldingothe f chain. neratesasuccessor.Thisallowsto nd troebuild iifmodules t fail.

Knowledge principle: Eachmoduleknowsall itssuccessors

Whenamodulefails,itisregeneratedaccordingtothea needstobeattachedtothepredecessor(theonewhogenerate successor.Theregeneratingmodulemustknowitstwoimmediat regeneratedmodule the informationsnecessary taottach it Butthatisnotenoughincaseom f ultiplefailuresoaf modulehastroegenerate successor a until vaalid succes

ctivationprinciple.Then,thisnewmodule dit)otfhefaultymoduleandtoits esuccessorssoastogivetothe selfto the chain. djacentmodules.Inthatcase,eachregenerated sorhasbeen found. So,to

restorethechainwhateverthenumberofaultyadjacent successors. Iftheregeneratedmoduleistheinitialone,itcannotrece predecessorsincethelastonedoesnotexist.Inordertoa modulemustkeepinstablememoryinformationsofitssuc communicated tiot,ifitisregenerated after faailure •

Purge principle: Each non-initialmodulewithnporedecessor

Whenthepredecessoroamodule f fails,thismodulemustbe the predecessorafter fainite time.We avoid,like that, •

Suspicion principle: The validity othe f chain ips eriodical

modules,eachmodulehastoknowallits iveinformationsfromitssuccessorbyits voidthebuildingoanew f chain,theinitial cessors.Theseinformationswillbe . destroiesitselfafter fainitetime eliminatedinf onewmodulereplaced the creationo"parasite" f sub chain. ly tested

Periodically,eachmodule supervisesitssuccessor. This

mechanism allowstdoetectfaultymodules.

Moreprecisely,thedetectionibs asedonanacknowledgement transmitsacontrolmessagetoitssuccessorsoasto answerigs iven,thesuccessoricsonsideredfaulty.This rate ipsroportional to t.

mechanism.Eachmoduleperiodically verifythatiitstillvalid.Ifafterawhile mechanismcanbringfalsedetectionswhich

nt o

4.3.Fewexecutions •

The chain building

The activationprinciple allowsthechainbuilding.ThechainbuildingoN flongids onei nNsteps. Eachstepcorrespondstoagenerationoanf ewmoduleandits joiningtothechainalreadybuilt.This joiningias chievedwhenallitspredecessorsknowitsexiste nce.Thenewmodulecommunicatesits identitytoitspredecessor(regeneratingmodule)whichsend isback t toitspredecessorandso nuntil theinitialmodulereceivesit.Then,theinitialmodulet ransmitsanacknowledgementwhichis retransmittedfromsuccessortosuccessoruntilitarri vestothenewmodule.Itisonlyathis t moment thatthenewmoduleisjoinedtothechain.Thisprocess isrepeateduntilthejoiningoftheN-th module. • Chainmaintenanceicnase ofailures f The principlesof suspicion, activation and knowledge allow the chainmaintenance.Assume thattwomodules,CMiandCMi+l,failatthesametim e.TheCMi-lmoduledetectsbythe suspicion principle the CMimodule failureandregeneratessubstitute a moduleCMi'bythe activation principle. ThesubstituteCMi'moduledetectstheCMi+lfailureandfo llowthesameprocess.Thatis, regenerationothe f substituteCMi+l'moduletowhichigives t itsownidentityandthoseoCMi+2 f to CMNmodulesaccordingtothe knowledge principle.Then,CMi+l'moduleconnectsitselftoCMi+2 module and thusrestoresthe chain.Thisexample osituat f ion ishown ifnigure 10.

c

Figure 10: Recovery from multiple failures

With the knowledge principle,itispossibletorestore communication a c faultymodules.Thefailureothe f firstcommunicationmodule detection and regeneration are done btyhe user. •

hain,whateverthenumberof (CMlmodule)isparticular,becauseits

creation anddestructionopafarasite sub-chain

Wesawthattheprobabilityofalsedetectionins on-null In .caseofalsedetection,anewmoduleis regeneratedandwillbereconnectedtothesuccessorofthe modulesupposedtobefaulty.The supposedfaultymodulebecomesisolated(withnosuccessora ndpredecessor).Ict onstitutesasubchainwhich can grow upbfyollowing the activation principle.Such case a ishown ifnigure 11.

Figure 11: Creation osafub-chain from an isolatedmodule ( Thissituation insottheonlyonewhere sub-chain a icsrea beaccessiblecanalsoconstituteasub-chain.Thedestruc purge principle.

CM3) ted.Severaladjacentmoduleswhichcannot tionoparasite f sub-chainias chievedbythe

4.4. Definition othe f algorithm 4.4.1.Description oam f odule The statestransitionsocafommunicationmodulearesetup sakeosfimplicity,onlythemaintransitionsaredescri variousrecoverycases(duringoafter r completebuildingof fail at anymomentand thusianny state.

bythereducedgraphofigure f 12.Forthe bed.Inthisgraph,wedonotspecifyneither thechain)northefactthatamodulecan

Figure 12: Statestransitionsocafommunicationmodule Non-existent: themodule doesnotexist. Generated: the CMimodulehasjustbeen generated. WaitConnpred: the CMimodule gave itsidentity ttohemodulewhich genera module)andwaitsthe acknowledgementmessage. WaitConnSucc: the CMimodule generated CMi+l a module towhich igave t andwaitsforthe reception othe f CMi+l identity.

ted i(CMi-l t itsidentity,

Connected: the CMimodule received theidentity f'orm themodule it module). WaitReConnPred: the CMimodule detected itsdisconnection with itspredece waits reconnection a message. Eachmodulehasasetovf ariablesincludingitsleveli containsall itssuccessors'identities.

hasgenerated (CMi+l

nthechainanda

ssor,it

knowledgevectorwhich

4..42.Description omessage f types Thevariousmessage typesthat are exchanged between comm

unicationmodulesare:

New:allowsanewlygeneratedmoduletobke nownbyallitsprede cessors.Eachpredecessorwhich receivesthismessagememorizestheidentityothe f newlyge neratedmoduleandtransfersito titsown predecessor Whenthismessagearrivestothefirstmodule ofthechain,itmemorizestheidentityin the stablememory and sendsaancknowledgementmessage tiots successor. AckNew: istheacknowledgementmessagefortheNewmessagesentby thenewlygeneratedmodule. Itisused tcoonfirm to thismodule thatitisactually known bayll itspredecessors.

Update:incaseorfecovery,thismessageis entbytheregener successorthatitisitsnew predecessor. Thismessage AckUpdate: confirmstregenerated ao module thatitssuccessorcons When module a igs enerated,orregenerated,itreceivesthe vectorfrom itspredecessor(thegeneratingmodule) When the regenerated,itreceivesthe knowledgevectorsavedisntable

atedmoduleinordertoinformthe isalwaystaken into accountand acknowledged. idersias titsnew predecessor. identity and theknowledge initialmoduleigsenerated,or memory

Thisexampleshowsthegenerationandtheinsertionotfhefo urthmodule Inthefirststep,thethird modulegenerates new a module and changesttohe WaitConnSucc (WCS)state Thenewmoduleiisn Generated (G) state.Inthesecondstep,thenewmodulesendsitsidentity (Newmessage)toits predecessorandchangestotheWaitConnPred(WCP)state T heNewmessageirsetransmittedfrom predecessortopredecessor.Inthelaststep,theinitial modulesendstheacknowledgementoftheNew message(AckNewmessage)whichisretransmittedfroms uccessortosuccessor.Whenthenew modulereceivestheAckNewmessage,igt eneratesasuccess orandchangestotheWaitConnSucc state Thisprocessirsepeated until thegeneration and insertion othe f Nthmodule.

Thissecondexampleshowstherecoveryofafaultymodule.Int detectsthefailureofthethirdone.Itregeneratesasubstit WaitConnSucc(WCS)state.Thefourthmoduledetectsitsd changestotheWaitReConnPred(WRCP)state.Theregenera thesecondstep,thesubstitutemoduleisrecognizedbyitsprede sendsanUpdatemessagetoitssuccessorandchangesto fourthmodule receivesthe Updatemessage,

hefirststep,thesecondmodule utemoduleandchangestothe isconnectionwiththefaultymoduleand tedmoduleiisntheGeneratedstate.In cessors,likeintheaboveexample, theWaitConnSuccstate.Inthelaststep,the

acknowledgesibt ysendinganAckUpdatemessageandchangesto substitutemodule receivestheAckUpdatemessageichanges t to

theConnectedstate.Whenthe the Connected state.

4.4.3.Definition orules f Wegivethevariousrulesforthebuildingofthechain,itsr modulesand theelimination osub-chains. f

ebuildingafterthefailureopf articular

4.4.3. 1. Rulesobuilding f Initializationrule: Theoriginalchaincreationdemandcomesfromtheuserwho generatesthefirst module(CMlmodule).Whenthisfirstmoduleigs enerated(ge neratedstate)itthenalsogenerates the secondmodule and son. Connectionrule: Whenamoduleigs enerated,itsimultaneouslyreceivestheide ntityotfhemodule whichgeneratedit(exceptforthefirstmodule).Amodule thatisinstateGeneratedsendsits identitytoitspredecessors(thegeneratingmodule)through theemissionoaNew f message,and waitsforan acknowledgement(WaitConnpred state). Theidentityofacommunicationmoduleencloselocalization processingmoduleetc...

,localizationoftheassociated

Knowledgerule: Amodulewhichreceivesitssuccessor'sidentityfromane wlygeneratedmodule memorizesiand t sendsiback t toitspredecessor.Ifth ereceiveritsheinitialoneotfhechain,it savesiitnstablememoryandsendsanacknowledgementme ssage(AckNewmessage)toits successor.Moreover,ifthereceiverisinstateWaitCon nSucc(ithasnosuccessor)thenthe messagesenderisconsideredasitssuccessorandchanges toConnectedstate(itgeneratedthe senderofmessage). Generation rule: Whenmodule a CMireceivesanAckNewmessagefromitspredec essor(CMi-l),it sendsitoitssuccessor(CMi+l).IC f Miisthenewly generatedone(WaitConnpredstate)two casesare possible: .CMi isthe Nthmodule othe f chain (i.e.the lastone)thenit changesttohe Connected state and the building iosver; .CMi isnotthelastofthechain(i ≠N),twoothercasescanoccur:eitheritreceivedtheide ntity ofCMi+1(rebuildingstep)andthenitcanconnectitselft oitandtherebuildingisover; eithertheCMididnotreceivetheidentityofCMi+l,then itgeneratesasuccessorand changesttohe WaitConnSuccstate (building step). Whenamoduledetectsitssuccessor'sfailureiot nlygener connected tpredecessor. ao

atesanothermodulewhenitis

4.4.3.2.Rulesof maintenance Therebuildingrulesofthechainconsistinstartingthe detected.Twocasesmustbetakenintoaccount:eithert orrebuilding step.

recoveryprocessassoonasafailureis hechainiaslreadybuilteitheritisinbuilding

Regenerationrule When : aCMi-ldetectsitssuccessorfailure,theCMimod ule,itreplacesiw t ith the generation oCMi' f substitutemodule and waitsitnhe Wai tConnSuccstate.Atthesametime,it givesiits town coordinatesand thoseoCMi+l f to CMNm odules,iftheyexist. Updaterule An : CMimodule receiving Update a message from C a Mj mod acknowledgesi t,and considersthe CMj module aistsnew predecessor. Purgerule When : module, a thatisdisconnected tiotspredecessor, reconnection demand (Updatemessage)during particular a while Thisrule allowsthegradual elimination omodules f ian

ule always doesnotreceive itdestroys , itself.

sub-chain wrongly created.

4.5. Verification othe f algorithm

4.5.1.Model Thespecificationoadf istributedapplicationalwaysreq aremainlyconsidered:eithertheprogramminglanguagesupposed f'ormalmodelsuchaC s .C.S.[5]otrhePetrinets[2].T reducedcostreasons,itdoesnotincludeanyprovingmethod, that require validity a proof.Withthesecondsolution,some the formal proving othe f system correction. Wechooseaformalmodel,thoughmoregeneralthentheonemen inspiredfrom[4]:thesocalledeventmodel.Iitsworth others which are more specific.

uiresthechoiceoam f odel.Twopossibilities tobuildtheapplication;eitherany houghthefirstsolutionim s oreattractivefor whichiisrrelevantforcomplexsystems validationtoolsareavailablewhichallow tionedaboveandwhichihs ighly ourwhiletochooseageneralmodelthan

4.5.1.1.Definition otransition f systems Theeventmodeldescribesasystemwiththetotalityoif change from one state taonother.

tsstatesandthesetoef ventsthatmakeit

Definition 1 A :transition system S is pair a (E,R) where: • E isthe setofsystem states. • Risthe setofsystem rules. where r ∈R , .r.p: E →{true,false}ipredicate as whichdefinestheguard E→E is function a which definesthe action .r.a: Moreprecisely,r.pisapredicateonasystemstatew possibleinthatstate,andthefalseoneitfheactioni system behaviourand step a othe f system isdefined b:y

hichtakesthetruevalueifther.aactionis ns otpossibleinthatstate.Thus,

Rdefinesthe

ir.p(e) f ande=r'.a(e) e →e'ifand r only Such model a iscalledeventmodelfortheeventistheoccur the system change from satate taonother.

renceoan f action.Then,aneventmakes

4.5.1.2.Provingtechniqueson themodel System statesand propertiesmay bexpressed apsredic capture osystem f statesand propertiesbpyredicates, called assertions-orientedmethod.

ates.The proofmethod based otnhe and which iasdapted tourmodel,is

We are often interested toensure thatif saystemve thispropertyisretainedwhateverthesystemevolution.Pr propertiesare calledinvariants. Given transition a system,wegenerally study the behaviour certain initial states.Itis,ofcourse,unlikely that of E aspossibleinitialstates.Whenasystemstartsine by the system from e 0. Definition2 Let : follows:

rifiesproperty a am at omentorat gaivenstate,so edicateswhichexpressthiskindof

ofthe systemwhen startedin we are interested icnonsidering allelements , onlyinterestingstatesarethosereached 0the

S=(E,R) betransition a systemande

(i)e 0 ∈ Acc(e0) (ii)ife ∈ Acc(e0) and e

→e'then r e'

0

∈We E. definethet'unction

.Acc overEas

∈ Acc(e0)

Acc(e)isthe setofall reachable statesfrom e. Definition 3 Let : ifforeache

S=(E,R) be transition a system ande

∈ A E.predicate

0

Pissaidtobe

0-invariant

∈ Acc(e0) P(e)istrue.

Toprovethatparedicate P ise 0-invariantrequirestoprovethat P istrueforallstatesof Acc(e0).This is,tediousforcomplexsystemsandimpossiblewhen Acc(e0)isinfinite.FollowingKeller[4],we propose more a restrictive conceptnamely induction. Definition 4 Let : S=(E,R) be transition a system ande provided that: (i) P(e0and ) (ii) ∀e,e' ∈ E, P(e) ANDe →e'implies r Thepowerwhichliesintheinductionprincipleitshat ofreachabilityinordertodemonstratethatapredicat trueaftereveryactionwhichmakesthesystemchangefrom thatthispredicate iisnvariant. Thus,we have justto s

0

∈A E predicate .

P issaid tboe

0-inductive

P(e') itdoesnotrequireacompletecharacterization eiisnvariant.Provingthatapredicateremains astatetoanotheriesnoughtoconclude tudy the initial state and the rules.

Itisnotarestrictiontoconsideraninductivepredicate firstoneiosbtained bsytrrengthening and including thesec Proposition1 [4]:Anyinvariantisnotaninductiveinvariant,butanyin invariantwhichimpliesit.

insteadoaf ninvariantpredicatebecausethe ondone. varianthasaninductive

Thediscussionabovehasinvolvedtheuseotfheinductionprincipl etoshowthatcertainproperties always hold.However,thereareotherconditionswhichmight berequiredbeforeasystemcanbe considered correct. Insomecases,itisdesirablethatsaystem alwayste rminatesforcertaininitialvalues.Thisitshecase withsystem a designedtocomplete specific a task.Onth oetherhand,manysystemsaredesignednot to terminate,orto terminate only ianbnormal situation s.

Ouralgorithmmatchesthefirstcase.Thechainbuildingmust generatedandeachothem f isinits"Connected"state.This system.However,itisnotobvioustoshowthatstartingfr reach itshomestate.Hencewuese anothermethod forproving

terminatewhenNmodulesare constitutesahomestate,notede , the Hfor om aninitialstatethesystemwillinevitably termination.

Definition 5 Let : S=(E,R) be transition a system ande saythatafunction η: E → ω (any 0 ∈ E . We well-ordered setwilldoipnlace of ω)isa R'-norm (R' ⊂ R ) withminimal state eprovided that H (i) η(e)isminimaliff =e H (ii) ∀e ∈ Acc(e0), η(e)isnotminimal ∀e ∈ E, ∀ r ∈ R' e, →e'r η(e')< η(e) R' isa

well-foundedsubset

associatea

R'-normfunction

of Rsuchthateachruleof ηwiththesystemsuchthat

bythesystemandim s inimalforthestateww e ishthesy state the system goesto,itmustalwaysgetto itshome

R' decreases η.Toprovethetermination,we ηdecreaseseachtimea nactioniesxecuted stemreach.Thismeansthatnomatterwhat state.

4.5.2.Modelling 4.5.2.1.Problemsrelated tothemodel Thetimedoesnotexistintheeventmodel.Thus,itistr ickytomodeleventswhoseoccurrencesobey totemporalconstraints.Schedulingmayalsobteediously achieved.Thenon existenceotime f implies sometimesuncontrolledeventoccurrenceswhichexpresscondition seasilyavoidableinthereal system.Thetimeabstractioncannotbeaninconveniencesi ncethemodelallowstodescribea behaviourwhich includesthe real system behaviour. 4.5.2.2.Modellingchoice Tomodeloursystemwechooseanapproachbasedonthechara communicationchainiscomposedofasetofmodulesanda stateids efinedbythevaluesoiftsvariablesandthech sentbutnotyetreceived.Thesystemstateitshendefin modulesandthechannel.Inthisway,wecanstudythesyste whichmakeichange t fromonestatetoanother.Aninstance sequence oevents f expressesthe system behaviour. Weassumethatthesystemiscomposedofaninfinitear indexedwith itsidentity thatdesignatesiin taunniqu modulewhich are used tporove the correctnessothe f algori

cterizationofitsstates.The communicationchannel.Themodule annelstateids efinedbythesetofmessages edbyacombinationothe f statesothe f whole mbymeansoiftsstatesandtheactions ofanactionicsalledaneventandeach ray(mod[])ofmodules.Eachmoduleis m e anner.Wegivethestructurecomponentsoaf thm in the nextsection.

Array omodules f indexedwith Id

mod[Id]

Id: Integer ∪{-1}itshemoduleidentity ∪{-1}expressesthemodule levelin the chain

MLev:

1,..., [ n]

Pred:

Id ∪{-1}iistspredecessor'sidentity

Succ:

Id ∪{-1}iistssuccessor'sidentity

State: [Non-existent, Generated, WaitConnPred, WaitConnSucc, Connected, WaitReConnPred] mod[Id]isthemoduleindexedbyitsidentityId.Whenacomp onentisnotdefined,wegiveithe t -1 value.We use global a variable Nextto allocate aunniqu iedentity foreachmodule.

Initially,allthemoduleshavetheirStatecomponentequal (1).

toNon-existentandNextisequaltoone

Thechannelismodelledbyamulti-set"Channel".Thediff messagesaregiveninthefollowing.Wegivetheonlynecessary correctnessothe f algorithm in thenextsection.

erenttypesandcontentsotfheexchanged messagesthatareusedtoprovethe

New=(type =New,Sender,receiver,source-sender,Level) AckNew=(type =AckNew,Sender, receiver,final-receiver) Two actionsare definedotnhemessages: • Channel := Channel-m,expressesthe reception othe f messagem. • Channel :=Channel +(type,compl,comp2,...), expressesthee message (type,compl,comp2,...). Moreover, • Channel ≥ m,expressesthatthere iasleast t onemessagem in the = ∅, expressesthatthe channelisempty. • Channel

mission othe f channel.

4.5.2.3.Proof Theassertions-orientedmethodliesonpredicatesthatof way,the userhas gaeneral viewofthe system and can the instances.

tenexpresssystemglobalvariables.Inthis nbettermasteritsevolutionincaseoevent f

Forsakeofsimplicityandlackofplace,wegivethecorr building ireliable an environment.

ectnessofthealgorithmforthechain

Ateach time,in the chain building stepwheave: I0 : ∃0 ≤k ≤N modules, ∀i,i ≤k ⇔ mod[i].state ≠Non-existent whichexpressesthatkmoduleshave beengenerated.Nisth leengthothe f chainww e antto build. The I invariantwhichexpressesall the statesstemmingfrom the a lgorithm execution,i.e.the statesof the kmodulesand the channel,isdefined afsollow: I I=

0

AND

(Ideb OR I1 OR I2 OR I3 OR I4 OR I5)

where Ideb:

I 1:

k= 0, ∀ i



Thebuildinginot s started

[1,k-2],

mod[i] .state=Connected mod[k-l].state=WaitConnSucc mod[k].state =Generated

I2:

∀ i



[1,k-2],

mod[i] .state=Connected mod[k-l].state=WaitConnSucc mod[k].state =WaitConnPred Channel

I3:

∀ i



[1,k-1],

=m,

m.type =New

mod[i] .state=Connected mod[k].state =WaitConnPred Channel =m,

I4:

∀ i



[1,k-1],

m.type =New

mod[i] .state=Connected mod[k].state =WaitConnPred Channel

I4:

∀ i



[1,k],

=m,

mod[i] .state=Connected Channel k=N

=∅

m.type =AckNew

Informally,the Idebpredicatetakesintoaccountthesystemstatesforwhich thechainbuildinghasnot started.Thispredicatebecomesandremainstrueasoo natsheinitializationstarts.Thelastpredicate Iallows toexpresstheterminationconditionotfhebuilding.The otherpredicatestakeintoaccount 5 the situationswhere the building istillin progress. Weshownowthatthepreviouslydefinedinvariantremainstrue whatevertheactionthatcanchange state ttohe system.Forthat,we systematically study alltheactionswhichcanalter I.Weconsiderone byoneeachpredicateof aIndestablishthatifanactionothe f systemalters theconsideredpredicate, then oneothe f otherspredicatesof bIecomestrue. Let Ibe TheonlyonepossibleeventisthesendingoaN f ew messagebythek-thmodule.Two l true. casesarepossible.Ifthek-thmoduleitsheinitialone(k= l),thenigenerates t asuccessorandthe predicateremainstrue.Otherwise(k ≠1),itsendsitsidentity(includedinaNewmessage)toi true I2 . predecessorand changesttohe WaitConnpred state. Thisalt ers Ibut l make Weproceedinthesamemannerwiththeotherspredicatest invariantsincewteake into accountall the possible

odemonstratethat rulesothe f system.

Iisaninductive

5.Conclusion Wehavepresentedanalgorithmprovidingfault-toleranceforl OSImodel,wehaveconsideredacommunicationchain.Oura preservingofthechaininaunreliableenvironment.Thisis regenerationoffaultyelements.Incontrasttootherm tolerates anunlimitednumberofailureswithasmal applicable forsoftware architectures.The correctnesso

ayereddistributedsystems.Fromthe lgorithmensuresthebuildingandthe achievedbyintroducingthedynamic ethods,thedynamicregenerationmethod leroverhead.Naturallythistechniqueisonly the f algorithm isformally proved.

Attheprospect,thegeneralizationofthedynamicregenerati componenthas onepredecessorandoneomore r successors.T tree and sonwhich are mostoftenoperated.

ontoanyarchitectureswhereeach hiskindoarchitectures f includesring,

References [1]A.Avizienis,The N-versionApproach toFaultTolerant, Engineering voln°12,December1985.

IEEE TransactionsoS n oftware

[2]G.W.BRAMS(collective name), RéseauxdP e etri:Theorieepratique, t Masson,Vol.1and 2Paris, , 1982 and 1983.

Edited by

[3]J.Henshall,S.Shaw, OSI EXPLAINEDEnd-to-End Computer Standards, Edition,EllisHorwoodlimited,1990. [4]R.M.Keller, FormalVerificationoParallel f Programs, July 1976,Vol.19,No.7, pp.371-384. [5]R. Milner, [6]V.P.Nelson, 1990.

Communication and Concurrency,

Second

Communicationsothe f ACM,

EditedbP y rentice Hall,1989.

Fault-Tolerant Computing:Fundamental Concepts,

COMPUTER, July

[7]B.Randell, DesignFault Tolerance, Theevolution oFault-Tolerant f Computing,A. Avizienis,H.Kopetz,J-C.Laprie,Edited bS y pringer-Ve rlag 1987,Vol.1,pp251-270.

Il ts

[8]S.S.B. Shi,G.G.Belford, ConsistentReplicatedTransactions,AhighlyReliableProgramexecut ion Environment, EighthSymposiumonReliableDistributedSystems,Seattl e,Washington,October 1989,pp30-41. [9]N.A.Speirs P.A. , Barett, Using passiveReplicatesinDELTA-4 toprovidedependable distributedcomputing, 19-th Fault-Toleranton Computing Systems,1989. [10]P. Thambidurai,K.S.Trivedi, TransientOverloads in Fault-Tolerant Real-Time Systems, Real Time SystemsSymposium,SantaMonica,Californie 1989, , pp126-133.