soft-upgrade vs failover and back to/from 2nd-ary system

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

soft-upgrade vs failover and back to/from 2nd-ary system

Reto Kramer-2
I admire the soft-upgrade approach in Erlang and OTP's support for
coordination thereof very much, but am puzzled why it had to be
invented. It must be due to my ignorance of the context of the problem
(and the protocols involved) I'm sure - help me out please!

Context: Imagine a system that requires two nodes for fault tolerance.
Each node must be able to take over the other node's traffic (and state
if protocols are stateful) at any one point to handle the fault of one
of them.

For such architectures, system upgrade can be performed by artificially
evacuating a node, restarting it (VM process) with the new version of
the code and rebalancing the traffic. This works really well if the
protocols used to talk to these nodes support some form of redirection
(either in the sender process, or in an intermediary such as a load
balancer for http traffic).

Q: When does it not work well?

Q: Are there guidelines as to when I should rather invest in writing
soft-upgradable code when I can get away with the above brute force
approach to system upgrade?

Q: Many systems that run Erlang do indeed contain redundant CPU boards
(or multiple machines). Is there an easy way to characterize why the
brute force upgrade approach did not work in those systems (AXD 301
comes to mind of course) and the soft-upgrade approach had to be
invented?

I could not find guidelines for when to use brute-force upgrade in a
dual node system vs soft-upgrade in the documentation or papers
(comparing the two in general terms, or specific examples of pro/cons)
- can anyone point me at material?  I fear the answer must be obvious
or trivial, or left to the reader ;-)   In reality I found that live
system upgrade is a massive headache (for successful system only ;-)
and it's odd that not more is written about how to architect for it
from the the beginning, what the limitations and pitfalls are with
either approach etc.

Thanks,
- Reto



Reply | Threaded
Open this post in threaded view
|

soft-upgrade vs failover and back to/from 2nd-ary system

Ulf Wiger-5
Den 2005-01-22 07:08:43 skrev Reto Kramer <kramer>:

> Context: Imagine a system that requires two nodes for fault tolerance.  
> Each node must be able to take over the other node's traffic (and state  
> if protocols are stateful) at any one point to handle the fault of one  
> of them.
>
> For such architectures, system upgrade can be performed by artificially  
> evacuating a node, restarting it (VM process) with the new version of  
> the code and rebalancing the traffic. This works really well if the  
> protocols used to talk to these nodes support some form of redirection  
> (either in the sender process, or in an intermediary such as a load  
> balancer for http traffic).
>
> Q: When does it not work well?

There are indeed good reasons to always upgrade a redundant system
using the redundancy mechanisms - esp. since that mechanism sometimes
is the only reasonable option.

For systems that have no redundancy, soft upgrade is a better option
than to design for redundancy anyway and then e.g. starting a second
node and doing a redundancy upgrade. One could of course argue that
if the system has no redundancy, then downtime during upgrade must
be acceptable.

> Q: Are there guidelines as to when I should rather invest in writing  
> soft-upgradable code when I can get away with the above brute force  
> approach to system upgrade?

For debugging and patching, soft upgrade is superb. You can fairly
easily write code that is soft-upgradeable in Erlang/OTP, and using
it, you can swiftly load instrumented code or correct minor software
bugs without the users even noticing.

I've had occasions where I've developed server applications, and
had the server up and running all the time, always correcting errors
and adding new features through soft upgrade, and not restarting the
server for weeks. Very convenient, even if not perhaps strictly
necessary.


> Q: Many systems that run Erlang do indeed contain redundant CPU boards  
> (or multiple machines). Is there an easy way to characterize why the  
> brute force upgrade approach did not work in those systems (AXD 301  
> comes to mind of course) and the soft-upgrade approach had to be  
> invented?

AXD 301 supports a wide range of upgrade techniques, from soft upgrade
to system reboot with an upgraded configuration database. One reason for
this is that the AXD301 project started roughly at the same time as
the first version of OTP was being developed. Our understanding of
software upgrade using OTP in a very large system was understandably
poor in the beginning (it had never been done before!), so we kept
inventing ways to do it, until we eventually had support for almost
all techniques you can think of. (:

Redundancy upgrade is in there somewhere between the extremes, and is
one of the more useful techniques, but soft upgrade is used quite
often, esp. for error correction packages.


> I could not find guidelines for when to use brute-force upgrade in a  
> dual node system vs soft-upgrade in the documentation or papers  
> (comparing the two in general terms, or specific examples of pro/cons) -  
> can anyone point me at material?  I fear the answer must be obvious or  
> trivial, or left to the reader ;-)   In reality I found that live system  
> upgrade is a massive headache (for successful system only ;-) and it's  
> odd that not more is written about how to architect for it from the the  
> beginning, what the limitations and pitfalls are with either approach  
> etc.

I don't think such documentation exists, unfortunately.
And I agree - live system upgrade _is_ a massive headache, esp. of large
systems.

Regards,
Uffe
--
Anv?nder Operas banbrytande e-postklient: http://www.opera.com/m2/


Reply | Threaded
Open this post in threaded view
|

soft-upgrade vs failover and back to/from 2nd-ary system

Matthias Lang-2
In reply to this post by Reto Kramer-2
Reto Kramer writes:

 > Context: Imagine a system that requires two nodes for fault tolerance.
 > Each node must be able to take over the other node's traffic (and state
 > if protocols are stateful) at any one point to handle the fault of one
 > of them.

 > For such architectures, system upgrade can be performed by artificially
 > evacuating a node, restarting it (VM process) with the new version of
 > the code and rebalancing the traffic. This works really well if the
 > protocols used to talk to these nodes support some form of redirection
 > (either in the sender process, or in an intermediary such as a load
 > balancer for http traffic).

 > Q: When does it not work well?

A1) When you only have one node

A2) When the state is long lived and difficult, or impossible, to
    transfer from one node to another.

HTTP is pretty much the opposite of A2. In many telco applications, A2
describes the situation perfectly. On one telco voice application I
worked on, a typical upgrade/patch meant:

  1. Block new calls to the node.

  2. Wait until all calls end (i.e. people finish talking).

  3. Do the upgrade

  4. Unblock

Waiting for everyone to finish talking can take a long time. There are
two ways to reduce the wait: first, upgrade in the middle of the
night. Second, once you've waited (say) an hour, there'll just be a
handful of callers left, so you could just disconnect them and let the
helpdesk handle the complaints.

On that system, we could also insert "small" patches by loading new
code into Erlang. That eliminated all the waiting and sprinting down
the corridor to escape enraged helpdesk people.

---

Hot code loading isn't as general as 'evacuate-upgrade-restart' with
isolated, duplicated hardware. You can't upgrade the OS or VM by
reloading code. But in many systems it is simpler. In such systems you
handle most upgrades without any downtime and then accept a couple of
minutes per year of _planned_ downtime to upgrade the OS. In return,
you get a simpler (== less unplanned downtime) and cheaper system.

Matt


Reply | Threaded
Open this post in threaded view
|

soft-upgrade vs failover and back to/from 2nd-ary system

Reto Kramer-2
> A1) When you only have one node
>
> A2) When the state is long lived and difficult, or impossible, to
>     transfer from one node to another.
>
> HTTP is pretty much the opposite of A2. In many telco applications, A2
> describes the situation perfectly.

Matthias, can you give me an additional clarification w.r.t. to the
telco domain. (A2) implies that if the system that owns the state
crashes, the state is gone. I assume the telco applications you're
referring to use a definition of availability that does not count such
crashes as dropped calls?  I.e. are there telco applications in which
one can loose the call signaling state and as long as the voice trunk
remains up the call continues and is not counted as a drop?  I.e. as
long as one is able to setup new calls (on a fresh backup system that
needed none of the lost state transfered at all) life is good (modulo
the lost opportunity to charge for a call)?

> On one telco voice application I
> worked on, a typical upgrade/patch meant:
>
>   1. Block new calls to the node.
>
>   2. Wait until all calls end (i.e. people finish talking).
>
>   3. Do the upgrade
>
>   4. Unblock
>
> Waiting for everyone to finish talking can take a long time. There are
> two ways to reduce the wait: first, upgrade in the middle of the
> night. Second, once you've waited (say) an hour, there'll just be a
> handful of callers left, so you could just disconnect them and let the
> helpdesk handle the complaints.
>
> On that system, we could also insert "small" patches by loading new
> code into Erlang. That eliminated all the waiting and sprinting down
> the corridor to escape enraged helpdesk people.
>
> ---
>
> Hot code loading isn't as general as 'evacuate-upgrade-restart' with
> isolated, duplicated hardware. You can't upgrade the OS or VM by
> reloading code. But in many systems it is simpler. In such systems you
> handle most upgrades without any downtime and then accept a couple of
> minutes per year of _planned_ downtime to upgrade the OS. In return,
> you get a simpler (== less unplanned downtime) and cheaper system.
>
> Matt
>


Reply | Threaded
Open this post in threaded view
|

soft-upgrade vs failover and back to/from 2nd-ary system

Matthias Lang-2

 Matthias> > A2) When the state is long lived and difficult, or
 Matthias> >     impossible, to transfer from one node to another.
[...]
 Matthias> > In many telco applications, A2 describes the situation perfectly.

    Reto> Matthias, can you give me an additional clarification
    Reto> w.r.t. to the telco domain. (A2) implies that if the system
    Reto> that owns the state crashes, the state is gone. I assume the
    Reto> telco applications you're referring to use a definition of
    Reto> availability that does not count such crashes as dropped
    Reto> calls?  

Such events should be (and are) counted as dropped calls. But one
dropped call isn't the end of the world. It happens. That's why the
requirements specify nonzero limits to the number of dropped calls.

The granularity of fault recovery is a design choice you make once
you've seen the requirements. Take a voicemail system. Imagine someone
pulls out both power plugs while you're listening to one of your
messages. Some possible ways the system could appear to the
subscriber:

   1. You never notice anything, i.e. the message keeps playing
      without so much as a hiccup.

or 2. There's a slight pop in the middle of the message

or 3. The whole system hiccups, e.g. the message starts over from the
      start, or perhaps you go back to the menu.

or 4. The call gets dropped, i.e you have to call voicemail again

or 5. The call gets dropped. You try and call voicemail again but
      it's busy. You try again after five minutes and it works.

or 6. The call gets dropped and it takes several hours before voicemail
      works again.

or 7. All your messages get erased

I think all good voicemail systems settle for #4. Trying to do better
than that introduces a lot of complexity to deal with an unlikely
event. Cheap systems do #5. #6 and #7 are unacceptable.

Maybe I exaggerated the difference to HTTP. If I was on CNN's homepage
and the browser was in the middle of downloading the large picture on
the front page when someone pulled the power plug(s) on the CNN
webserver I happened to be using, I'd be pretty surprised if the load
balancer/failover system was smart enough to transfer the HTTP and TCP
state so that the image arrived whole anyway.

There are people who make voicemail (and IVR) systems _and_ HTTP
robustifiers on this list. Maybe they'd care to comment what their
systems do.

    Reto> I.e. are there telco applications in which one can
    Reto> loose the call signaling state and as long as the voice
    Reto> trunk remains up the call continues and is not counted as a
    Reto> drop?  I.e. as long as one is able to setup new calls (on a
    Reto> fresh backup system that needed none of the lost state
    Reto> transfered at all) life is good (modulo the lost opportunity
    Reto> to charge for a call)?

Keeping the voice connection up when the signalling state has been
lost is bad. It leaks connection resources and leaves subscribers
stuck in broken calls. Better to keep it simple and just drop the call
that triggered the problem.

Matthias


Reply | Threaded
Open this post in threaded view
|

soft-upgrade vs failover and back to/from 2nd-ary system

Massimo Cesaro-2
On Wed, 2005-01-26 at 11:33, Matthias Lang wrote:
>
> Keeping the voice connection up when the signalling state has been
> lost is bad. It leaks connection resources and leaves subscribers
> stuck in broken calls. Better to keep it simple and just drop the call
> that triggered the problem.
>
> Matthias
On the other hand, for IP telephony keeping connections up even when the
signalling state is lost is acceptable. Given that the job of the
(stateless) call agent is mainly to setup calls between intelligent
gateways (i.e. telephones), a redundant system can take care of the
hangup at the end of the call, if the primary system crashes after
establishing the same call. The two endpoints have an HTTP-like approach
to the signalling server, thus the failover/failback mechanism fits
pretty well.

Massimo




Reply | Threaded
Open this post in threaded view
|

soft-upgrade vs failover and back to/from 2nd-ary system

Vance Shipley-2
In reply to this post by Reto Kramer-2
Reto,

I worked on Nortel Meridian 1 PBXs years ago.  As I recall the way
they handled a core processor crash was to audit the time switch
when it returned to service and build state for the connected calls.

        -Vance

On Tue, Jan 25, 2005 at 10:19:30PM -0800, Reto Kramer wrote:
}  
}  Matthias, can you give me an additional clarification w.r.t. to the
}  telco domain. (A2) implies that if the system that owns the state
}  crashes, the state is gone. I assume the telco applications you're
}  referring to use a definition of availability that does not count such
}  crashes as dropped calls?  I.e. are there telco applications in which
}  one can loose the call signaling state and as long as the voice trunk
}  remains up the call continues and is not counted as a drop?  I.e. as
}  long as one is able to setup new calls (on a fresh backup system that
}  needed none of the lost state transfered at all) life is good (modulo
}  the lost opportunity to charge for a call)?