banner art

Understanding fault tolerance

When streaming digital media content, fault tolerance refers to the ability of a streaming media system to maintain, or at least recover, service after a system fault. The likelihood of a fault in the system resulting in failure is also a measure of the fault tolerance of the system. Fault tolerance can also be measured in terms of the availability of the system or the percentage of up-time for the system.

A streaming media system is nothing more than a chain of components stretching from the content origin to the consumer. Like a chain, each component must adequately perform its assigned task or the system itself fails.

Faults can occur anywhere in a streaming media system. Upstream faults, in relation to Windows Media Services, are those that have to do with the source of the content, such as an encoder or digital media library. Downstream faults are those having to do with distribution of the content to the client, such as faults in distribution servers or cache/proxy servers.

Upstream and downstream components

The key to fault tolerance in a streaming media system is redundancy. A system that relies on a single component at any stage in the media distribution process is vulnerable to failure.

Upstream faults

An input failure to Windows Media Services, either from an encoder, a remote publishing point, or a file server, is particularly challenging because the system administrator may not be aware that there is a problem. When an upstream content source fails or disconnects, an error is written to the Troubleshooting tab and the session log, but there is no overt indication of a problem in Windows Media Services.

You can minimize the risk of an upstream fault by using multiple content sources for your publishing point. Multiple content sources can consist of redundant encoders or alternate content files that the publishing point can use if the primary content source is unavailable.

Protect against encoder failures

Downstream faults

Failure of the Windows Media server or one of its downstream components, such as a distribution server, can prevent clients from receiving the content they requested. Using multiple Windows Media servers to stream the same content, called clustering, reduces the risk of interrupted service.

Clustering is a valuable fault tolerance technique because reduced capacity or failure of any one server is unlikely to interrupt the whole system. If one server stops responding, the workload of the failed server can be immediately and seamlessly transferred to the other servers.

Highly fault tolerant system

This section contains the following topics:

Related topics

© 2005 Microsoft Corporation. All rights reserved.