Redundant Array of Inexpensive Servers (RAIS)
Project Overview
This project will explore intelligent high availability protocols for
server clusters where each machine in the cluster does not have to have
the same function as the other machines in the cluster. In traditional
high availability or load balancing devices, such as the Cisco
LocalDirector, an incoming TCP request is received and is subsequently
forwarded to one of a specific number of servers which must all serve the
same content and/or services. The idea of RAIS is to provide the backend
internal communication to allow a cluster of heterogeneous servers to
provide high availability and load balancing services within the cluster.
Why Study High Availability?
Availability is one of the three main compenents to computer security and
deals with whether or not a machine's services are available for use by
legitimate users. In industry, where mission critical applications demand
high availability and where downtime could cost tens of thousands of
dollars per minute, high availability solutions are especially important.
Traditional solutions are not intelligent--when failures occur a single
action occurs based on a very limited state machine. This simplicity,
while extremely easy to implement, does not allow for intelligent
failovers. An example of an intelligent failover would be the dynamic (and
automatic) reassignment of failed services or resources to other
machines in the cluster based on the other machines pre-existing services
and resources as well as other crucial factors including network latency
and network/CPU usage. In essence, intelligent high availability protocols
should preserve a specific quality of service as configured by the administrator
in addition to providing high availability and load balancing services.
Preliminary Ideas
Each cluster in a network is assigned a unique 16-bit cluster id separated
into two dotted octals. Each host in a cluster is assigned a unique 16-bit
host id and is given a host-cluster id with the cluster id being the
16 most significant bits and the host id being the 16 least significant
bits. This allows for up to 65535 clusters with 65535 hosts per cluster
and should be ample for most purposes. In addition each host will have a
unique IP address.
A cluster will be defined as a grouping of servers with the same cluster ID
(or top 16 bits in the host-cluster id). Each cluster will contain servers
grouped into 3 categories: 1) active, 2) standby, 3) cluster master. An
active server denotes a machine that is currently providing services. A
server that is on standby will act like a file that is in the hot-swap
pool for many RAID implementations--mainly it will act as a spare for
either active or cluster master servers and will be readily available to be
made active or to replace the cluster master. The cluster master is the
server that maintains the many tables for the servers--mainly: 1) services
table, 2) latency table, 3) cpu usage table, 4) network usage table,
5) server status table and possibly others. This server will facilitate
communication to other servers and will probe (at a given interval) each
system to check whether or not the services defined for the system are
available. It is also in charge of assigning failed services or resources
to other active or standby servers. After updating each of the tables (above),
the cluster master will send a sync message to all servers with the same
cluster id--the servers will then fetch a copy of the tables from the
cluster master via multicast.
The obvious single point of failure is the cluster master however the
standby and active servers can also detect a failed cluster master and can
facilitate the replacement of the failed cluster master with either a standby
server (first) or an active server (second).
Resources