Redundant Array of Inexpensive Servers (RAIS)

Project Overview
This project will explore intelligent high availability protocols for server clusters where each machine in the cluster does not have to have the same function as the other machines in the cluster. In traditional high availability or load balancing devices, such as the Cisco LocalDirector, an incoming TCP request is received and is subsequently forwarded to one of a specific number of servers which must all serve the same content and/or services. The idea of RAIS is to provide the backend internal communication to allow a cluster of heterogeneous servers to provide high availability and load balancing services within the cluster.

Why Study High Availability?
Availability is one of the three main compenents to computer security and deals with whether or not a machine's services are available for use by legitimate users. In industry, where mission critical applications demand high availability and where downtime could cost tens of thousands of dollars per minute, high availability solutions are especially important. Traditional solutions are not intelligent--when failures occur a single action occurs based on a very limited state machine. This simplicity, while extremely easy to implement, does not allow for intelligent failovers. An example of an intelligent failover would be the dynamic (and automatic) reassignment of failed services or resources to other machines in the cluster based on the other machines pre-existing services and resources as well as other crucial factors including network latency and network/CPU usage. In essence, intelligent high availability protocols should preserve a specific quality of service as configured by the administrator in addition to providing high availability and load balancing services.

Preliminary Ideas
Each cluster in a network is assigned a unique 16-bit cluster id separated into two dotted octals. Each host in a cluster is assigned a unique 16-bit host id and is given a host-cluster id with the cluster id being the 16 most significant bits and the host id being the 16 least significant bits. This allows for up to 65535 clusters with 65535 hosts per cluster and should be ample for most purposes. In addition each host will have a unique IP address.

A cluster will be defined as a grouping of servers with the same cluster ID (or top 16 bits in the host-cluster id). Each cluster will contain servers grouped into 3 categories: 1) active, 2) standby, 3) cluster master. An active server denotes a machine that is currently providing services. A server that is on standby will act like a file that is in the hot-swap pool for many RAID implementations--mainly it will act as a spare for either active or cluster master servers and will be readily available to be made active or to replace the cluster master. The cluster master is the server that maintains the many tables for the servers--mainly: 1) services table, 2) latency table, 3) cpu usage table, 4) network usage table, 5) server status table and possibly others. This server will facilitate communication to other servers and will probe (at a given interval) each system to check whether or not the services defined for the system are available. It is also in charge of assigning failed services or resources to other active or standby servers. After updating each of the tables (above), the cluster master will send a sync message to all servers with the same cluster id--the servers will then fetch a copy of the tables from the cluster master via multicast.

The obvious single point of failure is the cluster master however the standby and active servers can also detect a failed cluster master and can facilitate the replacement of the failed cluster master with either a standby server (first) or an active server (second).

Resources