Outline for January 23, 2001
1. Greetings and felicitations!
a. No class on January 25 or January 30; no office hours on Wednesday, January 24 or Monday, January 29
b. Extra office hour: Friday January 26: 11 AM-1 PM
2. Distributed system?
a. What is it?
b. Why use it?
3. System Architectures
a. minicomputer mode
b. workstation model
c. processor pool
4. Issues
a. global knowledge
b. naming
c. scalability
d. compatibility
e. process synchronization, communication
f. security
g. structure
5. Networks
a. goals
b. message, packet, subnet, session
c. switching: circuit, store-and-forward, message, packet, virtual circuit, dynamic routing
d. OSI model: PDUs, layering
i. physical: ethernet, aloha, etc.
ii. data link layer: frames, parity checks, link encryption
iii. network layer: virtual circult vs. datagram, routing via flooding, static routes, dynamic routes, central-
ized routing vs. distributed routing; congestion solutions (packet discarding, isarithmic, choke packets)
iv. transport: services provided (UDP vs. TCP), functions to higher layers, addressing schemes (flat, DNS,
etc.), gateway fragmentation and reassembly
v. session: adds session characteristics like authentication
vi. presentation: compression, end-to-end encryption, virtual terminal
vii. application: user-level programs
6. Clocks
a. happened-before relation
b. Lamport's distributed clocks: a Æ b means C(a) < C(b)
c. Example where C(a) < C(b) does not mean a Æ b
d. Vector clocks and causal relation
e. ordering of messages so you receive them in the order sent
i. why
ii. for broadcast (ISIS): Birman-Schiper-Stephenson
iii. for point to point: Schiper-Eggli-Sandoz
7. Global state
a. Show problem of slicing state when something is in transit
b. Define local state; send(mij) ? LSi iff time of send(mij) < current time of LSi; similar for receive
c. transit(LSi, LSj); inconsistent(LSi, LSj); consistent state is one with inconsistent set empty for all pairs LSi,
LSj
d. Consistent global state: Chandry-Lamport
8. Termination detection
a. Haung
Lamport's Clocks
Introduction
Lamport's clocks keep a virtual time among distributed systems. The goal is to provide an ordering upon events
within the system.
Notation
· Pi process
· Ci. clock associated with process Pi
Protocol
1. Increment clock Ci between any two successive events in process Pi: Ci ¨ Ci + d (d > 0)
2. Let event a be the sending of a message by process Pi; the timestamp is ta = Ci(a) after the clock is incremented.
Let b be the receipt of that message by Pj. Then when Pj receives the message, Cj ¨ max(Cj, ta + d) (d > 0)
Example
Assume all clocks start at 0, and d is 1 (that is, each event incrememts the clock by 1). The events and clocks are:
e11: C1 ¨ 1
e31: C3 ¨ 1; timestamp t3,1 of message is 1
e21: C2 ¨ 2 as t3,1 = 1, after the increment C2 ¨ 1, and C2 ¨ max(C2, t3,1 + 1) = max(1, 1 + 1) = max(1, 2) = 2
e22: C2 ¨ 3; timestamp t2,2 of message is 3
e12: C1 ¨ 2; timestamp t1,2 of message is 2
e23: C2 ¨ 4 as t1,2 = 2, after the increment C2 ¨ 4, and C2 ¨ max(C2, t1,2 + 1) = max(4, 2 + 1) = max(4, 2) = 4
e24: C2 ¨ 5; timestamp t2,4 of message is 5
e13: C1 ¨ 4 as t2,2 = 3, after the increment C1 ¨ 3, and C1 ¨ max(C1, t2,2 + 1) = max(3, 3 + 1) = max(3, 4) = 4
e32: C3 ¨ 6 as t2,4 = 5, after the increment C3 ¨ 2, and C3 ¨ max(C3, t2,4 + 1) = max(2, 5 + 1) = max(2, 6) = 6
Problem
Clearly, if a Æ b, then C(a) < C(b). But if C(a) < C(b), does a Æ b?
The answer, surprisingly, is not necessarily. In the above example, C3(e31) = 1 < 2 = C1(e12). But e31 and e12 are caus-
ally unrelated; that is, e31 Æ/e12. However, C1(e11) = 1 < 6 = C3(e32), and clearly e11 Æ e32. Hence one cannot say
one way or the other.
Vector Clocks
Introduction
This is based upon Lamport's clocks, but each process keeps track of what is believes the other processes' interrnal
clocks are (hence the name, vector clocks). The goal is to provide an ordering upon events within the system.
Notation
· n processes
· Pi process
· Ci. vector clock associated with process Pi; jth element is Ci[j] and contains Pi's latest value for the current time
in process Pk.
Protocol
1. Increment clock Ci between any two successive events in process Pi: Ci[i] ¨ Ci[i] + d (d > 0)
2. Let event a be the sending of a message by process Pi; its vector timestamp is ta = Ci(a) after the clock is incre-
mented.. Let b be the receipt of that message by Pj. Then when Pj receives the message, it updates its vector
clock for all k = 1, -, n: Cj[k] ¨ max(Cj[k], ta[k])
Example
Here is the progression of time for the three processes:
e11: C1 ¨ (1, 0, 0)
e31: C3 ¨ (0, 0, 1); timestamp t3,1 of message is (0, 0, 1)
e21: C2 ¨ (0, 1, 1) as t3,1 = (0, 0, 1), C2 ¨ (0, 1, 0) after the increment, C2[0] ¨ max(C2[0], t3,1[0]) = max(0, 0) = 0,
C2[1] ¨ max(C2[1], t3,1[1]) = max(1, 0) = 1, and C2[2] ¨ max(C2[2], t3,1[2]) = max(0, 1) = 1
e22: C2 ¨ (0, 2, 1); timestamp t2,1 of message is (0, 2, 1)
e12: C1 ¨ (2, 0, 0); timestamp t1,1 of message is (2, 0, 0)
e23: C2 ¨ (2, 3, 1) as t1,1 = (2, 0, 0), C2 ¨ (0, 3, 1) after the increment, C2[0] ¨ max(C2[0], t1,1[0]) = max(0, 2) = 2,
C2[1] ¨ max(C2[1], t1,1[1]) = max(3, 0) = 3, and C2[2] ¨ max(C2[2], t1,1[2]) = max(0, 1) = 1
e24: C2 ¨ (2, 3, 1); timestamp t2,2 of message is (2, 3, 1)
e13: C1 ¨ (3, 2, 1) as t2,1 = (0, 2, 1), C1 ¨ (3, 0, 0) after the increment, C1[0] ¨ max(C1[0], t2,1[0]) = max(3, 0) = 3,
C1[1] ¨ max(C1[1], t2,1[1]) = max(2, 0) = 2, and C1[2] ¨ max(C1[2], t2,1[2]) = max(1, 0) = 1
e32: C3 ¨ (2, 3, 2) as t2,2 = (2, 3, 1), C3 ¨ (0, 0, 2) after the increment, C3[0] ¨ max(C3[0], t2,2[0]) = max(2, 0) = 2,
C3[1] ¨ max(C3[1], t2,2[1]) = max(3, 0) = 3, and C3[2] ¨ max(C3[2], t2,2[2]) = max(1, 2) = 2
Notice that C1(e11) < C3(e32), so e11 Æ e32, but C1(e11) and C3(e31) are incomparable, so e11 and e31 are concurrent.
Birman-Schiper-Stephenson Protocol
Introduction
The goal of this protocol is to preserve ordering in the sending of messages. For example, if send(m1) Æ send(m2),
then for all processes that receive both m1 and m2, receive(m1)Æ receive(m2). The basic idea is that m2 is not given to
the process until m1 is given. This means a buffer is needed for pending deliveries. Also, each message has an associ-
ated vector that contains information for the recipient to determine if another message preceded it. Also, we shall
assume all messages are broadcast. Clocks are updated only when messages are sent.
Notation
· n processes
· Pi process
· Ci. vector clock associated with process Pi; jth element is Ci[j] and contains Pi's latest value for the current time
in process Pk.
· tm vector timestamp for message m (stamped after local clock is incremented)
Protocol
Pi broadcasts a message
1. Pi increments Ci[i] and sets the timestamp tm = Ci for message m.
Pj receives a message from Pi
1. When Pj, j i, receives m with timestamp tm, it delays the message's delivery until both:
a. Cj[i] = tm[i] - 1; and
b. for all k n and k i, Cj[k] tm[k].
2. When the message is delivered to Pj, update Pj's vector clock for all k = 1, -, n: Cj[k] ¨ max(Cj[k], ta[k])
3. Check buffered messages to see if any can be delivered.
Example
Here is the protocol applied to the above situation:
e31: P3 sends message a; C3 = (0, 0, 1); ta = (0, 0, 1)
e21: P2 receives message a. As C2 = (0, 0, 0), C2[3] = ta[3] - 1 = 1 - 1 = 0 and C2[1] ta[1] and C2[2] ta[2] = 0. So
the message is accepted, and C2 is set to (0, 0, 1)
e22: P2 sends message b; C2 = (0, 1, 1); tb = (0, 1, 1)
e11: P1 receives message a. As C1 = (0, 0, 0), C1[3] = ta[3] - 1 = 1 - 1 = 0 and C1[1] ta[1] and C1[2] ta[2] = 0. So
the message is accepted, and C1 is set to (0, 0, 1)
e12: P1 receives message b. As C1 = (0, 0, 1), C1[2] = tb[2] - 1 = 1- 1 = 0 and C1[1] tb[1] and C1[3] tb[2] = 0. So
the message is accepted, and C1 is set to (0, 1, 1)
e32: P3 receives message b. As C3 = (0, 0, 1), C3[2] = tb[2] - 1 = 1 - 1 = 1 and C1[1] tb[1] and C1[3] tb[2] = 0. So
the message is accepted, and C3 is set to (0, 1, 1)
Now, suppose ta arrived as event e12, and tb as event e11:
Then the progression of time in P1 goes like this:
e31: P3 sends message a; C3 = (0, 0, 1); ta = (0, 0, 1)
e21: P2 receives message a. As C2 = (0, 0, 0), C2[3] = ta[3] - 1 = 1 - 1 = 0 and C2[1] ta[1] and C2[2] ta[2] = 0. So
the message is accepted, and C2 is set to (0, 0, 1)
e22: P2 sends message b; C2 = (0, 1, 1); tb = (0, 1, 1)
e11: P1 receives message b. As C1 = (0, 0, 0), C1[2] = tb[2] - 1 = 1 - 1 = 0 and C1[1] tb[1], but C1[3] < tb[3], so the
message is held until another message arrives. The vector clock updating algorithm is not run.
e12: P1 receives message a. As C1 = (0, 0, 0), C1[3] = ta[3] - 1 = 1 - 1 = 0, C1[1] ta[1], and C1[2] ta[2]. The mes-
sage is accepted and C1 is set to (0, 0, 1). Now the queue is checked. As C1[2] = tb[2] - 1 = 1 - 1 = 0, C1[1] tb[1],
and C1[3] tb[3], that message is accepted and C1 is set to (0, 1, 1).
e32: P3 receives message b. As C3 = (0, 0, 1), C3[2] = tb[2] - 1 = 1 - 1 = 1 and C1[1] tb[1] and C1[3] tb[2] = 0. So
the message is accepted, and C3 is set to (0, 1, 1)
Schiper-Eggli-Sandoz Protocol
Introduction
The goal of this protocol is to ensure that messages are given to the receiving processes in order of sending. Unlike
the Birman-Schiper-Stephenson protocol, it does not require using broadcast messages. Each message has an associ-
ated vector that contains information for the recipient to determine if another message preceded it. Clocks are updated
only when messages are sent.
Notation
· n processes
· Pi process
· Ci. vector clock associated with process Pi; jth element is Ci[j] and contains Pi's latest value for the current time
in process Pk.
· tm vector timestamp for message m (stamped after local clock is incremented)
· ti current time at process Pi
· Vi vector of Pi's previously sent messages; Vi[j] = tm, where the last message sent to Pj has the vector timestamp
tm; Vi[j][k] is the kth component of Vi[j].
· Vm vector accompanying message m
Protocol
Pi sends a message to Pj
1. Pi sends message m, timestamped tm, and Vi, to process Pj .
2. Pi sets Vi[j] ¨ tm.
Pj receives a message from Pi
1. When Pj, j i, receives m, it delays the message's delivery if both:
a. Vm[j] is set; and
b. Vm[j] < tj
Otherwise it is queued for later delivery.
2. If the message can be delivered to Pj, the following three actions occur:
a. Update all set elements of Vj with the corresponding elements of Vm, except for Vj[j], as follows:
i. If Vj[k] and Vm[k] are uninitialized, do nothing.
ii. If Vj[k] is uninitialized and Vm[k] is initialized, set Vj[k] ¨ Vm[k].
iii. If both Vj[k] and Vm[k] are initialized, Vj[k][k¢] ¨ max(Vj[k][k¢], Vm[k][k¢]) for all k¢ = 1, -, n
3. Update Pj's vector clock for all k = 1, -, n: Cj[k] ¨ max(Cj[k], tm[k])
4. Check buffered messages to see if any can be delivered.
Example
Here is the protocol applied to the above situation. [ ... ] and { ... } are like ( ... ) but used when too many parentheses
would be confusing (to me, at any rate!):
e31: P3 sends message m3,1 to P2. C3 = (0, 0, 1); t3,1 ¨ (0, 0, 1), V3,1 ¨ (?, ?, ?); V3 ¨ [ ?, (0, 0, 1), ? ]
e21: P2 receives message m3,1 from P3. As V3,1[2] = (?, ?, ?)[2] is uninitialized, the message is accepted.
V2 ¨ [ ?, ?, ? ] and C2 ¨ max[(0, 0, 0), (0, 0, 1)] = (0, 0, 1)
e22: P2 sends message m2,1 to P1. C2 ¨ (0, 1, 1); t2,1 ¨ (0, 1, 1), V2,1 ¨ [ ?, ?, ? ]; V2 ¨ [ (0, 1, 1), ?, ? ]
e11: P1 sends message m1,1 to P3. C1 ¨ (1, 0, 0); t1,1 ¨ (1, 0, 0), V1,1 ¨ ( ?, ?, ? ); V1 ¨ [ ?, ?, (1, 0, 0) ]
e32: P3 receives message m1,1 from P1. As V1,1[3] = ( ?, ?, ? )[3] is uninitialized, the message is accepted.
V3 ¨ [ ?, (0, 0, 1), ? ] and C3 ¨ max[(0, 0, 1), (1, 0, 0)] = (1, 0, 1).
e12: P1 receives message m2,1 from P2. As V2,1[1] = [ ?, ?, ? ][1] is uninitialized, the message is accepted.
V1 ¨ [ ?, ?, (1, 0, 0) ] and C1 ¨ max[(1, 0, 0), (0, 1, 1)] = (1, 1, 1)
e23: P2 sends message m2,2 to P1. C2 ¨ (0, 2, 1); t2,2 ¨ (0, 2, 1), V2,2 ¨ [ (0, 1, 1), ?, ? ]; V2 ¨ [ (0, 2, 1), ?, ? ]
e13: P1 receives message m2,2 from P2. As V2,2[1] = (0, 1, 1) < (1, 1, 1) = C1, the message is accepted.
V1 ¨ [ ?, ?, (1, 0, 0) ] and C1 ¨ max[(0, 2, 1), (1, 1, 1)] = (1, 2, 1)
Now, suppose m2,1 arrived as event e13, and m2,2 as event e12:
e31: P3 sends message m3,1 to P2. C3 = (0, 0, 1); t3,1 ¨ (0, 0, 1), V3,1 ¨ (?, ?, ?); V3 ¨ [ ?, (0, 0, 1), ? ]
e21: P2 receives message m3,1 from P3. As V3,1[2] = (?, ?, ?)[2] is uninitialized, the message is accepted.
V2 ¨ [ ?, ?, ? ] and C2 ¨ max[(0, 0, 0), (0, 0, 1)] = (0, 0, 1)
e22: P2 sends message m2,1 to P1. C2 ¨ (0, 1, 1); t2,1 ¨ (0, 1, 1), V2,1 ¨ [ ?, ?, ? ]; V2 ¨ [ (0, 1, 1), ?, ? ]
e11: P1 sends message m1,1 to P3. C1 ¨ (1, 0, 0); t1,1 ¨ (1, 0, 0), V1,1 ¨ ( ?, ?, ? ); V1 ¨ [ ?, ?, (1, 0, 0) ]
e32: P3 receives message m1,1 from P1. As V1,1[3] = ( ?, ?, ? )[3] is uninitialized, the message is accepted.
V3 ¨ [ ?, (0, 0, 1), ? ] and C3 ¨ max[(0, 0, 1), (1, 0, 0)] = (1, 0, 1).
e23: P2 sends message m2,2 to P1. C2 ¨ (0, 2, 1); t2,2 ¨ (0, 2, 1), V2,2 ¨ [ (0, 1, 1), ?, ? ]; V2 ¨ [ (0, 2, 1), ?, ? ]
e12: P1 receives message m2,2 from P2. But V2,2[1] = (0, 1, 1) 0, Wj > 0. (Wi' is the new weight of Pi.)
2. Send B(Wj) to Pj
Pj receives a computation message B(W) from Pi
1. Wj = Wj + W
2. If Pj is idle, Pj becomes active
Pi becomes idle:
1. Send C(Wi) to P0
2. Wi = 0
3. Pi becomes idle
Pi receives a control message C(W):
1. Wi = Wi + W
2. If Wi = 1, the computation has completed.
Example
The picture shows a process P0, designated the controlling agent, with
W0 = 1. It asks P1 and P2 to do some computation. It sets W1 to 0.2, W2 to
0.3, and W3 to 0.5. P2 in turn asks P3 and P4 to do some computations. It
sets W3 to 0.1 and W4 to 0.1.
When P3 terminates, it sends C(W3) = C(0.1) to P2, which changes W2 to
0.1 + 0.1 = 0.2.
When P2 terminates, it sends C(W2) = C(0.2) to P0, which changes W0 to
0.5 + 0.2 = 0.7.
When P4 terminates, it sends C(W4) = C(0.1) to P0, which changes W0 to
0.7 + 0.1 = 0.8.
When P1 terminates, it sends C(W1) = C(0.2) to P0, which changes W0 to
0.8 + 0.2 = 1.
P0 thereupon concludes that the computation is finished.
Total number of messages passed: 8 (one to start each computation, one to return the weight).