[Nets-seminars] Thursday, 20th Oct, 11 AM: SOSP practice talk by Joshua Leners (UT Austin)

Mon Oct 17 18:02:22 BST 2011

Greetings, everyone.

SOSP 2011, this year's instance of the biennial premier CS systems
research conference, is next week in Cascais, Portugal. It is my
pleasure to announce that Joshua Leners of UT Austin will give a
practice talk on his accepted SOSP paper here at UCL this Thursday at
11 AM. Joshua is advised by Mike Walfish, now an Assistant Professor
at UT Austin, whom many of you will remember from his term visiting
the Networks Research Group here at UCL CS in 2008.

Josh's work offers an exciting new approach to detecting failures in
distributed systems. Failure detection is a central and very
challenging problem, with major implications both for robustness and
performance.

This is a chance to hear a talk on state-of-the-art distributed
systems research, and also to help Joshua prepare for the notoriously
critical SOSP audience by asking hard questions and giving him
feedback on how to improve his talk.

All strongly encouraged to attend! Full talk announcement follows.

See you there,
-Brad, bkarp at cs.ucl.ac.uk

------

Detecting Failures in Distributed Systems with the FALCON Spy Network

			   Joshua Leners
		   University of Texas at Austin

                     11 AM, 20th October 2011
                Room SB1, 188 Tottenham Court Road

Abstract:

A common way for a distributed system to tolerate crashes is to
explicitly detect them and then recover from them. Interestingly,
detection can take much longer than recovery, as a result of many
advances in recovery techniques, making failure detection the dominant
factor in these systems' unavailability when a crash occurs.

In this talk, I will present the design, implementation, and
evaluation of Falcon, a failure detector with several desirable
properties. First, Falcon's common-case detection time is sub-second,
which keeps unavailability low. Second, Falcon is reliable: it never
reports a process as down when it is actually up.  Third, Falcon
sometimes kills to achieve reliable detection but aims to kill the
smallest needed component. Falcon achieves these features by
coordinating a network of *spies*, each monitoring a layer of the
system. Falcon's main cost is a small amount of platform-specific
logic. Falcon is thus the first failure detector that is fast,
reliable, and viable. As such, it could change the way that a class of
distributed systems is built.

Bio:

Joshua Leners is a 4th-year PhD student at UT Austin, advised by Mike
Walfish.