CS199-6 Project Ideas

Fast and robust wide-area remote execution

Contact Brent Chun (bnc@intel-research.net) if you have any questions about this project.

Deploying and managing wide-area services on PlanetLab requires coordination and control of geographically dispersed sets of computational resources. Currently, such tasks are performed using the Secure Shell Protocol (SSH) as a base primitive and by layering various shell scripts of top. This approach, while straightforward and practical, suffers across at least two dimensions. First, performance is abysmal. Each connection requires establishment of a new SSH session and the client is required to manage N such sessions, using some combination of serial and parallel connections. Second, control over large sets of remote processes is flaky at best. For example, a C-c will usually result in remote processes being killed; at other times, straggler processes remain. If the SSH client exits in an unexpected way (e.g., sent a SIGKILL, node crashes, etc.), the results are even less predictable.

The dual problems of coordination and control of a set of distributed resources has existed for quite some time in the parallel computing community. A recent example of a system that addresses this problem is the GEXEC remote execution system developed at Caltech. GEXEC provides fast, RSA authenticated remote execution of parallel and distributed jobs on a cluster of machines. Internally, it builds an overlay, an n-ary tree of TCP sockets and threads, which it uses to propagate control and data up and down the tree between the client and the remote nodes running a particular parallel program. By using hierarchical control, it distributes both the work and resource usage associated with large amounts of parallelism across multiple nodes, thereby eliminating problems associated with single node resource limits (e.g., limits on the number of file descriptors on front-end nodes) and achieving significant speedups as well.

The goal of this project is to build a fast and robust wide-area remote execution system for PlanetLab. It will involve leveraging the techniques used in GEXEC for scalability, constructing bandwidth/latency aware overlays (which are important since we cannot assume a fast local area network), and augmenting the approach taken in GEXEC to account for node and network failures to make it robust. Other differences that must be taken into account for the wide-area case include eliminating assumptions of a uniform UID/GID namespace and a shared filesystem such as NFS or AFS. The former might be addressed by bootstrapping a new RSA key pair over an initial SSH session to all nodes. (The cost of this gets amortized over multiple uses of the fast remote execution system.) The latter might be addressed by utilizing a scalable file distribution system (e.g., see related project on this below).

plush: The PlanetLab distributed shell

Contact Brent Chun (bnc@intel-research.net) if you have any questions about this project.

Since their introduction nearly four decades ago, Unix shells have experienced enormous success. Two of the key reasons for this success have been the shell's ability to expose underlying OS abstractions and capabilities in a convenient manner and its mechanisms for facilitating composition and reuse of simple commands and procedures via the shell's underlying scripting language. For example, interprocess communication between two processes via a Unix pipe in the shell is achieved by simply separating two commands with the '|' character. Similarly, I/O redirection (i.e., >, >>, >&, etc.) and managing sets of processes and process groups are also vastly easier using the shell, as opposed to having to write special programs to perform these tasks using the operating system's underlying system calls.

In the spirit of the original Unix shell, the goal of this project is to write plush, the PlanetLab distributed shell. Whereas traditional Unix shells serve as the interface to an operating system running on a single node, plush will serve as the interface to networks of virtual machines running across the wide area. Similar to the original design of the Unix shell, the key goals in designing plush are exposing underlying abstractions (including core PlanetLab services) in a convenient manner and facilitating reuse and composition of common tasks. As part of the implementation, plush should make it easy for users to deploy, manage, and interact with PlanetLab services. A significant piece of implementing this will involve communication and coordination with multiple PlanetLab nodes either directly or indirectly through other PlanetLab services.

A good example of a system that has also attempted to build a shell as an interface to a distributed system (and hence might be a good starting point) is JXTA. JXTA is a network programming and computing platform designed to solve a number of problems in modern distributed computing, in particular in the realm of P2P applications. The JXTA shell provides a convenient interface to the JXTA system by making it easy for users to interact with its underlying abstractions and to perform common tasks (e.g., joining, leaving, and discovering peer groups). Analogous to the JXTA shell, we would like plush to be a convenient interface to PlanetLab and to make it easy for users to deploy, manage, discover, and interact with wide area network services. If successful, plush could substantially reduce the barrier to entry to using PlanetLab as well as facilitate rapid development of increasingly complex PlanetLab services.

Parallel pipelined wide-area file distribution

Contact Brent Chun (bnc@intel-research.net) if you have any questions about this project.

File distribution to a collection of remote nodes is a common task when developing and deploying wide-area applications. For example, binaries and related files often need to pushed out to nodes in a slice before a service can be instantiated. Today on PlanetLab, this task is accomplished by building on top of the SSH protocol. That is, files are distributed by essentially iterating over a set nodes (potentially in parallel) and using the scp program to push a set of files out. Distributed filesystems simplify this task, but still may suffer from performance problems since a common pattern during development is a write followed immediately by a set of N simultaneous reads. For example, while AFS employs client-side caching for performance, all readers would still need to initially perform a read on a central server hosting the relevant files.

Another way to view this common case of a write followed immediately by multiple reads is to view this as a reliable multicast problem. We can view the node pushing the update out as the sender and the nodes receiving the updated files as receivers in a multicast group. Taking this view, one possible way to approach this problem might be to simply create a multicast group, perform a barrier (i.e., wait until all nodes are ready to receive), and multicast the file to all receivers. Unfortunately, native IP multicast on the Internet is still not widely deployed (and is unlikely to ever be) and, additionally, does not guarantee reliable data delivery. An alternative approach and one which has seen significant recent research activity is to build an application-level multicast overlay, where the participating nodes are responsible for forwarding packets to the other members in the group (e.g., over a tree).

The goal of this project is to build the fastest wide-area file distribution system in existence using an application-level overlay. We ask the question: what is the fastest way to push a sequence of bytes out to N nodes in a wide-area network? Naturally, a system that addresses this question will need to employ various types of parallelism. For example, it might build an overlay out of the nodes as a tree and exploit parallelism across multiple paths and multiple levels in the tree (i.e., pipelining the data). Furthermore, because it operates in the wide-area, it will need to build the overlay to account for available bandwidth between the different nodes in the tree (i.e., it cannot assume a fast local area network). Finally, because all bits in the files being distributed are not always changing, the system may also want to try and exploit sending deltas, as opposed to entire files. A few useful starting points for examining this problem include Overcast, PCP, and the rsync program.

Distributed ephemeral log service

Contact Timothy Roscoe (troscoe@intel-research.net) or Matt Welsh (mdw@cs.berkeley.edu) if you have any questions about this project.

Planetary-scale services generally need to "publish" information either to the outside world or to other nodes running the same service. For example, Brent's NetBait (http://netbait.planet-lab.org) service collects information on worm infections from every PlanetLab node, and currently publishes them to a single Web page -- hardly very scalable. Lots of services want to share information between nodes, but unfortunately we don't have a good mechanism for doing this. We're not going to mount a common NFS filesystem on every node on the PlanetLab, and many existing techniques (sending large log files to a central location) are heavyweight.

At the same time we don't need a shared "filesystem" between nodes - that would imply that any node can modify any file at any time, and that all nodes must have the same view of every file's contents. Alas, this is very hard to get right, though several research projects are working on it. For many services, we can get away with something much leaner than a filesystem -- such as a distributed append-only log.

Such a system provides two operations:

    /* Write a record to the log from the 'data' buffer for 'length' bytes */
    write_log_record(char *data, int length);

and

    /* 
     * Read a record from the log from address 'nodeaddr' with upto '*length' 
     * bytes (the actual length will be stored in '*length')
     */
    char *read_log_record(node_address_t nodeaddr, int *length);

Any node can write to its "own" log, and any node can read from the log of other nodes. How do you implement this efficiently? One way is that every "write" operation sends the new log entry to some set of nearby nodes. When a "read" operation occurs, a node checks to see if it has any data locally, and if not asks some other nodes around it. Several extensions are possible, such as associating a user-defined key with each record, allowing a program to read only those log entries matching a certain key.

The beauty of this system is that each node can simply throw away "old" log entries when it no longer wishes to store them; the semantics of read_log_record() are that only "recent" log entries are returned. For example, the Web server hosting NetBait infection listings needs to periodically call read_log_record() to fetch the latest updates, but no other node needs to store log entries permanently. Talk to Mothy about the Palimpsest storage system for more information of this kind of thing.

TRON: Tornado-encoded resilient overlay networks

Contact Brent Chun (bnc@intel-research.net) and Aki Nakao (nakao@cs.princeton.edu) if you have any questions about this project.

Tornado codes are erasure codes (also called forward error correction codes) which are both computationally inexpensive and space-efficient. Given a piece of data consisting of k packets, the basic idea is that a sender encodes the data and sends n packets (n > k) to a receiver with the property that any k of the packets received are enough to reconstruct the original data. One practical use of this property is the design of file transfer protocols which deliver good end-to-end bandwidth between a sender and a receiver (e.g., see Digital Fountain). For example, consider a file transfer protocol that uses Tornado-encoding on the data and UDP as the underlying transport protocol. Since the data is Tornado encoded, packet loss due to UDP's unreliability can be tolerated. At the same time, since UDP is used as the transport, end-to-end bandwidth is less sensitive to "background" packet loss. By comparison, file transfer protocols built on top of TCP would see considerably lower throughput in the presence of even small packet loss due to TCP's exponential backoff.

Resilient overlay networks (RON) form routing overlays on top of the existing Internet's routing infrastructure. Through continuous monitoring of network paths between overlay nodes, RONs are capable of detecting and recovering from path outages and periods of degraded performance within several seconds, as opposed to minutes over more with the existing Internet routing substrate. Furthermore, by measuring the quality of network paths between RON nodes, a RON can choose to route packets directly over the Internet or by way of other RON nodes, optimizing application-specific routing metrics.

The goal of this project is to marry Tornado encoding with RONs and build a file transfer service for PlanetLab that is both fast and robust to network failures. The service will provide transparent Tornado-encoding and encryption of application data and route the encoded data as UDP packets over a RON on PlanetLab. The project will involve writing a client, a server, modifying an existing RON, deploying the service on PlanetLab, and (time permitting) measuring its performance in the real world. Aki Nakao at Princeton has already constructed a system (described in this paper) which provides general-purpose mechanisms for building service overlays and has deployed instances of such overlays on PlanetLab (e.g., an overlay that routes web traffic more reliably). The routing overlay for the file transfer service's Tornado-encoded packets would simply be another instance of such an overlay.

Visualization of service overlays

Contact Brent Chun (bnc@intel-research.net) if you have any questions about this project.

The power of visualization as a means for analyzing, comprehending, and explaining the phenomena underlying large complex data sets is not to be underestimated. Its application to various application domains within computer science, in particular computer networks, has no doubt been an invaluable, if not simply very cool, tool for understanding the structure and behavior of complex systems (e.g., see the Internet Mapping Project, CAIDA's visualization tools, and the Atlas of Cyberspace).

One class of emerging planetary-scale applications that have been gaining popularity recently are service overlays. Examples include resilient transport, distributed object location, peer-to-peer storage, and multicast. Most of these services behave essentially in the same way: participants collectively and cooperatively implement some service for the benefit of all overlay members. On the other hand, the underlying distributed algorithms employed by these services and the resulting behavior of those algorithms under different workloads and real-world phenomena such as network congestion, routing failures, and node failures can vary tremendously. Furthermore, despite the common need to build and manage a network overlay, each of these services really is providing a different high-level service.

The goal of this project is to develop a visualization system that helps us understand the behavior of service overlays on a large scale at several different levels. For example, Aki Nakao has a port of MIT's Resilient Overlay Network to PlanetLab. One possible visualization project might involve deploying a RON on PlanetLab to improve the reliability and performance of web surfing and developing a system to visualize its behavior over time in response to real usage and network congestion. At the lowest level, the system might provide a network map that reflects the current state of link quality amongst the different overlay participants (e.g., bandwidth, latency, loss). At the next level, the system might provide a view of the different overlay routes from a source to a given destination (i.e., what's the best overlay route to a given web site?). At the next level, the system might gather statistics on usage and visualize this by overlaying it onto a graphical representation of the overlay's network topology.

Measurement and visualization of PlanetLab connectivity

Contact Timothy Roscoe (troscoe@intel-research.net) if you have any questions about this project.

Partly due to its age (or lack of it), very little is known about how network connectivity between PlanetLab nodes varies over time. Overlay networks, P2P systems, distributed hashtables, and other techniques are designed to route around network and node failures, however, it is unclear whether these systems work well under realistic, long-term network behavior. It's important to obtain a good set of long-term traces of node and link lifetimes, and use those traces to evaluate overlay network schemes.

The goal of this project is to build a distributed monitoring system that measures the liveness of links between nodes on PlanetLab, and produces a long-term trace of network behavior as observed from those nodes. Several "interfaces" to this data are needed:

a graphical visualization which displays connectivity and connection latency between PlanetLab nodes,
a service, available to an application running on a PlanetLab node, that allows it to obtain the information in a useful format, and
a historical database which collects the data over a long time period so that trends can be analyzed.

There are several things that could be measured, for instance ping times, or null RPC times between measurement processes on different nodes. It would be interesting to display such measurements relative to geographical distance, which can be obtained since we know where each PlanetLab node is on the Earth's surface. Relating the data to the Autonomous Systems (ASes) where the nodes reside would also be a good application-level empirical complement to all the Internet measurement research already in progress out there.

Making the service scalable also presents challenges as the number of nodes in PlanetLab grows (and it will during the course of the class). It's clearly a good idea to aggregate/share information between all the nodes at a single site, for instance.

StealthWeb - Censorship-resistant publication of Web documents

Contact Matt Welsh (mdw@cs.berkeley.edu) or Timothy Roscoe (troscoe@intel-research.net) if you have any questions about this project.

Political dissidents, governments-in-exile, and those harboring unpopular opinions fact the risk of censorship, even on the Internet. The German government has forced several Websites to shut down because they contained material that was deemed illegal in that country. The goal of this project is to build a distributed Web server that makes it hard for information to removed -- the very act of removing data from one site causes it to be rebuilt elsewhere.

While several P2P systems (such as FreeNet) have attempted to provide a "censor-proof" data dissemination, to date these systems require using arcane command-line tools and essentially only store flat text files. What we really need is a distributed Web server that can host censor-proof content complete with hyperlinks that can be used from any Web browser.

A major component of this will be some kind of proxy service that translates from HTTP and URL addresses into requests for pages that are stored across a wide range of back-end machines. A user can either download and run the proxy on their local machine, or point to a number of "public" proxies -- all that's needed is to configure the Web browser to use the local or remote HTTP proxy. You can use erasure coding techniques to scatter "fragments" of stored Web pages across many nodes, or use a simpler replication strategy. In any case, Web pages published on the StealthWeb should always be available even if a significant fraction of the back-end Web servers are taken down. Ideally, bringing up a new StealthWeb server automatically replicates some portion of the hosted web pages, making it hard to "bring down" the entire StealthWeb.

Self-healing wide area network services

Contact Brent Chun (bnc@intel-research.net) if you have any questions about this project.

One of the key challenges in maintaining a highly available wide area network service is being able to automatically detect and respond to faults in a timely manner. For example, consider Akamai's content distribution network, which serves web content for such major sites as AOL, Yahoo, and CNN and relies on a overlay network consisting of 13,000 servers scattered around the world. In such a system, failures in both software and hardware are inevitable due to the vast number of components involved. Manual detection of dead servers and the restarting of new ones becomes infeasible in such a system, not to mention being both time consuming and highly error prone. Instead, what you would really like is to have a system that is capable of both detecting failures in a distributed manner and automatically repairing them without human intervention.

Interestingly, one class of distributed "applications" that have this property and have been shown to be highly resilient in the presence of failures are Internet worms (see Code Redv2 and Nimda). For example, despite widespread filtering of worm replication network traffic at ISPs and the availability of software patches that fix the underlying vulnerabilities exploited by these worms, all of the high profile worms in recent years (e.g., Code Red, Nimda) continue to survive on the Internet. The ability for these worms to automatically replicate themselves on additional machines is a powerful capability. Early research investigation in early 1980s on worm programs that performed useful system functions also realized this, along with the associated risks of self-replicating programs.

"Even worse, however, were the unstable worms, which suddenly seemed to grow out of control..."
-- "The 'Worm' Programs -- Early Experience with a Distributed Computation" (March 1982)

The goal of this project is to leverage the power of self-replicating code and to construct a framework for building highly available wide area network services that other applications can use. Much of the related work in this area has in mainly focused on systems running on local area networks, where networks are fast and network partitions are rare. Furthermore, despite all this work, there has yet to emerge a widely used generic framework that provides automatic fault detection and repair for either local or wide area network services. The goal of this project is to build such a framework by leveraging dynamic slices on PlanetLab and using them as a distributed sandbox for service replication and containment. The end result should be a new PlanetLab service that automatically maintains some desired level of availability for specific PlanetLab services by detecting faults and replicating services on additional nodes as needed. One starting point for this might be George Candea's work on Medusa.

DIM: Decentralized instant messaging

Contact Timothy Roscoe (troscoe@intel-research.net) if you have any questions about this project.

Instant Messaging Systems have actually been around for ages but have achieved huge popularity in the last few years.

Instant messaging implemented over PlanetLab has a number of advantages over existing solutions beyond the ability to flame other team members without AOL or MSN censoring your insults. In particular, messages can be generated by things other than human beings, and sent to things other than human beings. One can imagine implementing some plush-like features using IM, for instance, or having some PlanetLab events generate Instant Messages which are then sent to some combination of interested humans and software logging engines.

Simply deploying Jabber servers on PlanetLab would give us a lot of this, but we can do better. User identifiers on Jabber are tied to a specific Jabber server domain (much like email addresses). By using a Distributed Hash Table to store rendezvous points for logged-in IM users, we can implement the Jabber protocol (or something like it) while allowing a user to connect to any server they like in a scalable manner.

This poses some problems for fault tolerance, since we can't afford to allow a user to become accessible if a single PlanetLab node falls down. Some kind of replication strategy for finding a user's rendezvous point is therefore required. It would also be useful to be able for users to select either a rendezvous point which is "close" to them (in terms of latency), or a server which is close, or both.

The goal of this project would be a workable, "any-server to any-client" (machine) instant messaging infrastructure over PlanetLab. Adopting an existing client-server protocol like Jabber allows existing clients to work and reduces the work on less interesting bits of the project - the meat of the task is in using a DHT to route messages to the right server at the right moment. Performance should be measured in terms of throughput, latency, and also time to recover from node failures, and how long it takes to move a client from one server to another.

A Steganographic Storage System

Contact Timothy Roscoe (troscoe@intel-research.net) if you have any questions about this project.

The Mnemosyne system provided steganographic storage of data. What does this mean?

Well, a traditional encrypting filing system makes it hard for an adversary to read the contents of files without a key. In contrast, a steganographic storage file system has the property that an adversary can't even determine the existence of files without the key. Steganographic systems are intended, among other things, to be resistant to "rubber hose cryptanalysis". It's not necessarily about hiding messages in pictures.

Mnemosyne provided steganographic properties by dispersing information into a large, wide-area distributed block store in a way that was hard for an adversary to predict or reconstruct, since the system was widely shared. This introduced a further problem of overwriting other peoples' data, since by definition user's could not afford to know where this was. This was addressed using erasure- or loss-codes.

Unfortunately, the source code to Mnemosyne is not available. This project would attempt to implement the basic functionality of such a system, using block stores on PlanetLab nodes. The goal is a system where users can connect to any set of nodes in the system, and route erasure-coded data anonymously to a pseudo-random sequence of block stores, possibly using a mix net.