See also a list of classical papers in distributed systems by various authors
General Readings
- Overview Papers
- Andrew S. Tannenbaum and Robbert van Renesse, ``Distributed
Operating Systems'', Computing Surveys, Vol. 17, No. 4, Pages 419-470,
December 1985
- E. Levy and A. Silberschatz, ``Distributed File Systems: Concepts and Examples'', ACM Computing Surveys, Vol. 22, No. 4, Pages 321-374, December 1990
- A. Tannenbaum "Can we make operating systems reliable and secure?", IEEE Computer
- Tanenbaum on
ukernels
- Also read Linus Torvalds vs. Tannenbaum debate on OS kernels
- Barham et. al. Xen and the art of Virtualization. SOSP 2003
- Popek and Goldberg Formal Requirements for Virtualizable Third Generation Architectures CACM 1974
Readings for Chapter 2 Communication
- Remote Procedure Call
- Andrew Birrell and Bruce Nelson, Implementing RPCs, ACM Transactions on Computer Systems, Vol. 2, No. 1, Pages 39-59, February 1984.
- B. Bershad, T. Anderson, E. Lazowska, and H. Levy, Lightweight Remote Procedure Call,
Proceedings of the 12th ACM Symposium on Operating Systems Principles,
Operating Systems Review, Vol. 23, No. 5, Pages 12-113, December 1989
- Sun RPC documentation
- Java RMI documentation
- Google Protocol Buffers
Readings for Chapter 3 Processes
- Process and Thread Management
- Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy, The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors, IEEE Transactions on Computers, Vol. 38, No. 12, Pages 1631-1644, December 1989
- Scheduling
- D. L. Black, Scheduling Support for Concurrency and Parallelism in the Mach Operating System, IEEE Computer, 23, 5, Pages 35-43, May 1990.
- Process Migration
- F. Douglis and J. Ousterhout, "Process Migration in the Sprite Operating System:A Status Report"
- M.Theimer, K.Lantz, D.Cheriton, ''Preemptable Remote Execution'', Proceedings of the 10th SOSP, Operating Systems Review, Vol. 19, No. 5, Pages 2-12, December 1985
- The
Worldwide Computer. An operating system spanning the Internet would
harness the power of millions of the world's networked PCs. Scientific American, February 2002
- Virtual Machine Live Migration
- Clark et. al. Live migration of virtual machines NSDI 2005
- Post-copy migration Hines and Gopalan. Post-Copy Based Live Virtual Machine Migration Using Adaptive Pre-Paging and Dynamic Self-Ballooning. VEE 2009.
- Distributed Computing
- Condor: A Hunter of Idle Workstations, Proc of IEEE ICDCS 1988.
- Jim Basney and Miron Livny, "Deploying a High Throughput Computing Cluster", High Performance Cluster Computing,
Rajkumar Buyya, Editor, Vol. 1, Chapter 5, Prentice Hall PTR, May 1999.
More information about Condor is available at its homepage http://www.cs.wisc.edu/condor/
Readings for Chapter 4 Naming
- Butler Lampson, Designing a global name service. Proc. 4th ACM Symposium on Principles of Distributed Computing, Minaki, Ontario, 1986, pp 1-10
Readings for Chapter 5 Synchronization
- Synchronization
- Leslie Lamport, Michael Melliar-Smith, "Byzantine Clock Synchronization", Proceedings of the Third Annual ACM Symposium on Principles of Distributed Computing (August, 1984), 68-74.
- Leslie Lamport, "Synchronizing Time Servers", SRC Research Report 18 (June 1987).
- P. Ramanathan, K. G. Shin, and R. W. Butler, "Fault-tolerant clock synchronization in distributed systems", IEEE Computer, vol. 23, pp. 33-42, Oct. 1990
- Mills, D., "Network Time Protocol (Version 3)", RFC 1305, March 1992.
- Mills, D., "Improved Algorithms for Synchronizing Computer Network Clocks", IEEE/ACM Transactions on NetworkingIEEE Communications Society, 1994
- Logical Clocks
- Leslie Lamport,"Time, Clocks and the Ordering of Events in a Distributed System",
Communications of the ACM 21, 7 (July 1978), 558-565. Reprinted in
several collections, including Distributed Computing: Concepts and
Implementations, McEntire et al., ed. IEEE Press, 1984.
- Global State
- K. Mani Chandy, Leslie Lamport, "Distributed snapshots: determining global states of distributed systems", ACM Transactions on Computer Systems (TOCS) archive, Volume 3, Issue 1, Pages: 63 - 75
- K. Mani Chandy, Jayadev Misra, "Termination Detection of Diffusing Computations in Communicating Sequential Processes", ACM Transactions on Programming Languages and Systems (TOPLAS) archive, Volume 4 , Issue 1 (January 1982), Pages: 37 - 43
- Edsgar W. Dijkstra, "Termination detection for diffusing computations", EWD 687a, 1979
- Mutual Exclusion
- K. Mani Chandy, Jayadev Misra, "A distributed algorithm for detecting resource deadlocks in distributed systems",
Annual ACM Symposium on Principles of Distributed Computing archive,
Proceedings of the first ACM SIGACT-SIGOPS symposium on Principles of
distributed computing, Ottawa, Canada, Pages: 157 - 164
- K. Mani Chandy, Jayadev Misra, Laura M. Haas, "Distributed deadlock detection", ACM Transactions on Computer Systems (TOCS) archive, Volume 1, Issue 2 (May 1983), Pages: 144 - 156
- K. Mani Chandy, Jayadev Misra, "The drinking philosophers problem",
ACM Transactions on Programming Languages and Systems (TOPLAS) archive,
Volume 6, Issue 4 (October 1984), Lecture notes in computer science
Vol. 174 , Pages: 632 - 646
- G. Ricart and A. K. Agrawala, "An Optimal Algorithm for Mutual Exclusion in Computer Networks", In Communications of the ACM, 24(1):9-17, January 1981
- Distributed Transactions
- P. A. Bernstein, V. Hadzilacos, and N. Goodman, "Concurrency Control and Recovery in Database Systems", Addison-Wesley, 1987
- Jim Gray, "The Transaction Concept, Virtues And Limitations", Proceedings of 7th VLDB, Cannes, France, 1981, pp. 144-154
- Butler Lampson, How to build a highly available system using consensus. In Distributed Algorithms, ed. Babaoglu and Marzullo, Lecture Notes in Computer Science 1151, Springer, 1996, pp 1-17
- C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985
- Edsgar W. Dijkstra, "A short introduction to the art of programming", EWD 316, 1971
- Edsgar W. Dijkstra, "The humble programmer", EWD 340, 1972
- Leslie Lamport,"A New Solution of Dijkstra's Concurrent Programming Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
- Leslie Lamport,"A New Approach to Proving the Correctness of Multiprocess Programs", ACM Transactions on Programming Languages and Systems 1, 1 (July 1979), 84-97.
- Leslie Lamport, Susan Owicki, "Proving Liveness Properties of Concurrent Programs", ACM Transactions on Programming Languages and Systems 4, 3 (July 1982), 455-495.
Consensus
- Leslie Lamport. Paxos Made Simple
- Raft Ongaro and Ousterhout. In Search of an Understandable Consensus Algorithm. USENIX ATC 2014.
Readings for Chapter 6 Consistency and Replication
- Distribution Protocols
- C. Gray and D. Cheriton, "Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency", Proceedings of the 12th ACM Symposium on Operating Systems Principles, 1989
- K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and Demers, "Flexible Update Propagation for Weakly Consistent Replication", , Proc. of the 16th ACM Symp. on Op. Syst. Prin. (SOSP-16), S. Malo, France, Oct.5-8,97, p. 288-301.
- A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry, "Epidemic algorithms for replicated database maintenance", In PODC, 1987.
- Consistency Protocols
- Gifford, D, "Weighted voting for replicated data", In: Proceedings of 7th ACM Symposium on Operating System Principles. (1979) 150 162
- Modern Systems
- MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004
- Spark Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing . NSDI 2012
- Spanner Corbett et al. Spanner: Google’s Globally-Distributed Database. OSDI 2012.
- Big Table Chang et al. Bigtable: A Distributed Storage System for Structured Data. OSDI 2006
- DynamoDBDe Candia et al. Dynamo: Amazon’s Highly Available Key-value Store. SOSP 2007
- Bayou Terry et al. Managing Update Conflicts in Bayou,
a Weakly Connected Replicated Storage System. SOSP 1995
- BitCoin Satoshi Nakamoto. Bitcoin: A Peer-to-Peer Electronic Cash System
- Akamai Dilley et al. Globally distributed content delivery. IEEE Internet Computing. September-October 2002.
- Distributed Hash Tables Gribble et al. Scalable, Distributed Data Structures for Internet Service Construction. OSDI 2000.
- Scaling applications
- Nishtala et al. Scaling Memcache at Facebook NSDI 2013
- Scaling Wikipedia
- Netflix Scalability Stack
- Scaling Reddit
- Scaling LinkedIn
Readings for Chapter 7 Fault Tolerance
- Edsgar W. Dijkstra, "Self-stabilization in spite of distributed control", EWD 391, 1973
- Edsgar W. Dijkstra, "Self-stabilizing systems in spite of distributed control", EWD 426, 1974
- Jim Gray, Lesile Lamport, "Consensus on Transaction Commit", MSR-TR-2003-96, January 2004, 32 p.
- Jim Gray, Why Do Computers Stop and What Can We Do About It", 6th International Conference on Reliability and Distributed Databases, June 1987
- Jim Gray, "Notes on Database Operating Systems",
Operating Systems, an Advanced Course, Bayer et. al. eds., Lecture
notes in Computer Science 60, Springer-Verlag, 1978, pp. 393-481.
- Leslie Lamport, Marshall Pease, Robert Shostak, "The Byzantine Generals Problem", ACM Transactions on Programming Languages and Systems 4, 3 (July 1982), 382-401.
- Leslie Lamport, "The Part-Time Parliament", ACM Transactions on Computer Systems 16, 2 (May 1998), 133-169.
- Atomic Multicast
- Kenneth P. Birman and Thomas Joseph, "Exploiting Virtual Synchrony in distributed systems", In Proceedings of the 11th ACM Symposium on Operating Systems Principles, pages 123--138, Austin, Texas, November 1987
- Andr� Schiper, Kenneth Birman, Pat Stephenson , "Lightweight causal and atomic group multicast", ACM Transactions on Computer Systems (TOCS) archive, Volume 9, Issue 3, Pages: 272 - 314, 1991
- Practical Byzantine Fault Tolerance Castro and Liskov. OSDI 1999.
Readings for Chapter 8 Security
- Butler Lampson, M. Abadi, M. Burrows, E. Wobber. "Authentication in distributed systems: Theory and practice", ACM Trans. Computer Systems 10, 4 (Nov. 1992), pp 265-310
Readings for Chapter 10: Distributed File
Systems
- NFS
- NFS Version 4. The NFS v4 RFC is here.
- Zebra Network File System
- Serverless Network File SYsytem
- Google File System Ghemawat et al. The Google File System. SOSP 2003.
- CAP TheoremBrewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. PODC 2002.