See also a list of classical papers in distributed systems by various authors
- Overview Papers
- Andrew S. Tannenbaum and Robbert van Renesse, ``Distributed
Operating Systems'', Computing Surveys, Vol. 17, No. 4, Pages 419-470,
- E. Levy and A. Silberschatz, ``Distributed File Systems: Concepts and Examples'', ACM Computing Surveys, Vol. 22, No. 4, Pages 321-374, December 1990
- A. Tannenbaum "Can we make operating systems reliable and secure?", IEEE Computer
- Tanenbaum on
- Also read Linus Torvalds vs. Tannenbaum debate on OS kernels
- Barham et. al. Xen and the art of Virtualization. SOSP 2003
- Popek and Goldberg Formal Requirements for Virtualizable Third Generation Architectures CACM 1974
Readings for Chapter 2 Communication
- Remote Procedure Call
- Andrew Birrell and Bruce Nelson, Implementing RPCs, ACM Transactions on Computer Systems, Vol. 2, No. 1, Pages 39-59, February 1984.
- B. Bershad, T. Anderson, E. Lazowska, and H. Levy, Lightweight Remote Procedure Call,
Proceedings of the 12th ACM Symposium on Operating Systems Principles,
Operating Systems Review, Vol. 23, No. 5, Pages 12-113, December 1989
- Sun RPC documentation
- Java RMI documentation
- Google Protocol Buffers
Readings for Chapter 3 Processes
- Process and Thread Management
- Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy, The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors, IEEE Transactions on Computers, Vol. 38, No. 12, Pages 1631-1644, December 1989
- D. L. Black, Scheduling Support for Concurrency and Parallelism in the Mach Operating System, IEEE Computer, 23, 5, Pages 35-43, May 1990.
- Process Migration
- F. Douglis and J. Ousterhout, "Process Migration in the Sprite Operating System:A Status Report"
- M.Theimer, K.Lantz, D.Cheriton, ''Preemptable Remote Execution'', Proceedings of the 10th SOSP, Operating Systems Review, Vol. 19, No. 5, Pages 2-12, December 1985
Worldwide Computer. An operating system spanning the Internet would
harness the power of millions of the world's networked PCs. Scientific American, February 2002
- Virtual Machine Live Migration
- Clark et. al. Live migration of virtual machines NSDI 2005
- Post-copy migration Hines and Gopalan. Post-Copy Based Live Virtual Machine Migration Using Adaptive Pre-Paging and Dynamic Self-Ballooning. VEE 2009.
- Distributed Computing
- Condor: A Hunter of Idle Workstations, Proc of IEEE ICDCS 1988.
- Jim Basney and Miron Livny, "Deploying a High Throughput Computing Cluster", High Performance Cluster Computing,
Rajkumar Buyya, Editor, Vol. 1, Chapter 5, Prentice Hall PTR, May 1999.
More information about Condor is available at its homepage http://www.cs.wisc.edu/condor/
Readings for Chapter 4 Naming
- Butler Lampson, Designing a global name service. Proc. 4th ACM Symposium on Principles of Distributed Computing, Minaki, Ontario, 1986, pp 1-10
Readings for Chapter 5 Synchronization
- Leslie Lamport, Michael Melliar-Smith, "Byzantine Clock Synchronization", Proceedings of the Third Annual ACM Symposium on Principles of Distributed Computing (August, 1984), 68-74.
- Leslie Lamport, "Synchronizing Time Servers", SRC Research Report 18 (June 1987).
- P. Ramanathan, K. G. Shin, and R. W. Butler, "Fault-tolerant clock synchronization in distributed systems", IEEE Computer, vol. 23, pp. 33-42, Oct. 1990
- Mills, D., "Network Time Protocol (Version 3)", RFC 1305, March 1992.
- Mills, D., "Improved Algorithms for Synchronizing Computer Network Clocks", IEEE/ACM Transactions on NetworkingIEEE Communications Society, 1994
- Leslie Lamport,"Time, Clocks and the Ordering of Events in a Distributed System",
Communications of the ACM 21, 7 (July 1978), 558-565. Reprinted in
several collections, including Distributed Computing: Concepts and
Implementations, McEntire et al., ed. IEEE Press, 1984.
- K. Mani Chandy, Leslie Lamport, "Distributed snapshots: determining global states of distributed systems", ACM Transactions on Computer Systems (TOCS) archive, Volume 3, Issue 1, Pages: 63 - 75
- K. Mani Chandy, Jayadev Misra, "Termination Detection of Diffusing Computations in Communicating Sequential Processes", ACM Transactions on Programming Languages and Systems (TOPLAS) archive, Volume 4 , Issue 1 (January 1982), Pages: 37 - 43
- Edsgar W. Dijkstra, "Termination detection for diffusing computations", EWD 687a, 1979
- K. Mani Chandy, Jayadev Misra, "A distributed algorithm for detecting resource deadlocks in distributed systems",
Annual ACM Symposium on Principles of Distributed Computing archive,
Proceedings of the first ACM SIGACT-SIGOPS symposium on Principles of
distributed computing, Ottawa, Canada, Pages: 157 - 164
- K. Mani Chandy, Jayadev Misra, Laura M. Haas, "Distributed deadlock detection", ACM Transactions on Computer Systems (TOCS) archive, Volume 1, Issue 2 (May 1983), Pages: 144 - 156
- K. Mani Chandy, Jayadev Misra, "The drinking philosophers problem",
ACM Transactions on Programming Languages and Systems (TOPLAS) archive,
Volume 6, Issue 4 (October 1984), Lecture notes in computer science
Vol. 174 , Pages: 632 - 646
- G. Ricart and A. K. Agrawala, "An Optimal Algorithm for Mutual Exclusion in Computer Networks", In Communications of the ACM, 24(1):9-17, January 1981
Butler Lampson, How to build a highly available system using consensus. In Distributed Algorithms, ed. Babaoglu and Marzullo, Lecture Notes in Computer Science 1151, Springer, 1996, pp 1-17
C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985
Edsgar W. Dijkstra, "A short introduction to the art of programming", EWD 316, 1971
Edsgar W. Dijkstra, "The humble programmer", EWD 340, 1972
Leslie Lamport,"A New Solution of Dijkstra's Concurrent Programming Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
Leslie Lamport,"A New Approach to Proving the Correctness of Multiprocess Programs", ACM Transactions on Programming Languages and Systems 1, 1 (July 1979), 84-97.
Leslie Lamport, Susan Owicki, "Proving Liveness Properties of Concurrent Programs", ACM Transactions on Programming Languages and Systems 4, 3 (July 1982), 455-495.
- P. A. Bernstein, V. Hadzilacos, and N. Goodman, "Concurrency Control and Recovery in Database Systems", Addison-Wesley, 1987
- Jim Gray, "The Transaction Concept, Virtues And Limitations", Proceedings of 7th VLDB, Cannes, France, 1981, pp. 144-154
- Leslie Lamport. Paxos Made Simple
- Raft Ongaro and Ousterhout. In Search of an Understandable Consensus Algorithm. USENIX ATC 2014.
Readings for Chapter 6 Consistency and Replication
- C. Gray and D. Cheriton, "Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency", Proceedings of the 12th ACM Symposium on Operating Systems Principles, 1989
- K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and Demers, "Flexible Update Propagation for Weakly Consistent Replication", , Proc. of the 16th ACM Symp. on Op. Syst. Prin. (SOSP-16), S. Malo, France, Oct.5-8,97, p. 288-301.
- A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry, "Epidemic algorithms for replicated database maintenance", In PODC, 1987.
- Gifford, D, "Weighted voting for replicated data", In: Proceedings of 7th ACM Symposium on Operating System Principles. (1979) 150 162
- MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004
- Spark Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing . NSDI 2012
- Spanner Corbett et al. Spanner: Google’s Globally-Distributed Database. OSDI 2012.
- Big Table Chang et al. Bigtable: A Distributed Storage System for Structured Data. OSDI 2006
- DynamoDBDe Candia et al. Dynamo: Amazon’s Highly Available Key-value Store. SOSP 2007
- Bayou Terry et al. Managing Update Conflicts in Bayou,
a Weakly Connected Replicated Storage System. SOSP 1995
- BitCoin Satoshi Nakamoto. Bitcoin: A Peer-to-Peer Electronic Cash System
- Akamai Dilley et al. Globally distributed content delivery. IEEE Internet Computing. September-October 2002.
- Distributed Hash Tables Gribble et al. Scalable, Distributed Data Structures for Internet Service Construction. OSDI 2000.
- Nishtala et al. Scaling Memcache at Facebook NSDI 2013
- Scaling Wikipedia
- Netflix Scalability Stack
- Scaling Reddit
- Scaling LinkedIn
Readings for Chapter 7 Fault Tolerance
- Edsgar W. Dijkstra, "Self-stabilization in spite of distributed control", EWD 391, 1973
- Edsgar W. Dijkstra, "Self-stabilizing systems in spite of distributed control", EWD 426, 1974
- Jim Gray, Lesile Lamport, "Consensus on Transaction Commit", MSR-TR-2003-96, January 2004, 32 p.
- Jim Gray, Why Do Computers Stop and What Can We Do About It", 6th International Conference on Reliability and Distributed Databases, June 1987
- Jim Gray, "Notes on Database Operating Systems",
Operating Systems, an Advanced Course, Bayer et. al. eds., Lecture
notes in Computer Science 60, Springer-Verlag, 1978, pp. 393-481.
- Leslie Lamport, Marshall Pease, Robert Shostak, "The Byzantine Generals Problem", ACM Transactions on Programming Languages and Systems 4, 3 (July 1982), 382-401.
- Leslie Lamport, "The Part-Time Parliament", ACM Transactions on Computer Systems 16, 2 (May 1998), 133-169.
- Atomic Multicast
Practical Byzantine Fault Tolerance Castro and Liskov. OSDI 1999.
- Kenneth P. Birman and Thomas Joseph, "Exploiting Virtual Synchrony in distributed systems", In Proceedings of the 11th ACM Symposium on Operating Systems Principles, pages 123--138, Austin, Texas, November 1987
- Andr� Schiper, Kenneth Birman, Pat Stephenson , "Lightweight causal and atomic group multicast", ACM Transactions on Computer Systems (TOCS) archive, Volume 9, Issue 3, Pages: 272 - 314, 1991
Readings for Chapter 8 Security
- Butler Lampson, M. Abadi, M. Burrows, E. Wobber. "Authentication in distributed systems: Theory and practice", ACM Trans. Computer Systems 10, 4 (Nov. 1992), pp 265-310
Readings for Chapter 10: Distributed File
- NFS Version 4. The NFS v4 RFC is here.
- Zebra Network File System
- Serverless Network File SYsytem
- Google File System Ghemawat et al. The Google File System. SOSP 2003.
- CAP TheoremBrewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. PODC 2002.