Книга: Site Reliability Engineering. Надежность и безотказность как в Google
Назад: Об авторах
На главную: Предисловие

Библиография

Adams Bram, Bellomo Stephany, Bird Christian, Marshall-Keim Tamara, Khomh Foutse, Moir Kim. The Practice and Future of Release Engineering: A Roundtable with Three Release Engineers (). IEEE Software, vol. 32, no. 2 (March/April 2015), pp. 42–49.

Aguilera M.K. Stumbling over Consensus Research: Misunderstandings and Issues (). Replication, Lecture Notes in Computer Science 5959, 2010.

Allspaw J., Robbins J. Web Operations: Keeping the Data on Time: O’Reilly, 2010.

Allspaw J. Blameless PostMortems and a Just Culture (/). Blog post, 2012.

Allspaw J. Trade-Offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages (). MSc thesis, Lund University, 2015.

Anantharaju S. Automating web application security testing (). Blog post, July 2007.

Ananatharayan R. et al. Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams (). SIGMOD ’13, 2013.

Andrieux A., Czajkowski K., Dan A. et al. Web Services Agreement Specification (WS-Agreement) (). September 2005.

Bailis P., Ghodsi A. Eventual Consistency Today: Limitations, Extensions, and Beyond (). ACM Queue, vol. 11, no. 3, 2013.

Bainbridge L. Ironies of Automation (). Automatica, vol. 19, no. 6, November 1983.

Baker J. et al. Megastore: Providing Scalable, Highly Available Storage for Interactive Services (). Proceedings of the Conference on Innovative Data System Research, 2011.

Barroso L.A. Warehouse-Scale Computing: Entering the Teenage Decade (). Talk at 38th Annual Symposium on Computer Architecture, video available online, 2011.

Barroso L.A., Clidaras J., Holzle U. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition (). Morgan & Claypool, 2013.

Bennett C., Tseitlin A. Chaos Monkey Released Into The Wild (). Blog post, July 2012.

Bland M. Goto Fail, Heartbleed, and Unit Testing Culture (). Blog post, June 2014.

Bock L. Work Rules! (). Twelve Books, 2015.

Bolosky W.J., Bradshaw D., Haagens R.B., Kusters, Li P. Paxos Replicated State Machines as the Basis of a High-Performance Data Store (). Proc. NSDI 2011, 2011.

Boysen P.G. Just Culture: A Foundation for Balanced Accountability and Patient Safety (/). The Ochsner Journal, Fall 2013.

Brasseur VM. Failure: Why it happens & How to benefit from it (). YAPC 2015.

Brewer E. Lessons From Giant-Scale Services (). IEEE Internet Computing, vol. 5, no. 4, July/August 2001.

Brewer E. CAP Twelve Years Later: How the Rules Have Changed (). Computer, vol. 45, no. 2, February 2012.

Brooker M. Exponential Backoff and Jitter (). AWS Architecture Blog, March 2015.

Brooks Jr. F.P. No Silver Bullet — Essence and Accidents of Software Engineering. The Mythical Man-Month, Boston: Addison-Wesley, 1995, pp. 180–186.

Brutlag J. Speed Matters (). Google Research Blog, June 2009.

Bull G.M. The Dartmouth Time-sharing System. Ellis Horwood, 1980.

Burgess M. Principles of Network and System Administration. Wiley, 1999.

Burrows M. The Chubby Lock Service for Loosely-Coupled Distributed Systems (). OSDI’06: Seventh Symposium on Operating System Design and Implementation, November 2006.

Burns B., Grant B., Oppenheimer D., Brewer E., Wilkes J. Borg, Omega, and Kubernetes (). ACM Queue, vol. 14, no. 1, 2016.

Castro M., Liskov B. Practical Byzantine Fault Tolerance (). Proc. OSDI 1999, 1999.

Chambers C., Raniwala A., Perry F., Adams S., Henry R., Bradshaw R., Weizenbaum N. FlumeJava: Easy, Efficient Data-Parallel Pipelines (). ACM SIGPLAN Conference on Programming Language Design and Implementation, 2010.

Chandra T.D., Toueg S. Unreliable Failure Detectors for Reliable Distributed Systems (). J. ACM, 1996.

Chandra T., Griesemer R., Redstone J. Paxos Made Live — An Engineering Perspective (). PODC ’07: 26th ACM Symposium on Principles of Distributed Computing, 2007.

Chang F. et al. Bigtable: A Distributed Storage System for Structured Data (). OSDI’06: Seventh Symposium on Operating System Design and Implementation, November 2006.

Chrousous G.P. Stress and Disorders of the Stress System (). Nature Reviews Endocrinology, vol. 5, no. 7, 2009.

Clos C. A Study of Non-Blocking Switching Networks (). Bell System Technical Journal, vol. 32, no. 2, 1953.

Contavalli C., Gaast W. van der, Lawrence D., Kumari W. Client Subnet in DNS Queries (). IETF Internet-Draft, 2015.

Conway M.E. Design of a Separable Transition-Diagram Compiler (). Commun. ACM 6, 7 (July 1963), 396–408.

Conway P. Preservation in the Digital World (). Report published by the Council on Library and Information Resources, 1996.

Cook R.I. How Complex Systems Fail (). Web Operations: O’Reilly, 2010.

Corbett J.C. et al. Spanner: Google’s Globally-Distributed Database (). OSDI ’12: Tenth Symposium on Operating System Design and Implementation, October 2012.

Cranmer J. Visualizing code coverage (). Blog post, March 2010.

Dean J., Barroso L.A. The Tail at Scale (). Communications of the ACM, vol. 56, 2013.

Dean J., Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters (). OSDI’04: Sixth Symposium on Operating System Design and Implementation, December 2004.

Dean J. Software Engineering Advice from Building Large-Scale Distributed Systems (). Stanford CS297 class lecture, Spring 2007.

Dekker S. Reconstructing human contributions to accidents: the new view on error and performance (? doi=10.1.1.411.4985&rep=rep1&type=pdf). Journal of Safety Research, vol. 33, no. 3, 2002.

Dekker S. The Field Guide to Understanding “Human Error”, 3rd edition: Ashgate, 2014.

Dickson C. How Embracing Continuous Release Reduced Change Complexity (). Presentation at USENIX Release Engineering Summit West 2014, video available online.

Durmer J., Dinges D. Neurocognitive Consequences of Sleep Deprivation (). Seminars in Neurology, vol. 25, no. 1, 2005.

Eisenbud D.E. et al. Maglev: A Fast and Reliable Software Network Load Balancer (). NSDI’16: 13th USENIX Symposium on Networked Systems Design and Implementation, March 2016.

Erenkrantz J.R. Release Management Within Open Source Projects (). Proceedings of the 3rd Workshop on Open Source Software Engineering, Portland, Oregon, May 2003.

Fischer M.J., Lynch N.A., Paterson M.S. Impossibility of Distributed Consensus with One Faulty Process (). J. ACM, 1985.

Fitzpatrick B.W., Collins-Sussman B. Team Geek: A Software Developer’s Guide to Working Well with Others: O’Reilly, 2012.

Floyd S., Jacobson V. The Synchronization of Periodic Routing Messages (). IEEE/ACM Transactions on Networking, vol. 2, issue 2, April 1994, pp. 122–136.

Ford D. et al. Availability in Globally Distributed Storage Systems (). Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.

Fox A., Brewer E.A. Harvest, Yield, and Scalable Tolerant Systems (). Proceedings of the 7th Workshop on Hot Topics in Operating Systems, Rio Rico, Arizona, March 1999.

Fowler M. GUI Architectures (). Blog post, 2006.

Gall J. SYSTEMANTICS: How Systems Really Work and How They Fail, 1st ed. Pocket, 1977.

Gall J. The Systems Bible: The Beginner’s Guide to Systems Large and Small, 3rd ed. General Systemantics Press/Liberty, 2003.

Gawande A. The Checklist Manifesto: How to Get Things Right. Henry Holt and Company, 2009.

Ghemawat S., Gobioff H., Leung S-T. The Google File System (). 19th ACM Symposium on Operating Systems Principles, October 2003.

Gilbert S., Lynch N. Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services (). ACM SIGACT News, vol. 33, no. 2, 2002.

Glass R. Facts and Fallacies of Software Engineering. Addison-Wesley Professional, 2002.

Golab W. et al. Eventually Consistent: Not What You Were Expecting? (). ACM Queue, vol. 12, no. 1, 2014.

Graham P. Maker’s Schedule, Manager’s Schedule (). Blog post, July 2009.

Gupta A., Shute J. High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads (). Workshop on Business Intelligence for the Real Time Enterprise, 2015.

Hamilton J. On Designing and Deploying Internet-Scale Services (). Proceedings of the 21st Large Installation System Administration Conference, November 2007.

Hanks S., Li T., Farinacci D., Traina P. Generic Routing Encapsulation over IPv4 networks (). IETF Informational RFC, 1994.

Hickins M. Tape Rescues Google in Lost Email Scare (/). Digits, Wall Street Journal, 1 March 2011.

Hixson D. Capacity Planning (). ;login:, vol. 40, no. 1, February 2015.

Hixson D. The Systems Engineering Side of Site Reliability Engineering (). ;login: vol. 40, no. 3, June 2015.

Hodges J. Notes on Distributed Systems for Young Bloods (/). Blog post, 14 January 2013.

Holmwood L. Applying Cardiac Alarm Management Techniques to Your On-Call (/). Blog post, 26 August 2014.

Humble J., Read C., North D. The Deployment Production Line. Proceedings of the IEEE Agile Conference, July 2006.

Humble J., Farley D. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley, 2010.

Hunt P., Konar M., Junqueira F.P., Reed B. ZooKeeper: Wait-free coordination for Internet-scale systems (). USENIX ATC, 2010.

International Atomic Energy Agency, Safety of Nuclear Power Plants: Design, SSR-2/1 (), 2012.

Jain S. et al. B4: Experience with a Globally-Deployed Software Defined WAN (). SIGCOMM’13.

Jones C., Underwood T., Nukala S. Hiring Site Reliability Engineers (). ;login:, vol. 40, no. 3, June 2015.

Junqueira F., Mao Y., Marzullo K. Classic Paxos vs. Fast Paxos: Caveat Emptor (). Proc. HotDep ’07, 2007.

Junqueira F.P., Reid B.C., Serafini M. Zab: High-performance broadcast for primary-backup systems. (). Dependab­le Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on 27 Jun 2011: 245–256.

Kahneman D. Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011.

Karger D. et al. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web (). Proc. STOC ’97, 29th annual ACM symposium on theory of computing, 1997.

Kemper C. Build in the Cloud: How the Build System Works (). Google Engineering Tools blog post, August 2011.

Kendrick S. What Takes Us Down? (). ;login:, vol. 37, no. 5, October 2012.

Kincaid Jason. T-Mobile Sidekick Disaster: Danger’s Servers Crashed, And They Don’t Have A Backup. Techcrunch. n.p., 10 Oct. 2009. Web. 20 Jan. 2015, .

Kingsbury K. The trouble with timestamps (). Blog post, 2013.

Kirsch J., Amir Y. Paxos for System Builders: An Overview (). Proc. LADIS ’08, 2008.

Klau R. How Google Sets Goals: OKRs (). Blog post, October 2012.

Klein D.V. A Forensic Analysis of a Distributed Two-Stage Web-Based Spam Attack (). Proceedings of the 20th Large Installation System Administration Conference, December 2006.

Klein D.V., Betser D.M., Monroe M.G. Making Push On Green a Reality (). ;login:, vol. 39, no. 5, October 2014.

Krattenmaker T. Make Every Meeting Matter (). Harvard Business Review, February 27, 2008.

Kreps J. Getting Real About Distributed System Reliability (). Blog post, 19 March 2012.

Krishan K. Weathering The Unexpected (). Communications of the ACM, vol. 55, no. 11, November 2012.

Kumar A. et al. BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing (). SIGCOMM ’15.

Lamport L. The Part-Time Parliament (). ACM Transactions on Computer Systems 16, 2, May 1998.

Lamport L. Paxos Made Simple (). ACM SIGACT News 121, December 2001.

Lamport L. Fast Paxos (). Distributed Computing 19.2, October 2006.

Limoncelli T.A., Chalup S.R., Hogan C.J. The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2. Addison-Wesley, 2014.

Loomis J. How to Make Failure Beautiful: The Art and Science of Postmortems. Web Operations: O’Reilly, 2010.

Lu H. et al. Existential Consistency: Measuring and Understanding Consistency at Facebook (). SOSP ’15, 2015.

Mao Y., Junqueira F.P., Marzullo K. Mencius: Building Efficient Replicated State Machines for WANs (). OSDI ’08, 2008.

Maslow A.H. A Theory of Human Motivation. Psychological Review 50 (4), 1943.

Maurer B. Fail at Scale (). ACM Queue, vol. 13, no. 12, 2015.

Mayer M. This site may harm your computer on every search result?!?! (). Blog post, January 2009.

McIlroy M.D. A Research Unix Reader: Annotated Excerpts from the Programmer’s Manual, 1971–1986 ().

McNutt D. Maintaining Consistency in a Massively Parallel Environment (). Presentation at USENIX Configuration Management Summit 2013, video available online.

McNutt D. Accelerating the Path from Dev to DevOps (). ;login:, vol. 39, no. 2, April 2014.

McNutt D. The 10 Commandments of Release Engineering (). Presentation at 2nd International Workshop on Release Engi­neering 2014, April 2014.

McNutt D. Distributing Software in a Massively Parallel Environment (). Presentation at USENIX LISA 2014, video available online.

Microsoft TechNet. What is SNMP? Last modified March 28, 2003, .

Meadows D. Thinking in Systems. Chelsea Green, 2008.

Menage P. Adding Generic Process Containers to the Linux Kernel (). Proc. Of Ottawa Linux Symposium, 2007.

Merchant N. Culture Trumps Strategy, Every Time (). Harvard Business Review, March 22, 2011.

Mockapetris P. Domain Names — Implementation and Specification (). IETF Internet Standard, 1987.

Moler C. Matrix Computation on Distributed Memory Multiprocessors. Hypercube Multiprocessors 1986, 1987.

Moraru I., Andersen D.G., Kaminsky M. Egalitarian Paxos (). Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-12-108, 2012.

Moraru I., Andersen D.G., Kaminsky M. Paxos Quorum Leases: Fast Reads Without Sacrificing Writes (). Proc. SOCC ’14, 2014.

Morgenthaler J.D., Gridnev M., Sauciuc R., Bhansali S. Searching for Build Debt: Experiences Managing Technical Debt at Google (). Proceedings of the 3rd Int’l Workshop on Managing Technical Debt, 2012.

Narla C., Salas D. Hermetic Servers (). Blog post, 2012.

Nelson B. The Data on Diversity (). Communications of the ACM, vol. 57, 2014.

Nichols K., Jacobson V. Controlling Queue Delay (). ACM Queue, vol. 10, no. 5, 2012.

O’Connor P., Kleyner A. Practical Reliability Engineering, 5th edition. Wiley, 2012.

Ohno T. Toyota Production System: Beyond Large-Scale Production. Productivity Press, 1988.

Ongaro D., Ousterhout J. In Search of an Understandable Consensus Algorithm (Extended Version) ().

Peng D., Dabek F. Large-scale Incremental Processing Using Distributed Transactions and Notifications (). Proc. of the 9th USENIX Symposium on Operating System Design and Implementation, November 2010.

Perrow C. Normal Accidents: Living with High-Risk Technologies. Princeton University Press, 1999.

Perry A.R. Engineering Reliability into Web Sites: Google SRE (). Proc. of LinuxWorld 2007, 2007.

Pike R., Dorward S., Griesemer R., Quinlan S. Interpreting the Data: Parallel Analysis with Sawzall (). Scientific Programming Journal vol. 13, no. 4, 2005.

Potvin R., Levenberg J. The Motivation for a Monolithic Codebase: Why Google stores billions of lines of code in a single repository. Communications of the ACM, forthcoming July 2016. Video available on YouTube ().

Rooney J.J., Vanden Heuvel L.N. Root Cause Analysis for Beginners (). Quality Progress, July 2004.

Saint Exupery A. de. Hommes Terre des. Paris: Le Livre de Poche, 1939, in translation by Lewis Galantiere as Wind, Sand and Stars.

Sambasivan R.R., Fonseca R., Shafer I., Ganger G.R. So, You Want To Trace Your Distributed System? Key Design Insights from Years of Practical Experience (). Carnegie Mellon University Parallel Data Lab Technical Report CMUPDL-14-102, 2014.

Santos N., Schiper A. Tuning Paxos for High-Throughput with Batching and Pipelining (). 13th Int’l Conf. on Distributed Computing and Networking, 2012.

Sarter N.B., Woods D.D., Billings C.E. Automation Surprises. Handbook of Human Factors & Ergonomics, 2nd edition, G. Salvendy (ed.), Wiley, 1997.

Schmidt E., Rosenberg J., Eagle A. How Google Works (): Grand Central Publishing, 2014.

Schwartz B. The Factors That Impact Availability, Visualized (). Blog post, 21 December 2015.

Schneider F.B. Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial (). ACM Computing Surveys, vol. 22, no. 4, 1990.

Securities and Exchange Commission, Order In the Matter of Knight Capital Americas LLC (). File 3-15570, 2013.

Shao G., Berman F., Wolski R. Master/Slave Computing on the Grid (). Heterogeneous Computing Workshop, 2000.

Shute J. et al. F1: A Distributed SQL Database That Scales (). Proc. VLDB 2013, 2013.

Sigelman B.H. et al. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (). Google Technical Report, 2010.

Singh A. et al. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network (). SIGCOMM ’15.

Skelton M. Operability can Improve if Developers Write a Draft Run Book (/). Blog post, 16 October 2013.

Treynor Sloss B. Gmail back soon for everyone (). Blog post, 28 February 2011.

Tatham S. How to Report Bugs Effectively (), 1999.

Verma A., Pedrosa L., Korupolu M.R., Oppenheimer D., Tune E., Wilkes J. Large-scale cluster management at Google with Borg (). Proceedings of the European Conference on Computer Systems, 2015.

Wallace D.R., Fujii R.U. Software Verification and Validation: An Overview (). IEEE Software, vol. 6, no. 3 (May 1989), pp. 10, 17.

Ward R., Beyer B. BeyondCorp: A New Approach to Enterprise Security (). ;login:, vol. 39, no. 6, December 2014.

Whittaker J.A., Arbon J., Carollo J. How Google Tests Software: Addison-Wesley, 2012.

Wood A. Predicting Software Reliability (). Computer, vol. 29, no. 11, 1996.

Wright H.K. Release Engineering Processes, Their Faults and Failures (), (section 7.2.2.2) PhD Thesis, University of Texas at Austin, 2012.

Wright H.K., Perry D.E. Release Engineering Practices and Pitfalls (), in Proceedings of the 34th International Conference on Software Engineering (ICSE ’12). (IEEE, 2012), pp. 1281–1284.

Wright H.K., Jasper D., Klimek M., Carruth C., Wan Z. Large-Scale Automated Refactoring Using ClangMR (). Proceedings of the 29th International Conference on Software Maintenance (ICSM ’13), (IEEE, 2013), pp. 548–551.

ZooKeeper Project (Apache Foundation), ZooKeeper Recipes and Solutions (). ZooKeeper 3.4 documentation, 2014.


Назад: Об авторах
На главную: Предисловие