A survey of cloud network fault diagnostic systems and tools
Tóm tắt
Từ khóa
Tài liệu tham khảo
Aceto G, Botta A, de Donato W, et al., 2013. Cloud monitoring: a survey. Comput Netw, 57(9):2093–2115. https://doi.org/10.1016/j.comnet.2013.04.001
Andreyev A, 2014. Introducing Data Center Fabric, the Next-Generation Facebook Data Center Network. https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/
Armbrust M, Fox A, Griffith R, et al., 2010. A view of cloud computing. Commun ACM, 53(4):50–58. https://doi.org/10.1145/1721654.1721672
Arzani B, Ciraci S, Loo BT, et al., 2016. Taking the blame game out of data centers operations with NetPoirot. Proc ACM SIGCOMM Conf, p.440–453. https://doi.org/10.1145/2934872.2934884
Arzani B, Ciraci S, Chamon L, et al., 2018. 007: democratically finding the cause of packet drops. Proc 15th USENIX Conf on Networked Systems Design and Implementation, p.419–435.
Bahl P, Chandra R, Greenberg A, et al., 2007. Towards highly reliable enterprise network services via inference of multi-level dependencies. Proc Conf on Applications, Technologies, Architectures, and Protocols for Computer Communications, p.13–24. https://doi.org/10.1145/1282380.1282383
Bannour F, Souihi S, Mellouk A, 2018. Distributed SDN control: survey, taxonomy, and challenges. IEEE Commun Surv Tutor, 20(1):333–354. https://doi.org/10.1109/COMST.2017.2782482
Calder M, Schröder M, Gao R, et al., 2018. Odin: Microsoft’s scalable fault-tolerant CDN measurement system. Proc 15th USENIX Conf on Networked Systems Design and Implementation, p.501–517.
Casella G, Berger RL, 2002. Statistical Inference (2nd Ed.). Duxbury Press, Pacific Grove, USA.
Claise B, Sadasivan G, Valluri V, et al., 2004. RFC 3954: Cisco Systems NetFlow Services Export Version 9. https://www.hjp.at/doc/rfc/rfc3954.html
Dhamdhere A, Teixeira R, Dovrolis C, et al., 2007. NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data. Proc ACM CoNEXT Conf, p.1–12. https://doi.org/10.1145/1364654.1364677
Duffield N, Haffner P, Krishnamurthy B, et al., 2009. Rule-based anomaly detection on IP flows. IEEE INFOCOM, p.424–432. https://doi.org/10.1109/INFCOM.2009.5061947
Fang CR, Liu HY, Miao M, et al., 2020. VTrace: automatic diagnostic system for persistent packet loss in cloud-scale overlay network. Proc Annual Conf of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, p.31–43. https://doi.org/10.1145/3387514.3405851
Ganguli S, Corbett T, 2019. Gartner Magic Quadrant for Network Performance Monitoring and Diagnostics.
Garfinkel SL, 1999. Architects of the Information Society: Thirty-Five Years of the Laboratory for Computer Science at MIT. The MIT Press, Cambridge, USA.
Geng YL, Liu SY, Yin Z, et al., 2019. SIMON: a simple and scalable method for sensing, inference and measurement in data center networks. Proc 16th USENIX Conf on Networked Systems Design and Implementation, p.549–564.
Gong CY, Liu J, Zhang Q, et al., 2010. The characteristics of cloud computing. Proc 39th Int Conf on Parallel Processing Workshops, p.275–279. https://doi.org/10.1109/ICPPW.2010.45
Guo CX, Yuan LH, Xiang D, et al., 2015. Pingmesh: a large-scale system for data center network latency measurement and analysis. Proc ACM Conf on Special Interest Group on Data Communication, p.139–152. https://doi.org/10.1145/2785956.2787496
Herodotou H, Ding BL, Balakrishnan S, et al., 2014. Scalable near real-time failure localization of data center networks. Proc 20th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.1689–1698. https://doi.org/10.1145/2623330.2623365
Huang P, Guo CX, Zhou LD, et al., 2017. Gray failure: the Achilles’ heel of cloud-scale systems. Proc 16th Workshop on Hot Topics in Operating Systems, p.150–155. https://doi.org/10.1145/3102980.3103005
Jin YC, Renganathan S, Ananthanarayanan G, et al., 2019. Zooming in on wide-area latencies to a global cloud provider. Proc ACM Conf on Special Interest Group on Data Communication, p.104–116. https://doi.org/10.1145/3341302.3342073
Kanuparthy P, Dovrolis C, 2014. Pythia: diagnosing performance problems in wide area providers. Proc USENIX Conf on USENIX Annual Technical Conference, p.371–382.
Kim C, Bhide P, Doe E, et al., 2015. In-Band Network Telemetry via Programmable Dataplanes. Technical Specification P, 4:2015.
Li Z, Cheng Q, Hsieh K, et al., 2020. Gandalf: an intelligent, end-to-end analytics service for safe deployment in large-scale cloud infrastructure. Proc 17th USENIX Symp on Networked Systems Design and Implementation, p.389–402.
Marston S, Li Z, Bandyopadhyay S, et al., 2011. Cloud computing—the business perspective. Dec Support Syst, 51(1):176–189. https://doi.org/10.1016/j.dss.2010.12.006
Mell P, Grance T, 2011. The NIST Definition of Cloud Computing. Gaithersburg: Computer Security Division, Information Technology Laboratory.
Moshref M, Yu ML, Govindan R, et al., 2016. Trumpet: timely and precise triggers in data centers. Proc ACM SIGCOMM Conf, p.129–143. https://doi.org/10.1145/2934872.2934879
Padmanabhan VN, Ramabhadran S, Padhye J, 2005. Net-Profiler: profiling wide-area networks using peer cooperation. Proc 4th Int Conf on Peer-to-Peer Systems, p.80–92. https://doi.org/10.1007/11558989_8
Peng YH, Yang J, Wu C, et al., 2017. deTector: a topology-aware monitoring system for data center networks. Proc USENIX Conf on Usenix Annual Technical Conf, p.55–68.
Roskind J, 2013. Quick UDP Internet Connections: Multiplexed Stream Transport over UDP. https://docs.google.com/document/d/1RNHkx_VvKWyWg6Lr8SZ-saqsQx7rFV-ev2jRFUoVD34/
Roy A, Zeng HY, Bagga J, et al., 2015. Inside the social network’s (datacenter) network. Proc ACM Conf on Special Interest Group on Data Communication, p.123–137. https://doi.org/10.1145/2785956.2787472
Roy A, Zeng HY, Bagga J, et al., 2017. Passive realtime datacenter fault detection and localization. Proc 14th USENIX Symp on Networked Systems Design and Implementation, p.595–612.
Tan C, Jin Z, Guo CX, et al., 2019. NetBouncer: active device and link failure localization in data center networks. Proc 16th USENIX Conf on Networked Systems Design and Implementation, p.599–614.
Tibshirani R, 1996. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B, 58(1):267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Veloso B, Malheiro B, Burguillo JC, et al., 2020. Impact of trust and reputation based brokerage on the CloudAnchor platform. Int Conf on Practical Applications of Agents and Multi-agent Systems, p.303–314.
Wang M, Li BC, Li ZP, 2004. sFlow: towards resource-efficient and agile service federation in service overlay networks. Proc 24th Int Conf on Distributed Computing Systems, p.628–635. https://doi.org/10.1109/ICDCS.2004.1281630
Wang T, Zhang WB, Ye CY, et al., 2016. FD4C: automatic fault diagnosis framework for web applications in cloud computing. IEEE Trans Syst Man Cybern Syst, 46(1):61–75. https://doi.org/10.1109/TSMC.2015.2430834
Widanapathirana C, Li J, Sekercioglu YA, et al., 2011. Intelligent automated diagnosis of client device bottlenecks in private clouds. Proc 4th IEEE Int Conf on Utility and Cloud Computing, p.261–266. https://doi.org/10.1109/UCC.2011.42
Wu X, Turner D, Chen CC, et al., 2012. NetPilot: automating datacenter network failure mitigation. Proc Conf on Applications, Technologies, Architectures, and Protocols for Computer Communication, p.419–430. https://doi.org/10.1145/2342356.2342438
Yu D, Zhu YB, Arzani B, et al., 2019. dShark: a general, easy to program and scalable framework for analyzing in-network packet traces. Proc 16th USENIX Conf on Networked Systems Design and Implementation, p.207–220.
Yu ML, Greenberg A, Maltz D, et al., 2011. Profiling network performance for multi-tier data center applications. Proc 8th USENIX Conf on Networked Systems Design and Implementation, p.57–70.
Zeng HY, Mahajan R, McKeown N, et al., 2015. Measuring and Troubleshooting Large Operational Multipath Networks with Gray Box Testing. Technical Report MSR-TR-2015-55 (Microsoft Research).
Zhang Q, Yu G, Guo CX, et al., 2018. Deepview: virtual disk failure diagnosis and pattern detection for Azure. Proc 15th USENIX Conf on Networked Systems Design and Implementation, p.519–532.
Zhu YB, Kang NX, Cao JX, et al., 2015. Packet-level telemetry in large datacenter networks. ACM SIGCOMM Comput Commun Rev, p.479–491. https://doi.org/10.1145/2829988.2787483
Zhuo DY, Ghobadi M, Mahajan R, et al., 2017. Understanding and mitigating packet corruption in data center networks. Proc ACM Conf on Special Interest Group on Data Communication, p.362–375. https://doi.org/10.1145/3098822.3098849