Advanced Telemetry Correlation Techniques for Real-Time Reliability Engineering in Edge-Cloud Systems
Abstract
The rapid expansion of edge-cloud computing architecture has introduced unprecedented observability challenges arising from the geographic dispersion of workloads, heterogeneous infrastructure substrates, and bandwidth-constrained interconnection fabrics. Traditional telemetry pipelines, designed for centralized cloud environments with homogeneous connectivity, fail to deliver the cross-layer diagnostic visibility required for effective real-time reliability engineering across distributed edge and cloud tiers. This paper presents a context-aware telemetry correlation engine (CATCE) that employs semantic enrichment, adaptive signal fusion, and topology-guided causal reasoning to correlate metrics, distributed traces, structured logs, and infrastructure events across edge-cloud boundaries in real time. The framework introduces a novel context propagation protocol that preserves causal relationships even under intermittent edge connectivity, and a resource-proportional analysis strategy that distributes computational workloads according to available capacity at each tier. Evaluation on a production-representative testbed with 18 edge locations and a multi-region cloud backbone demonstrates a 71.2 percent reduction in mean fault isolation time relative to conventional centralized approaches, with sustained diagnostic accuracy of 94.6 percent under degraded network conditions. The findings offer actionable architectural guidance for reliability engineering teams operating at the edge-cloud frontier.
References
[1] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, "Dapper, a large-scale distributed systems tracing infrastructure," Google, Inc., Tech. Rep., 2010.
[2] OpenTelemetry Authors, "OpenTelemetry specification," Cloud Native Computing Foundation, 2023. [Online]. Available: https://opentelemetry.io/docs/specs/otel/
[3] R. R. Sambasivan, R. Fonseca, I. Shafer, and G. R. Ganger, "So, you know that your latency is not normal," ACM Queue, vol. 11, no. 12, pp. 50-63, 2013.
[4] J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O’Neill, K. W. Ong, B. Schreiber, P. Sajber, S. Balakrishnan, and K. Karr, "Canopy: An end-to-end performance tracing and analysis system," in Proc. ACM Symp. Operating Systems Principles, Shanghai, China, 2017, pp. 34-50.
[5] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA, USA: O’Reilly Media, 2016.
[6] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, "Detecting large-scale system problems by mining console logs," in Proc. ACM SIGOPS 22nd Symp. Operating Systems Principles, Big Sky, MT, USA, 2009, pp. 117-132.
[7] P. Aggarwal, J. Sahoo, J. Moss, and R. Kompella, "On unified telemetry data analysis for cloud infrastructure fault localization," IEEE Transactions on Network and Service Management, vol. 19, no. 4, pp. 4023-4036, 2022.
[8] P. Chen, Y. Qi, P. Zheng, and D. Hou, "CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems," in Proc. IEEE INFOCOM, Toronto, ON, Canada, 2014, pp. 1887-1895.
[9] L. Wu, J. Tordsson, E. Elmroth, and O. Kao, "MicroRCA: Root cause localization of performance issues in microservices," in Proc. IEEE/IFIP Network Operations and Management Symp., Budapest, Hungary, 2020, pp. 1-9.
[10] Y. Meng, S. Zhang, Y. Sun, R. Zhang, Z. Hu, Y. Zhang, C. Jia, Z. Wang, and D. Pei, "Localizing failure root causes in a microservice through causality inference," in Proc. IEEE/ACM 28th Int. Symp. Quality of Service, Hang Zhou, China, 2020, pp. 1-10.
[11] Z. Li, Y. Chen, D. Sui, S. Yang, and T. Zhang, "Transfer entropy-based fault propagation analysis for microservice systems," IEEE Transactions on Services Computing, vol. 16, no. 3, pp. 1987-1999, 2023.
[12] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, "Edge computing: Vision and challenges," IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637-646, 2016.
[13] S. Yi, Z. Hao, Z. Qin, and Q. Li, "Fog computing: Platform and applications," in Proc. IEEE Workshop on Hot Topics in Web Systems and Technologies, Washington, DC, USA, 2015, pp. 73-78.
[14] K. Toczé and S. Nadjm-Tehrani, "A taxonomy for management and optimization of multiple resources in edge computing," Wireless Communications and Mobile Computing, vol. 2018, pp. 1-23, 2018.
[15] X. Ren, P. London, J. Ziani, and A. Wierman, "Adaptive monitoring for edge-cloud systems," in Proc. IEEE Int. Conf. Cloud Computing, San Jose, CA, USA, 2021, pp. 112-121.
[16] Y. Dang, Q. Lin, and P. Huang, "AIOps: Real-world challenges and research innovations," in Proc. IEEE/ACM 41st Int. Conf. Software Engineering: Companion, Montreal, QC, Canada, 2019, pp. 4-7.
[17] P. Notaro, J. Cardoso, and M. Gerndt, "A systematic mapping study in AIOps," in Proc. Int. Conf. Service-Oriented Computing, Dubai, UAE, 2021, pp. 110-123.
[18] M. Ma, W. Lin, D. Pan, and P. Wang, "MS-Rank: Multi-metric and multi-source based root cause ranking for cloud service failure diagnosis," IEEE Transactions on Services Computing, vol. 15, no. 6, pp. 3550–3563, 2022.
[19] T. Ahmed, H. Yin, and R. White, "Recommending root-cause and mitigation steps for cloud incidents using large language models," in Proc. IEEE/ACM 45th Int. Conf. Software Engineering, Melbourne, Australia, 2023, pp. 1314-1326
