Alibaba Cloud recently unveiled proprietary technology that significantly improved server fault prediction and detection capabilities. According to the company, its ability to detect problems surpassed similar technologies by ten percent.
This breakthrough was detailed in a paper [PDF] presented at the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
The document emphasizes the importance of reliability in public clouds, highlighting the significance of predicting failures. The authors note that log files contain valuable information on “exceptions” that signify potential performance issues. They argue that existing tools focused on using machine learning and deep learning to predict failures overlook the importance of timestamps as clear indicators of issues.
In response to this, Alibaba Cloud developed its own tool called Time-Aware Attention-Based Transformer (TAAT) to analyze timestamp data effectively. TAAT complements existing machine learning tools by incorporating Bidirectional Encoder Representations from Transformers (BERT), a language model from Google that has been used for predicting server failures. The paper argues that BERT does not fully leverage log timestamps.
Alibaba Cloud’s tool combines BERT for failure analysis with TAAT’s analysis of timestamp data from log files. The paper outlines the complex mathematical approach Alibaba uses to analyze log information, resulting in a ten percent improvement in fault predictions and enhanced reliability for cloud infrastructure.
One key advantage of TAAT is its ability to provide useful insights without requiring expert analysis, reducing the need for specialized knowledge of cloud crashes. The tool is currently in use at Alibaba Cloud.
While TAAT is not available for download, Alibaba Cloud has shared a massive dataset comprising “∼2.7 billion syslogs from ∼300,000 servers in a four-month period of the real productional system of Alibaba Cloud” to support researchers in developing their own log sampling strategies for future failure prediction efforts.
The authors have also released a video detailing TAAT’s operation. ®