Case Study Memory Failure Prediction Tencent Cloud Solutions ® Intel Memory Failure Prediction at Tencent

Case Study
Memory Failure Prediction
Tencent Cloud Solutions
®
Intel Memory Failure Prediction
at Tencent

Case Study Memory Failure Prediction Tencent Cloud Solutions ® Intel Memory Failure Prediction at Tencent

Intel® Memory Failure Prediction substantially improves memory reliability through online machine learning and reduces downtime Business: Tencent is one of the biggest cloud solution providers in China with a presence throughout three continents. Challenges • Real-time visibility into memory health • Effective DIMM replacement strategy • Predictive insights into server memory... uptime and workload transfer Solution • Intel® Memory Failure Prediction Tencent Seafront Towers in Shenzhen, China Executive Summary Tencent, a leading China-based global cloud-solutions provider with operations in APAC, Europe and North America, set up Intel® Memory Failure Prediction (Intel® MFP) for a test deployment with thousands of servers based on Intel® Xeon® Scalable Processors to reduce downtime caused by server memory failures. Tencent’s IT staff deployed Intel® MFP in their data center and integrated it into their existing management systems to analyze their server memory failures, predict potential future failures, reduce downtime, and improve their current Dual Inline Memory Module (DIMM) replacement and upgrade policies. The Intel® MFP deployment resulted in improved memory reliability due to predictions based on the capture of micro-level memory failure information from the operating system’s Error Detection and Correction (EDAC) driver which stores historical memory error logs. Intel® MFP also gave Tencent’s IT staff enough information to proactively address potential memory issues, and replace failing DIMMs before they reach a terminal stage and cause server failures, and thus reducing downtime. This initial test deployment indicated 5X improvement on DIMM level failure prediction. If Tencent deployed Intel® MFP across its entire data centers, they would improve the effectiveness of server reliability aware workload management and decrease the percentage of Uncorrectable Errors (UEs) and therefore significantly reduce downtime. Additionally, Tencent’s operational efficiency would improve and so would their expenses on unnecessary DIMM purchases. 1 Case Study | Intel® Memory Failure Prediction at Tencent ® Intel Memory Failure Pediction at Tencent Reduces uncorrectable memory errors Simplifies workload migration decision making Improves DIMM failure prediction 5X Optimizes page offlining policies Improves DIMM toss & purchase decisions Reduces downtime caused by server memory failures Background Memory failures are one of the most critical hardware failures that occur in data centers today. Intel® MFP is a perfect solution for organizations such as online and cloud service providers that depend heavily on server reliability, availability and serviceability (RAS). Intel® MFP predicts memory failure events by analyzing historical data to prevent potential catastrophic events before they happen. Intel® MFP is vendor agnostic and works in conjunction with other data center management solutions including Intel® Data Center Manager (Intel® DCM). Once deployed, the resulting data can be used to analyze and predict server memory issues before they happen. Tencent deployed Intel® MFP in a test environment containing thousands of servers with Intel® Xeon® Scalable Processors to gain better insights into their memory health. Intel® MFP monitored the health of the servers’ Dynamic Random Access Memory (DRAM) modules and provided administrators with critical information about them including a health-score based on their historical data. Intel® MFP Provides Real-time Memory Health Insights Intel® MFP uses online machine learning to analyze the historical data collected on server memory down to the DIMM, bank, column, row, and cell levels and gives a memory health score to predict potential future failures. The resulting analysis and health scores indicated the potential for a large number of memory issues within Tencent’s test environment including both Correctable Errors (CE) and Uncorrectable Errors (UE). A Read the full Case Study Memory Failure Prediction Tencent Cloud Solutions ® Intel Memory Failure Prediction at Tencent.

Notices and Disclaimers

Intel does not control or audit third-party data.  You should consult other sources to evaluate accuracy.

Memory failure prediction results provided through the use of Intel MFP are estimated and may vary based on differences in system hardware, software, or configuration. Results are derived using multi-dimensional models and algorithms to predict potential memory failures and do not constitute a representation or guarantee regarding memory failure.