2022-09-06: Anomaly Detection for Manufactured Computer Components' Failures: Summer Internship Experience at Microsoft Corporation

This Summer, I was accepted as a graduated student at the Quality Management System (QMS), a subdivision of Cloud Hardware and Infrastructure Engineering (CHIE) organization, at Microsoft Corporation located at Redmond, Washington, USA. Microsoft is an international organization, whose mission is to empower every person and every organization on the planet to achieve more. This year, Microsoft continued its student internship program for Summer 2022. Approximately 4000 students joined Microsoft in the USA to work on various projects during this Summer. Due to the current situation of the COVID-19 pandemic, most internships were an hybrid of both onsite and remote work.

My internship was a 12 weeks program which started on May 23rd, 2022. During this internship, I worked mostly remotely as a data scientist intern under the supervision of Nikolaj Lunoee and Kapil Jain. However, I visited the Redmond campus during the week July 25th through July 29th, 2022. Throughout this program, I attended daily meetings with the entire QMS team, and monthly meetings with the CHIE team. The daily meetings with the QMS team were to update my progress and to obtain feedback to resolve issues or to improve the solution. I usually have one-on-one meeting with my mentor Kapil Jain every day to discuss my progress and any issues faced. I also met with my manager Nikolaj Lunoee at least once a week.

Project

Data Center computer racks with different components and servers

QMS Business Nature

The QMS team within the CHIE organization is responsible for systematically driving the quality of hardware components. One of the metrics used to monitor the quality of the different hardware components is the Part Replacement Rate (PRR). PRR is obtained from the hardware components that failed after installation in the data center. Examples of the different components the QMS monitors includes memory modules, solid state drive, etc. These components are present in different servers located in different computer racks from multiple data centers. These components also consists of thousands of different sub-parts, which makes it infeasible to manually monitor the different parts within each component. The current manual approach could make a faulty part go undetected for a long time, which in turn can negatively affect other proper functioning parts. Thus, the goal of my project was to design an anomaly detector that can automatically track the performances of these different parts and components.

I began my internship by first familiarizing myself with the QMS domain and the PRR data, by reading several documentations and taking relevant courses to help my understanding of the problem I was to tackle. I also scheduled several one-on-one meetings with each member of the team to obtain details about each person's work, and contributions. I also familiarized myself with Microsoft Azure Synapse Analytics and Azure Machine Learning platforms, since these are the platforms my team uses to carry out their operation.

I started by first exploring the data to get some basic understanding of the features in the data. Then, I conducted both univariate and bivariate analysis of the different features of the data using Python packages such as Plotly. I also performed a lot of data preprocessing and data cleaning to prepare the data for modeling. One of the challenges I faced with the data was differentiating between PRR anomaly and an accepted variation. I resolved this by using a modified z-score (a statistical approach) to find a threshold between anomalies and an accepted variation. Then, I used this threshold to create a ground truth for the data in other to evaluate the performance of my models.

Project Framework

Then, I proceeded with exploring different anomaly detection methods such as Unsupervised K-Nearest Neighbors, Isolation Forest, Density-Based Spatial Clustering of Applications (DBSCAN), and Autoencoders.

Summary of Explored Anomaly Detection Models

K-Nearest Neighbors (KNN): KNN is a supervised learning algorithm, but in the context of anomaly detection, it functions in an unsupervised manner. It belongs to the nearest-neighbors family. The fundamental assumption of this family of models is that similar observations are in proximity to each other and outliers are usually lonely observations, staying farther from the cluster of similar observations. I used the KNN model available in the Python for Outlier Detection (PyOD) packages. I optimized its parameters using Grid Search with cross-validation approach to improve the model performances.
Isolation Forest (IForest): It isolates observations by randomly selecting a feature and then randomly select a split value between the maximum and minimum values of the selected feature. Similar optimization approach was used for improving IForest model.
K-Means: It randomly selects initial cluster centroids based on the specified number of clusters, then it assigns each observation to the closest centroid. This process continues until the centroids computation no longer changes. I used the silhoutte score metrics to select an optimal number of clusters for my K-means model.
Density-Based Spatial Clustering of Applications (DBSAN): It finds core samples of high density and expands clusters from them. It works well for data which contains clusters of similar density. One of the most important parameters to optimize for DBSCAN is the "eps" (maximum distance between two samples for one to be considered as in the neighborhood of the other) parameter. So, I used the nearest neighbor techniques to find the optimal values of eps to use.
Local Outlier Factor (LOF): It measures the local deviation of the density of a given observation with respect to its neighbors. The anomaly score of each observation is called LOF. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. I used LOF in the context of novelty detection, and optimized it with Grid Search method.
One Class Support Vector Machine (OCSVM): It is used to estimate the support for a high-dimensional distribution. I mainly optimized the kernel and tolerance hyper-parameters of the model.
Autoencoder: It is a type of neural network that is trained to copy its input to output. It first internally compress its input to a lower latent space representation, and then expands it out back to the original input dimension. The underlying assumption is that noise (outlier) gets removed during this compression. I optimized the model parameters using keras-tuner.

Data

I divided the ground truth (data without anomalies) into 3 parts; training, validation, and test inliers. I also divided the anomalies into 2 parts; validation and test anomalies. Then, I combined both the validation inliers and anomalies to form my validation data. I used the same approach to obtain my test data. The training and validation data sizes are 84,000 and 10,000. So, I trained the anomaly detection models with more stable data (may contain few anomalies) with the assumption that the trained models would be able to flag unusual observations in new inference data.

Results

My internship ended with a presentation of my project to the top management of CHIE team within the United States. Also, my model was deployed and integrated into the QMS project pipeline to reduce monitoring time and improve hardware quality.

Overall Experience

This is my second internship program in the United States after having an excellent internship experience at Los Alamos National Laboratory in Summer 2020. During this internship, I learnt how develop and optimize anomaly detection models on manufacturing data. I had incredible experience working as a data scientist intern at Microsoft. I learnt how to communicate technical concepts to non-technical audience. The experience I gained through this internship is crucial for my future career.

Acknowledgments

I would like to express my gratitude to my PhD advisor, Dr. Jian Wu, for his boundless support and encouragement towards getting this internship, to my internship manager, Nikolaj Lunoee, and mentor, Kapil Jain, for guiding me throughout my internship by providing feedback and suggestions. I am thankful for the opportunity to work as data scientist intern with the CHIE team!

-Kehinde Ajayi (@KennyAj)

Search This Blog

Web Science and Digital Libraries Research Group