Skip to main content
Till KTH:s startsida Till KTH:s startsida

Long Zhang

Profile picture of Long Zhang

About me

Long is currently a senior SRE at Electrolux working on its IoT system's observability and reliability. Long holds a Ph.D. degree in software reliability from KTH Royal Institute of Technology, Sweden. His research work focuses on self-healing software, chaos engineering, and antifragile systems. During Long's Ph.D. study, he was supervised by Professors Martin Monperrus and Benoit Baudry, and funded by Wallenberg AI, Autonomous Systems and Software Program (WASP). Long received his BE degree and ME degree in software engineering from Harbin Institute of Technology, China. Before his Ph.D. study, Long was hired by Tencent as a software developer and project manager, who was responsible for university-enterprise cooperation projects design and development.

My research interests

Chaos engineering, Self-healing software, Antifragile systems

Work and study experience

Senior Site Reliability Engineer at Electrolux (since 2022.10)

Ph.D. student in computer science, KTH Royal Institute of Technology (2018.01 - 2022.11)

Manager of University Relations at Tencent (2015.07 ~ 2017.11)

Assistant Engineer (Intern) at Tencent (2014.07 ~ 2015.07  2012.07 ~ 2013.07)

Master and bachelor in software engineering, Harbin Institute of Technology, China

Brief Introduction to My Research Work

I'm mainly focusing on software systems resilience problems. It's impossible to predict every failure or unanticipated situation of your system, especially when it is deployed into production. So it's quite important to improve system resilience, enabling it to bear and self-heal the perturbations.

On the concept level, I'm using Chaos engineering to address this problem. Chaos engineering is the practice of experimenting on a distributed system in order to build confidence in the system’s capability to withstand unexpected conditions in production. In another word, breaking things on purpose. We should change our perspective on errors, instead of preventing them all the time, but we trigger them in some controlled situation. On the delivery level, I’ll implement a software system that could be easily combined with customers' applications. When the chaos system is running, it automatically monitors the target application and conducts chaos experiments, analyzes weakness points of resilience for the application.
For example, we actively shut down one of the service instances, to check whether the routing node could smoothly redirect the traffic to other working ones. In this scenario, we get more chance to analyze and learn from system error handling behaviors, and finally improve its resilience.

Keywords:chaos engineering, fault injection, monitoring, self-healing, reliability, availability, antifragile, software engineering