CSE Colloquium: Challenges and Opportunities in Deploying Parallel Filesystems in the Modern Cloud
Abstract
High-Performance Computing (HPC) has never been a stagnant area to work in, with the preferred languages, libraries, operating systems or distributions, processors, interconnect technologies, and storage media (to name a few) changing constantly. However, while the “how” has constantly evolved, the “where” has almost entirely remained the same through these decades of HPC advancement: on-premise. With the advent and rampant growth of modern clouds the fiscal and technological rationales for continuing to make huge investments in on-premise HPC systems becomes more difficult to make each and every year. However, to enable that “where” to shift from on-prem to in the cloud, modern cloud vendors must innovate to deliver the software and hardware ecosystems HPC users have become accustomed to while still enabling the most salient feature of life in the cloud: transiency.
One major component of any HPC solution is a parallel filesystem up to the task of keeping the applications and associated expensive processors well-fed. The most popular incarnation of this is the Lustre Filesystem, which is used by the majority of the supercomputers on the TOP500 list. However, there arise numerous challenges and simultaneously, opportunities, when attempting to deploy Lustre into a modern cloud like Microsoft Azure. This talk will highlight the most interesting challenges the Azure-Managed Lustre Filesystem team faced while determining how to deploy, monitor, and manage a parallel filesystem in Azure, and will detail currently implemented and future opportunities for innovation that manifested when cloud and HPC storage ecosystems collided.
Bio
Ellis Wilson is a Principal Software Engineer Manager on the Azure Managed Lustre Filesystem team of Microsoft in Pittsburgh, Pennsylvania. Prior to joining Microsoft in 2021 he spent a decade at Panasas as Software Architect working on the PanFS parallel filesystem and associated storage appliance. He received his PhD in Computer Science from the Pennsylvania State University under Mahmut Kandemir, having presented a dissertation focused on NAND flash firmware technology, parallel filesystems, and multi-protocol filesystem interactions.
Event Contact: Timothy Zhu