publications
2022
- One Big Happy Family: Sharing the S3 Layer between Ceph, CORTX, and DAOSAndriy Tkachuk, and Zuhair AlsaderIn Proceedings of the 2022 Workshop on Emerging Open Storage Systems and Solutions for Data Intensive Computing, 2022
Object storage has transformed the storage industry. Freed from the complex hierarchical organization of file systems, object storage systems have achieved tremendous growth and scalability in the past two decades. However, object storage systems for Enterprise/Cloud computing and those for High Performance Computing (HPC) have had some differences; the main difference being the client interface. Enterprise computing has preferred a GET-PUT interface which, similar to Map-Reduce, has enabled tremendous human productivity by simplifying the interface. Whereas, typical HPC frameworks tend to optimize computational productivity over human productivity which means they prefer more complex, low-level interfaces with more flexibility and more ability to optimize.Given this, it is no surprise that three object storage systems (Ceph, CORTX, and DAOS) originally motivated by HPC all have similar low-level interfaces (librados, libmotr, and libdaos respectively). However, given the increased convergence of Cloud and HPC, these object storage systems also need to support the industry standard interface which has becomes Amazon’s S3 protocol. In this talk, we will discuss how the Ceph project was the first to add an S3 layer, how they later made it modular so that multiple object backends could share it, and how two small groups of engineers have added modular backends for both CORTX and DAOS.
2020
- Optimizing MPI Collective Operations for Cloud DeploymentsAlSader, Zuhair2020
Cloud infrastructures are increasingly being adopted as a platform for high performance computing (HPC) science and engineering applications. For HPC applications, the Message-Passing Interface (MPI) is widely-used. Among MPI operations, collective operations are the most I/O intensive and performance critical. However, classical MPI implementations are inefficient on cloud infrastructures because they are implemented at the application layer using network-oblivious communication patterns. These patterns do not differentiate between local or cross-rack communication and hence do not exploit the inherent locality between processes collocated on the same node or the same rack of nodes. Consequently, they can suffer from high network overheads when communicating across racks. In this thesis, we present COOL, a simple and generic approach for Message-Passing Interface (MPI) collective operations. COOL enables highly efficient designs for collective operations in the cloud. We then present a system design based on COOL that describes how to implement frequently used collective operations. Our design efficiently uses the intra-rack network while significantly reducing cross-rack communication, thus improving application performance and scalability. We use software-defined networking capabilities to build more efficient network paths for I/O intensive collective operations. Our analytic evaluation shows that our design significantly reduces the network overhead across racks. Furthermore, when compared with OpenMPI and MPICH, our design reduces the latency of collective operations by a factor of log N, where N is the total number of processes, decreases the number of exchanged messages by a factor of N and reduces the network load by up to an order of magnitude. These significant improvements come at the cost of a small increase in the computation load on a few processes.
2018
- COOL: A Cloud-Optimized Structure for MPI Collective OperationsMohammed Alfatafta, Zuhair AlSader, and Samer Al-KiswanyIn 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), Jul 2018
We present COOL, a simple and generic structure for MPI collective operations. COOL enables highly efficient designs for all collective operations in the cloud. We then present a system design based on COOL that implements frequently used collective operations. Our design efficiently uses the intra-rack network while minimizing cross-rack communication, thus improving the application performance and scalability. We use recent software-defined networking capabilities to build optimal network paths for I/O intensive collective operations. Our analytical evaluation shows that our design imposes the least possible network overhead across racks. Furthermore, when compared with OpenMPI and MPICH, our design reduces the number of steps to only three, decreases the number of exchanged messages by a factor of N, the total number of processes, and reduces the network load by up to an order of magnitude. These significant improvements come at the cost of a modest increase in the computation load on a few processes.