13 Juin à l'IMT Palaiseau

6ème édition du Workshop

Performance and Scalability of Storage Systems

Per3S est un workshop francophone autour des thématiques de la performance des systèmes de stockage et de leur passage à l’échelle

Le stockage de données tel qu'abordé dans Per3S inclut donc le monde du HPC mais aussi les architectures Cloud orientées «scalabilité» extrême. Les présentations abordent les thématiques applicatives, systèmes, ou architecturales. Cette 6ième édition de Per3S vise à réunir, durant une journée, des chercheurs issus des milieux académique et industriel, junior et senior afin d'échanger sur ces problématiques.

50 Participants

Les éditions précédentes de Per3S ont réussi à aggreger une petite communauté de chercheurs et industriels actifs sur les thématiques du stockage avec une jauge de 60 participants.

4 sessions

Le programme de la journée inclut 4 sessions, une sur les thématiques de passage a l'échelle pour le stockage de données, une sur les aspects relatif au Cloud, une troisième sur les considérations bas niveau qui permettent de passer à l'échelle et enfin une session dédiée aux travaux des doctorants et jeunes chercheurs.

15 Présentations

Dont 5 présentations Flash de doctorants pour avoir un bon panorama des activités de recherche en stockage en France en une seule journée.

Programme

Le workshop s'étend sur une journée complète de 9h à 17h30 avec 4 sessions :

9h-9h30 Accueil des Participants

9h30-12h Storage at Scale

12h-12h30 Présentation Flash des Doctorants

12h30-14h ** Buffet et Posters **

14h-15h30 Cloud and Storage

15h30-16h ** Pause café **

16h-17h30 Lowlevel and Scale enablers

9h30-12h Storage at Scale

09.30 am Amphi Thévenin

Towards a Better Understanding and Evaluation of Tree Structures on SSDs

Radu Stoica, IBM Zurich

Solid-state drives (SSDs) are extensively used to deploy persistent data stores, as they provide low latency random access, high write throughput, high data density, and low cost. Tree-based data structures are widely used to build persistent data stores, and indeed they lie at the backbone of many of the data management systems used in production and research today. In this paper, we show that benchmarking a persistent tree-based data structure on an SSD is a complex process, which may easily incur subtle pitfalls that can lead to an inaccurate performance assessment. At a high-level, these pitfalls stem from the interaction of complex software running on complex hardware. On one hand, tree structures implement internal operations that have nontrivial effects on performance. On the other hand, SSDs employ firmware logic to deal with the idiosyncrasies of the underlying flash memory, which are well known to lead to complex performance dynamics. We identify seven benchmarking pitfalls using RocksDB and WiredTiger, two widespread implementations of an LSM-Tree and a B+Tree, respectively. We show that such pitfalls can lead to incorrect measurements of key performance indicators, hinder the reproducibility and the representativeness of the results, and lead to suboptimal deployments in production environments. We also provide guidelines on how to avoid these pitfalls to obtain more reliable performance measurements, and to perform more thorough and fair comparison among different design points.

Slide Here!

10h00 AM Amphi Thévenin

IO-SETS: Simple and efficient approaches for I/O

Francieli Zanon Boito, Inria Bordeaux

One of the main performance issues faced by high-performance platforms is the delay caused by concurrent applications performing I/O. When this happens, applications sharing the I/O bandwidth create congestion, which delays the execution time and affects the platform’s overall performance. Several solutions have been proposed to tackle I/O congestion. Among those solutions, I/O scheduling strategies are an essential solution to this problem in HPC systems. Currently, their main drawback is the amount of information needed. In this work, we propose a novel framework for I/O management, IO-SETS. We present the potential of this framework through a simple scheduling solution called SET-10, an I/O scheduling solution that demands minimum information from users and can be easily integrated into HPC environments.

Slide Here!

10h30 AM Amphi Thévenin

Using metadata matching for prefetching strategy auto-optimization

Sophie Robert-Hayek, Atos

As High Performance Computing systems become more and more complex, optimal performance can only be obtained through self-adaptive behavior. This often requires some knowledge of incoming applications, even before the application is launched, in order to customize runtime environments to the particular needs of this application. In this talk, we discuss the architecture of an auto-tuner which relies on metadata matching to match an incoming application with a database of already known applications, to predict their I/O behavior and customize the runtime parameters of a prefetching strategy. This system has been implemented in the case of a scenario representative of conditions we expect to encounter in production, where we show an improvement of 28% in terms of I/O performance compared to the default parametrization for a set of benchmarks. Besides, we demonstrate that this tuner's architecture provides a negligible overhead when used in a high-traffic scenario, making it resilient for parallel use on a production HPC cluster.

11h00 AM Amphi Thévenin

From small files to no files

Marco Aldinucci, University of Torino, Italy

Modern distributed high-performance storage systems saturate the network bandwidth, and the margins for improvement at the software level are tiny. Due to metadata access, they might be troubled with massive access to small files. An example is the Software Heritage (SH) dataset, half petabytes of files with an average size of 3kBytes (Terabytes of metadata). While working with SH, we developed the idea of substituting files with in-memory streams. We did it living in dread with the fear of asking application programmers to rewrite their lovely antique legacy code exploiting the POSIX interface, and up to now, we did not. In the talk, we will introduce CAPIO (Cross-Application Programmable I/O) design principles and the current state of development of the prototype.

11h30 AM Amphi Thévenin

Investigating IO bottlenecks with EZTrace

François Trahay, IMT-Palaiseau

Identifying the part of a distributed application that is affecting performance is complex. EZTrace helps developers investigate performance issues by generating execution traces of parallel applications. We show that EZTrace can locate the process that suffers from I/O contention, and how IOTracer helps identify the precise origin of the bottleneck in the I/O stack.

Slide Here!

12h-12h30 Présentation Flash des posters des Doctorants

12h00 Amphi Thévenin

Partage des noeuds E/S entre applications

Alexis Bandet

Avec l’arrivé prochaine des machines exascale, l’essort du Big Data et le developpement du machine learning dans les centres de calcul l’écart entre performance de calcul et performance du système du stockage devient toujours plus une problématique pour maximiser l’efficacité des machines. Ainsi, une surcouche logicielle et matérielle de transfert des E/S permet de coordonner les flux E/S entre les nœuds de calcul et le système de fichier. Nous proposons une approche de partage des ressources de la couche de transfert d’E/S à l’aide d’algorithmes gloutons, transparente pour l’utilisateur et avec un minimum de connaissance a priori des applications. Nous pouvons grâce a cette approche réduire de moitié le nombre de nœuds d’E/S nécessaire tout en ayant un faible impact sur la performance du système.

Slide Here!

12h05 Amphi Thévenin

Pythia: Runtime decisions based on prediction

Alexis Colin, Télécom SudParis

Runtime systems are commonly used by parallel applications in order to efficiently exploit the underlying hardware resources. A runtime system hides the complexity of the hardware and exposes a high-level interface to application developers. To efficiently exploit the hardware resources, a runtime system makes decisions and relies on heuristics that estimate the future behavior of the application. We propose Pythia, a library that serves as an oracle able to predict the future behavior of an application, so that a runtime system can make wiser decisions. Pythia relies on the deterministic nature of many HPC applications: by recording an execution trace, Pythia captures the application main behavior. The trace can be provided for future executions of the application, and a runtime system can ask for predictions of the program future behavior.

12h10 Amphi Thévenin

Revisiting storage virtualization for modern devices

Damien Thenot, VATES, Télécom SudParis

Since decades, the performance cost of virtualizing a hard drive was acceptable because a hard drive was orders of magnitudes slower than the virtualization layer. However, since a few years, new storage devices with low latency and high throughput have emerged (e.g. NVMe). These devices make old storage virtualization techniques obsolete because the virtualization layer prevents a virtual machine of fully leveraging modern storage capabilities. Nowadays the best performance is obtained by using passthrough of a full disk to a virtual machine. Unfortunately passthrough prevents the usage of important functionalities like Virtual Disk Image (VDI) snapshot, virtual machine migration to another host, storage device sharing with other VMs and everything where the host might need access to the disk. Therefore new ways of virtualizing storage devices need to be developed to enable the best of both performances that the new technologies bring and functionalities that virtualization technologies offers. During this presentation, I will introduce the current IO software stack in XCP-ng, a Xen based virtualization platform. And I will discuss the challenges encountered while trying to enhance the storage software stack of a hypervisor.

Slide Here!

12h15 Amphi Thévenin

Conception d'un algorithme d'entrainement de forêt aléatoire économe en E/S

Camélia Slimani, Laboratoire des sciences et techniques de línformation, de la communication et de la connaissance

Les forêts aléatoires sont une méthode de classification supervisée. Leur principe est d'entraîner plusieurs sous-ensembles d’éléments du data-set d'entraînement séparément pour former plusieurs arbres. Chaque arbre classifie les éléments de son sous-ensemble selon un certain nombre de propriétés du data-set. Ainsi, une feuille de l’arbre constitue une classe de données « plus ou moins » pure (tous les éléments de la feuille appartiennent à une même classe). L'entraînement de chaque arbre consiste à sélectionner les propriétés du data-set qui permettent d’avoir des nœuds feuilles les plus pures possible. Cela revient à répartir les éléments de l’arbre selon chaque propriété pour en déterminer la meilleure et donc à parcourir le sous-ensemble de données plusieurs fois. Dans un contexte d’espace mémoire restreint, où la taille du data-set est supérieure à la taille de la mémoire disponible cela va engendrer des swaps de parties du data-set en espace de stockage secondaire et donc engendrer un volume important d’E/S, notamment lorsque le nombre de propriétés du data-set est élevé. D’après nos expériences, le pourcentage de temps d’E/S représente jusqu’à 90% du temps d’exécution total pour un data-set qui est deux fois plus grand que la mémoire disponible. Nous révisons dans ce travail l’algorithme de forêts aléatoires pour le rendre moins gourmand en E/S. La méthode proposée est fondée sur deux mécanismes, le premier vise à réorganiser le data-set de sorte à augmenter la localité spatiale. Le deuxième mécanisme consiste à charger les données à la demande, uniquement lorsqu’elles sont utiles au traitement d’un nœud de l’arbre. L’évaluation de la méthode proposée montre que le temps de construction d’une forêt aléatoire est réduit de 70% en moyenne.

Slide Here!

12h20 Amphi Thévenin

Placement de données dans un cloud fédéré à base de système de stockage hybride

Amina Chikhaoui, université des sciences et de la technologie Houari Boumediene, UBO

Pour gérer les performances d’E/S des clients, les infrastructures clouds actuelles hébergent différentes classes de stockage ayant différentes caractéristiques en termes de performances et de prix. Tandis que, pour optimiser la latence du réseau, les CSP tentent de rapprocher dynamiquement les objets de leurs utilisateurs. De ce point de vue, comment placer efficacement les objets des clients pour un cloud faisant partie d’une fédération constitue un véritable défi. Nous avons exploité un modèle de coût soumis à différentes contraintes liées à l’hétérogénéité des classes et services de stockage locaux et fédérés, les charges de travail des clients et leur SLA. Pour résoudre le problème multi-objectifs proposé, nous avons développé CDP-NSGAIIIR (a Constraint Data Placement matheuristic based on NSGAII with Injection and Repair functions). CDP-NSGAIIIR est une matheuristique de placement de données qui ajoute à NSGAII des fonctions d’injection et de réparation. La fonction d’injection vise à améliorer la qualité des solutions de NSGAII. Elle consiste à calculer des solutions à l’aide d’une méthode exacte puis à les injecter dans la population initiale de NSGAII. La fonction de réparation garantit que les solutions obéissent aux contraintes du problème et évite ainsi d’explorer de grands ensembles de solutions irréalisables.

Slide Here!

12h25 Amphi Thévenin

Investigating allocation of heterogeneous storage resources on HPC systems

Julien Monniot, INRIA Rennes

The ability of large-scale infrastructures to store and retrieve a massive amount of data is now decisive to scale up scientific applications. However, there is an ever-widening gap between I/O and computing performance: over the last ten years, the ratio of I/O bandwidth to computing power has been divided by ~10 on the top 3 supercomputers of the Top500. A way to mitigate this gap consists of deploying new intermediate storage tiers (node-local storage, burst-buffers, ...) and their underlying technologies (NVMeoF, NVRAM, ...) between the compute nodes and the traditional global shared parallel file-system. Unfortunately, without advanced techniques to allocate and size these resources, they remain underutilized. To address this problem, we investigate how heterogeneous storage resources can be allocated on an HPC platform, in a similar way as compute resources. In that regard, we introduce StorAlloc, a simulator used as a testbed for assessing storage-aware job scheduling algorithms and evaluating various storage infrastructures.

Slide Here!

14h-15h30 Cloud and Storage

14h00 Amphi Thévenin

Object Storage - What shirt suits you?

Charlotte Letamendia, OVHCloud

OVHcloud facilitates the Object Storage data management with 3 S3 new tiers of storage designed by use cases: IA & media, web & back-up, archive OVHcloud will present the product line, and best practices to optimize the management of the data that brings efficiency, cost optimization and better resiliency.

14h30 Amphi Thévenin

Time to Revisit Erasure Codes in Data-intensive Clusters

Shadi Ibrahim, Inria

Replication has been successfully employed and practiced to ensure high data availability in large-scale distributed storage systems. However, with the relentless growth of generated and collected data, replication has become expensive not only in terms of storage cost but also in terms of network cost and hardware cost. Traditionally, erasure coding (EC) is employed as a cost-efficient alternative to replication when high access latency to the data can be tolerated. However, with the continuous reduction in its CPU overhead, EC is performed on the critical path of data access. For instance, EC has been integrated into the last major release of Hadoop Distributed File System (HDFS) which is the primary storage backend for data analytic frameworks such as Hadoop and Spark. This talk explores some of the potential benefits of erasure coding in data-intensive clusters and discusses aspects that can help to realize EC effectively for data-intensive applications.

15h Amphi Thévenin

Leaderless State-Machine Replication: An Overview

Pierre SUTRA, Télécom SudParis

Modern cloud applications replicate their data in multiple geographical locations and require strong consistency guarantees for their most critical data. These guarantees are usually provided via state-machine replication (SMR). Recent advances in SMR have focused on leaderless protocols, which improve the availability and performance of traditional Paxos-based solutions. This talk offers an overview of this new generation of replication protocols.

Slide Here!

16h-17h30 Low level and scale enablers

16h00 Amphi Thévenin

Phobos: a scale-out object store implementing tape library support

Thomas Leibovici, CEA, Patrice Lucas, CEA - Philippe Deniel, CEA

Phobos is an open-source parallel object store designed to manage large volumes of data. It can manage various kind of storage from SSD devices to tapes libraries. Phobos is developed at CEA, where it has been in production since 2016 to manage the many petabytes of the France Genomique dataset, hosted in the TGCC compute center. Very large datasets are handled efficiently on inexpensive media without sacrificing scalability, performance, or fault-tolerance requirements. Phobos is designed to offer different layouts, such as mirrored double write and erasure coding. I/O on magnetic tapes are optimized through the use of dedicated scheduling policies applied during allocation of storage resources. Phobos natively supports the control of tape libraries in SCSI and relies on well-known standards (such as writing to tapes via LTFS) to avoid any dependency on a proprietary format. It provides several interfaces including S3 and the possibility of being an HSM backend for Lustre in a Lustre/HSM configuration. Its API also makes it easy to add other front-ends, including an interface to present data as a POSIX filesystem (under development). This presentation presents the design of Phobos, designed to be used in an Exascale context, and also some future use cases to which it is able to respond effectively.

Slide Here!

16h30 Amphi Thévenin

Handling IO data with PDI and Optimizing away IO with PDI/Deisa

Amal GUEROUDJI, Maison de la Simulation

In this talk, we present the PDI Data Interface designed to handle HPC simulation data in a flexible way. PDI separates data generation from its management through a plugin system (HDF5, NetCDF, FTI, etc.) to enable IO optimization experts to adopt the best solution in each situation without modifying or even recompiling simulation codes. Because from a performance point of view, the best IO is the IO you don't do, PDI also offers plugins for in situ data handling (Melissa, Sensei, etc.). Using such plugins allows reducing the quantity of data that is written and fully leveraging the HPC platform. In this talk, we focus on the Deisa plugin that interfaces MPI simulations with Dask-based in situ data analytics.

Slide Here!

17h00 Amphi Thévenin

Spintronics: from device to system for low-power, reliable applications and non-conventional computing

Guillaume PRENAT, Université Grenoble Alpes, CEA-Grenoble

In this talk, we will present how spintronics can contribute to push forward the limits of microelectronics scaling. We will first introduce this emerging non-volatile technology and explain how the standard design flow has to be adapted to integrate spintronic devices. Then, we will give examples of using hybrid CMOS/spintronics circuits for specific purposes, including low-power, reliability improvement or non-conventional computing.

Organisateurs

Per3S est gérée depuis 6 ans par un Comité de Pilotage, plus un comité local qui lui change à chaque édition.

François Trahay

Télécom SudParis / Institut Polytechnique de Paris

Comité de local, 2022

François Trahay is an associate professor at Télécom SudParis since 2011. He received his Ph.D. degree in computer science from the University of Bordeaux in 2009. He has been working on runtime systems for high performance computing since 2006. His research interests now mostly focus on performance analysis for HPC and distributed systems. He is the project leader and main developer of the EZTrace framework for performance analysis.