The terminology used in Per3S for storage includes HPC systems as well as Cloud architectures, the common point between both being scalability. Presentations and talks focus in application, system or architecture. This 7th edition aims to gather during one day researchers from academia and industry, experimented or junior, storage users and customers with the sole purpose to exchange and foster the community.
Per3S is a workshop aiming to bring together the scientific and technological storage community to discuss and address issues and challenges associated to performance and data operations at scale. These topics cover HPC storage as well as Cloud-oriented architectures, both sharing the need for extreme scale.
Per3S fully encourages young researchers to present their work by submitting an abstract. The abstract can relate to an original work, on-going work, with fresh problems/solutions, or one already submitted and/or accepted in an international conference in order to be the subject of discussions.Previous editions of Per3S have successfully fostered a community of researchers both from academia and industry working on storage technologies. The audience is around 50 persons.
The program is organized around 3 sessions: one dedicated to Cloud, Storage technologies and data management, the second sessions is focused on poster for interactive discussion, and the last and third session is centered on HPC storage technologies and Lustre in particular.
Each posters is coming with an additional Flash-Presentation. Get within a single day a comprehensive overview of the storage activities in France.
Pers3S workshop spans a full day from 9am to 17h30, with a total of 3 sessions:
In 2006, British mathematician Clive Humby described data as "the new oil". Later on, as the growth of the Web, it becomes apparent how prophetic this statement was. In order to derive value from data, businesses allocate significant resources towards storing and managing it. Centralized storage has been the prevailing method for storing data for many years, where a third-party provider manages and stores data on large servers. In such a storage system, no matter how stringent the security measures are, the data center authority holds the encryption keys for personal, sensitive, and business information placed in their custody. This raises potential concerns for accessibility, transparency, privacy, sovereignty and control. Moreover, a single point of failure presents an easier and more gainful target for hackers attempting to gain access to a significant amount of data. Decentralized storage solutions were developed to address these issues, among others. They offer a secure and resilient way to store data by distributing storage responsibilities among multiple participants. This approach leverages a network of storage devices contributed by independent users as a shared pool of storage space for storing data while ensuring the persistence and availability of the stored data even when individual peers in the network are unreliable. In this talk we give an overview of the architecture of decentralized cloud storage systems, we discuss their advantages and limitations compared to traditional storage platforms and we present the challenges to be addressed to build such a solution.
Les graphes sont largement utilisés comme outil d'analyse pour modéliser les interactions dans divers domaines, comme les réseaux sociaux, la finance, et la sécurité informatique. Toutefois, de nombreux graphes sont sujets à des changements continus ou sporadiques, et l'analyse de leur historique est souvent plus pertinente que l’analyse d’un seul état statique. La gestion de l'historique de ces graphes débloque de nombreuses capacités de requête, comme la reconstruction de l'état du graphe avant une défaillance du système, permettant ainsi de prévenir les causes de dysfonctionnement. Dans cette présentation, nous nous concentrons sur l'utilisation de Thing'in, une plateforme de gestion de graphe développée par Orange. Thing'in est utilisée pour gérer un graphe IoT où les nœuds représentent en majorité des objets connectés et les arêtes représentent les connexions entre eux. L’étude de l’historique du graphe de Thing’in permet à ces clients une meilleure analyse des systèmes sous-jacents, notamment dans les usines intelligentes pour la détection des causes de défaillance des systèmes et le suivi des chemins des produits tout au long de la chaîne de fabrication. Ainsi, notre objectif principal est de concevoir un système de gestion des graphes temporels et de l'intégrer dans la plateforme Thing'in. Pour ce faire, nous avons développé Clock-G, un système de gestion des graphes temporels, en résolvant divers défis tels que le langage de requête, l'optimisation des requêtes et le stockage des graphes. Ce système permet de stocker les graphes de manière compacte et d'interroger l'historique de manière intuitive à l'aide d'un langage de requête. De plus, notre évaluateur de requêtes utilise l'historique des données déjà inséré dans Clock-G pour choisir
During this talk we will detail with a concrete use case and demonstration how to leverage on cloud infrastructures to build a meteorological report map. We will detail 3 classes of object storage (high performance, standard et archive) and provide explanation how to select the right class of storage, selecting resiliency, performance and price during the long lifecycle of the data
As Product Manager, Charlotte Letamendia is animating the roadmap of Storage services managed by OVH Cloud, the European cloud leader. By sharing daily with users of cloud infrastructure, IT administrators, devops, developers, she passionately debates needs and products. Deployment at scale, performance, agility, cost, scalability, resilience, so many challenges to confront with market trends, standards, and innovations. His role as product manager is humbly to carry the voice of the users and rally the research and development teams, to turn that vision and innovations into a reality. Charlotte is an engineer by training, in Telecommunication ENST Paris, and has more than 15 years of experience in product management in high growth tech industry, with specific experience in digital video, media devices, and cloud infrastructure. When not talking about products, Charlotte is passionate of outdoor sports and adventures, you will find her in mountains or on a bicycle.
Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, due to rapidly growing data set sizes this approach has become increasingly infeasible. Surprisingly, the questions of why and to what extent random access is required have not received a lot of attention in the literature from an empirical standpoint. In this work, we revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme.
Edge computing promises to extend Clouds by moving computation close to data sources to facilitate short-running and low-latency applications and services. Providing fast and predictable service provisioning time presets a new and mounting challenge, as the scale of Edge-servers grows and the heterogeneity of networks between them increases. This work is driven by a simple question: can we place container images across Edge-servers in such a way that an image can be retrieved to any Edge-server fast and in a predictable time. We propose two novel container image placement algorithms based on k-Center optimization. In particular, we introduce a formal model to tackle down the problem of reducing the maximum retrieval time of container images. Based on the model, we present KCBP and KCBP-WC, two placement algorithms which target reducing the maximum retrieval time of container images to any Edge-server. Evaluations using trace-driven simulations demonstrate that KCBP and KCBP-WC can be applied to various network configurations and reduce the maximum retrieval time of container images.
The world of computational science faces a paradox due to the separation of HPC and Cloud computing. HPC users seek access to the data analysis and automation tools provided by Cloud technologies, while Cloud users desire the powerful resources of HPC centers. One way to address these requirements is by partitioning the infrastructure into HPC and Cloud stacks. This, however, can lead to increased maintenance costs and decreased utilization. Additionally, transferring data between different computing systems can be a bottleneck since the amount of data generated by scientific simulations and analyses continues to grow exponentially. This talk will present a unified architecture for the collocation of HPC and Cloud applications on the same hardware. Our solution exploits Kubernetes as the runtime manager, Slurm for resource allocation, and Singularity and a rootless containerization engine. The proposed architecture will be demonstrated through two use-cases that utilize visual workflows (Argo). The first use-case involves a multi-stage genome analysis workflow that stores data on an on-demand object storage backend (S3), while the second use-case integrates MPI codes and post-analysis stages into a fully automated workflow.
Magnetic tapes are often considered as an outdated storage technology, yet they are still used to store huge amounts of data. Their main interests are a large capacity and a low price per gigabyte, which come at the cost of a file access time much larger than on disks. With tapes, finding the right ordering of multiple file accesses is thus key to performance. Moving the reading head back and forth along a kilometer long tape has a non-negligible cost and unnecessary movements thus have to be avoided. However, the optimization of tape request ordering has rarely been studied in the scheduling literature, much less than I/O scheduling on disks. For instance, minimizing the average service time for several read requests on a linear tape remains an open question. Therefore, in this paper, we aim at improving the quality of service experienced by users of tape storage systems, and not only the peak performance of such systems. To this end, we propose a reasonable polynomial-time exact algorithm while this problem and simpler variants have been conjectured NP-hard. We also refine the proposed model by considering U-turn penalty costs accounting for inherent mechanical accelerations. Then, we propose low-cost variants of our optimal algorithm by restricting the solution space, yet still yielding an accurate suboptimal solution. Finally, we compare our algorithms to existing solutions from the literature on logs of the mass storage management system of a major datacenter. This allows us to assess the quality of previous solutions and the improvement achieved by our low-cost algorithm. Aiming for reproducibility, we make available the complete implementation of the algorithms used in our evaluation, alongside the dataset of tape requests that is, to the best of our knowledge, the first of its kind to be publicly released.
Peer-to-peer cloud storage solutions offer a way to strengthen data security by removing any central authority who can undermine confidentiality and reduce availability of the data. One topic of interest in peer-to-peer cloud storage is group-based file sharing and collaboration. This poster will present Group Key Agreement Protocols to provide a way for members of a group to agree on a group key that can be used to secure communication. This led me to study and compare different protocols. Most of these protocols maintain a tree structure to organize intermediate keys and use the root key as the group key. This structure is quite efficient as it reaches logarithmic complexity when performing operations on the tree: adding or removing members and updating member's keys. This work will discuss the two families of protocols: Collaborative Group Key Agreement Protocols and Continuous Group Key Agreement Protocols. Using these protocols in a peer-to-peer environment also requires to address untrusted delivery service which is a centralized point to exchange protocol messages between members.
Interplanetary File System (IPFS) is an open-source content-addressable peer-to-peer system that provides large scale distributed data storage and delivery. IPFS is designed to share immutable data, thus modifying data results in a new copy. This may lead to a high cost in terms of storage, data transfer, and latency. For example, in collaborative applications like text editing, a new version of the whole document will be generated and replicated when one single letter is written or deleted. Conflict-free Replicated Data Types (CRDTs) are specific data types built in a way that mutable data can be managed without the need for consensus-based concurrency control. They are a promising solution for merging modifications on mutable data in IPFS. Previous efforts have explored how to implement such CRDTs in IPFS and discussed the limitations and several potential optimizations, but they have not been implemented and evaluated in real IPFS deployment. In this work-in-progress, we conduct experiments to show the deficiencies of IPFS when dealing with mutable data; and evaluate the convergence time, traffic usage and scalability of our implementations of CRDT in IPFS for two simple data types (i.e., Counter and Set). Our implementations are inspired by previous work and written in Go. They utilize libP2P as a communication system and use a rendez-vous point to connect peers together.
The Steinbuch Centre for Computing (SCC) at Karlsruhe Institute of Technology (KIT) designed a data transfer federation in the context of NFDI4Ing project and also in close collaboration with University Heidelberg in the bwHPC-S5 project. The proposed federation will allow researchers to access and transfer data between different storage systems using their home organization's user account, based on their access rights to the resources in a collaborative project. The storage systems in a federation are either dedicated large-scale systems or systems associated with High-Performance Computing. At SCC, we have integrated our Large Scale Data Facility: Online Storage (LSDF OS) with WebDAV protocol and OAuth2 authentication to enable access of third-party applications to the storage service. Beyond improved support for programmatic access using OAuth2, the WebDAV server enables users to inspect the LSDF OS filesystem via their web browser. For enabling data movement, we have deployed an instance of the File Transfer Service (FTS) on-premises and integrated it with our identity provider (IdP). FTS is a low-level data management service that schedules reliable bulk transfer of files from one site to another. It is an open-source software developed by CERN that distributes the majority of the Large Hadron Collider (LHC) data across the Worldwide LHC Computing Grid (WLCG) infrastructure. However, FTS is designed to be used with one identity provider, whereas in a federation, more than one IdP is involved. To address this limitation, we have designed a central IdP to issue tokens that are recognizable by the downstream identity providers which manage users’ access to their corresponding storage service in the federation. At each storage service, tokens issued by the central IdP are resolved to the corresponding user identity at the local IdP via a mapping policy. To achieve this, a unified approach in the federation regarding the information included in the token is necessary. This will enable users to have seamless access to a wider range of resources, thereby enhancing collaboration between research centers.
One of the biggest challenges facing storage HPC systems today is the placement and the migration of data into different storage levels. To decide to move a file from one tier to another, placement policies have been defined and implemented in multi-tiered storage systems. For instance, file placement tasks in CEA supercomputers are performed using ROBINHOOD 1, a tool for applying and planning data placement policies. This tool works at the file granularity. The problem of selecting the subset of data to keep in a low-capacity, high-speed memory has been studied for implementing memory page caches in operating systems. The ARC (Adaptive Replacement Cache) algorithm, or its adaptations, are state-of-the-art solutions to manage page caches. ARC is an adaptive caching algorithm that is designed to consider both recency and frequency of accesses. It manages the cache with two LRU lists at a block granularity. In this work, our aim is to determine whether the ARC algorithm is transposable to the management of a two-tier (HDD-SSD) storage architecture at a file granularity. Our strategy seeks to elaborate a trade-off between recency and frequency of accesses in order to keep in the top tier the recently and frequently used files while keeping the logic of ARC as it proved to be highly efficient. To evaluate our method, we use a storage simulator that we have enriched with our adapted ARC. This simulator allows us to replay several I/O traces on a multi-tier storage architecture. We compare the SSD hit ratio obtained with our proposition and with usual solutions such as the LRU and LFU policies.
Serverless computing relies on the composition of short-lived stateless functions that are executed following events such as HTTP requests. These functions are mostly deployed in containers called ”replicas”. Scaling a FaaS application, i.e. to maintain a consistent level of performance, consists in growing or shrinking the pool of function replicas following load peaks. When a function is requested while no replica exists on the cluster, it goes through a cold start that incurs an additional initialization delay to the function’s response time. Function images are stored in repositories on dedicated nodes and pulled by worker nodes where and when functions are deployed. Depending on image size, this can have detrimental consequences on request latency with deployments where cold starts dominate a function’s total response time. To mitigate these issues, we propose a distributed function cache that opportunistically takes advantage of available disk space and memory on worker nodes in the cluster. Such nodes are highly heterogeneous: we consider functions that can be executed on different platform architectures, ranging from CPUs, to GPUs and FPGAs. Furthermore, storage on these nodes exhibit various levels of performance and cost, ranging from HDDs, to SSDs and NVMs. By proactively caching functions from the same application using adequate storage on the same nodes, we seek to minimize cold starts and data movement to improve total response times. We evaluate our platform in a simulation environment using workload traces derived from Microsoft’s Azure Functions, enriched with measurements from a deepfake detection project at the B<>com Institute of Research and Technology.
The central focus of this work is the Gaussian Mixture Model (GMM), a machine learning model widely used for density estimation and cluster analysis. The Expectation-Maximization (EM) algorithm is commonly used to train GMMs. One of the main challenges facing this algorithm under the shift towards embedded systems is the crippling memory constraints. Indeed, the EM is an iterative algorithm that requires several scans of the data, thus several I/O opeartions. Hence, when the dataset cannot be fully stored in the main memory, its execution is slowed down by the huge data movements (I/Os) between the main memory and the secondary storage. In this work, we present an I/O optimization of the EM algorithm for GMMs that relies on two main contributions: (1) a divide-and-conquer strategy that divides the dataset into chunks, learns the GMM separately in each chunk and combines the results incrementally; (2) a strategy that restricts the training of the GMM on a subset of data while producing a sufficiently good accuracy, by utilizing the information learned from the first chunk.
One of the most popular algorithms for data analysis is K-means, where each cluster is reduced to a mean that is representative of cluster’s data observations. K-Means algorithm iterates over all data elements (data-set) multiple times until it converges. When data-set is larger than the available memory work-space, traditional K-means execution time is dominated by I/Os, since secondary storage is used as an extension to the main memory. To tackle this issue, K-MLIO (K-Means with Low I/Os), a state-of-the-art method, was proposed to decorrelate the number of data spans from the convergence speed by applying a divide-and-conquer approach. In our work, we aim to propose an any-time and energy efficient version of K-MLIO. By any-time we mean a version that is able to give a results under a time constraint. Our proposal relies on two ideas : (1) since K-MLIO acts on multiple small chunks to data (the divide step), our assumption is that it is possible to only consider a subset of chunks to meet the time constraint at the expense of a (small) clustering error ; (2) it is possible to reduce the power consumption of the method by applying a dynamic voltage and frequency scaling (DVFS) technique during chunks processing. The proposed method consists, in speculating on the remaining chunks worst case processing time to tune the minimum frequency to be used, and the number of data chunks to drop (I/Os to reduce) , so as to meet the time constraint at the expense of the minimum error increase. Preliminary experiments show encouraging results as compared to K-MLIO.
Lustre is the leading open-source and open-development file system for HPC. Around two thirds of the top 100 supercomputers use Lustre. It is a community developed technology with contributors from around the world. Lustre currently supports many HPC infrastructures beyond scientific research, such as financial services, energy, manufacturing, and life sciences and in recent years has been leveraged by cloud solutions to bring its performance benefits to a variety of new use cases (particularly relating to AI). This talk will reflect on the current state of the Lustre ecosystem and also will include the latest news relating to Lustre community releases (LTS releases and major releases), the roadmap , and details of features under development.
Large Language Models (LLM) are fast becoming the key differentiating tool for NLP as they offer significant accuracy gains. This has been seen recently through the impressive success of ChatGPT and other LLM-powered tools such as the new Microsoft Bing. These large models are scale-up models with billions to trillions of parameters and consistently outperform fine-tuned models with more limited corpus. Between 2019 and 2023 the model size has been multiplied by 1000x and the recent regain attention for sparse models won’t slow down this trend, but training and inference at this scale is challenging and puts enormous pressure on memory resources. To overcome that memory limitation and allow for the next 1000x increase in model size, a few software stacks have been developed to optimize the memory usage and to offload parts of the model from the GPU memory to other forms of storage such as local NVMEs, therefore swapping the model in and out during the run. We show how LLM offloading performance is dominated by the throughput and how Lustre consistently outperforms by a large margin the local raid of a DGX A100 system while allowing for bigger models to be run: up-to 24 trillion parameters on a single DGX A100 during our testing. We notably show a 1.69x inference throughput speed-up by using Luster over the local raid for the BLOOM model (176 billion parameters).
As supercomputers are becoming faster and faster, so does their data output. Since the regularly accessed data must be stored and available quickly to users, it is important to put it on fast storage systems. However, these tend to have a low capacity, meaning we must be able to chose the data which should remain on those types of storage systems, and which can be placed on slower but more capacitive systems. As such, it is important to be able to accurately know the state of a filesystem at any point, but using the conventional means provided by the operating system for this, for instance to do filesystem traversals, can be time consuming if done regularly. Moreover, these operations impose a heavy load on the filesystem, making it slower. To counter these problems, we created a suite of tools called RobinHood that aims to mirror a filesystem in a database, and use the latter to define policies that will manage data placement according to their usage.
Storage HPC centers are usually shared by multiple organisations/users with various applications. Those applications can have all kind of IO patterns or metadata accesses assuming they are alone consuming storage ressources. So, how do we prevent those different HPC jobs or clients to impact each others? How do we prioritize a particular HPC jobs or nodes? How do we maintain an acceptable storage performances for all the consumers? Lustre's filesystem already implements a QoS solution for those purposes named TBF (Token Bucket Filter). This functionality has been in production for 2 years at the CEA. This presentation will describe what is Lustre TBF, how it can be used and what the current limitations are.
The ability of large-scale infrastructures to store and retrieve a massive amount of data is now decisive to scale up scientific applications. However, there is an ever-widening gap between I/O and computing performance on these systems. A way to mitigate this gap consists of deploying new intermediate storage tiers (node-local storage, burst-buffers, …) between the compute nodes and the traditional global shared parallel file-system. Unfortunately, without advanced techniques to allocate and size these resources, they remain often underutilized. To address this problem, we investigate how heterogeneous storage resources can be allocated on a HPC platform, in a similar way as compute resources. In that regard, we introduce StorAlloc, a simulator used as a testbed for assessing storage-aware job scheduling algorithms and evaluating various storage infrastructures.
Over the past years the storage landscape has increased in both size and options of storage systems. Contrary to HPC environments, storage systems are not fully replaced periodically and old systems are routinely operated besides newer technologies. This proves to be challenging for users having their data distributed across these different systems and their various APIs. To facilitate users data movement across the boundaries of old, new and different, the Data Mover was developed as part of the Fenix Data Services. In this talk we describe the Data Mover service, how it is operated and what options it provides the users. We will motivate the development of the Data Mover by introducing the complex infrastructure of the Juelich storage systems and how the Data Mover helps users navigate their data between components.
Since its inception 7 years ago, Per3S is managed by a steering committee, the committee is fluid and tends to evolve from an edition to the other.
Provided email addresses will only be used to contact participant for logistic purpose or broadcast last minute changes. The emails addresses will not be kept after the Workshop (GDPR).
The workshop will be held in La Maison des Mines et des Ponts , a building of the prestigious Ecole des Mines et des Ponts, at the heart of the Latin quarter on the left bank of Paris.
The last fourth editions of PER3S are available at the following addresses (2020 and 21 have been canceled due to the pandemic).
6th edition: 2022 5th edition: 2019 4th edition: 2018 3rd edition: 2017