The terminology used in Per3S for storage includes HPC systems as well as Cloud architectures, the common point between both being scalability. Presentations and talks focus in application, system or architecture. This 7th edition aims to gather during one day researchers from academia and industry, experimented or junior, storage users and customers with the sole purpose to exchange and foster the community.
Per3S is a workshop aiming to bring together the scientific and technological storage community to discuss and address issues and challenges associated to performance and data operations at scale. These topics cover HPC storage as well as Cloud-oriented architectures, both sharing the need for extreme scale.
Per3S fully encourages young researchers to present their work by submitting an abstract. The abstract can relate to an original work, on-going work, with fresh problems/solutions, or one already submitted and/or accepted in an international conference in order to be the subject of discussions.Previous editions of Per3S have successfully fostered a community of researchers both from academia and industry working on storage technologies. The audience is around 50 persons.
The program is organized around 3 sessions: one dedicated to Cloud, Storage technologies and data management, the second sessions is focused on poster for interactive discussion, and the last and third session is centered on HPC storage technologies and Lustre in particular.
Each posters is coming with an additional Flash-Presentation. Get within a single day a comprehensive overview of the storage activities in France.
Pers3S workshop spans a full day from 9am to 17h30, with a total of 3 sessions plus two distinguished talks and a concluding panel session:
Modern block-addressable NVMe SSDs provide much higher bandwidth and similar performance for random and sequential access. Persistent key-value stores (KVs) designed for earlier storage devices, using either Log-Structured Merge (LSM) or B trees, do not take full advantage of these new devices. Logic to avoid random accesses, expensive operations for keeping data sorted on disk, and synchronization bottlenecks make these KVs CPU-bound on NVMe SSDs. We present a new persistent KV design. Unlike earlier designs, no attempt is made at sequential access, and data is not sorted when stored on disk. A shared-nothing philosophy is adopted to avoid synchronization overhead. Together with batching of device accesses, these design decisions make for read and write performance close to device bandwidth. Finally, maintaining an inexpensive partial sort in memory produces adequate scan performance. We implement this design in KVell, the first persistent KV able to utilize modern NVMe SSDs at maximum bandwidth. We compare KVell against available state-of-the-art LSM and B tree KVs, both with synthetic benchmarks and production workloads. KVell achieves throughput at least 2x that of its closest competitor on read-dominated workloads, and 5x on write-dominated workloads. For workloads that contain mostly scans, KVell performs comparably or better than its competitors. KVell provides maximum latencies an order of magnitude lower than the best of its competitors, even on scan-based workloads.
Data management at scale is a challenge for both academic and private institutions. Indeed, large-scale scientific simulation programs require more and more data, in very sophisticated analysis workflows. For instance, the LHC experiment at CERN is generating 30 PB of data every year, with a total of 100 PB permanently archived. Mass storage is also central in Cloud infrastructures, that have to deal with EB of data (for instance, Google offers 27 EB of free storage with Gmail). Such amount of data are stored in dedicated data centers, with two main paradigms: cold storage using magnetic tapes (mass capacity, slow access time) and disks for intermediate to short term storage (so called "hot" data). Tapes are interfaced with disks throughout dedicated software layers, which do not allow direct access to tapes from computing resources, should they be in a Cloud environment or on computing clusters. Hence, data movement between cold and hot storage becomes more and more critical in the performance of both computing and storage infrastructures. We present in this talk a roadmap for building a better continuum between cold and hot storage, by deriving new software interactions between the resource management system and the tape systems. This roadmap expects to be very collaborative, especially with academic and private actors using mass tape storage systems (CEA, Cloud providers etc)
TBA
Persistent memory (PMEM) offers durability at byte granularity and operates at nearly the speed of volatile memory. While this technology holds great promise for improving performance in large databases and analytics systems, current PMEM runtimes fail to provide a simple interface. This is because developers must manually specify the write set of a failure-atomic block. To avoid the burden of manually specifying the write set, we propose to use low-level hardware features managed by the operating system. To access these features, we isolate the system part of a PMEM runtime, and define three new system primitives: two to define a failure-atomic section and one to recover and map a PMEM in a process. We implemented this interface in VolipMem, which leverages virtualization to expose a page table in a process. We integrated VolipMem into three language runtimes, and evaluated these runtimes with two databases and several libraries. Our results show that VolipMem is both easy to use and efficient.
In-network caching plays a pivotal role in enabling scalable, low-latency data delivery in Information-Centric Networking (ICN). However, as data volumes continue to surge, existing storage infrastructures—especially multi-tiered systems with varying cost and performance characteristics—struggle to keep pace. Meanwhile, critical concerns such as energy efficiency, bandwidth constraints, and Quality of Service (QoS) are often overlooked. Current caching strategies fall into two camps: centralized approaches that optimize local cache performance but neglect network-wide costs, and distributed methods that encourage global coordination but often ignore hardware heterogeneity and fail to enforce fine-grained QoS. To address these challenges, we introduce two complementary solutions. QM-ARC is a centralized, QoS-aware multi-tier adaptive replacement strategy that extends ARC by incorporating application and user priorities through a penalty-based model inspired by Service Level Agreements (SLAs). Complementing this, CL2SM (Cache Less to Save More) is a distributed caching algorithm designed to optimize content placement and replication across multi-tier nodes while minimizing total system cost. It integrates a holistic cost model covering hardware depreciation, bandwidth, energy consumption, and SLA penalties. Together, these strategies offer a unified, intelligent, and cost-effective caching framework tailored for the evolving demands of next-generation ICN.
abstract to be provided
As the energy consumption of modern supercomputers can rival that of a small city—resulting in significant costs and greenhouse gas emissions—energy efficiency in scientific computing has become an essential concern. Consequently, "green computing" is now a major scientific and industrial challenge. Although storage devices consume relatively less energy compared to processors and their cooling systems, storage-related bottlenecks can prolong application runtimes, thereby indirectly increasing overall energy consumption. This is particularly critical in high-performance computing (HPC), where I/O operations are often sporadic and bursty, as there is a performance gap of several orders of magnitude between volatile memory and storage systems. In this context, we identify two promising levers for optimizing I/O energy consumption: advising functions and dynamic voltage and frequency scaling (DVFS). As leveraging these mechanisms effectively requires a precise yet efficient model of I/O behavior, in this work we first present our past contributions to the modeling of HPC I/Os, and follows by presenting our work-in-progress studies on the energy-saving potential of advising functions and DVFS. We show that while advising functions have varying effects depending on the PFS, they can positively affect I/O performance, and that DVFS can be used on some I/O patterns to reduce I/O energy consumption at a small performance cost.
Identifying performance bottlenecks in a parallel application is tedious, especially because it requires analyzing the behaviour of various software components, as bottlenecks may have several causes and symptoms. For example, a load imbalance may cause long MPI waiting times, or contention on disk may degrade the performance of I/O operations. Detecting a performance problem means investigating the execution of an application and applying several performance analysis techniques. To do so, one can use a tracing tool to collect information describing the behaviour of the application. Tracing applications may alter the performance of the application, and can create thousands of heavy trace files, especially at a large scale. Most importantly, the post-mortem analysis needs to load these thousands of trace files in memory, and process them. This quickly becomes impractical for large scale applications, as memory gets exhausted and the number of opened files exceeds the system capacity. We propose PALLAS, a generic trace format tailored for conducting various post-mortem performance analysis of traces describing large executions of HPC applications. During the execution of the application, PALLAS collects events and detects their repetitions on-the-fly. When storing the trace to disk, PALLAS groups the data from similar events or groups of events together in order to later speed up trace reading. We demonstrate that the PALLAS online detection of the program structure does not significantly degrade the performance of the applications. Moreover, the PALLAS format allows faster trace analysis compared to other evaluated trace formats. Overall, the PALLAS trace format allows an interactive analysis of a trace that is required when a user investigates a performance problem.
Abstract to be provided
Today, large language models have demonstrated their strengths in various tasks ranging from reasoning, code generation or complex problem solving. However, this advancement comes with high computational cost, and requires considerable memory to store the model parameters and request context, making it challenging to deploy these models on edge devices to ensure real-time responses and data privacy. The rise of edge accelerators has significantly boosted on-device processing capabilities; however, memory is still a major bottleneck, a problem that is especially significant in large language models with high memory requirements, when it is unclear whether using all model layers is crucial for maintaining generation quality. In addition to that, the varying workloads on edge devices call for an adaptive solution for the efficient utilization of hardware resources. In this paper, we propose a flexible layer-wise compression approach that produces multiple model variants. Our solution leverages a smart storage mechanism to ensure efficient storage and rapid loading of the most appropriate variant tailored to the Quality of Service (QoS) requirements and the dynamic workload of the system.
Software-Defined Storage (SDS) is gaining traction in edge environments due to its adaptability and decoupling from hardware constraints. However, characterizing and optimizing SDS performance remains challenging, particularly in resource-constrained systems. In this work, we present a systematic tracing-based methodology for detecting and analyzing I/O performance bottlenecks across the Linux storage stack—from the block layer down to device-level interactions. By leveraging low-level tracers, we trace I/O operations to identify the root causes of performance degradation, such as disk queuing delays and write amplification. By simulating diverse I/O workloads, we collect trace data that helps pinpoint specific issues such as throughput degradation, latency spikes, and inefficient I/O patterns. Based on the insights gathered, we propose corrective optimization strategies at the OS level, such as dynamic I/O throttling, aimed at mitigating the identified bottlenecks. Using a Ceph-based deployment as a case study, we aim to demonstrate how our methodology can be applied to effectively diagnose performance issues in near real-time and propose actionable solutions to enhance storage efficiency and maintain stable performance in edge computing environments.
High-performance computing (HPC) installations usually front a large-capacity HDD tier with a limited, low-latency SSD tier. Data migration is driven by Hierarchical Storage Management (HSM) software such as the Robinhood Policy Engine, which operates strictly at file granularity. This constraint prevents direct reuse of the many block-level cache algorithms proposed in the literature. Our previous multi-criteria replacement policy, MC-ARC, showed that combining recency, frequency, predicted lifetime and user fairness at file scope can outperform classical block caches. Analysis of production traces nevertheless uncovered three recurring pathologies: (P1) cache pollution by very large or rarely used files; (P2) post-burst persistence of files whose references occur in short, periodic I/O bursts; (P3) thrashing when several large files alternate and collectively overflow SSD capacity. We introduce FABME—a new file-level placement policy that preserves MC-ARC’s multi-criteria spirit while explicitly addressing P1–P3. Running fully on-line and relying solely on metadata already captured by Robinhood changelogs, FABME fuses three components: Admission control: on every miss, the policy decides whether to admit the file to SSD or confine it to HDD, using file size, access density, inactivity window, and a benefit–cost estimate; Burst-aware prefetch and eviction: a lightweight detector predicts periodic bursts, triggers just-in-time prefetching, and schedules early eviction to avoid post-burst pollution; MC-ARC multi-criteria eviction: for admissible files, the final victim is selected with the original recency–frequency–lifetime–fairness score. Ongoing experiments on Yombo, Synthesized Google I/O Traces and IBM ObjectStoreTrace, executed with our open two-tier simulator, will report hit rate, trace-processing time and migration overhead.
Abstract to be provided
abstract to be provided
Lustre is the leading open-source and open-development file system for HPC. Around two thirds of the top 100 supercomputers use Lustre. It is a community developed technology with contributors from around the world. Lustre currently supports many HPC infrastructures beyond scientific research, such as financial services, energy, manufacturing, and life sciences and in recent years has been leveraged by cloud solutions to bring its performance benefits to a variety of new use cases (particularly relating to AI). This talk will reflect on the current state of the Lustre ecosystem and also will include the latest news relating to Lustre community releases (LTS releases and major releases), the roadmap, and details of features under development.
Ephemeral IO services, explored in the IO-SEA project, and being extended in the EUPEX project, help data intensive HPC workflows minimizing data movements, keeping active data close to compute nodes for their entire durations, and isolating IO intensive steps from other applications sharing the same file systems. They range from workflows-dedicated parallel file systems to Burst Buffers and more specific object storage services. They run on dedicated hardware resources. In this talk, we will introduce the main concepts, the intitial user interfaces and performance results, and discuss the challenges still to be addressed to make it a mainstream solution.
In 2024, the Juelich Supercomputing Centre successfully deployed the sixth iteration of its central storage system, JUelich STorage (JUST)6, a 154 PiB disk storage system based on IBM's Storage Scale System 6000. This milestone was achieved after migrating over 32 PiB of data to the new infrastructure. The Centre is now expanding its storage capabilities with the deployment of the 300 PiB disk-based ExaSTORE and 29 PiB NVMe-based ExaFLASH systems, both designed to support the upcoming Exascale system JUPITER. This presentation will provide an overview of the Juelich storage landscape, highlighting the challenges and strategies for deploying, migrating, and maintaining complex storage infrastructures at scale.
Over the last 10 years, the Mochi project has explored the notion of composition to foster research and development of HPC data services. Adopted by an increasing number of international users, its collection of components and its methodology allow for rapid development of new data services tailored to specific use cases and applications. In this talk we will look back on this decade of research in and around Mochi, providing a critical retrospective with lessons learned and perspectives for the future of HPC data service research.
Phobos, standing for Parallel Heterogeneous OBject Store, is a software designed to handle large volumes of data across different storage technologies. It was firstly designed to manage tape storage, thus offering tape lifecycle features such as repack, but is also able to store data on POSIX and RADOS systems, which can be used as cache systems before tape archival. Multiple interfaces were developped to use Phobos as an HSM-end backend, like Lustre and iRODS. This presentation will update the current status of Phobos development and detail some new features like the object copy management.
The increasing gap between compute and I/O speeds in high-performance computing (HPC) systems imposes the need for techniques to improve applications’ I/O performance. Such techniques must rely on assumptions about I/O behavior in order to efficiently allocate I/O resources such as burst buffers, to schedule accesses to the shared parallel file system or to delay certain applications at the batch scheduler level to prevent contention, for instance. In this paper, we verify these common assumptions about I/O behavior, specifically about temporal behavior, using over 440,000 traces from real HPC systems. By combining traces from diverse systems, we characterize the behaviors observed in real HPC workloads. Among other findings, we show that I/O activity tends to last for a few seconds, and that periodic jobs are the minority, but responsible for a large portion of the I/O time. Furthermore, we make projections for the expected improvement yielded by popular approaches for I/O performance improvement. Our work provides valuable insights to everyone working to alleviate the I/O bottleneck in HPC.
Since its inception 8 years ago, Per3S is managed by a steering committee, the committee is fluid and tends to evolve from an edition to the other.
Provided email addresses will only be used to contact participant for logistic purpose or broadcast last minute changes. The emails addresses will not be kept after the Workshop (GDPR).
The workshop will be held in La Maison des Mines et des Ponts , a building of the prestigious Ecole des Mines et des Ponts, at the heart of the Latin quarter on the left bank of Paris.
The last tree editions of PER3S are available at the following addresses.
8th edition: 2024 7th edition: 2023 6th edition: 2022