Devito Codes

Benchmarking DevitoPRO on Intel® Xeon® 6 Processors

2025-02-07T00:01:00+00:00

Introduction

High-performance computing (HPC) plays a critical role in scientific and engineering applications, from seismic imaging to simulation. DevitoPRO is designed to help researchers and engineers automate HPC code generation, allowing them to focus on algorithmic development for their domain-specific challenges while ensuring their simulations run efficiently on modern hardware.

As part of our ongoing efforts to optimize performance across multiple architectures, this post presents benchmark results for DevitoPRO on Intel® Xeon® 6 6980P processors, comparing them to the previous-generation 5th Gen Intel® Xeon® Platinum 8592+ processors. These results provide insights into computational throughput, data transfer performance, and mixed-precision acceleration—all key factors in achieving high-performance finite-difference seismic imaging kernels.

Benchmarking Setup

The goal of this benchmarking study is to evaluate:

Generational performance improvements when moving from 5th Gen Intel® Xeon® Platinum 8592+ processors (Emerald Rapids) to Intel® Xeon® 6 6980P processors (Granite Rapids).
The impact of mixed-precision computing, where FP16 is used for storage to improve memory bandwidth utilization, while FP32 is retained for arithmetic to maintain numerical stability and accuracy, assessing its effect on compute throughput and data movement efficiency.

Benchmarks

The benchmarks use two workloads:

Acoustic anisotropic propagator (acoustic TTI model) – A widely used model in seismic imaging for energy applications, testing floating-point operations per second (FLOPS) and data transfer rates.
Elastic propagator – A more complex model incorporating mixed-precision techniques (FP32/FP16) to evaluate performance improvements in multi-node MPI environments.

Processor Specifications

Processor	Physical Cores	Sockets	Architecture	HBM	Memory
5th Gen Intel® Xeon® Platinum 8592+ (Emerald Rapids)	128	2	x86_64	No	Supports DDR5 memory with an eight-channel interface
Intel® Xeon® 6 6980P (Granite Rapids)	256	2	x86_64	No	Supports DDR5 memory with an twelve-channel interface

Benchmarks were conducted using identical compilers, software environments, and simulation parameters to ensure fair comparisons.

Compilers: Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1
MPI Library: Intel(R) MPI 2021.13
Optimization Flags: -O3 -g -fPIC -Wall -std=c99 -xHost -fp-model=fast -qopt-zmm-usage=high -shared -qopenmp
Software Stack: Benchmarks were conducted using DevitoPRO v4.8.x, which includes an experimental mixed-precision implementation. While this version incorporates early optimizations for FP32/FP16 computing, ongoing development efforts are focused on further refining precision handling, improving stability, and enhancing performance scalability across diverse hardware architectures.

Performance Results

Generational Performance Gains for acoustic TTI

The acoustic TTI benchmark measures the efficiency of seismic wave propagation simulations, a key workload in geophysical exploration. The benchmark was conducted using a 1024×2048×1024 computational grid with 5000 time steps, ensuring a realistic and computationally demanding test case. The simulations were executed using DevitoPRO, leveraging optimized code generation for modern CPU architectures.

Metric	5th Gen Intel® Xeon® Platinum 8592+ (Emerald Rapids)	Intel® Xeon® 6 6980P (Granite Rapids)	Improvement
Operations	2.28 TFlops	5.96 TFlops	2.6x faster
FD-throughput (GPts/s)	7.45 GPts/s	15.74 GPts/s	2.1x faster

To fully exploit the hardware capabilities, the benchmark used a NUMA-aware hybrid MPI-OpenMP configuration, optimizing both computation and data locality:

Emerald Rapids (EMR) Configuration:
- Physical cores per socket: 64
- Total physical cores (dual-socket): 128
- NUMA domain per socket: 2
- MPI ranks: 4 (1 per NUMA domain)
- OpenMP threads per rank: 32
- Pinning: I_MPI_PIN_DOMAIN=numa, I_MPI_PIN_ORDER=bunch, I_MPI_PIN_CELL=core
Granite Rapids (GNR) Configuration:
- Physical cores per socket: 128
- Total physical cores (dual-socket): 256
- NUMA domain per socket: 3
- MPI ranks: 6 (1 per NUMA domain)
- OpenMP threads per rank: 42 (NUMA domains on GNR are unbalanced with two NUMA domains with 43 cores and one NUMA domain with 42 cores per socket. We use 42 cores per NUMA domain to keep the configuration uniform.)
- Pinning: I_MPI_PIN_DOMAIN=numa, I_MPI_PIN_ORDER=bunch, I_MPI_PIN_CELL=core

The NUMA-aware process placement ensured that:

MPI ranks were confined within NUMA nodes, minimizing cross-socket communication overhead.
OpenMP threads were pinned to physical cores within NUMA domains, reducing memory latency.
Halo exchange efficiency was maximized through optimized memory access patterns.

Key Factors Driving Performance Gains

Granite Rapids exhibited over twice the performance of Emerald Rapids due to:

Higher Memory Bandwidth:
- GNR features twelve DDR5 memory channels per socket, improving data movement efficiency.
Increased Core Count:
- GNR doubles the physical core count (256 vs. 128) compared to EMR, significantly boosting parallel execution.

These architectural and software improvements collectively delivered 2.6x higher floating-point performance and 2.1x faster compute throughput (GPts/s), demonstrating the generational leap in efficiency from Emerald Rapids to Granite Rapids.

Mixed-precision performance gains for Isotropic Elastic on Granite Rapids (Gen 6)

The isotropic elastic benchmark evaluates the efficiency of multi-component wave propagation simulations, a crucial workload in geophysical imaging. This test was conducted on a 1024 × 2048 × 1024 computational grid with 5000 time steps.

Unlike acoustic TTI, which was benchmarked exclusively in FP32 precision, isotropic elastic simulations were tested in both FP32-only and mixed FP32/FP16 modes. The introduction of mixed precision yielded significant computational and memory efficiency improvements.

We use the same hybrid MPI-OpenMP parallelism with NUMA-aware pinning as with the acoustic TTI benchmark.

Precision Mode	Operations	Compute Throughput (GPts/s)	Improvement
FP32	1.02 TFlops	3.59 GPts/s	-
FP32/FP16 (Mixed)	2.37 TFlops	8.30 GPts/s	~2.3x faster

Why Does Mixed Precision Matter?

Switching from FP32-only computation to a mixed FP32/FP16 approach provided a ~2.3x speedup, achieved through a balanced approach that optimizes both storage and arithmetic precision. Since finite difference methods are memory-bound, the key to performance gains lies in reducing memory bandwidth pressure while maintaining numerical accuracy.

The mixed-precision strategy used in DevitoPRO follows this principle:

FP16 for storage: Reduces memory footprint and accelerates data movement.
FP32 for arithmetic: Preserves computational accuracy by minimizing rounding errors.

This approach ensures that precision is maintained where it matters most while taking advantage of FP16’s efficiency for memory operations.

Key Benefits of Mixed Precision

Lower Memory Footprint:
- FP16 values take up half the memory of FP32, doubling effective cache capacity and reducing pressure on memory bandwidth.
- More wavefield data can fit in fast-access memory, improving locality and cache reuse.
Reduced Communication Overhead:
- In MPI-based distributed environments, using FP16 for storage reduces halo exchange size, halving inter-rank data transfer costs.
- This is particularly beneficial in multi-node scaling scenarios, where communication is a major bottleneck.
Faster Computation:
- Intel Xeon 6 processors feature optimized FP16 vector and tensor operations, accelerating data movement and memory loads.
- Arithmetic remains in FP32, avoiding excessive rounding errors while still benefiting from higher memory bandwidth efficiency.

By carefully combining FP16 for storage and FP32 for arithmetic, DevitoPRO achieves significant speedups while ensuring numerical stability, making this approach ideal for large-scale elastic wave simulations.

Impact on Elastic Wave Simulations

Elastic wave simulations require multiple coupled wavefields, significantly increasing memory and computational demands. By leveraging mixed FP32/FP16 precision, DevitoPRO achieves:

Higher performance with reduced memory bandwidth constraints**.
More efficient inter-rank communication, crucial for large-scale multi-node workloads.
Improved scalability, making mixed precision a practical choice for high-fidelity seismic modeling.

These optimizations ensure Granite Rapids delivers superior elastic wave simulation performance, making it a compelling choice for next-generation geophysical imaging workloads.

Key Takeaways

Intel Xeon 6 delivers substantial generational performance gains for finite-difference simulations, achieving 2.6× higher computational throughput and 2.1× faster compute performance (GPts/s) compared to its predecessor.
Mixed-precision computing (FP32/FP16) significantly enhances efficiency by reducing memory footprint and accelerating computations, making it a crucial optimization for large-scale seismic workloads. Using FP16 for storage while maintaining FP32 for arithmetic strikes the right balance between performance and numerical accuracy.
DevitoPRO automates high-performance code generation, allowing users to fully exploit modern hardware without requiring manual tuning. Optimized memory layouts, vectorization, and hybrid MPI/OpenMP parallelism are applied transparently, enabling peak efficiency across different architectures.
Hardware-aware code generation is essential for performance portability, ensuring that computational workloads can scale efficiently on diverse hardware platforms. These results reinforce DevitoPRO’s approach, where automated performance tuning maximizes computational efficiency without sacrificing precision.

Looking Ahead

At Devito Codes, we remain committed to hardware-neutral performance optimization. While this benchmarking study focuses on Intel Xeon 6, we are actively working on:

Performance analysis on other architectures, including AMD EPYC, NVIDIA Hopper, and ARM-based HPC systems.
Refining mixed-precision strategies to balance accuracy, performance, and memory efficiency.
Expanding code portability, ensuring DevitoPRO runs optimally across a diverse range of HPC platforms.

Users and organizations interested in seismic imaging, medical imaging, or wave-based simulations can explore DevitoPRO’s capabilities at https://www.devitocodes.com/features/.

Would you like more details? Feel free to reach out!

Accurate and Robust Propogators and Gradients for Land Seismic Imaging

2024-08-21T00:01:00+00:00

At Devito Codes, we continue to push the boundaries of seismic imaging and wave propagation. Our latest innovation, to be showcased at IMAGE’24, addresses one of the most challenging aspects of land seismic imaging: accurately modeling wavefields in the presence of complex topography using finite-difference methods, all without the need for unstructured mesh generation.

Imaging from Topography

High-quality topography handling enables calculation of accurate gradients and FWI topographic updates, even in settings where topographic variation is extreme. If forward and adjoint modeling can capture the physics of the free-surface, these effects become data rather than noise; leveraging this data enhances resolution and illumination, whilst minimizing the requisite preprocessing and streamlining imaging workflows. However, failure to accurately account for the effects of topographic variation leads to images which are unfocused and artefact-prone at best, and entirely incoherent at worst. It follows that for confident and robust land seismic imaging, accurate topography implementation is crucial.

Illumination-corrected gradient for an FWI tomographic update

Corresponding inverse-scattering imaging condition

Understanding the Challenge

Traditionally, incorporating topography into solvers has been a complex and time-consuming task, often requiring the development of bespoke solutions that are difficult to generalize across different wave equations. While finite-element methods and mesh generation offer potential solutions, they come with significant challenges, including a lack of software and practical experience in developing finite-element inversion methods for exploration geophysics and the absence of robust automatic mesh generation techniques. These factors make finite-element approaches less attractive for building efficient seismic imaging processing pipelines, further complicating the incorporation of complex topography for land seismic imaging.

A General Approach to Modelling Complex Topography

To solve this problem, we’ve developed Schism, a module that automatically generates immersed-boundary operators for Devito. Schism simplifies the integration of complex topographies into wave propagation models, separating topography specification from numerical implementation. This allows imaging algorithms to leverage sophisticated topography handling routines without added complexity.

Immersed boundaries represent free surfaces as sharp interfaces on a regular grid, accurately reflecting the true surface position. This method eliminates the need for mesh generation or curvilinear grids, by constructing suitable field extensions beyond the boundary, which are incorporated into surface-adjacent FD operators, enforcing the necessary boundary conditions.

Our paper, A Novel Immersed Boundary Approach for Irregular Topography with Acoustic Wave Equations, published earlier this year in Geophysics, demonstrates this approach by modeling wave propagation around the complex topography of Mount St. Helens.

As illustrated below, the 3D free-surface wavefield is accurately rendered, capturing the intricate interactions of the wavefield with steep, uneven terrain. This showcases the power of the immersed boundary method in addressing real-world seismic challenges.

Applying to Land Seismic Data

We are collaborating with service companies like S-Cube to apply these methods to real-world land seismic applications. Accurately modeling acoustic and elastic TTI wavefield behavior in environments with pronounced and irregular topography is essential. Our immersed boundary method effectively captures topographic effects such as surface multiples, amplitude variations, and diffraction around obstacles while maintaining the regular computational grids commonly used in imaging applications. This makes it an ideal choice for integrating topography handling into existing solvers and imaging workflows.

The innovation here lies in the robustness and generality of the approach, allowing integration into existing seismic data processing pipelines. The boundary treatment is tied to the discretization, boundary conditions, and geometry but not to the governing equations. Through symbolic computation and a generalized mathematical approach, immersed boundary treatments can be generated for various equations and boundary conditions, ensuring flexibility and ease of use. For example, higher-order free surface conditions derived from isotropic acoustic wave equations behave similarly in acoustic TTI contexts, offering a computationally efficient alternative.

Join Us at IMAGE’24

Our immersed boundary support is now available as an experimental feature in DevitoPRO. If you are interested in learning more about Schism, Devito, or DevitoPRO, we invite you to go and see Dr. Ed Caunt at his IMAGE’24 presentations:

Wednesday 28/08 10:00

S-Cube & Devito Codes: New Developments for Advanced Physics, S-Cube & ThinkOnward Booth (1659) in the Digitalisation Pavillion.

Thursday 29/08 08:00-09:40AM

Recent Advances in Seismic Modeling 1, (Session ID: SMT P1) Poster Station 3 (3rd Level) A, in the George R. Brown Convention Center

An immersed boundary topography approach for TTI acoustic propagation
Towards elastic-free surface topography with immersed boundaries

Seismic Imaging Performance with AWS Graviton4

2024-08-19T00:01:00+00:00

With the general availability of AWS Graviton4, we explore its performance advancements over previous Graviton generations using DevitoPRO’s standard 3D acoustic wave propagation kernels. These Full Waveform Inversion (FWI) and Reverse Time Migration (RTM) propagation kernels, serve as benchmarks to develop cost models and assess computational efficiency.

The AWS Graviton4 exhibits substantial performance improvements compared to its predecessors. For 3D Isotropic acoustic, the AWS Graviton4 (r8g.24xlarge) delivers performance that is approximately 2.7 times faster than Graviton2 and 81% faster than Graviton3. In the 3D Fletcher Du Fowler TTI benchmark, the Graviton4 outperforms Graviton2 by over 3.4 times and is 51% faster than Graviton3. Similarly, the 3D Self-adjoint TTI benchmark on AWS Graviton4 runs nearly 3.6 times faster than Graviton2 and 80% faster than Graviton3.

These results underscore the significant generational improvements in the Graviton4, particularly for memory-bound high-performance computing applications.

Graviton4 - Amazon EC2 R8g Instances

The Amazon EC2 instance type r8g.24xlarge is a single socket Graviton4. The Graviton4 processors offer significant advancements over the Graviton3. The Graviton4 has 96 Neoverse V2 cores and a revamped memory subsystem, featuring 12 DDR5-5600 channels and a peak memory bandwidth of 536.7 GB/s, which is 75% higher than the Graviton3. These improvements are particularly beneficial for memory-bound HPC applications, such as finite-difference based FWI/RTM operators.

Compiling on Graviton using GCC-14.1

To optimize the performance of our benchmarks on the AWS Graviton4 processor, we built GCC 14.1 from source, rather than the system default for Amazon Linux 2023 (6.1.94-99.176.amzn2023.aarch64) which is GCC 11.4.1. GCC 14.1 has a range of improvements that enhance the compiler’s ability to leverage the advanced features of the Neoverse cores, particularly the Neoverse V2 architecture used in Graviton4.

Key enhancements in GCC 14 include improved support for vectorization and new optimizations that are tailored for Neoverse V2 cores. These improvements allow for better exploitation of the high memory bandwidth and increased core count in Graviton4, resulting in more efficient execution of high-performance computing workloads. Additionally, the release includes enhancements in auto-vectorization, which are particularly beneficial for memory-bound applications like seismic imaging and simulation tasks.

Devito uses the following GCC flags depending on the generation of Graviton to maximize performance on a given instance. As there is just one NUMA domain in all cases, we parallelizing with pure OpenMP.

Graviton2 (Neoverse-N1):

gcc-14 -mcpu=neoverse-n1 -O3 -g -fPIC -Wall -std=c99 -Wno-unused-result -Wno-unused-variable -Wno-unused-but-set-variable -ffast-math -shared -fopenmp

Graviton3 (Neoverse-V1):

gcc-14 -mcpu=neoverse-v1 -O3 -g -fPIC -Wall -std=c99 -Wno-unused-result -Wno-unused-variable -Wno-unused-but-set-variable -ffast-math -shared -fopenmp

Graviton4 (Neoverse-V2):

gcc-14 -mcpu=neoverse-v2 -O3 -g -fPIC -Wall -std=c99 -Wno-unused-result -Wno-unused-variable -Wno-unused-but-set-variable -ffast-math -shared -fopenmp

3D acoustic benchmarks

To evaluate the performance of the Graviton4, we conducted benchmarks using three standard 3D acoustic wave propagation kernels. These tests were run in single precision with OpenMP thread parallelism. Each benchmark was carefully autotuned to optimize parameters such as cache block size, and the best result from multiple runs was recorded.

These benchmarks are categorized based on their operational intensity. The isotropic acoustic benchmark is the simplest among them, commonly used in seismic imaging, and its performance is primarily influenced by memory bandwidth. The Fletcher-Du-Fowler TTI kernel, which is frequently utilized by hardware vendors for benchmarking, represents a moderate level of complexity. In contrast, the self-adjoint TTI kernel, designed for robustness and accuracy, is employed in production workloads, reflecting the most demanding computational requirements for acoustic FWI/RTM. This categorization allows for a comprehensive evaluation of performance across a spectrum of real-world scenarios.

All three benchmarks were run with a fixed problem size of dimensions 512x512x512, for a total of 400 time-steps, space-order 8 and time-order 2.

In the bar-chart below we show how the Graviton3 and Graviton4 performance relative to the Graviton2. The r6g.16xlarge, r7g.16xlarge,r8g.16xlarge all have 64 cores; this highlights the performance improvement per core. In contrast, the r8g.24xlarge (a single socket Graviton4) has a total of 96 cores. We can see some strong scaling limits running the isotropic acoustic benchmark as we go from r8g.16xlarge the r8g.24xlarge (~10% performance improvement). However, this does not appear to be an issue when running with more complex propagator kernels such as TTI.

Price performance

The overall picture for price-performance also looks good for AWS users, though there are some nuances. In the bar-chart below we use the AWS on-demand pricing for each instance and the benchmark performance in units of giga-points-per-second to create a tera-points-per-dollar (TP/$) benchmark metric. This is a measure of how much work you get done per dollar allowing users to get an estimate of the price-to-solution.

For the isotropic acoustic and the self-adjoint acoustic TTI, we can see that the Graviton4 delivers the highest throughput per dollar, followed by Graviton3 and Graviton2.

However, the Fletcher Du Fowler TTI benchmark presents an exception. In this case, Graviton3 provides the highest throughput per dollar, followed by Graviton4 and then Graviton2. Although both the isotropic acoustic and self-adjoint TTI benchmarks are 81% faster on Graviton4 than on Graviton3, the Fletcher Du Fowler TTI benchmark is only 51% faster on Graviton4 than on Graviton3. Given that the theoretical increase in memory bandwidth is 75%, this discrepancy warrants a deeper performance profiling analysis to understand why this particular benchmark underperforms relative to the other benchmarks.

Graviton4 r8g.16xlarge vs r8g.24xlarge

When choosing between the r8g.16xlarge and r8g.24xlarge instances, it is important to consider the specific characteristics of your workload. For workloads where the problem domain is too small to benefit from strong scaling across all available cores, allocating the entire node and running multiple shots per node can provide better value. This approach not only maximizes resource utilization but also helps avoid the potential impact of noisy neighbors in multi-tenant environments if r8g.16xlarge was instead used.

By fully utilizing the r8g.24xlarge instance, which contains all 96 cores of the Graviton4 processor, you can achieve more consistent performance, as the risk of resource contention from other tenants is minimized. This strategy ensures you get the best possible value from the Graviton4 architecture for your HPC tasks.

Conclusion

The benchmarking results clearly demonstrate the impressive capabilities of the AWS Graviton4 processor. With its increased core count and higher memory bandwidth, the Graviton4 significantly enhances the performance of high-performance computing (HPC) applications. Our benchmarks show that DevitoPRO, utilizing AWS Graviton4, achieves substantial performance gains over previous Graviton generations, particularly for memory-bound applications such as seismic imaging.

The Graviton4’s advancements in core architecture and memory subsystem allow DevitoPRO to run efficiently without requiring extensive modifications. This highlights the ease of integration and the high productivity potential that DevitoPRO offers for cutting-edge seismic imaging applications. The performance improvements observed across various benchmarks—3D Isotropic Acoustic, Fletcher Du Fowler TTI, and Self-adjoint TTI—underscore the generational leap in computational efficiency provided by Graviton4.

Moreover, the Graviton4 processor also offers favorable price-performance ratios, making it a cost-effective choice for demanding workloads. While certain benchmarks like Fletcher Du Fowler TTI indicate areas for further optimization, the overall enhancements in speed and throughput per dollar make Graviton4 an attractive option for HPC users.

Our work with AWS underscores our commitment to delivering top-tier performance across diverse hardware platforms, ensuring that our users can achieve the best possible outcomes in their seismic imaging and beyond.

Acknowledgements

Many thanks to the AWS team for their generous provision of credits and technical support, which made this benchmarking study possible. Their continued collaboration and support are invaluable in our ongoing efforts to optimize DevitoPRO for cutting-edge seismic imaging and high-performance computing applications.

Advancements in elastic wave solvers using DevitoPRO and mixed-precision

2024-08-15T00:01:00+00:00

At Devito Codes, we are making significant strides in developing highly optimized elastic wave solvers using DevitoPRO. Elastic wave inversion provides a more comprehensive understanding of subsurface properties than acoustic inversion. By capturing both P-wave and S-wave data, elastic inversion enhances the resolution and accuracy of subsurface models, which is crucial for exploration geophysics and other applications. However, elastic RTM/FWI is significantly more complex and computationally expensive than acoustic RTM/FWI, making it vital to develop fast, innovative solutions.

Utilizing mixed-precision methods, we have found that we can nearly double the performance of wave propagators. Comparisons with standard 32-bit floating point implementations have shown negligible numerical errors. While mixed-precision support in DevitoPRO is still under development, we are working with energy and computer industry collaborators to accelerate our roadmap to bring mixed-precision support into production quickly.

Processor Technology Trends

The rapid growth in machine learning and AI is driving all processor manufacturers to better support these workloads by focusing more silicon on 16-bit floating point (FP16) and other reduced precision floating point datatypes, with Nvidia announcing FP4 and AMD announcing FP4 and FP6 and this year for their next generation of processor. At the same time, the rate of increase in memory bandwidth is lower than the rate of computational power increasing on these processors.

These hardware trends challenge the vast majority of HPC code developers, who have traditionally relied upon FP64 arithmetic to maintain accuracy, and seismic imaging workloads, which predominantly use FP32.

For these reasons, mathematical software programmers across all HPC communities are increasingly focused on developing algorithms and software support for accuracy-preserving mixed precision. However, this will not be straightforward, necessitating an interdisciplinary approach and changes throughout the full software stack to achieve reliable results.

DevitoPRO for agility and performance portability

DevitoPRO enhances Devito’s core functionalities by incorporating advanced compiler techniques for automatic optimization of stencil computations. Users can write complex finite-difference schemes in high-level Python code, which is then translated into optimized, parallelized C/CUDA/HIP/SYCL code suitable for various hardware architectures. DevitoPRO includes many advanced algorithmic optimizations like the expanding-box technique, which focuses computation on active domains to reduce overhead. Beyond performance enhancements, DevitoPRO offers essential features for production-level seismic imaging. It supports compression-based asynchronous serialization and intelligent data-streaming techniques for efficient disk-host-GPU transfers, significantly improving data management and performance during reverse time migration (RTM) and full-waveform inversion (FWI). These capabilities make DevitoPRO an indispensable tool for high-performance and scalable seismic imaging solutions.

Reduced memory pressure and mixed-precision

In DevitoPRO, we successfully doubled the speed of elastic propagators by using reduced-precision storage for model parameters and wavefields while using mixed-precision for floating-point arithmetic. This approach benefits from reduced memory pressure and makes use of available hardware support for mixed-precision. Our comparisons with pure FP32 implementations have shown negligible numerical errors, validating the effectiveness of this method.

Example - Elastic VTI Running on Intel Sapphire Rapids As an example, we benchmarked a discretized form of elastic VTI running on the SEAM 3D model (Fehler and Larner, 2008) using DevitoPRO on a single socket Intel Sapphire Rapids. The velocity-stress equations are solved using a space order of 8 and a first-order leap-frog time discretization on an 800x400x400 grid. Using FP32, the best result was 2.45 GP/s (giga-points-per-second), while using mixed-precision, we were able to achieve 4.1 GP/s, which is a 1.7 speedup. For comparison, we plot a shot record and wavefield snapshot of the mixed-precision wavefield on the left panels below and the absolute error between the mixed-precision and standard FP32 solution, scaled up by a factor of 500 so it would be visible on the right-hand side panels below.

Conclusion

A major barrier to the adoption of mixed-precision arithmetic in production codes is the dual complexity of designing robust algorithms and managing the resulting software. DevitoPRO overcomes this by insulating application developers from this complexity through the Devito domain-specific language (DSL) and compiler pipeline.

Another significant barrier to developing mixed-precision software is the lack of standardized reduced-precision floating-point formats in programming languages and standard libraries such as MPI. Although initiatives such as Microscaling Formats (MX) Alliance is actively working on developing standards to simplify the integration and adoption of these new data formats across the industry; it may take years for these standards to be widely available throughout the entire software ecosystem. In the meantime, DevitoPRO has implemented portability layers to enable users to leverage reduced precision on currently supported platforms.

Even in cases where the mathematical formulation of the problem needs adjustment to make it more amenable to reduced-precision floating point arithmetic, these changes are at a high level where it is relatively straightforward to test and evaluate different variations. This agility is one of the reasons why DevitoPRO will be instrumental in the industry transition to leveraging mixed-precision hardware platforms.

Early results are promising, showing a 70% improvement in performance. Ongoing efforts aim to develop more sophisticated compiler passes to fully utilize mixed precision on modern processors, potentially achieving even greater speedups in the future. By continuing to innovate and optimize, DevitoPRO is poised to make elastic modeling and inversion more accessible and efficient, benefiting a wide range of scientific and engineering applications.

Announcing JUDI’s Support for DevitoPRO

2024-08-13T00:01:00+00:00

As part of our ongoing efforts to enable rapid innovation and performance portability across the industry we are pleased to announce that as of JUDI, (Julia Devito Inversion framework) v3.4.5, DevitoPRO is supported as a bring-your-own-license (BYOL) feature. This brings production-grade performance-portability and scalability to JUDI solvers for modeling, inversion, machine learning, and much more.

JUDI, (Julia Devito Inversion framework) is an open-source Julia-based package for large-scale seismic modeling and inversion designed by Georgia Tech’s Seismic Laboratory for Imaging and Modeling (SLIM) to translate wave-based algorithms into fast, scalable code suitable for industry-size 3D problems on clusters or in the cloud. Built on top of Devito, a Python domain-specific language for automated finite-difference computations, JUDI leverages Devito’s symbolic API to generate high-performance wave propagation kernels. This integration combines Devito’s computational power with Julia’s flexibility, enabling efficient simulations and the implementation of PDE-constrained optimizations like full-waveform inversion (FWI) and imaging (LS-RTM). JUDI’s modeling operators can also be integrated into neural networks for physics-augmented deep learning, as highlighted in SLIM’s Leading Edge article, Learned multiphysics inversion with differentiable programming and machine learning.

The integration of DevitoPRO into JUDI enables large-scale and real-world seismic inversion simulations by introducing key performance and memory management improvements. These include support for CUDA/HIP/SYCL for GPUs, domain-specific optimizations such as automatic source/receiver expanding box in the propagator for both forward and adjoint solves, and asynchronous wavefield serialization with lossy/lossless compression. Additionally, the fully supported DevitoPRO decoupler allows single-shot domain decomposition over multiple devices or NUMA domains while maintaining a serial Julia process. These features enable state-of-the-art performance and scalability, particularly for long-offset FWI, high-frequency RTM, and full-wavefield imaging, all of which have a large memory footprint and computational requirements. Users with a DevitoPRO license benefit from enhanced performance, improved scalability, and an easier transition from prototyping to production environments, facilitating the application of their projects to real-world scenarios.

JUDI use cases from SLIM

JUDI has been successfully applied in various advanced research and cloud imaging projects, highlighting its versatility and power in seismic modeling and inversion. One significant use case is WISE: full-Waveform variational Inference via Subsurface Extensions, where JUDI is used to combine variational inference and conditional normalizing flows for probabilistic full-waveform inference. This approach helps reduce the reliance on accurate initial migration-velocity models and enables reliable uncertainty quantification in velocity models, showcasing JUDI’s capability to generate high-quality seismic images that include uncertainty.

Another notable application is in compressive time-lapse seismic monitoring of carbon storage and sequestration. JUDI was employed in this project to improve the efficiency and accuracy of time-lapse seismic data acquisition using a joint recovery model. This method allows for high-quality monitoring of CO~2~ plumes over extended periods, which is crucial for carbon capture and storage (CCS) projects.

In the realm of machine learning and inversion, JUDI was integral to the development of a learned coupled inversion framework for carbon sequestration monitoring. This framework uses Fourier Neural Operators (FNOs) to estimate permeability from time-lapse seismic data, enabling the forecasting of CO~2~ plume behavior with improved accuracy and computational efficiency.

JUDI also forms an essential component in the development of a Digital Twin for Geological Carbon Storage. In this approach, JUDI provides wave simulation and imaging capabilities that undergird training of the Digital Twin’s generative neural networks (checkout President’s Page: Digital twins in the era of generative AI).

Furthermore, JUDI has played a role in the development of serverless imaging on the cloud, which involves deploying seismic imaging processes on cloud infrastructure without the need for traditional server management. This approach allows for scalable and cost-effective seismic data processing, making it accessible for large-scale industrial applications (see [An Event-Driven Approach to Serverless Seismic Imaging in the Cloud])https://ieeexplore.ieee.org/document/9044390)).

These examples underline JUDI’s extensive applicability across various domains in seismic research, particularly in leveraging advanced computational methods and cloud-based solutions.

Conclusion and Outlook

The integration of DevitoPRO into JUDI represents a significant advancement in the field of seismic imaging and inversion, offering users enhanced performance, scalability, and new capabilities. As we look to the future, several promising developments are on the horizon. These include the continuous expansion of features, such as more sophisticated wave-equation solvers and advanced optimization techniques. Additionally, the integration of cutting-edge machine learning algorithms, generative AI, and physics-informed neural networks is expected to further improve the accuracy and computational efficiency of seismic imaging and inversion.

Continued collaboration within the user community will be essential to drive innovation and address emerging challenges. Open-source contributions, feedback, and joint research initiatives will play a crucial role in advancing JUDI and DevitoPRO, ensuring they remain at the forefront of seismic technology.

The ongoing support and development by Devito Codes, particularly through the contributions of Dr Mathias Louboutin, who has been instrumental in both JUDI’s initial development and its continued evolution, underline the commitment to delivering value to the open-source community. This collaboration bridges the gap between academic research and production-level applications, positioning JUDI and Devito Codes as leaders in providing comprehensive solutions to accelerate innovations in the field of seismic imaging and inversion.

We encourage users to explore these new functionalities and actively engage with the community to advance further this powerful toolset, continuing the journey of innovation and excellence in seismic research and applications.

Announcing Devito and DevitoPRO v4.8.7

2024-06-11T00:01:00+00:00

Devito and DevitoPRO v4.8.7, brings a host of new features, enhancements, and optimizations to our high-performance computing (HPC) tools.

About open-source Devito

Devito is an invaluable tool for solving partial differential equations (PDEs) on structured grids, generating optimized C code from a high-level symbolic specification, facilitating HPC applications.

Key Features:

High-productivity symbolic PDE solver specification with symbolic computation capabilities.
Automatic code generation, producing finite-difference (FD) kernels, targeting a diverse range of architectures.
Comprehensive, multi-level performance optimization.

About DevitoPRO

DevitoPRO builds upon Devito open-source’s foundation, adding advanced features and optimizations specifically for high-productivity HPC. This commercial extension targets seismic imaging and inversion workloads, supporting multiple hardware architectures, including AMD, Intel and Nvidia GPUs, to deliver enhanced performance and reduced development time.

Key Features:

Additional DSL abstractions for developing RTM/FWI solutions.
Enhanced compiler and toolchain for enterprise needs
Advanced GPU acceleration using vendor-native languages (CUDA, HIP, SYCL)

Devito and DevitoPRO are released in lockstep. The new features and improvements in version 4.8.7, detailed below, collectively enhance the performance, usability, and scalability of both tools, making them indispensable for high-performance computing applications in fields such as seismic imaging and inversion.

New Features in Devito v4.8.7

API Enhancements

Time Derivatives Expansion: Expand time derivatives in generated code for better clarity and consistency in computations.
Improved Derivatives API: Cleaner abstractions for interpolants and interpolated derivatives to be compactly specified
SparseFunction Improvements: Revamp sparse subfunction setup and sparse dimension handling to optimize grid operations.
Improve examples: Clarify sparse function setup and interpolation examples.

Compiler Improvements

Device-Aware Blocking: Implement device-aware blocking and refine related tests to improve performance on multiple architectures.
Elementary Functions Optimization: Make code generation of elementary functions dtype-aware for improved type handling.
ConditionalDimension Placement: Fix placement of ConditionalDimension within loop nests when used in combination with subdomains.
Memory Management: Restructure MemoryAllocator hierarchy to streamline memory allocation and deallocation processes.

MPI and Parallel Computing

MPI Initialization and Finalization: Refine optional MPI initialization and finalization to better manage parallel execution environments.
Threading Support: Initialize MPI with threading support to leverage multi-threading capabilities.
C-Level MPI_Allreduce: Add support for C-level MPI_Allreduce to facilitate efficient distributed reductions.
Data Gathering Fixes: Fix data gathering for sparse functions to ensure accurate data collection in parallel computations.
Halo Touch Sequentialization: Sequentialize halo touch operations to prevent race conditions in parallel environments.
MPI0 Logging Level: Add MPI0 logging level to restrict performance logging to rank 0, reducing overhead in multi-rank setups.

Tutorials and Examples

ADER-FD: Add notebook demonstrating implementation of ADER-FD schemes.
Tutorials for new API features: Add and update examples for enhanced FD API and sinc interpolation/injection.

Testing and Continuous Integration (CI)

Parallel Marker Revamp: Revamp parallel markers in tests to improve test coverage and reliability in parallel execution scenarios.

Docker and Build Enhancements

Docker Build Fixes: Fix Nvidia Docker builds and Intel OneAPI setups to streamline containerized deployments and compatibility.
ARM Architecture Support: Build ARM base images to support ARM architecture, extending the range of supported hardware platforms.

Miscellaneous Enhancements

Logging and Error Handling Improvements: Improve logging and error handling for MPI and compiler sniff operations to provide clearer diagnostics and error messages.
Code Cleanup and Refactoring: Perform extensive code cleanup and refactoring, including polishing built-ins, updating cached properties, and removing unnecessary code to maintain a clean and efficient codebase.
Dependency Updates: Drop support for Python 3.7 and update requirements for dependencies like mpi4py to ensure the project stays up-to-date with the latest software versions and standards.

New Features in DevitoPRO v4.8.7

Performance Optimizations

Advanced Compiler Optimizations: Implement various compiler optimizations to reduce runtime and increase computational efficiency.
Multi-GPU Scaling: Enhance scaling across multiple GPUs, allowing better utilization of hardware resources and significantly improving performance in large-scale simulations.
Device-Aware Blocking: Add device-aware blocking to optimize memory and compute resource allocation for heterogeneous computing environments, including CPU and GPU.

API Enhancements

Extended Symbolic Functionality: Introduce new symbolic operations and functions to support more sophisticated mathematical models and simulations.
Boundary Conditions Support: Improve support for specifying boundary conditions and provide greater flexibility in defining simulation domains.

Parallel Computing and MPI

Improved MPI Parallelism: Enhance parallel computing capabilities using MPI, ensuring better scalability on large clusters and more efficient parallel execution.
C-Level MPI_Allreduce: Add support for C-level MPI_Allreduce operations to facilitate efficient distributed reductions, improving the performance of parallel algorithms.

Debugging and Profiling Tools

New Debugging Tools: Introduce new debugging tools and logging capabilities, making it easier to troubleshoot and optimize simulations.
Performance Profiling: Integrate performance profiling tools to help developers identify and address bottlenecks in their code, ensuring optimal performance.

Compiler Improvements

Memory Management: Restructure the MemoryAllocator hierarchy to streamline memory management processes, reducing overhead and improving performance.

Miscellaneous Enhancements

Preconditioning Techniques: Add new preconditioning techniques to improve the convergence rates of iterative solvers, making simulations more efficient and accurate.
Code Cleanup and Refactoring: Perform extensive code cleanup and refactoring, including polishing built-ins, updating cached properties, and removing unnecessary code, ensuring a clean and maintainable codebase.
Dependency Updates: Update software requirements and dependencies to ensure compatibility with the latest versions and standards, including dropping support for outdated Python versions and updating key libraries.

S-Cube integrates DevitoPRO for performance-portable innovation

2024-05-31T00:01:00+00:00

At Devito Codes, we take pride in fostering collaborations that push the boundaries of technological innovation. Our ongoing work with S-Cube exemplifies how synergy between companies can drive advancements in computational science. Through this collaboration, we have harnessed our strengths in developing high-productivity, accurate, and performance-portable wave propagators using DevitoPRO, while S-Cube has developed several cutting-edge algorithms for seismic imaging, such as XWI (X-Wave Full Waveform Inversion). XWI boasts superior predictive power, enabling more accurate subsurface models in complex geological settings. Implementing these innovative algorithms on top of the DevitoPRO framework ensures they are performance-portable across all major CPUs (ARM64 and x86_64) and GPUs (AMD/HIP, Intel/SYCL, Nvidia/CUDA), providing complete flexibility and efficiency in diverse computational environments.

Enhancing Wave Propagation Technology

Devito Codes has consistently delivered state-of-the-art solutions for wave propagation problems, emphasizing productivity and accuracy. Our operators are designed to be performance-portable, ensuring they can run efficiently on all major CPU and GPU platforms. This flexibility is crucial in the dynamic landscape of computational geophysics, where adaptability to different computational environments is vital for optimizing time-to-solution, price-to-solution, and as a means to overcome hardware supply constraints.

In collaboration with S-Cube and AWS, we benchmarked DevitoPRO operators on a wide range of AWS SKUs so price performance could be accurately estimated before running any seismic imaging processes. This enabled S-Cube to optimize the efficiency of seismic workloads on AWS.

DevitoPRO is also enabling S-Cube to quickly develop the next generation of elastic wave solvers to extend their existing state-of-the-art seismic imaging algorithms. Rapid innovation in these areas is essential to improving the accuracy and efficiency of characterizing complex geological formations. We achieve performance-portability and accelerated innovation in seismic data processing by combining DevitoPRO code generation capabilities with S-Cube’s innovative inversion algorithms.

The Power of Collaboration

Working together, Devito Codes and S-Cube have demonstrated that collaboration can lead to breakthroughs that would be challenging to achieve independently. Our joint efforts have resulted in a suite of tools that enhanced S-Cube capabilities and brought DevitoPRO to a wider range of companies.

Innovation: Integrating DevitoPRO into S-Cube’s seismic imaging algorithms has led to accelerated innovation in the development of new methodologies that improve the accuracy of subsurface imaging.
Performance: By leveraging our performance-portable operators, S-Cube’s algorithms can run on a variety of hardware platforms, making advanced seismic imaging techniques more accessible.

Achievements Through Partnership

The collaboration between Devito Codes and S-Cube is a testament to what can be achieved when companies work together towards a common goal. It underscores the importance of combining expertise from different domains to tackle complex challenges. This partnership has not only enhanced our technological offerings but also provided valuable insights that will guide future developments.

At Devito Codes, we are committed to continuing our work with a diverse range of service companies, cloud vendors, and hardware providers. Our goal is to ensure that our solutions remain at the forefront of innovation, providing our clients with the tools they need to succeed in an ever-evolving industry.

By fostering such collaborations, we aim to contribute to the advancement of computational science and its applications in geophysics and beyond. The success of our partnership with S-Cube is a clear indication that, working together, we can achieve remarkable results and drive the field forward.

Announcing DevitoPRO SYCL Code Generation

2024-05-13T00:01:00+00:00

This week at ISC 2024 in Hamburg, we are thrilled to introduce SYCL code generation support for DevitoPRO, specifically optimized for Intel’s Data Center GPU Max 1100 and 1550. This advancement, developed in collaboration with Intel, extends our OpenMP offloading support in open-source Devito and provides a robust SYCL capability essential for delivering high-performance for seismic imaging workloads on Intel GPUs.

SYCL is a versatile C++-based parallel programming framework that facilitates code portability across diverse computing architectures including CPUs, GPUs, and FPGAs from various vendors. The integration of SYCL into DevitoPRO means users can now deploy their existing Devito applications on Intel GPUs effortlessly, only needing to specify a different target architecture allowing just-in-time compilation to reap the performance benefits of SYCL.

This update empowers DevitoPRO users with true performance portability across all major CPU and GPU vendors.

Overview of Devito and DevitoPRO

Devito and DevitoPRO provide high-level abstractions that shield developers from the complexities of porting and optimizing code across different GPU platforms. For HPC specialists, Devito also offers the capability to tweak the generated code, offering further customization. This strategy significantly cuts development time and avoids vendor lock-in, granting users genuine flexibility in their hardware choices.

Devito: A robust, open-source Python-based DSL and compiler, Devito capitalizes on high-level symbolic definitions to produce optimized finite-difference computational kernels across multiple CPU and GPU platforms. Developed initially at Imperial College London in collaboration with the SLIM group at GaTech, Devito supports MPI, OpenMP, and OpenACC, parallel programming models providing a high-productivity solution for both academic and commercial applications.

DevitoPRO: Serving primarily the energy sector, DevitoPRO is an enhanced commercial version of Devito designed for maximizing performance portability in seismic imaging. With the new addition of SYCL code generation for Intel GPUs, DevitoPRO now offers greater adaptability across GPU platforms from all leading vendors.

New Features in DevitoPRO

We continuously refine our code generation through iterative benchmarking against manually optimized codes. This process ensures DevitoPRO not only matches but frequently surpasses the performance of hand-tuned implementations across various GPUs. The new SYCL integration allows for seamless switching between target backends, ensuring application consistency and performance across different architectures. Thanks to Intel’s support, we’ve also incorporated an Intel GPU Max 1100 into our development cluster to boost our testing and optimization capabilities.

Open Source Contributions and OpenMP Support

The open-source iteration of Devito includes OpenMP support for Intel GPUs, broadening its usability across various research and development applications. Our vibrant community on Slack and GitHub is instrumental in continually enhancing Devito, ensuring it stays at the cutting edge of computational science for simulations, inversions, and optimizations based on finite differences.

Conclusion

The introduction of SYCL code generation in DevitoPRO marks a crucial advancement in our mission to deliver high-performance, high-productivity computing solutions across major CPU and GPU platforms. We value and encourage user feedback to further refine and evolve our technologies.

Please contact us for trial licenses or benchmarks of DevitoPRO with the new SYCL capabilities. For more details or to start utilizing the new features of DevitoPRO, visit our website and reach out to our team through the contact links provided.

Devito Codes on the road to IMAGE 2023

2023-08-21T00:01:00+00:00

DevitoPRO at IMAGE 2023

Devito Codes, an Independent Software Vendor (ISV), takes great pride in specializing in Devito, an open-source Domain Specific Language (DSL) crafted for optimizing finite-difference operators used in RTM and FWI. Our enterprise solution, DevitoPRO, is a cutting-edge platform that facilitates simulation, inversion, and optimization by automatically generating tuned parallel software for various architectures, including AMD, ARM, Intel, and Nvidia. Primarily targeting the field of exploration geophysics, DevitoPRO seamlessly blends ease of use with high performance, drastically reducing development time from months to days, and ensuring performance portability across various computer systems.

Engaging at IMAGE 2023

We’re eagerly looking forward to engaging with the community at IMAGE 2023 this year. The past year has been filled with excitement, innovation, and substantial growth in both the open-source Devito platform and our enterprise product, DevitoPRO.

Our collaboration with processor manufacturers like Nvidia, AMD, and Intel has been instrumental in optimizing Devito and DevitoPRO on all major CPUs and GPUs. We also actively work with Cloud providers such as AWS and Azure to benchmark and optimize containerized deployments with Devito. Our approach has been collaborative and forward-thinking, ensuring we stay at the forefront of technological advancement.

Year Highlights Since IMAGE2022

Increased Adoption: Over the past year DevitoPRO has seen a surge in adoption by Energy producers and service companies, both for R&D and production runs.
Benchmarking Efforts: Working closely with the geophysics community, processor manufacturers, and cloud companies to develop a standardized cross-platform benchmarking suite.
Growing the Team: In July, we welcomed long-time collaborator Dr. Mathias Louboutin to our team as Senior Solution Architect, a fantastic addition to support the community in developing solutions on top of DevitoPRO and driving new features.

Key Features and Updates

The Devito compiler has seen an impressive array of enhancements, fixes, and new features:

1. Compiler Enhancements and Fixes:

Many improvements cater to various compiler capabilities, including compatibility with different processors, optimization enhancements, and more.

2. Parallelism and Synchronization:

Focus on augmenting parallelization, blocking, and synchronization logic to boost efficiency.

3. GPU Support:

Numerous enhancements for AMD, Intel and Nvidia GPUs.

4. Buffering and Memory Management:

Revisions to buffering logic and memory handling functionalities.

5. Code Generation and Linearization:

Refinements to enhance code manipulation and efficiency.

6. CSE and Optimization:

Robust improvements in Common Subexpression Elimination and other optimization strategies.

7. Testing and Documentation:

Comprehensive updates to ensure code quality and thorough documentation.

Enhancing support for parallel processing and high-performance computing.

9. Miscellaneous Improvements and Fixes:

General enhancements to improve functionality and user experience.

10. Docker Updates:

Optimizations to the Docker environment for a seamless development experience on different platforms.

11. Architecture and Other Enhancements:

Broad support for various processors and technologies, along with general improvements in data handling, benchmarking, profiling, and more.

The above changes represent a considerable evolution of the Devito compiler, encapsulating efficiency, compatibility, GPU support, parallelism, memory management, testing, and robustness.

Recent Updates to DevitoPRO

A brief rundown of the substantial advancements in DevitoPRO includes:

Docker Integration, Submodule Handling, Allocator and Abox Fixes: These include a wide range of integrations, fixes, and improvements.
Parametric Blocking, MPI Integration, Tuning, and Optimization: Key advancements in various technical aspects.
Compression, Serialization, Testing, and Benchmarking Enhancements: Significant strides in data handling and performance metrics.
New Demos and Examples, Docker and CI/CD Configuration Enhancements, Miscellaneous Adjustments: Introduction of new features, improvements, and extensive development in testing, demos, and more.

About Devito Codes

Devito Codes is dedicated to pioneering solutions in the realm of computational imaging, in particular in the field of exploration geophysics and ultrasound imaging. DevitoPRO is designed to optimize and automate the generation of finite-difference kernels. With a focus on high performance, high-level abstractions and user-friendliness, we are committed to reducing development time while enhancing cross-platform compatibility and performance portability. By working closely with the community and industry leaders, we are continually shaping the future of computational geophysics. We invite you to connect with us at IMAGE 2023 and explore the exciting possibilities that Devito Codes has to offer!

Cross-platform seismic imaging benchmarking

2023-02-26T00:00:00+00:00

We have launched a groundbreaking framework to benchmark seismic imaging workloads across different platforms. This initiative brings together key stakeholders from Devito, hardware vendors, cloud providers, and the broader industry, setting a stage for standardization, reproducibility, and elevated performance in seismic imaging workloads.

Key Highlights:

Standardization and Reproducibility:
- Advocates for standardized comparisons and robust performance data, aiding organizations in insightful hardware or cloud system acquisitions.
Efficient Resource Utilization:
- Promotes code/data reuse and minimizes redundant efforts, leading to efficient resource and human capital utilization.
Extendable and Automated Workflow:
- The flexible architecture allows for extensibility and employs automation for a streamlined benchmarking process, catering to evolving needs.
Community Engagement:
- The initiative welcomes community engagement and dialogue, laying the groundwork for future collaborations, workshops, and benchmarking expansions.
Transparency and Validation:
- Even in its alpha phase, the emphasis on transparency and validation of benchmark data ensures responsible use of preliminary data. However, because of the commercial sensitivity of the data, the benchmarking data is only available under NDA.

This framework signifies a substantial stride towards nurturing a collaborative ecosystem aimed at advancing standardization and optimization of seismic imaging workloads across diverse computing architectures. The collaborative ethos facilitated by this platform is geared towards driving notable advancements in seismic imaging performance, contributing to the overarching goal of efficient resource utilization and heightened computational capabilities.

Overview

The framework is build on GitHub-based having being heavily influenced by our existing CI/CD framework. The platform includes a development cluster of servers with various computer architectures, configured as GitHub self-hosted runners, and utilizes GitHub Actions to automate workflows.

One of the advantages of this approach is that it naturally supports automation, standardized and reproducible comparisons of different methods, hardware, and skills. It also enables collaboration and code/data reuse, ultimately leading to better performance of the software and efficient use of human capital. The platform is also easily extendable by configuring self-hosted runners, adding different benchmarks to the GitHub Actions workflow, and more servers (either on-prem or Cloud-based).

Benchmarking as a platform

The key idea behind the seismic imaging benchmarking platform is to bring together stakeholders in the industry, such as energy companies, service companies, processor manufacturers, and academic researchers, to standardize benchmarking of seismic imaging kernels.

The objective is to enable accurate and reproducible benchmark experiments, facilitate collaboration and code/data reuse, reduce the duplication of effort and improve the overall performance of seismic imaging software. Additionally, robust performance data will help organizations make informed purchasing decisions for on-premise or cloud computing systems.

Overall, the proposed platform aims to address the common issues in benchmarking seismic imaging kernels, such as differences in the PDEs, discretization, algorithmic optimizations, and runtime choices, and provide a more standardized and reproducible approach for comparing different methods, hardware, and skill.

Anatomy of a standard benchmark

graph TB subgraph Standard: Benchmark setup/input A(Problem specification: PDEs, BCs, grid size/shape, ...) end subgraph Concrete implementation A-->B1(OSS Devito) A-->B2(DevitoPRO) A-->B3(Hardware vendor implementation) A-->B4(Other ISV, research, proprietary implementations) end subgraph Execution environment B2-->C1(Singularity) B2-->C2(Docker) B2-->C3(Conda) end subgraph Target platform C1-->D1(On-Prem dev-cluster) C1-->D2(Vendor/slurm dev-cluster) C1-->D3(Public Cloud) D1-->E1(Vendor CPUs) D1-->E2(Vendor GPUs) end subgraph Standard: benchmark output F(JSON: performance metrics, solution norms, status, implementation specific metadata) E1-->F E2-->F end;

The software infrastructure

We have created a GitHub-based extensible framework for benchmarking seismic imaging kernels. The Seismic Benchmark Platform e-infrastructure comprises GitHub Actions for automating workflows and a development cluster of servers with various computer architectures, each configured as a GitHub self-hosted runner.

GitHub Actions is a feature that allows users to automate software development workflows. It allows users to create custom workflows, called actions, triggered by specific events such as a code push, pull request, or the creation of an issue. These workflows include building and testing code, deploying software, and integrating with other tools. With GitHub Actions, users can automate repetitive tasks, reduce manual errors, and improve the overall efficiency of their development process. In our case, GitHub Actions are used to

Execute one or more benchmarks.
Upload benchmark data to a results repository.
Post-process benchmark data.

Workflow of benchmark automation with GitHub Actions

graph TB subgraph GitHub Action: manual event trigger A(Benchmark matrix of jobs: benchmarks x architectures) B(GitHub actions schedules individual jobs to self-hosted runners) A-->B end subgraph Foreach benchmark job C(Job allocated to self-hosted runner) D(Setup execution environment) E(Run benchmark) F(Push benchmark output to data repo) B-->C C-->D D-->E E-->F end subgraph GitHub Action: triggered by data push G(Process data) H(Publish results to gh-pages) F-->G G-->H end;

GitHub Actions can run on either GitHub-hosted runners or self-hosted runners. Self-hosted runners are used to execute a workflow on machines the users have direct access to, rather than on GitHub-managed infrastructure. Self-hosted runners allow users more control over the environment in which their workflows run, including access to specific software, libraries, or hardware resources. Users can also use self-hosted runners to run workflows on-premises, in a virtual private cloud, or in a hybrid environment. Self-hosted runners are a flexible solution for organizations with specific requirements for their development environments and need more control over their workflow execution.

For the work described here, we have configured the following self-hosted runners:

NVIDIA A100-PCIE-40GB (on-prem)
NVIDIA Tesla PG503-216 (on-prem)
AMD Instinct™ MI210 (on-prem)
Intel(R) Xeon(R) Gold 5218R CPU (on-prem)
AMD EPYC 7413 24-Core Processor (on-prem)

The advantages of this design based on GitHub Actions are

Fully automated workflow.
Fully reproducible:
All software is maintained in GitHub repositories.
Reuses existing CI/CD infrastructure and know-how.
Readily extendable:
- Add benchmarks by adding extra jobs to the GitHub actions workflow.
- Add more servers by configuring GitHub self-hosted runners.
- Self-hosted runners can be bare-metal servers or running on the Cloud.

While the vision is to advance standardization in our industry and grow a community around this platform, it is also straightforward to fork our codebase and create a private instance with proprietary benchmarks.

Another fundamental aspect of our software infrastructure is the use of virtual containers, in particular Docker. This makes it straightforward to configure new machines and reproduce performance results. In our experience, virtual containers are the only realistic way of maintaining and extending a software and hardware infrastructure like the one we envision in this project.

Currently, three benchmarks are configured:

Isotropic acoustic

Shortcut: iso
Dimensions: 512x512x512
Number of time steps: 400
Space order: 8
Time order: 2

Fletcher and Fowler TTI

Shortcut: tti_fl
Dimensions: 512x512x512
Number of time steps: 400
Space order: 8
Time order: 2

Skew-adjoint TTI

Shortcut: tti_sa
Dimensions: 512x512x512
Number of time steps: 400
Space order of operator: 8
Time order: 2

Results preview

We have included a snapshot of results below. FLOPS (Floating Point Operations Per Second) is a well-recognized measure in High-Performance Computing (HPC), but it may not always be the most revealing. Its value can be inflated by employing inefficient numerical methods. In seismic imaging, GPts/s (Giga-Points Per Second) — also termed giga-cells-per-second — is often favored. This metric directly measures work throughput, offering a clearer gauge of performance. In essence, GPts/s helps in accurately estimating the time or cost required to solve a specific problem.

Disclaimer: Although we have exerted maximum effort to guarantee precise, *equitable, and replicable results, it is crucial to understand that the *benchmarking framework is still in the alpha development phase. Consequently, *the benchmarks provided here are preliminary and subject to change. The *benchmark data should not be considered comprehensive or final, and are not *suited for making any financial decisions.

No warranty, express or implied, is provided with the data. The information is *supplied on an “as is” basis. We expressly disclaim, to the maximum extent *permitted by law, any liability for any damages or losses, direct or *consequential, resulting from the use of these benchmarks. Please utilize this *information responsibly, keeping in mind its tentative nature.

3D Isotropic acoustic

Processor	GPts/s
NVIDIA A100-80GB	62.7
NVIDIA A100-PCIE-40GB	54.2
AMD Instinct™ MI250	54
NVIDIA Tesla PG503-216 (V100)	31
AMD Instinct™ MI210	29.2
Intel(R) Xeon(R) Gold 5218R CPU	7.97
AMD EPYC 7413 24-Core Processor	1.49

3D Fletcher Du Fowler TTI

Processor	GPts/s
AMD Instinct™ MI250	16.1
NVIDIA A100-80GB	12.4
NVIDIA A100-PCIE-40GB	12.2
AMD Instinct™ MI210	9.83
NVIDIA Tesla PG503-216 (V100)	9.34
Intel(R) Xeon(R) Gold 5218R CPU	1.72
AMD EPYC 7413 24-Core Processor	0.797

Future work

Add a link to the page capturing the benchmark characteristics
Add more metrics such as FLOPS.
Add more benchmarks:
- Laplacian operator (trivial case helps when working with vendors).
- Gradient operators to stress backward propagation.
- Elastic formulation, as these are of growing importance.
Add support for 3rd parties to provide their implementation of the benchmarks.
Add more benchmark configurations:
MPI for NUMA CPUs.
MPI for multiple GPUs per server.
Add more bare metal nodes to the development cluster.
Engage with hardware vendors to get test nodes.
Add Cloud-based self-hosted runners:
- Ideally, this would be configured as an on-demand runner.
Community engagement.
Organize benchmarking workshops.
Engage with hardware and Cloud vendors to review and optimize benchmarks.

Acknowledgements

Many thanks to Chevron for the funding and feedback to kickstart this initiative. We would also like to thank AMD, AWS, Dell, Nvidia and Supermicro for providing hardware and cloud resources.