<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.5">Jekyll</generator><link href="https://www.devitocodes.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.devitocodes.com/" rel="alternate" type="text/html" /><updated>2025-03-19T10:09:13+00:00</updated><id>https://www.devitocodes.com/feed.xml</id><title type="html">Devito Codes</title><subtitle>Devito is a Python package to implement optimized stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions. Devito builds on SymPy and employs automated code generation and just-in-time compilation to execute optimized computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof.</subtitle><entry><title type="html">Benchmarking DevitoPRO on Intel® Xeon® 6 Processors</title><link href="https://www.devitocodes.com/granite" rel="alternate" type="text/html" title="Benchmarking DevitoPRO on Intel® Xeon® 6 Processors" /><published>2025-02-07T00:01:00+00:00</published><updated>2025-02-07T00:01:00+00:00</updated><id>https://www.devitocodes.com/granite</id><content type="html" xml:base="https://www.devitocodes.com/granite"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>High-performance computing (HPC) plays a critical role in scientific and engineering applications, from seismic imaging to simulation. DevitoPRO is designed to help researchers and engineers automate HPC code generation, allowing them to focus on algorithmic development for their domain-specific challenges while ensuring their simulations run efficiently on modern hardware.</p>

<p>As part of our ongoing efforts to optimize performance across multiple architectures, this post presents benchmark results for <strong>DevitoPRO on Intel® Xeon® 6 6980P processors</strong>, comparing them to the previous-generation <strong>5th Gen Intel® Xeon® Platinum 8592+ processors</strong>. These results provide insights into <strong>computational throughput, data transfer performance, and mixed-precision acceleration</strong>—all key factors in achieving <strong>high-performance finite-difference seismic imaging kernels</strong>.</p>

<h2 id="benchmarking-setup">Benchmarking Setup</h2>

<p>The goal of this benchmarking study is to evaluate:</p>

<ul>
  <li>Generational performance improvements when moving from 5th Gen Intel® Xeon® Platinum 8592+ processors (Emerald Rapids) to Intel® Xeon® 6 6980P processors (Granite Rapids).</li>
  <li>The impact of mixed-precision computing, where FP16 is used for storage to improve memory bandwidth utilization, while FP32 is retained for arithmetic to maintain numerical stability and accuracy, assessing its effect on compute throughput and data movement efficiency.</li>
</ul>

<h3 id="benchmarks">Benchmarks</h3>

<p>The benchmarks use two workloads:</p>

<ol>
  <li>Acoustic anisotropic propagator (acoustic TTI model) – A widely used model in seismic imaging for energy applications, testing floating-point operations per second (FLOPS) and data transfer rates.</li>
  <li>Elastic propagator – A more complex model incorporating mixed-precision techniques (FP32/FP16) to evaluate performance improvements in multi-node MPI environments.</li>
</ol>

<h3 id="processor-specifications">Processor Specifications</h3>

<table>
  <thead>
    <tr>
      <th style="text-align: center">Processor</th>
      <th style="text-align: center">Physical Cores</th>
      <th style="text-align: center">Sockets</th>
      <th style="text-align: center">Architecture</th>
      <th style="text-align: center">HBM</th>
      <th style="text-align: center">Memory</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: center">5th Gen Intel® Xeon® Platinum 8592+ (Emerald Rapids)</td>
      <td style="text-align: center">128</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">x86_64</td>
      <td style="text-align: center">No</td>
      <td style="text-align: center">Supports DDR5 memory with an eight-channel interface</td>
    </tr>
    <tr>
      <td style="text-align: center">Intel® Xeon® 6 6980P (Granite Rapids)</td>
      <td style="text-align: center">256</td>
      <td style="text-align: center">2</td>
      <td style="text-align: center">x86_64</td>
      <td style="text-align: center">No</td>
      <td style="text-align: center">Supports DDR5 memory with an twelve-channel interface</td>
    </tr>
  </tbody>
</table>

<p>Benchmarks were conducted using identical compilers, software environments, and
simulation parameters to ensure fair comparisons.</p>

<ul>
  <li>Compilers: Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1</li>
  <li>MPI Library: Intel(R) MPI 2021.13</li>
  <li>Optimization Flags: <code class="language-plaintext highlighter-rouge">-O3 -g -fPIC -Wall -std=c99 -xHost -fp-model=fast -qopt-zmm-usage=high -shared -qopenmp</code></li>
  <li>Software Stack: Benchmarks were conducted using <strong>DevitoPRO v4.8.x</strong>, which includes <strong>an experimental mixed-precision implementation</strong>. While this version incorporates <strong>early optimizations for FP32/FP16 computing</strong>, ongoing development efforts are focused on further refining precision handling, improving stability, and enhancing performance scalability across diverse hardware architectures.</li>
</ul>

<h2 id="performance-results">Performance Results</h2>

<h3 id="generational-performance-gains-for-acoustic-tti">Generational Performance Gains for acoustic TTI</h3>

<p>The acoustic TTI benchmark measures the efficiency of seismic wave propagation simulations, a key workload in geophysical exploration. The benchmark was conducted using a 1024×2048×1024 computational grid with 5000 time steps, ensuring a realistic and computationally demanding test case. The simulations were executed using <strong>DevitoPRO</strong>, leveraging optimized code generation for modern CPU architectures.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: center">Metric</th>
      <th style="text-align: center">5th Gen Intel® Xeon® Platinum 8592+ (Emerald Rapids)</th>
      <th style="text-align: center">Intel® Xeon® 6 6980P (Granite Rapids)</th>
      <th style="text-align: center">Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: center">Operations</td>
      <td style="text-align: center">2.28 TFlops</td>
      <td style="text-align: center">5.96 TFlops</td>
      <td style="text-align: center">2.6x faster</td>
    </tr>
    <tr>
      <td style="text-align: center">FD-throughput (GPts/s)</td>
      <td style="text-align: center">7.45 GPts/s</td>
      <td style="text-align: center">15.74 GPts/s</td>
      <td style="text-align: center">2.1x faster</td>
    </tr>
  </tbody>
</table>

<p>To fully exploit the hardware capabilities, the benchmark used a <strong>NUMA-aware hybrid MPI-OpenMP configuration</strong>, optimizing both <strong>computation</strong> and <strong>data locality</strong>:</p>

<ul>
  <li><strong>Emerald Rapids (EMR) Configuration:</strong>
    <ul>
      <li><strong>Physical cores per socket:</strong> 64</li>
      <li><strong>Total physical cores (dual-socket):</strong> 128</li>
      <li><strong>NUMA domain per socket:</strong> 2</li>
      <li><strong>MPI ranks:</strong> 4 (<strong>1 per NUMA domain</strong>)</li>
      <li><strong>OpenMP threads per rank:</strong> 32</li>
      <li><strong>Pinning:</strong> <code class="language-plaintext highlighter-rouge">I_MPI_PIN_DOMAIN=numa</code>, <code class="language-plaintext highlighter-rouge">I_MPI_PIN_ORDER=bunch</code>, <code class="language-plaintext highlighter-rouge">I_MPI_PIN_CELL=core</code></li>
    </ul>
  </li>
  <li><strong>Granite Rapids (GNR) Configuration:</strong>
    <ul>
      <li><strong>Physical cores per socket:</strong> 128</li>
      <li><strong>Total physical cores (dual-socket):</strong> 256</li>
      <li><strong>NUMA domain per socket:</strong> 3</li>
      <li><strong>MPI ranks:</strong> 6 (<strong>1 per NUMA domain</strong>)</li>
      <li><strong>OpenMP threads per rank:</strong> 42 (NUMA domains on GNR are unbalanced with two NUMA domains with 43 cores and one NUMA domain with 42 cores per socket. We use 42 cores per NUMA domain to keep the configuration uniform.)</li>
      <li><strong>Pinning:</strong> <code class="language-plaintext highlighter-rouge">I_MPI_PIN_DOMAIN=numa</code>, <code class="language-plaintext highlighter-rouge">I_MPI_PIN_ORDER=bunch</code>, <code class="language-plaintext highlighter-rouge">I_MPI_PIN_CELL=core</code></li>
    </ul>
  </li>
</ul>

<p>The NUMA-aware process placement ensured that:</p>

<ul>
  <li>MPI ranks were confined within NUMA nodes, minimizing cross-socket communication overhead.</li>
  <li>OpenMP threads were pinned to physical cores within NUMA domains, reducing memory latency.</li>
  <li>Halo exchange efficiency was maximized through optimized memory access patterns.</li>
</ul>

<h3 id="key-factors-driving-performance-gains">Key Factors Driving Performance Gains</h3>

<p>Granite Rapids exhibited over twice the performance of Emerald Rapids due to:</p>

<ol>
  <li>Higher Memory Bandwidth:
    <ul>
      <li>GNR features twelve DDR5 memory channels per socket, improving data movement efficiency.</li>
    </ul>
  </li>
  <li>Increased Core Count:
    <ul>
      <li>GNR doubles the physical core count (256 vs. 128) compared to EMR, significantly boosting parallel execution.</li>
    </ul>
  </li>
</ol>

<p>These architectural and software improvements collectively delivered 2.6x higher floating-point performance and 2.1x faster compute throughput (GPts/s), demonstrating the generational leap in efficiency from Emerald Rapids to Granite Rapids.</p>

<h3 id="mixed-precision-performance-gains-for-isotropic-elastic-on-granite-rapids-gen-6">Mixed-precision performance gains for Isotropic Elastic on Granite Rapids (Gen 6)</h3>

<p>The isotropic elastic benchmark evaluates the efficiency of multi-component wave
propagation simulations, a crucial workload in geophysical imaging. This test
was conducted on a 1024 × 2048 × 1024 computational grid with 5000 time steps.</p>

<p>Unlike acoustic TTI, which was benchmarked exclusively in FP32 precision,
isotropic elastic simulations were tested in both FP32-only and mixed FP32/FP16
modes. The introduction of mixed precision yielded significant computational and
memory efficiency improvements.</p>

<p>We use the same hybrid MPI-OpenMP parallelism with NUMA-aware pinning as with
the acoustic TTI  benchmark.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: center">Precision Mode</th>
      <th style="text-align: center">Operations</th>
      <th style="text-align: center">Compute Throughput (GPts/s)</th>
      <th style="text-align: center">Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: center">FP32</td>
      <td style="text-align: center">1.02 TFlops</td>
      <td style="text-align: center">3.59 GPts/s</td>
      <td style="text-align: center">-</td>
    </tr>
    <tr>
      <td style="text-align: center">FP32/FP16 (Mixed)</td>
      <td style="text-align: center">2.37 TFlops</td>
      <td style="text-align: center">8.30 GPts/s</td>
      <td style="text-align: center">~2.3x faster</td>
    </tr>
  </tbody>
</table>

<h3 id="why-does-mixed-precision-matter">Why Does Mixed Precision Matter?</h3>

<p>Switching from FP32-only computation to a mixed FP32/FP16 approach provided a
~2.3x speedup, achieved through a balanced approach that optimizes both storage
and arithmetic precision. Since finite difference methods are memory-bound, the
key to performance gains lies in reducing memory bandwidth pressure while
maintaining numerical accuracy.</p>

<p>The mixed-precision strategy used in DevitoPRO follows this principle:</p>

<ul>
  <li>FP16 for storage: Reduces memory footprint and accelerates data movement.</li>
  <li>FP32 for arithmetic: Preserves computational accuracy by minimizing rounding errors.</li>
</ul>

<p>This approach ensures that precision is maintained where it matters most while
taking advantage of FP16’s efficiency for memory operations.</p>

<h3 id="key-benefits-of-mixed-precision">Key Benefits of Mixed Precision</h3>

<ol>
  <li>Lower Memory Footprint:
    <ul>
      <li>FP16 values take up half the memory of FP32, doubling effective cache capacity and reducing pressure on memory bandwidth.</li>
      <li>More wavefield data can fit in fast-access memory, improving locality and cache reuse.</li>
    </ul>
  </li>
  <li>Reduced Communication Overhead:
    <ul>
      <li>In MPI-based distributed environments, using FP16 for storage reduces halo exchange size, halving inter-rank data transfer costs.</li>
      <li>This is particularly beneficial in multi-node scaling scenarios, where communication is a major bottleneck.</li>
    </ul>
  </li>
  <li>Faster Computation:
    <ul>
      <li>Intel Xeon 6 processors feature optimized FP16 vector and tensor operations, accelerating data movement and memory loads.</li>
      <li>Arithmetic remains in FP32, avoiding excessive rounding errors while still benefiting from higher memory bandwidth efficiency.</li>
    </ul>
  </li>
</ol>

<p>By carefully combining FP16 for storage and FP32 for arithmetic, DevitoPRO
achieves significant speedups while ensuring numerical stability, making this
approach ideal for large-scale elastic wave simulations.</p>

<h3 id="impact-on-elastic-wave-simulations">Impact on Elastic Wave Simulations</h3>

<p>Elastic wave simulations require multiple coupled wavefields, significantly
increasing memory and computational demands.  By leveraging mixed FP32/FP16
precision, DevitoPRO achieves:</p>

<ul>
  <li>Higher performance with reduced memory bandwidth constraints**.</li>
  <li>More efficient inter-rank communication, crucial for large-scale multi-node workloads.</li>
  <li>Improved scalability, making mixed precision a practical choice for high-fidelity seismic modeling.</li>
</ul>

<p>These optimizations ensure Granite Rapids delivers superior elastic wave
simulation performance, making it a compelling choice for next-generation
geophysical imaging workloads.</p>

<h2 id="key-takeaways">Key Takeaways</h2>

<ul>
  <li>
    <p>Intel Xeon 6 delivers substantial generational performance gains for finite-difference simulations, achieving 2.6× higher computational throughput and 2.1× faster compute performance (GPts/s) compared to its predecessor.</p>
  </li>
  <li>
    <p>Mixed-precision computing (FP32/FP16) significantly enhances efficiency by reducing memory footprint and accelerating computations, making it a crucial optimization for large-scale seismic workloads. Using FP16 for storage while maintaining FP32 for arithmetic strikes the right balance between performance and numerical accuracy.</p>
  </li>
  <li>
    <p>DevitoPRO automates high-performance code generation, allowing users to fully exploit modern hardware without requiring manual tuning. Optimized memory layouts, vectorization, and hybrid MPI/OpenMP parallelism are applied transparently, enabling peak efficiency across different architectures.</p>
  </li>
  <li>
    <p>Hardware-aware code generation is essential for performance portability, ensuring that computational workloads can scale efficiently on diverse hardware platforms. These results reinforce DevitoPRO’s approach, where automated performance tuning maximizes computational efficiency without sacrificing precision.</p>
  </li>
</ul>

<h2 id="looking-ahead">Looking Ahead</h2>

<p>At Devito Codes, we remain committed to hardware-neutral performance
optimization. While this benchmarking study focuses on Intel Xeon 6, we are
actively working on:</p>

<ul>
  <li>
    <p>Performance analysis on other architectures, including AMD EPYC, NVIDIA Hopper, and ARM-based HPC systems.</p>
  </li>
  <li>
    <p>Refining mixed-precision strategies to balance accuracy, performance, and memory efficiency.</p>
  </li>
  <li>
    <p>Expanding code portability, ensuring DevitoPRO runs optimally across a diverse range of HPC platforms.</p>
  </li>
</ul>

<p>Users and organizations interested in seismic imaging, medical imaging, or
wave-based simulations can explore DevitoPRO’s capabilities at
<a href="https://www.devitocodes.com/features/">https://www.devitocodes.com/features/</a>.</p>

<p>Would you like more details? Feel free to reach out!</p>]]></content><author><name>mlouboutin</name></author><category term="Intel/HPC" /><category term="DSL" /><category term="Python" /><category term="HPC" /><category term="RTM" /><category term="FWI" /><category term="Seismic" /><category term="Elastic" /><category term="Mixed-precision" /><summary type="html"><![CDATA[Explore the cutting-edge performance of DevitoPRO on Intel® Xeon® 6 6980P processors in this comprehensive benchmark study. This blog post reveals how next-generation hardware accelerates finite-difference seismic imaging simulations, delivering over 2.6× higher computational throughput and 2.1× faster compute performance compared to 5th Gen Intel® Xeon® Platinum systems. With a focus on both acoustic TTI and elastic propagator models, we examine the benefits of mixed-precision techniques that combine FP16 storage with FP32 arithmetic for optimized memory bandwidth and accuracy. Learn how NUMA-aware hybrid MPI/OpenMP configurations further boost performance, enabling scalable and efficient high-fidelity geophysical simulations. Discover the full benchmark results.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.devitocodes.com/images/make-subsurface-from-granite.png" /><media:content medium="image" url="https://www.devitocodes.com/images/make-subsurface-from-granite.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Accurate and Robust Propogators and Gradients for Land Seismic Imaging</title><link href="https://www.devitocodes.com/IM" rel="alternate" type="text/html" title="Accurate and Robust Propogators and Gradients for Land Seismic Imaging" /><published>2024-08-21T00:01:00+00:00</published><updated>2024-08-21T00:01:00+00:00</updated><id>https://www.devitocodes.com/IM</id><content type="html" xml:base="https://www.devitocodes.com/IM"><![CDATA[<p>At Devito Codes, we continue to push the boundaries of seismic imaging and wave
propagation. Our latest innovation, to be showcased at IMAGE’24, addresses one
of the most challenging aspects of land seismic imaging: accurately modeling
wavefields in the presence of complex topography using finite-difference
methods, all without the need for unstructured mesh generation.</p>

<h2 id="imaging-from-topography">Imaging from Topography</h2>

<p>High-quality topography handling enables calculation of accurate gradients and FWI topographic updates, even in settings where topographic variation is extreme. If forward and adjoint modeling can capture the physics of the free-surface, these effects become data rather than noise; leveraging this data enhances resolution and illumination, whilst minimizing the requisite preprocessing and streamlining imaging workflows. However, failure to accurately account for the effects of topographic variation leads to images which are unfocused and artefact-prone at best, and entirely incoherent at worst. It follows that for confident and robust land seismic imaging, accurate topography implementation is crucial.</p>

<h4 id="illumination-corrected-gradient-for-an-fwi-tomographic-update">Illumination-corrected gradient for an FWI tomographic update</h4>
<p><img src="/images/corrected_fwi_gradient.png" alt="Example of a corrected gradient for an FWI tomographic update" /></p>

<h4 id="corresponding-inverse-scattering-imaging-condition">Corresponding inverse-scattering imaging condition</h4>
<p><img src="/images/gradient_laplacian_alt_cmap.png" alt="Example of an inverse-scattering imaging condition" /></p>

<h2 id="understanding-the-challenge">Understanding the Challenge</h2>

<p>Traditionally, incorporating topography into solvers has been a complex and
time-consuming task, often requiring the development of bespoke solutions that
are difficult to generalize across different wave equations. While
finite-element methods and mesh generation offer potential solutions, they come
with significant challenges, including a lack of software and practical
experience in developing finite-element inversion methods for exploration
geophysics and the absence of robust automatic mesh generation techniques. These
factors make finite-element approaches less attractive for building efficient
seismic imaging processing pipelines, further complicating the incorporation of
complex topography for land seismic imaging.</p>

<h2 id="a-general-approach-to-modelling-complex-topography">A General Approach to Modelling Complex Topography</h2>

<p>To solve this problem, we’ve developed Schism, a module that automatically
generates immersed-boundary operators for Devito. Schism simplifies the
integration of complex topographies into wave propagation models, separating
topography specification from numerical implementation. This allows imaging
algorithms to leverage sophisticated topography handling routines without added
complexity.</p>

<p>Immersed boundaries represent free surfaces as sharp interfaces on a regular
grid, accurately reflecting the true surface position. This method eliminates
the need for mesh generation or curvilinear grids, by constructing suitable
field extensions beyond the boundary, which are incorporated into
surface-adjacent FD operators, enforcing the necessary boundary conditions.</p>

<p>Our paper, <a href="https://doi.org/10.1190/geo2023-0515.1">A Novel Immersed Boundary Approach for Irregular Topography with
Acoustic Wave Equations</a>, published
earlier this year in <em>Geophysics</em>, demonstrates this approach by modeling wave
propagation around the complex topography of Mount St. Helens.</p>

<p>As illustrated below, the 3D free-surface wavefield is accurately rendered,
capturing the intricate interactions of the wavefield with steep, uneven
terrain. This showcases the power of the immersed boundary method in addressing
real-world seismic challenges.</p>

<p><img src="/images/StHelens.png" alt="Wave propagation around the complex topography of Mount St. Helens" /></p>

<h2 id="applying-to-land-seismic-data">Applying to Land Seismic Data</h2>

<p>We are collaborating with service companies like
<a href="https://www.s-cube.com/partnerships/xwi-plus-devito/">S-Cube</a> to apply these methods to
real-world land seismic applications. Accurately modeling acoustic and elastic
TTI wavefield behavior in environments with pronounced and irregular topography
is essential. Our immersed boundary method effectively captures topographic
effects such as surface multiples, amplitude variations, and diffraction around
obstacles while maintaining the regular computational grids commonly used in
imaging applications. This makes it an ideal choice for integrating topography
handling into existing solvers and imaging workflows.</p>

<p>The innovation here lies in the robustness and generality of the approach,
allowing integration into existing seismic data processing pipelines. The
boundary treatment is tied to the discretization, boundary conditions, and
geometry but not to the governing equations. Through symbolic computation and a
generalized mathematical approach, immersed boundary treatments can be generated
for various equations and boundary conditions, ensuring flexibility and ease of
use. For example, higher-order free surface conditions derived from isotropic
acoustic wave equations behave similarly in acoustic TTI contexts, offering a
computationally efficient alternative.</p>

<h2 id="join-us-at-image24">Join Us at IMAGE’24</h2>

<p>Our immersed boundary support is now available as an experimental feature in
DevitoPRO. If you are interested in learning more about Schism, Devito, or
DevitoPRO, we invite you to go and see Dr. Ed Caunt at his IMAGE’24 presentations:</p>

<h4 id="wednesday-2808-1000">Wednesday 28/08 10:00</h4>

<p><strong>S-Cube &amp; Devito Codes: New Developments for Advanced Physics</strong>, S-Cube &amp; ThinkOnward Booth (1659) in the Digitalisation Pavillion.</p>

<h4 id="thursday-2908-0800-0940am">Thursday 29/08 08:00-09:40AM</h4>

<p><strong>Recent Advances in Seismic Modeling 1</strong>, (Session ID: SMT P1) Poster Station 3 (3rd Level) A, in the George R. Brown Convention Center</p>
<ul>
  <li><strong>An immersed boundary topography approach for TTI acoustic propagation</strong></li>
  <li><strong>Towards elastic-free surface topography with immersed boundaries</strong></li>
</ul>]]></content><author><name>ecaunt</name></author><category term="RTM/FWI" /><category term="DSL" /><category term="Python" /><category term="HPC" /><category term="RTM" /><category term="FWI" /><category term="Seismic" /><summary type="html"><![CDATA[Devito Codes has developed a new module to improve land seismic imaging by accurately modeling wavefields in complex topographies without the need for unstructured mesh generation. This innovation uses an immersed boundary method, representing free surfaces on a regular grid and enforcing boundary conditions through field extensions. This approach, showcased in their recent paper and collaboration with S-Cube, allows seamless integration into existing seismic data processing pipelines. Schism’s flexibility and efficiency make it ideal for handling topography in seismic imaging. Learn more at IMAGE'24, where Dr. Ed Caunt will present these advancements.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.devitocodes.com/images/land-waves.png" /><media:content medium="image" url="https://www.devitocodes.com/images/land-waves.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Seismic Imaging Performance with AWS Graviton4</title><link href="https://www.devitocodes.com/graviton4" rel="alternate" type="text/html" title="Seismic Imaging Performance with AWS Graviton4" /><published>2024-08-19T00:01:00+00:00</published><updated>2024-08-19T00:01:00+00:00</updated><id>https://www.devitocodes.com/graviton4</id><content type="html" xml:base="https://www.devitocodes.com/graviton4"><![CDATA[<p>With the general availability of <a href="https://www.aboutamazon.com/news/aws/graviton4-aws-cloud-computing-chip">AWS
Graviton4</a>,
we explore its performance advancements over previous Graviton generations using
DevitoPRO’s standard 3D acoustic wave propagation kernels. These Full Waveform
Inversion (FWI) and Reverse Time Migration (RTM) propagation kernels, serve as
benchmarks to develop cost models and assess computational efficiency.</p>

<p>The <a href="https://aws.amazon.com/ec2/instance-types/r8g/">AWS Graviton4</a> exhibits substantial performance improvements compared to its
predecessors. For <strong>3D Isotropic acoustic</strong>, the AWS Graviton4 (r8g.24xlarge)
delivers performance that is approximately 2.7 times faster than Graviton2 and
81% faster than Graviton3. In the <a href="https://doi.org/10.1190/1.3269902">3D Fletcher Du Fowler
TTI</a> benchmark, the Graviton4 outperforms
Graviton2 by over 3.4 times and is 51% faster than Graviton3. Similarly, the <a href="https://library.seg.org/doi/10.1190/segam2016-13878451.1">3D
Self-adjoint TTI</a>
benchmark on AWS Graviton4 runs nearly 3.6 times faster than Graviton2 and 80%
faster than Graviton3.</p>

<p>These results underscore the significant generational improvements in the
Graviton4, particularly for memory-bound high-performance computing
applications.</p>

<h3 id="graviton4---amazon-ec2-r8g-instances">Graviton4 - Amazon EC2 R8g Instances</h3>

<p>The Amazon EC2 instance type <em>r8g.24xlarge</em> is a single socket Graviton4. The
Graviton4 processors offer significant advancements over the Graviton3. The
Graviton4 has 96 Neoverse V2 cores and a revamped memory subsystem, featuring
<a href="https://www.nextplatform.com/2023/11/28/aws-adopts-arm-v2-cores-for-expansive-graviton4-server-cpu/">12 DDR5-5600 channels and a peak memory bandwidth of 536.7 GB/s, which is 75%
higher than the
Graviton3.</a>
These improvements are particularly beneficial for memory-bound HPC
applications, such as finite-difference based FWI/RTM operators​.</p>

<h3 id="compiling-on-graviton-using-gcc-141">Compiling on Graviton using GCC-14.1</h3>

<p>To optimize the performance of our benchmarks on the AWS Graviton4 processor, we
built <a href="https://gcc.gnu.org/gcc-14/">GCC 14.1</a> from source, rather than the
system default for Amazon Linux 2023 (6.1.94-99.176.amzn2023.aarch64) which is
GCC 11.4.1. GCC 14.1 has a range of improvements that enhance the compiler’s
ability to leverage the advanced features of the Neoverse cores, particularly
the Neoverse V2 architecture used in Graviton4.</p>

<p>Key enhancements in GCC 14 include improved support for vectorization and new
optimizations that are tailored for Neoverse V2 cores. These improvements allow
for better exploitation of the high memory bandwidth and increased core count in
Graviton4, resulting in more efficient execution of high-performance computing
workloads. Additionally, the release includes enhancements in
auto-vectorization, which are particularly beneficial for memory-bound
applications like seismic imaging and simulation tasks.</p>

<p>Devito uses the following GCC flags depending on the generation of Graviton to
maximize performance on a given instance. As there is just one NUMA domain in
all cases, we parallelizing with pure OpenMP.</p>

<p>Graviton2 (Neoverse-N1):</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcc-14 <span class="nt">-mcpu</span><span class="o">=</span>neoverse-n1 <span class="nt">-O3</span> <span class="nt">-g</span> <span class="nt">-fPIC</span> <span class="nt">-Wall</span> <span class="nt">-std</span><span class="o">=</span>c99 <span class="nt">-Wno-unused-result</span> <span class="nt">-Wno-unused-variable</span> <span class="nt">-Wno-unused-but-set-variable</span> <span class="nt">-ffast-math</span> <span class="nt">-shared</span> <span class="nt">-fopenmp</span>
</code></pre></div></div>

<p>Graviton3 (Neoverse-V1):</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcc-14 <span class="nt">-mcpu</span><span class="o">=</span>neoverse-v1 <span class="nt">-O3</span> <span class="nt">-g</span> <span class="nt">-fPIC</span> <span class="nt">-Wall</span> <span class="nt">-std</span><span class="o">=</span>c99 <span class="nt">-Wno-unused-result</span> <span class="nt">-Wno-unused-variable</span> <span class="nt">-Wno-unused-but-set-variable</span> <span class="nt">-ffast-math</span> <span class="nt">-shared</span> <span class="nt">-fopenmp</span>
</code></pre></div></div>

<p>Graviton4 (Neoverse-V2):</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcc-14 <span class="nt">-mcpu</span><span class="o">=</span>neoverse-v2 <span class="nt">-O3</span> <span class="nt">-g</span> <span class="nt">-fPIC</span> <span class="nt">-Wall</span> <span class="nt">-std</span><span class="o">=</span>c99 <span class="nt">-Wno-unused-result</span> <span class="nt">-Wno-unused-variable</span> <span class="nt">-Wno-unused-but-set-variable</span> <span class="nt">-ffast-math</span> <span class="nt">-shared</span> <span class="nt">-fopenmp</span>
</code></pre></div></div>

<h3 id="3d-acoustic-benchmarks">3D acoustic benchmarks</h3>

<p>To evaluate the performance of the Graviton4, we conducted benchmarks using
three standard 3D acoustic wave propagation kernels. These tests were run in
single precision with OpenMP thread parallelism. Each benchmark was carefully
autotuned to optimize parameters such as cache block size, and the best result
from multiple runs was recorded.</p>

<p>These benchmarks are categorized based on their operational intensity. The
isotropic acoustic benchmark is the simplest among them, commonly used in
seismic imaging, and its performance is primarily influenced by memory
bandwidth. The Fletcher-Du-Fowler TTI kernel, which is frequently utilized by
hardware vendors for benchmarking, represents a moderate level of complexity. In
contrast, the self-adjoint TTI kernel, designed for robustness and accuracy, is
employed in production workloads, reflecting the most demanding computational
requirements for acoustic FWI/RTM. This categorization allows for a
comprehensive evaluation of performance across a spectrum of real-world
scenarios.</p>

<p>All three benchmarks were run with a fixed problem size of dimensions
512x512x512, for a total of 400 time-steps, space-order 8 and time-order 2.</p>

<p>In the bar-chart below we show how the Graviton3 and Graviton4 performance
relative to the Graviton2. The <em>r6g.16xlarge</em>, <em>r7g.16xlarge</em>,<em>r8g.16xlarge</em> all
have 64 cores; this highlights the performance improvement per core. In
contrast, the <em>r8g.24xlarge</em> (a single socket Graviton4) has a total of 96
cores. We can see some strong scaling limits running the isotropic acoustic
benchmark as we go from <em>r8g.16xlarge</em> the <em>r8g.24xlarge</em> (~10% performance
improvement). However, this does not appear to be an issue when running with
more complex propagator kernels such as TTI.</p>

<p><img src="/images/performance-relative-G2.png" alt="Relative performance relative to Graviton2" /></p>

<h3 id="price-performance">Price performance</h3>

<p>The overall picture for price-performance also looks good for AWS users, though
there are some nuances. In the bar-chart below we use the AWS on-demand pricing
for each instance and the benchmark performance in units of
<em>giga-points-per-second</em> to create a <em>tera-points-per-dollar</em> (TP/$) benchmark
metric. This is a measure of how much work you get done per dollar allowing
users to get an estimate of the <em>price-to-solution</em>.</p>

<p>For the isotropic acoustic and the self-adjoint acoustic TTI, we can see that
the Graviton4 delivers the highest throughput per dollar, followed by Graviton3
and Graviton2.</p>

<p>However, the Fletcher Du Fowler TTI benchmark presents an exception. In this
case, Graviton3 provides the highest throughput per dollar, followed by
Graviton4 and then Graviton2. Although both the isotropic acoustic and
self-adjoint TTI benchmarks are 81% faster on Graviton4 than on Graviton3, the
Fletcher Du Fowler TTI benchmark is only 51% faster on Graviton4 than on
Graviton3. Given that the theoretical increase in memory bandwidth is 75%, this
discrepancy warrants a deeper performance profiling analysis to understand why
this particular benchmark underperforms relative to the other benchmarks.</p>

<p><img src="/images/TP_per_dollar.png" alt="Price performance" /></p>

<h3 id="graviton4-r8g16xlarge-vs-r8g24xlarge">Graviton4 r8g.16xlarge vs r8g.24xlarge</h3>

<p>When choosing between the <em>r8g.16xlarge</em> and <em>r8g.24xlarge</em> instances, it is
important to consider the specific characteristics of your workload. For
workloads where the problem domain is too small to benefit from strong scaling
across all available cores, allocating the entire node and running multiple
shots per node can provide better value. This approach not only maximizes
resource utilization but also helps avoid the potential impact of noisy
neighbors in multi-tenant environments if <em>r8g.16xlarge</em> was instead used.</p>

<p>By fully utilizing the <em>r8g.24xlarge</em> instance, which contains all 96 cores of the
Graviton4 processor, you can achieve more consistent performance, as the risk of
resource contention from other tenants is minimized. This strategy ensures you
get the best possible value from the Graviton4 architecture for your HPC tasks.</p>

<h3 id="conclusion">Conclusion</h3>

<p>The benchmarking results clearly demonstrate the impressive capabilities of the
AWS Graviton4 processor. With its increased core count and higher memory
bandwidth, the Graviton4 significantly enhances the performance of
high-performance computing (HPC) applications. Our benchmarks show that
DevitoPRO, utilizing AWS Graviton4, achieves substantial performance gains over
previous Graviton generations, particularly for memory-bound applications such
as seismic imaging.</p>

<p>The Graviton4’s advancements in core architecture and memory subsystem allow
DevitoPRO to run efficiently without requiring extensive modifications. This
highlights the ease of integration and the high productivity potential that
DevitoPRO offers for cutting-edge seismic imaging applications. The performance
improvements observed across various benchmarks—3D Isotropic Acoustic, Fletcher
Du Fowler TTI, and Self-adjoint TTI—underscore the generational leap in
computational efficiency provided by Graviton4.</p>

<p>Moreover, the Graviton4 processor also offers favorable price-performance
ratios, making it a cost-effective choice for demanding workloads. While certain
benchmarks like Fletcher Du Fowler TTI indicate areas for further optimization,
the overall enhancements in speed and throughput per dollar make Graviton4 an
attractive option for HPC users.</p>

<p>Our work with AWS underscores our commitment to delivering top-tier performance
across diverse hardware platforms, ensuring that our users can achieve the best
possible outcomes in their seismic imaging and beyond.</p>

<h3 id="acknowledgements">Acknowledgements</h3>

<p>Many thanks to the AWS team for their generous provision of credits and
technical support, which made this benchmarking study possible. Their continued
collaboration and support are invaluable in our ongoing efforts to optimize
DevitoPRO for cutting-edge seismic imaging and high-performance computing
applications.</p>]]></content><author><name>ggorman</name></author><category term="AWS" /><category term="DSL" /><category term="Python" /><category term="HPC" /><category term="RTM" /><category term="FWI" /><category term="Seismic" /><category term="AWS" /><category term="Graviton" /><summary type="html"><![CDATA[AWS Graviton4 demonstrates significant performance improvements for seismic imaging using DevitoPRO's 3D acoustic wave propagation kernels. Benchmarks show Graviton4 is up to 3.6 times faster than Graviton2 and up to 81%25 faster than Graviton3, especially benefiting memory-bound HPC applications. Compiling with GCC 14.1 optimizes performance on Graviton4’s Neoverse V2 cores. While there are some performance nuances, overall Graviton4 delivers superior throughput per dollar, making it a cost-effective choice for demanding workloads. These advancements underscore Graviton4's capabilities in enhancing computational efficiency for seismic imaging and other high-performance computing tasks.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.devitocodes.com/images/Devito_with_arm_in_hand.png" /><media:content medium="image" url="https://www.devitocodes.com/images/Devito_with_arm_in_hand.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Advancements in elastic wave solvers using DevitoPRO and mixed-precision</title><link href="https://www.devitocodes.com/elastic" rel="alternate" type="text/html" title="Advancements in elastic wave solvers using DevitoPRO and mixed-precision" /><published>2024-08-15T00:01:00+00:00</published><updated>2024-08-15T00:01:00+00:00</updated><id>https://www.devitocodes.com/elastic</id><content type="html" xml:base="https://www.devitocodes.com/elastic"><![CDATA[<p>At Devito Codes, we are making significant strides in developing highly optimized elastic wave solvers using DevitoPRO. Elastic wave inversion provides a more comprehensive understanding of subsurface properties than acoustic inversion. By capturing both P-wave and S-wave data, elastic inversion enhances the resolution and accuracy of subsurface models, which is crucial for exploration geophysics and other applications. However, elastic RTM/FWI is significantly more complex and computationally expensive than acoustic RTM/FWI, making it vital to develop fast, innovative solutions.</p>

<p>Utilizing mixed-precision methods, we have found that we can nearly double the performance of wave propagators. Comparisons with standard 32-bit floating point implementations have shown negligible numerical errors. While mixed-precision support in DevitoPRO is still under development, we are working with energy and computer industry collaborators to accelerate our roadmap to bring mixed-precision support into production quickly.</p>

<h4 id="processor-technology-trends">Processor Technology Trends</h4>

<p>The rapid growth in machine learning and AI is driving all processor manufacturers to better support these workloads by focusing more silicon on 16-bit floating point (FP16) and other reduced precision floating point datatypes, with <a href="https://nvidianews.nvidia.com/news/nvidia-blackwell-platform-arrives-to-power-a-new-era-of-computing">Nvidia announcing FP4</a> and <a href="https://ir.amd.com/news-events/press-releases/detail/1201/amd-accelerates-pace-of-data-center-ai-innovation-and">AMD announcing FP4 and FP6</a> and this year for their next generation of processor. At the same time, the rate of increase in memory bandwidth is lower than the rate of computational power increasing on these processors.</p>

<p>These hardware trends challenge the vast majority of HPC code developers, who have traditionally relied upon FP64 arithmetic to maintain accuracy, and seismic imaging workloads, which predominantly use FP32.</p>

<p>For these reasons, mathematical software programmers across all HPC communities are increasingly focused on developing algorithms and software support for accuracy-preserving mixed precision. However, this will not be straightforward, necessitating an interdisciplinary approach and changes throughout the full software stack to achieve reliable results.</p>

<h4 id="devitopro-for-agility-and-performance-portability">DevitoPRO for agility and performance portability</h4>

<p>DevitoPRO enhances Devito’s core functionalities by incorporating advanced compiler techniques for automatic optimization of stencil computations. Users can write complex finite-difference schemes in high-level Python code, which is then translated into optimized, parallelized C/CUDA/HIP/SYCL code suitable for various hardware architectures. DevitoPRO includes many advanced algorithmic optimizations like the expanding-box technique, which focuses computation on active domains to reduce overhead.
Beyond performance enhancements, DevitoPRO offers essential features for production-level seismic imaging. It supports compression-based asynchronous serialization and intelligent data-streaming techniques for efficient disk-host-GPU transfers, significantly improving data management and performance during reverse time migration (RTM) and full-waveform inversion (FWI). These capabilities make DevitoPRO an indispensable tool for high-performance and scalable seismic imaging solutions.</p>
<h4 id="reduced-memory-pressure-and-mixed-precision">Reduced memory pressure and mixed-precision</h4>

<p>In DevitoPRO, we successfully doubled the speed of elastic propagators by using reduced-precision storage for model parameters and wavefields while using mixed-precision for floating-point arithmetic. This approach benefits from reduced memory pressure and makes use of available hardware support for mixed-precision. Our comparisons with pure FP32 implementations have shown negligible numerical errors, validating the effectiveness of this method.</p>

<p>Example - Elastic VTI Running on Intel Sapphire Rapids
As an example, we benchmarked a discretized form of elastic VTI running on the SEAM 3D model (Fehler and Larner, 2008) using DevitoPRO on a single socket Intel Sapphire Rapids. The velocity-stress equations are solved using a space order of 8 and a first-order leap-frog time discretization on an 800x400x400 grid. Using FP32, the best result was 2.45 GP/s (giga-points-per-second), while using mixed-precision, we were able to achieve 4.1 GP/s, which is a 1.7 speedup.
For comparison, we plot a shot record and wavefield snapshot of the mixed-precision wavefield on the left panels below and the absolute error between the mixed-precision and standard FP32 solution, scaled up by a factor of 500 so it would be visible on the right-hand side panels below.</p>

<p><img src="/images/rec-crop.png" alt="Shot record comparison" /> 
<img src="/images/tauxx-crop.png" alt="Wavefield comparison" /></p>
<h4 id="conclusion">Conclusion</h4>

<p>A major barrier to the adoption of mixed-precision arithmetic in production codes is the dual complexity of designing robust algorithms and managing the resulting software. DevitoPRO overcomes this by insulating application developers from this complexity through the Devito domain-specific language (DSL) and compiler pipeline.</p>

<p>Another significant barrier to developing mixed-precision software is the lack of standardized reduced-precision floating-point formats in programming languages and standard libraries such as MPI. Although initiatives such as <a href="https://www.opencompute.org/blog/amd-arm-intel-meta-microsoft-nvidia-and-qualcomm-standardize-next-generation-narrow-precision-data-formats-for-ai">Microscaling Formats (MX) Alliance</a> is actively working on developing standards to simplify the integration and adoption of these new data formats across the industry; it may take years for these standards to be widely available throughout the entire software ecosystem. In the meantime, DevitoPRO has implemented portability layers to enable users to leverage reduced precision on currently supported platforms.</p>

<p>Even in cases where the mathematical formulation of the problem needs adjustment to make it more amenable to reduced-precision floating point arithmetic, these changes are at a high level where it is relatively straightforward to test and evaluate different variations. This agility is one of the reasons why DevitoPRO will be instrumental in the industry transition to leveraging mixed-precision hardware platforms.</p>

<p>Early results are promising, showing a 70% improvement in performance. Ongoing efforts aim to develop more sophisticated compiler passes to fully utilize mixed precision on modern processors, potentially achieving even greater speedups in the future. By continuing to innovate and optimize, DevitoPRO is poised to make elastic modeling and inversion more accessible and efficient, benefiting a wide range of scientific and engineering applications.</p>]]></content><author><name>mlouboutin</name></author><category term="RTM/FWI" /><category term="DSL" /><category term="Python" /><category term="HPC" /><category term="RTM" /><category term="FWI" /><category term="Seismic" /><category term="Elastic" /><category term="Mixed-precision" /><summary type="html"><![CDATA[Devito Codes is developing advanced elastic wave solvers using DevitoPRO, offering improved subsurface modeling by leveraging both P-wave and S-wave data. Experiments show the use of mixed-precision methods in DevitoPRO has nearly doubled the performance of wave propagators, showing negligible numerical errors compared to standard 32-bit implementations. Despite challenges in developing robust mixed-precision software, DevitoPRO simplifies this process through its domain-specific language and compiler pipeline, making it an essential tool for seismic imaging. Early benchmarks show a 70 performance improvement, with ongoing efforts to further optimize mixed-precision use, potentially leading to even greater speedups.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.devitocodes.com/images/making-waves.png" /><media:content medium="image" url="https://www.devitocodes.com/images/making-waves.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Announcing JUDI’s Support for DevitoPRO</title><link href="https://www.devitocodes.com/judi" rel="alternate" type="text/html" title="Announcing JUDI’s Support for DevitoPRO" /><published>2024-08-13T00:01:00+00:00</published><updated>2024-08-13T00:01:00+00:00</updated><id>https://www.devitocodes.com/judi</id><content type="html" xml:base="https://www.devitocodes.com/judi"><![CDATA[<p>As part of our ongoing efforts to enable rapid innovation and performance portability across the industry we are pleased to announce that as of <a href="https://slimgroup.github.io/JUDI.jl/dev/">JUDI, (Julia Devito Inversion framework)</a> v3.4.5, <a href="https://www.devitocodes.com/features/">DevitoPRO</a> is supported as a bring-your-own-license (BYOL) feature. This brings production-grade performance-portability and scalability to JUDI solvers for modeling, inversion, machine learning, and much more.</p>

<p><a href="https://slimgroup.github.io/JUDI.jl/dev/">JUDI, (Julia Devito Inversion framework)</a> is an open-source Julia-based package for large-scale seismic modeling and inversion designed by Georgia Tech’s Seismic Laboratory for Imaging and Modeling (<a href="https://slim.gatech.edu">SLIM</a>) to translate  wave-based
algorithms into fast, scalable code suitable for industry-size 3D problems on clusters or in the cloud. Built on top of <a href="https://www.devitoproject.org/">Devito</a>, a Python domain-specific language for automated finite-difference computations, JUDI leverages Devito’s symbolic API to generate high-performance wave propagation kernels. This integration combines Devito’s computational power with Julia’s flexibility, enabling efficient simulations and the implementation of PDE-constrained optimizations like full-waveform inversion (FWI) and imaging (LS-RTM). JUDI’s modeling operators can also be integrated into neural networks for physics-augmented deep learning, as highlighted in SLIM’s Leading Edge article, <a href="https://library.seg.org/doi/full/10.1190/tle42070474.1">Learned
multiphysics inversion with differentiable programming and machine
learning</a>.</p>

<p>The integration of DevitoPRO into JUDI enables large-scale and real-world seismic inversion simulations by introducing key performance and memory management improvements. These include support for CUDA/HIP/SYCL for GPUs, domain-specific optimizations such as automatic source/receiver expanding box in the propagator for both forward and adjoint solves, and asynchronous wavefield serialization with lossy/lossless compression. Additionally, the fully supported DevitoPRO decoupler allows single-shot domain decomposition over multiple devices or NUMA domains while maintaining a serial Julia process. These features enable state-of-the-art performance and scalability, particularly for long-offset FWI, high-frequency RTM, and full-wavefield imaging, all of which have a large memory footprint and computational requirements. Users with a DevitoPRO license benefit from enhanced performance, improved scalability, and an easier transition from prototyping to production environments, facilitating the application of their projects to real-world scenarios.</p>

<h2 id="judi-use-cases-from-slim">JUDI use cases from SLIM</h2>

<p>JUDI has been successfully applied in various advanced research and cloud imaging projects, highlighting its versatility and power in seismic modeling and inversion. One significant use case is <a href="https://slim.gatech.edu/Publications/Public/Journals/Geophysics/2024/yin2023wise/paper.html">WISE: full-Waveform variational Inference via Subsurface Extensions</a>, where JUDI is used to combine variational inference and conditional normalizing flows for probabilistic full-waveform inference. This approach helps reduce the reliance on accurate initial migration-velocity models and enables reliable uncertainty quantification in velocity models, showcasing JUDI’s capability to generate high-quality seismic images that include uncertainty.</p>

<p>Another notable application is in <a href="https://slim.gatech.edu/Publications/Public/Conferences/SEG/2021/yin2021SEGcts/yin2021SEGcts.html">compressive time-lapse seismic monitoring of carbon storage and sequestration</a>. JUDI was employed in this project to improve the efficiency and accuracy of time-lapse seismic data acquisition using a joint recovery model. This method allows for high-quality monitoring of CO~2~ plumes over extended periods, which is crucial for carbon capture and storage (CCS) projects.</p>

<p>In the realm of <strong>machine learning and inversion</strong>, JUDI was integral to the development of a <a href="https://slim.gatech.edu/Publications/Public/Conferences/SEG/2022/yin2022SEGlci/paper.html">learned coupled inversion framework</a> for carbon sequestration monitoring. This framework uses Fourier Neural Operators (FNOs) to estimate permeability from time-lapse seismic data, enabling the forecasting of CO~2~ plume behavior with improved accuracy and computational efficiency.</p>

<p>JUDI also forms an essential component in the development of a Digital Twin for Geological Carbon Storage. In this approach, JUDI provides wave simulation and imaging capabilities that undergird training of the Digital Twin’s generative neural networks (checkout <a href="https://library.seg.org/doi/10.1190/tle42110730.1">President’s Page: <em>Digital twins in the era of generative AI</em></a>).</p>

<p>Furthermore, JUDI has played a role in the development of <strong>serverless imaging on the cloud</strong>, which involves deploying seismic imaging processes on cloud infrastructure without the need for traditional server management. This approach allows for scalable and cost-effective seismic data processing, making it accessible for large-scale industrial applications (see [An Event-Driven Approach to Serverless Seismic Imaging in the Cloud])https://ieeexplore.ieee.org/document/9044390)).</p>

<p>These examples underline JUDI’s extensive applicability across various domains in seismic research, particularly in leveraging <a href="https://slim.gatech.edu/research">advanced computational methods</a> and cloud-based solutions.</p>

<h3 id="conclusion-and-outlook">Conclusion and Outlook</h3>

<p>The integration of DevitoPRO into JUDI represents a significant advancement in the field of seismic imaging and inversion, offering users enhanced performance, scalability, and new capabilities. As we look to the future, several promising developments are on the horizon. These include the continuous expansion of features, such as more sophisticated wave-equation solvers and advanced optimization techniques. Additionally, the integration of cutting-edge machine learning algorithms, generative AI, and physics-informed neural networks is expected to further improve the accuracy and computational efficiency of seismic imaging and inversion.</p>

<p>Continued collaboration within the user community will be essential to drive innovation and address emerging challenges. Open-source contributions, feedback, and joint research initiatives will play a crucial role in advancing JUDI and DevitoPRO, ensuring they remain at the forefront of seismic technology.</p>

<p>The ongoing support and development by Devito Codes, particularly through the contributions of <a href="https://www.devitocodes.com/about/#mathias-louboutin-senior-solution-architect">Dr Mathias Louboutin</a>, who has been instrumental in both JUDI’s initial development and its continued evolution, underline the commitment to delivering value to the open-source community. This collaboration bridges the gap between academic research and production-level applications, positioning JUDI and Devito Codes as leaders in providing comprehensive solutions to accelerate innovations in the field of seismic imaging and inversion.</p>

<p>We encourage users to explore these new functionalities and actively engage with the community to advance further this powerful toolset, continuing the journey of innovation and excellence in seismic research and applications.</p>]]></content><author><name>Gerard Gorman (CEO)</name></author><category term="RTM/FWI" /><category term="DSL" /><category term="Python" /><category term="HPC" /><category term="RTM" /><category term="FWI" /><category term="Seismic" /><category term="Julia" /><category term="ML" /><category term="AI" /><summary type="html"><![CDATA[JUDI's latest update, version 3.4.5, now supports DevitoPRO as a bring-your-own-license feature, significantly enhancing its seismic imaging and inversion capabilities. This integration introduces advanced performance and scalability, particularly for large-scale simulations, performance portability across all major CPUs and GPUs and a wide range of domain-specific optimizations. JUDI, an open-source Julia-based framework, already excels in high-performance wave propagation and machine learning integration. The update enables cutting-edge applications such as probabilistic full-waveform inversion, generative AI, carbon storage monitoring, and serverless cloud imaging. This collaboration marks a major step forward in bridging academic research and production-level seismic applications, driving innovation and excellence in the field.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.devitocodes.com/images/judi.png" /><media:content medium="image" url="https://www.devitocodes.com/images/judi.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Announcing Devito and DevitoPRO v4.8.7</title><link href="https://www.devitocodes.com/v4.8.7" rel="alternate" type="text/html" title="Announcing Devito and DevitoPRO v4.8.7" /><published>2024-06-11T00:01:00+00:00</published><updated>2024-06-11T00:01:00+00:00</updated><id>https://www.devitocodes.com/v4.8.7</id><content type="html" xml:base="https://www.devitocodes.com/v4.8.7"><![CDATA[<p>Devito and DevitoPRO v4.8.7, brings a host of new features, enhancements, and
optimizations to our high-performance computing (HPC) tools.</p>

<h3 id="about-open-source-devito">About open-source Devito</h3>

<p><a href="https://www.devitoproject.org/">Devito</a> is an invaluable tool for solving
partial differential equations (PDEs) on structured grids, generating optimized
C code from a high-level symbolic specification, facilitating HPC applications.</p>

<h4 id="key-features">Key Features:</h4>

<ul>
  <li>High-productivity symbolic PDE solver specification with symbolic computation capabilities.</li>
  <li>Automatic code generation, producing finite-difference (FD) kernels, targeting a diverse range of architectures.</li>
  <li>Comprehensive, multi-level performance optimization.</li>
</ul>

<h3 id="about-devitopro">About DevitoPRO</h3>

<p><a href="https://www.devitocodes.com/">DevitoPRO</a> builds upon Devito open-source’s
foundation, adding advanced features and optimizations specifically for
high-productivity HPC. This commercial extension targets seismic imaging and
inversion workloads, supporting multiple hardware architectures, including AMD,
Intel and Nvidia GPUs, to deliver enhanced performance and reduced development
time.</p>

<h4 id="key-features-1">Key Features:</h4>

<ul>
  <li>Additional DSL abstractions for developing RTM/FWI solutions.</li>
  <li>Enhanced compiler and toolchain for enterprise needs</li>
  <li>Advanced GPU acceleration using vendor-native languages (CUDA, HIP, SYCL)</li>
</ul>

<p>Devito and DevitoPRO are released in lockstep. The new features and improvements
in version 4.8.7, detailed below, collectively enhance the performance,
usability, and scalability of both tools, making them indispensable for
high-performance computing applications in fields such as seismic imaging and
inversion.</p>

<h2 id="new-features-in-devito-v487">New Features in Devito v4.8.7</h2>

<h3 id="api-enhancements">API Enhancements</h3>

<ul>
  <li><strong>Time Derivatives Expansion</strong>: Expand time derivatives in generated code for better clarity and consistency in computations.</li>
  <li><strong>Improved Derivatives API</strong>: Cleaner abstractions for interpolants and interpolated derivatives to be compactly specified</li>
  <li><strong>SparseFunction Improvements</strong>: Revamp sparse subfunction setup and sparse dimension handling to optimize grid operations.</li>
  <li><strong>Improve examples</strong>: Clarify sparse function setup and interpolation examples.</li>
</ul>

<h3 id="compiler-improvements">Compiler Improvements</h3>

<ul>
  <li><strong>Device-Aware Blocking</strong>: Implement device-aware blocking and refine related tests to improve performance on multiple architectures.</li>
  <li><strong>Elementary Functions Optimization</strong>: Make code generation of elementary functions dtype-aware for improved type handling.</li>
  <li><strong>ConditionalDimension Placement</strong>: Fix placement of <code class="language-plaintext highlighter-rouge">ConditionalDimension</code> within loop nests when used in combination with subdomains.</li>
  <li><strong>Memory Management</strong>: Restructure <code class="language-plaintext highlighter-rouge">MemoryAllocator</code> hierarchy to streamline memory allocation and deallocation processes.</li>
</ul>

<h3 id="mpi-and-parallel-computing">MPI and Parallel Computing</h3>

<ul>
  <li><strong>MPI Initialization and Finalization</strong>: Refine optional MPI initialization and finalization to better manage parallel execution environments.</li>
  <li><strong>Threading Support</strong>: Initialize MPI with threading support to leverage multi-threading capabilities.</li>
  <li><strong>C-Level MPI_Allreduce</strong>: Add support for C-level <code class="language-plaintext highlighter-rouge">MPI_Allreduce</code> to facilitate efficient distributed reductions.</li>
  <li><strong>Data Gathering Fixes</strong>: Fix data gathering for sparse functions to ensure accurate data collection in parallel computations.</li>
  <li><strong>Halo Touch Sequentialization</strong>: Sequentialize halo touch operations to prevent race conditions in parallel environments.</li>
  <li><strong>MPI0 Logging Level</strong>: Add <code class="language-plaintext highlighter-rouge">MPI0</code> logging level to restrict performance logging to rank 0, reducing overhead in multi-rank setups.</li>
</ul>

<h3 id="tutorials-and-examples">Tutorials and Examples</h3>

<ul>
  <li><strong>ADER-FD</strong>: Add notebook demonstrating implementation of ADER-FD schemes.</li>
  <li><strong>Tutorials for new API features</strong>: Add and update examples for enhanced FD API and sinc interpolation/injection.</li>
</ul>

<h3 id="testing-and-continuous-integration-ci">Testing and Continuous Integration (CI)</h3>

<ul>
  <li><strong>Parallel Marker Revamp</strong>: Revamp parallel markers in tests to improve test coverage and reliability in parallel execution scenarios.</li>
</ul>

<h3 id="docker-and-build-enhancements">Docker and Build Enhancements</h3>

<ul>
  <li><strong>Docker Build Fixes</strong>: Fix Nvidia Docker builds and Intel OneAPI setups to streamline containerized deployments and compatibility.</li>
  <li><strong>ARM Architecture Support</strong>: Build ARM base images to support ARM architecture, extending the range of supported hardware platforms.</li>
</ul>

<h3 id="miscellaneous-enhancements">Miscellaneous Enhancements</h3>

<ul>
  <li><strong>Logging and Error Handling Improvements</strong>: Improve logging and error handling for MPI and compiler sniff operations to provide clearer diagnostics and error messages.</li>
  <li><strong>Code Cleanup and Refactoring</strong>: Perform extensive code cleanup and refactoring, including polishing built-ins, updating cached properties, and removing unnecessary code to maintain a clean and efficient codebase.</li>
  <li><strong>Dependency Updates</strong>: Drop support for Python 3.7 and update requirements for dependencies like <code class="language-plaintext highlighter-rouge">mpi4py</code> to ensure the project stays up-to-date with the latest software versions and standards.</li>
</ul>

<h2 id="new-features-in-devitopro-v487">New Features in DevitoPRO v4.8.7</h2>

<h3 id="performance-optimizations">Performance Optimizations</h3>

<ul>
  <li><strong>Advanced Compiler Optimizations</strong>: Implement various compiler optimizations to reduce runtime and increase computational efficiency.</li>
  <li><strong>Multi-GPU Scaling</strong>: Enhance scaling across multiple GPUs, allowing better utilization of hardware resources and significantly improving performance in large-scale simulations.</li>
  <li><strong>Device-Aware Blocking</strong>: Add device-aware blocking to optimize memory and compute resource allocation for heterogeneous computing environments, including CPU and GPU.</li>
</ul>

<h3 id="api-enhancements-1">API Enhancements</h3>

<ul>
  <li><strong>Extended Symbolic Functionality</strong>: Introduce new symbolic operations and functions to support more sophisticated mathematical models and simulations.</li>
  <li><strong>Boundary Conditions Support</strong>: Improve support for specifying boundary conditions and provide greater flexibility in defining simulation domains.</li>
</ul>

<h3 id="parallel-computing-and-mpi">Parallel Computing and MPI</h3>

<ul>
  <li><strong>Improved MPI Parallelism</strong>: Enhance parallel computing capabilities using MPI, ensuring better scalability on large clusters and more efficient parallel execution.</li>
  <li><strong>C-Level MPI_Allreduce</strong>: Add support for C-level <code class="language-plaintext highlighter-rouge">MPI_Allreduce</code> operations to facilitate efficient distributed reductions, improving the performance of parallel algorithms.</li>
</ul>

<h3 id="debugging-and-profiling-tools">Debugging and Profiling Tools</h3>

<ul>
  <li><strong>New Debugging Tools</strong>: Introduce new debugging tools and logging capabilities, making it easier to troubleshoot and optimize simulations.</li>
  <li><strong>Performance Profiling</strong>: Integrate performance profiling tools to help developers identify and address bottlenecks in their code, ensuring optimal performance.</li>
</ul>

<h3 id="compiler-improvements-1">Compiler Improvements</h3>

<ul>
  <li><strong>Memory Management</strong>: Restructure the <code class="language-plaintext highlighter-rouge">MemoryAllocator</code> hierarchy to streamline memory management processes, reducing overhead and improving performance.</li>
</ul>

<h3 id="miscellaneous-enhancements-1">Miscellaneous Enhancements</h3>

<ul>
  <li><strong>Preconditioning Techniques</strong>: Add new preconditioning techniques to improve the convergence rates of iterative solvers, making simulations more efficient and accurate.</li>
  <li><strong>Code Cleanup and Refactoring</strong>: Perform extensive code cleanup and refactoring, including polishing built-ins, updating cached properties, and removing unnecessary code, ensuring a clean and maintainable codebase.</li>
  <li><strong>Dependency Updates</strong>: Update software requirements and dependencies to ensure compatibility with the latest versions and standards, including dropping support for outdated Python versions and updating key libraries.</li>
</ul>]]></content><author><name>Gerard Gorman (CEO)</name></author><category term="RTM/FWI" /><category term="DSL" /><category term="Python" /><category term="HPC" /><category term="RTM" /><category term="FWI" /><category term="Seismic" /><summary type="html"><![CDATA[Devito and DevitoPRO v4.8.7 introduce numerous enhancements and optimizations for performance portable RTM/FWI. Devito, an open-source tool for solving partial differential equations, features improved symbolic computation, code generation, and performance optimizations. DevitoPRO, the commercial extension, offers advanced features for seismic imaging, supporting multiple hardware architectures, and enhanced performance. New version updates include API enhancements, compiler improvements, parallel computing refinements, better memory management, and updated tutorials. These updates collectively improve usability, scalability, and performance, making them essential for HPC applications in fields like seismic imaging and inversion.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.devitocodes.com/images/release-4.8.7.png" /><media:content medium="image" url="https://www.devitocodes.com/images/release-4.8.7.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">S-Cube integrates DevitoPRO for performance-portable innovation</title><link href="https://www.devitocodes.com/s-cube" rel="alternate" type="text/html" title="S-Cube integrates DevitoPRO for performance-portable innovation" /><published>2024-05-31T00:01:00+00:00</published><updated>2024-05-31T00:01:00+00:00</updated><id>https://www.devitocodes.com/s-cube</id><content type="html" xml:base="https://www.devitocodes.com/s-cube"><![CDATA[<p>At Devito Codes, we take pride in fostering collaborations that push the
boundaries of technological innovation. Our ongoing work with
<a href="https://www.s-cube.com/">S-Cube</a> exemplifies how synergy between companies can
drive advancements in computational science.  Through this collaboration, we
have harnessed our strengths in developing high-productivity, accurate, and
performance-portable wave propagators using DevitoPRO, while S-Cube has
developed several cutting-edge algorithms for seismic imaging, such as <a href="https://www.s-cube.com/">XWI
(X-Wave Full Waveform Inversion)</a>. XWI boasts superior
predictive power, enabling more accurate subsurface models in complex geological
settings.  Implementing these innovative algorithms on top of the DevitoPRO
framework ensures they are performance-portable across all major CPUs (ARM64 and
x86_64) and GPUs (AMD/HIP, Intel/SYCL, Nvidia/CUDA), providing complete
flexibility and efficiency in diverse computational environments.</p>

<h4 id="enhancing-wave-propagation-technology">Enhancing Wave Propagation Technology</h4>

<p>Devito Codes has consistently delivered state-of-the-art solutions for wave
propagation problems, emphasizing productivity and accuracy. Our operators are
designed to be performance-portable, ensuring they can run efficiently on all
major CPU and GPU platforms. This flexibility is crucial in the dynamic
landscape of computational geophysics, where adaptability to different
computational environments is vital for optimizing time-to-solution,
price-to-solution, and as a means to overcome hardware supply constraints.</p>

<p>In collaboration with S-Cube and AWS, we benchmarked DevitoPRO operators on a
wide range of AWS SKUs so price performance could be accurately estimated before
running any seismic imaging processes. This enabled S-Cube to optimize the
efficiency of seismic workloads on AWS.</p>

<p>DevitoPRO is also enabling S-Cube to quickly develop the next generation of
elastic wave solvers to extend their existing state-of-the-art seismic imaging
algorithms. Rapid innovation in these areas is essential to improving the
accuracy and efficiency of characterizing complex geological formations. We
achieve performance-portability and accelerated innovation in seismic data
processing by combining DevitoPRO code generation capabilities with S-Cube’s
innovative inversion algorithms.</p>

<h4 id="the-power-of-collaboration">The Power of Collaboration</h4>

<p>Working together, Devito Codes and S-Cube have demonstrated that collaboration
can lead to breakthroughs that would be challenging to achieve independently.
Our joint efforts have resulted in a suite of tools that enhanced S-Cube
capabilities and brought DevitoPRO to a wider range of companies.</p>

<ul>
  <li><strong>Innovation</strong>: Integrating DevitoPRO into S-Cube’s seismic imaging algorithms has led to accelerated innovation in the development of new methodologies that improve the accuracy of subsurface imaging.</li>
  <li><strong>Performance</strong>: By leveraging our performance-portable operators, S-Cube’s algorithms can run on a variety of hardware platforms, making advanced seismic imaging techniques more accessible.</li>
</ul>

<h4 id="achievements-through-partnership">Achievements Through Partnership</h4>

<p>The collaboration between Devito Codes and S-Cube is a testament to what can be
achieved when companies work together towards a common goal. It underscores the
importance of combining expertise from different domains to tackle complex
challenges. This partnership has not only enhanced our technological offerings
but also provided valuable insights that will guide future developments.</p>

<p>At Devito Codes, we are committed to continuing our work with a diverse range of
service companies, cloud vendors, and hardware providers. Our goal is to ensure
that our solutions remain at the forefront of innovation, providing our clients
with the tools they need to succeed in an ever-evolving industry.</p>

<p>By fostering such collaborations, we aim to contribute to the advancement of
computational science and its applications in geophysics and beyond. The success
of our partnership with S-Cube is a clear indication that, working together, we
can achieve remarkable results and drive the field forward.</p>]]></content><author><name>Gerard Gorman (CEO)</name></author><category term="RTM/FWI" /><category term="DSL" /><category term="Python" /><category term="HPC" /><category term="Cloud" /><category term="AWS" /><category term="RTM" /><category term="FWI" /><category term="Seismic" /><summary type="html"><![CDATA[At Devito Codes, we drive innovation through collaboration, exemplified by our partnership with S-Cube. Combining our expertise in performance-portable wave propagators with S-Cube's advanced seismic imaging algorithms like XWI, we enhance subsurface imaging accuracy and efficiency. Our joint efforts ensure our solutions run seamlessly across major CPUs and GPUs, optimizing performance and cost-efficiency. This collaboration has accelerated the development of next-gen elastic wave solvers, highlighting the power of teamwork in advancing computational geophysics and beyond.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.devitocodes.com/images/collaborate.png" /><media:content medium="image" url="https://www.devitocodes.com/images/collaborate.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Announcing DevitoPRO SYCL Code Generation</title><link href="https://www.devitocodes.com/sycl" rel="alternate" type="text/html" title="Announcing DevitoPRO SYCL Code Generation" /><published>2024-05-13T00:01:00+00:00</published><updated>2024-05-13T00:01:00+00:00</updated><id>https://www.devitocodes.com/sycl</id><content type="html" xml:base="https://www.devitocodes.com/sycl"><![CDATA[<p>This week at ISC 2024 in Hamburg, we are thrilled to introduce SYCL code
generation support for DevitoPRO, specifically optimized for Intel’s Data Center
GPU Max 1100 and 1550. This advancement, developed in collaboration with Intel,
extends our OpenMP offloading support in open-source Devito and provides a
robust SYCL capability essential for delivering high-performance for seismic
imaging workloads on Intel GPUs.</p>

<p>SYCL is a versatile C++-based parallel programming framework that facilitates
code portability across diverse computing architectures including CPUs, GPUs,
and FPGAs from various vendors. The integration of SYCL into DevitoPRO means
users can now deploy their existing Devito applications on Intel GPUs
effortlessly, only needing to specify a different target architecture allowing
just-in-time compilation to reap the performance benefits of SYCL.</p>

<p>This update empowers DevitoPRO users with true performance portability across
all major CPU and GPU vendors.</p>

<h4 id="overview-of-devito-and-devitopro">Overview of Devito and DevitoPRO</h4>

<p>Devito and DevitoPRO provide high-level abstractions that shield developers from
the complexities of porting and optimizing code across different GPU platforms.
For HPC specialists, Devito also offers the capability to tweak the generated
code, offering further customization. This strategy significantly cuts
development time and avoids vendor lock-in, granting users genuine flexibility
in their hardware choices.</p>

<p><strong>Devito</strong>: A robust, open-source Python-based DSL and compiler, Devito
capitalizes on high-level symbolic definitions to produce optimized
finite-difference computational kernels across multiple CPU and GPU platforms.
Developed initially at Imperial College London in collaboration with the SLIM
group at GaTech, Devito supports MPI, OpenMP, and OpenACC, parallel programming
models providing a high-productivity solution for both academic and commercial
applications.</p>

<p><strong>DevitoPRO</strong>: Serving primarily the energy sector, DevitoPRO is an enhanced
commercial version of Devito designed for maximizing performance portability in
seismic imaging. With the new addition of SYCL code generation for Intel GPUs,
DevitoPRO now offers greater adaptability across GPU platforms from all leading
vendors.</p>

<h4 id="new-features-in-devitopro">New Features in DevitoPRO</h4>

<p>We continuously refine our code generation through iterative benchmarking
against manually optimized codes. This process ensures DevitoPRO not only
matches but frequently surpasses the performance of hand-tuned implementations
across various GPUs. The new SYCL integration allows for seamless switching
between target backends, ensuring application consistency and performance across
different architectures. Thanks to Intel’s support, we’ve also incorporated
an Intel GPU Max 1100 into our development cluster to boost our testing and
optimization capabilities.</p>

<h4 id="open-source-contributions-and-openmp-support">Open Source Contributions and OpenMP Support</h4>

<p>The open-source iteration of Devito includes OpenMP support for Intel GPUs,
broadening its usability across various research and development applications.
Our vibrant community on Slack and GitHub is instrumental in continually
enhancing Devito, ensuring it stays at the cutting edge of computational science
for simulations, inversions, and optimizations based on finite differences.</p>

<h4 id="conclusion">Conclusion</h4>

<p>The introduction of SYCL code generation in DevitoPRO marks a crucial
advancement in our mission to deliver high-performance, high-productivity
computing solutions across major CPU and GPU platforms. We value and encourage user
feedback to further refine and evolve our technologies.</p>

<p>Please <a href="mailto:gerard@DevitoCodes.com">contact us</a> for trial licenses or
benchmarks of DevitoPRO with the new SYCL capabilities. For more details or to
start utilizing the new features of DevitoPRO, visit our website and reach out
to our team through the contact links provided.</p>]]></content><author><name>Gerard Gorman (CEO)</name></author><category term="Intel/SYCL" /><category term="DSL" /><category term="Python" /><category term="HPC" /><category term="Cloud" /><category term="AMD" /><category term="ARM" /><category term="Intel" /><category term="Nvidia" /><summary type="html"><![CDATA[Devito Codes introduces SYCL code generation in DevitoPRO optimized for Intel's GPU Max Series 1100 and 1550, enhancing performance in high-compute tasks like seismic imaging. This update, developed in collaboration with Intel, enables seamless use of existing Devito application code across various architectures without modifications. This enhancement solidifies DevitoPRO's commitment to performance portability and high productivity across all major HPC processor architectures.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.devitocodes.com/images/intel_data_center_gpu_max_series.png" /><media:content medium="image" url="https://www.devitocodes.com/images/intel_data_center_gpu_max_series.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Devito Codes on the road to IMAGE 2023</title><link href="https://www.devitocodes.com/image2023" rel="alternate" type="text/html" title="Devito Codes on the road to IMAGE 2023" /><published>2023-08-21T00:01:00+00:00</published><updated>2023-08-21T00:01:00+00:00</updated><id>https://www.devitocodes.com/image2023</id><content type="html" xml:base="https://www.devitocodes.com/image2023"><![CDATA[<h1 id="devitopro-at-image-2023">DevitoPRO at IMAGE 2023</h1>

<p>Devito Codes, an Independent Software Vendor (ISV), takes great pride in specializing in Devito, an open-source Domain Specific Language (DSL) crafted for optimizing finite-difference operators used in RTM and FWI. Our enterprise solution, DevitoPRO, is a cutting-edge platform that facilitates simulation, inversion, and optimization by automatically generating tuned parallel software for various architectures, including AMD, ARM, Intel, and Nvidia. Primarily targeting the field of exploration geophysics, DevitoPRO seamlessly blends ease of use with high performance, drastically reducing development time from months to days, and ensuring performance portability across various computer systems.</p>

<h2 id="engaging-at-image-2023">Engaging at IMAGE 2023</h2>

<p>We’re eagerly looking forward to engaging with the community at IMAGE 2023 this year. The past year has been filled with excitement, innovation, and substantial growth in both the open-source Devito platform and our enterprise product, DevitoPRO.</p>

<p>Our collaboration with processor manufacturers like Nvidia, AMD, and Intel has been instrumental in optimizing Devito and DevitoPRO on all major CPUs and GPUs. We also actively work with Cloud providers such as AWS and Azure to benchmark and optimize containerized deployments with Devito. Our approach has been collaborative and forward-thinking, ensuring we stay at the forefront of technological advancement.</p>

<h2 id="year-highlights-since-image2022">Year Highlights Since IMAGE2022</h2>

<ul>
  <li><strong>Increased Adoption</strong>: Over the past year DevitoPRO has seen a surge in adoption by Energy producers and service companies, both for R&amp;D and production runs.</li>
  <li><strong>Benchmarking Efforts</strong>: Working closely with the geophysics community, processor manufacturers, and cloud companies to develop a standardized cross-platform benchmarking suite.</li>
  <li><strong>Growing the Team</strong>: In July, we welcomed long-time collaborator Dr. Mathias Louboutin to our team as Senior Solution Architect, a fantastic addition to support the community in developing solutions on top of DevitoPRO and driving new features.</li>
</ul>

<h3 id="key-features-and-updates">Key Features and Updates</h3>

<p>The Devito compiler has seen an impressive array of enhancements, fixes, and new features:</p>

<h4 id="1-compiler-enhancements-and-fixes"><strong>1. Compiler Enhancements and Fixes</strong>:</h4>
<p>Many improvements cater to various compiler capabilities, including compatibility with different processors, optimization enhancements, and more.</p>

<h4 id="2-parallelism-and-synchronization"><strong>2. Parallelism and Synchronization</strong>:</h4>
<p>Focus on augmenting parallelization, blocking, and synchronization logic to boost efficiency.</p>

<h4 id="3-gpu-support"><strong>3. GPU Support</strong>:</h4>
<p>Numerous enhancements for AMD, Intel and Nvidia GPUs.</p>

<h4 id="4-buffering-and-memory-management"><strong>4. Buffering and Memory Management</strong>:</h4>
<p>Revisions to buffering logic and memory handling functionalities.</p>

<h4 id="5-code-generation-and-linearization"><strong>5. Code Generation and Linearization</strong>:</h4>
<p>Refinements to enhance code manipulation and efficiency.</p>

<h4 id="6-cse-and-optimization"><strong>6. CSE and Optimization</strong>:</h4>
<p>Robust improvements in Common Subexpression Elimination and other optimization strategies.</p>

<h4 id="7-testing-and-documentation"><strong>7. Testing and Documentation</strong>:</h4>
<p>Comprehensive updates to ensure code quality and thorough documentation.</p>

<h4 id="8-mpi-openmp-and-hpc-related-enhancements"><strong>8. MPI, OpenMP, and HPC-related Enhancements</strong>:</h4>
<p>Enhancing support for parallel processing and high-performance computing.</p>

<h4 id="9-miscellaneous-improvements-and-fixes"><strong>9. Miscellaneous Improvements and Fixes</strong>:</h4>
<p>General enhancements to improve functionality and user experience.</p>

<h4 id="10-docker-updates"><strong>10. Docker Updates</strong>:</h4>
<p>Optimizations to the Docker environment for a seamless development experience on different platforms.</p>

<h4 id="11-architecture-and-other-enhancements"><strong>11. Architecture and Other Enhancements</strong>:</h4>
<p>Broad support for various processors and technologies, along with general improvements in data handling, benchmarking, profiling, and more.</p>

<p>The above changes represent a considerable evolution of the Devito compiler, encapsulating efficiency, compatibility, GPU support, parallelism, memory management, testing, and robustness.</p>

<h3 id="recent-updates-to-devitopro">Recent Updates to DevitoPRO</h3>

<p>A brief rundown of the substantial advancements in DevitoPRO includes:</p>

<ul>
  <li><strong>Docker Integration, Submodule Handling, Allocator and Abox Fixes</strong>: These include a wide range of integrations, fixes, and improvements.</li>
  <li><strong>Parametric Blocking, MPI Integration, Tuning, and Optimization</strong>: Key advancements in various technical aspects.</li>
  <li><strong>Compression, Serialization, Testing, and Benchmarking Enhancements</strong>: Significant strides in data handling and performance metrics.</li>
  <li><strong>New Demos and Examples, Docker and CI/CD Configuration Enhancements, Miscellaneous Adjustments</strong>: Introduction of new features, improvements, and extensive development in testing, demos, and more.</li>
</ul>

<h2 id="about-devito-codes">About Devito Codes</h2>

<p>Devito Codes is dedicated to pioneering solutions in the realm of computational
imaging, in particular in the field of exploration geophysics and ultrasound
imaging. DevitoPRO is designed to optimize and automate the generation of
finite-difference kernels. With a focus on high performance, high-level
abstractions and user-friendliness, we are committed to reducing development
time while enhancing cross-platform compatibility and performance portability.
By working closely with the community and industry leaders, we are continually
shaping the future of computational geophysics. We invite you to connect with
us at IMAGE 2023 and explore the exciting possibilities that Devito Codes has
to offer!</p>]]></content><author><name>Gerard Gorman (CEO)</name></author><category term="HPC" /><category term="DSL" /><category term="Python" /><category term="HPC" /><category term="Cloud" /><category term="AMD" /><category term="ARM" /><category term="Intel" /><category term="Nvidia" /><summary type="html"><![CDATA[Devito Codes is showcasing DevitoPRO at IMAGE 2023, a cutting-edge platform for optimizing finite-difference operators in RTM and FWI. It enables simulation and optimization across various architectures, reducing development time. The past year saw increased adoption, collaboration with tech giants, team growth, substantial enhancements to the compiler, and advancements in DevitoPRO, confirming their commitment to innovation in computational geophysics.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.devitocodes.com/images/hand_of_bits.png" /><media:content medium="image" url="https://www.devitocodes.com/images/hand_of_bits.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Cross-platform seismic imaging benchmarking</title><link href="https://www.devitocodes.com/benchmarking" rel="alternate" type="text/html" title="Cross-platform seismic imaging benchmarking" /><published>2023-02-26T00:00:00+00:00</published><updated>2023-02-26T00:00:00+00:00</updated><id>https://www.devitocodes.com/benchmarking</id><content type="html" xml:base="https://www.devitocodes.com/benchmarking"><![CDATA[<p>We have launched a groundbreaking framework to benchmark seismic imaging
workloads across different platforms. This initiative brings together key
stakeholders from Devito, hardware vendors, cloud providers, and the broader
industry, setting a stage for standardization, reproducibility, and elevated
performance in seismic imaging workloads.</p>

<p>Key Highlights:</p>

<ol>
  <li><strong>Standardization and Reproducibility</strong>:
    <ul>
      <li>Advocates for standardized comparisons and robust performance data, aiding organizations in insightful hardware or cloud system acquisitions.</li>
    </ul>
  </li>
  <li><strong>Efficient Resource Utilization</strong>:
    <ul>
      <li>Promotes code/data reuse and minimizes redundant efforts, leading to efficient resource and human capital utilization.</li>
    </ul>
  </li>
  <li><strong>Extendable and Automated Workflow</strong>:
    <ul>
      <li>The flexible architecture allows for extensibility and employs automation for a streamlined benchmarking process, catering to evolving needs.</li>
    </ul>
  </li>
  <li><strong>Community Engagement</strong>:
    <ul>
      <li>The initiative welcomes community engagement and dialogue, laying the groundwork for future collaborations, workshops, and benchmarking expansions.</li>
    </ul>
  </li>
  <li><strong>Transparency and Validation</strong>:
    <ul>
      <li>Even in its alpha phase, the emphasis on transparency and validation of benchmark data ensures responsible use of preliminary data. However, because of the commercial sensitivity of the data, the benchmarking data is only available under NDA.</li>
    </ul>
  </li>
</ol>

<p>This framework signifies a substantial stride towards nurturing a collaborative
ecosystem aimed at advancing standardization and optimization of seismic imaging
workloads across diverse computing architectures. The collaborative ethos
facilitated by this platform is geared towards driving notable advancements in
seismic imaging performance, contributing to the overarching goal of efficient
resource utilization and heightened computational capabilities.</p>

<h3 id="overview">Overview</h3>

<p>The framework is build on <a href="https://github.com/">GitHub</a>-based having being
heavily influenced by our existing CI/CD framework. The platform includes a
development cluster of servers with various computer architectures, configured
as <a href="https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners">GitHub self-hosted
runners</a>,
and utilizes <a href="https://docs.github.com/en/actions">GitHub Actions</a> to automate
workflows.</p>

<p>One of the advantages of this approach is that it naturally supports automation,
standardized and reproducible comparisons of different methods, hardware, and
skills. It also enables collaboration and code/data reuse, ultimately leading to
better performance of the software and efficient use of human capital. The
platform is also easily extendable by configuring self-hosted runners, adding
different benchmarks to the GitHub Actions workflow, and more servers (either
on-prem or Cloud-based).</p>

<h3 id="benchmarking-as-a-platform">Benchmarking as a platform</h3>

<p>The key idea behind the seismic imaging benchmarking platform is to bring
together stakeholders in the industry, such as energy companies, service
companies, processor manufacturers, and academic researchers, to standardize
benchmarking of seismic imaging kernels.</p>

<p>The objective is to enable accurate and reproducible benchmark experiments,
facilitate collaboration and code/data reuse, reduce the duplication of effort
and improve the overall performance of seismic imaging software. Additionally,
robust performance data will help organizations make informed purchasing
decisions for on-premise or cloud computing systems.</p>

<p>Overall, the proposed platform aims to address the common issues in benchmarking
seismic imaging kernels, such as differences in the PDEs, discretization,
algorithmic optimizations, and runtime choices, and provide a more standardized
and reproducible approach for comparing different methods, hardware, and skill.</p>

<h4 id="anatomy-of-a-standard-benchmark">Anatomy of a standard benchmark</h4>

<div class="mermaid">
  graph TB
    subgraph Standard: Benchmark setup/input
    A(Problem specification: PDEs, BCs, grid size/shape, ...) 
    end
    subgraph Concrete implementation 
    A--&gt;B1(OSS Devito)
    A--&gt;B2(DevitoPRO)
    A--&gt;B3(Hardware vendor implementation)
    A--&gt;B4(Other ISV, research, proprietary implementations)
    end
    subgraph Execution environment
    B2--&gt;C1(Singularity)
    B2--&gt;C2(Docker)
    B2--&gt;C3(Conda)
    end
    subgraph Target platform
    C1--&gt;D1(On-Prem dev-cluster)
    C1--&gt;D2(Vendor/slurm dev-cluster)
    C1--&gt;D3(Public Cloud)
    D1--&gt;E1(Vendor CPUs)
    D1--&gt;E2(Vendor GPUs)
    end
    subgraph Standard: benchmark output
    F(JSON: performance metrics, solution norms, status, implementation specific metadata)
    E1--&gt;F
    E2--&gt;F
    end;
</div>

<h3 id="the-software-infrastructure">The software infrastructure</h3>

<p>We have created a GitHub-based extensible framework for benchmarking seismic
imaging kernels. The Seismic Benchmark Platform e-infrastructure comprises
GitHub Actions for automating workflows and a development cluster of servers
with various computer architectures, each configured as a GitHub self-hosted
runner.</p>

<p>GitHub Actions is a feature that allows users to automate software development
workflows. It allows users to create custom workflows, called actions, triggered
by specific events such as a code push, pull request, or the creation of an
issue. These workflows include building and testing code, deploying software,
and integrating with other tools. With GitHub Actions, users can automate
repetitive tasks, reduce manual errors, and improve the overall efficiency of
their development process. In our case, GitHub Actions are used to</p>

<ul>
  <li>Execute one or more benchmarks.</li>
  <li>Upload benchmark data to a results repository.</li>
  <li>Post-process benchmark data.</li>
</ul>

<h4 id="workflow-of-benchmark-automation-with-github-actions">Workflow of benchmark automation with GitHub Actions</h4>

<div class="mermaid">
  graph TB
    subgraph GitHub Action: manual event trigger
      A(Benchmark matrix of jobs: benchmarks x architectures)
      B(GitHub actions schedules individual jobs to self-hosted runners)
      A--&gt;B
    end
    subgraph Foreach benchmark job
      C(Job allocated to self-hosted runner)
      D(Setup execution environment)
      E(Run benchmark)
      F(Push benchmark output to data repo)
      B--&gt;C
      C--&gt;D
      D--&gt;E
      E--&gt;F
    end
    subgraph GitHub Action: triggered by data push
      G(Process data)
      H(Publish results to gh-pages)
      F--&gt;G
      G--&gt;H
    end;
</div>

<p>GitHub Actions can run on either GitHub-hosted runners or self-hosted runners.
Self-hosted runners are used to execute a workflow on machines the users have
direct access to, rather than on GitHub-managed infrastructure. Self-hosted
runners allow users more control over the environment in which their workflows
run, including access to specific software, libraries, or hardware resources.
Users can also use self-hosted runners to run workflows on-premises, in a
virtual private cloud, or in a hybrid environment. Self-hosted runners are a
flexible solution for organizations with specific requirements for their
development environments and need more control over their workflow execution.</p>

<p>For the work described here, we have configured the following self-hosted runners:</p>

<ul>
  <li>NVIDIA A100-PCIE-40GB (on-prem)</li>
  <li>NVIDIA Tesla PG503-216 (on-prem)</li>
  <li>AMD Instinct™ MI210 (on-prem)</li>
  <li>Intel(R) Xeon(R) Gold 5218R CPU (on-prem)</li>
  <li>AMD EPYC 7413 24-Core Processor (on-prem)</li>
</ul>

<p>The advantages of this design based on GitHub Actions are</p>

<ul>
  <li>Fully automated workflow.</li>
  <li>Fully reproducible:</li>
  <li>All software is maintained in GitHub repositories.</li>
  <li>Reuses existing CI/CD infrastructure and know-how.</li>
  <li>Readily extendable:
    <ul>
      <li>Add benchmarks by adding extra jobs to the GitHub actions workflow.</li>
      <li>Add more servers by configuring GitHub self-hosted runners.</li>
      <li>Self-hosted runners can be bare-metal servers or running on the Cloud.</li>
    </ul>
  </li>
</ul>

<p>While the vision is to advance standardization in our industry and grow a
community around this platform, it is also straightforward to fork our codebase
and create a private instance with proprietary benchmarks.</p>

<p>Another fundamental aspect of our software infrastructure is the use of virtual
containers, in particular Docker. This makes it straightforward to configure new
machines and reproduce performance results. In our experience, virtual
containers are the only realistic way of maintaining and extending a software
and hardware infrastructure like the one we envision in this project.</p>

<p>Currently, three benchmarks are configured:</p>

<h4 id="isotropic-acoustic">Isotropic acoustic</h4>

<ul>
  <li>Shortcut: iso</li>
  <li>Dimensions: 512x512x512</li>
  <li>Number of time steps: 400</li>
  <li>Space order: 8</li>
  <li>Time order: 2</li>
</ul>

<h4 id="fletcher-and-fowler-tti">Fletcher and Fowler TTI</h4>

<ul>
  <li>Shortcut: tti_fl</li>
  <li>Dimensions: 512x512x512</li>
  <li>Number of time steps: 400</li>
  <li>Space order: 8</li>
  <li>Time order: 2</li>
</ul>

<h4 id="skew-adjoint-tti">Skew-adjoint TTI</h4>

<ul>
  <li>Shortcut: tti_sa</li>
  <li>Dimensions: 512x512x512</li>
  <li>Number of time steps: 400</li>
  <li>Space order of operator: 8</li>
  <li>Time order: 2</li>
</ul>

<h3 id="results-preview">Results preview</h3>

<p>We have included a snapshot of results below. FLOPS (Floating Point Operations
Per Second) is a well-recognized measure in High-Performance Computing (HPC),
but it may not always be the most revealing. Its value can be inflated by
employing inefficient numerical methods. In seismic imaging, GPts/s (Giga-Points
Per Second) — also termed giga-cells-per-second — is often favored. This metric
directly measures work throughput, offering a clearer gauge of performance. In
essence, GPts/s helps in accurately estimating the time or cost required to
solve a specific problem.</p>

<p><strong>Disclaimer:</strong> <em>Although we have exerted maximum effort to guarantee precise,
*equitable, and replicable results, it is crucial to understand that the
*benchmarking framework is still in the alpha development phase. Consequently,
*the benchmarks provided here are preliminary and subject to change. The
*benchmark data should not be considered comprehensive or final, and are not
*suited for making any financial decisions.</em></p>

<p><em>No warranty, express or implied, is provided with the data. The information is
*supplied on an “as is” basis. We expressly disclaim, to the maximum extent
*permitted by law, any liability for any damages or losses, direct or
*consequential, resulting from the use of these benchmarks. Please utilize this
*information responsibly, keeping in mind its tentative nature.</em></p>

<h3 id="3d-isotropic-acoustic">3D Isotropic acoustic</h3>

<table>
  <thead>
    <tr>
      <th>Processor</th>
      <th>GPts/s</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>NVIDIA A100-80GB</td>
      <td>62.7</td>
    </tr>
    <tr>
      <td>NVIDIA A100-PCIE-40GB</td>
      <td>54.2</td>
    </tr>
    <tr>
      <td>AMD Instinct™ MI250</td>
      <td>54</td>
    </tr>
    <tr>
      <td>NVIDIA Tesla PG503-216 (V100)</td>
      <td>31</td>
    </tr>
    <tr>
      <td>AMD Instinct™ MI210</td>
      <td>29.2</td>
    </tr>
    <tr>
      <td>Intel(R) Xeon(R) Gold 5218R CPU</td>
      <td>7.97</td>
    </tr>
    <tr>
      <td>AMD EPYC 7413 24-Core Processor</td>
      <td>1.49</td>
    </tr>
  </tbody>
</table>

<h3 id="3d-fletcher-du-fowler-tti">3D Fletcher Du Fowler TTI</h3>

<table>
  <thead>
    <tr>
      <th>Processor</th>
      <th>GPts/s</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>AMD Instinct™ MI250</td>
      <td>16.1</td>
    </tr>
    <tr>
      <td>NVIDIA A100-80GB</td>
      <td>12.4</td>
    </tr>
    <tr>
      <td>NVIDIA A100-PCIE-40GB</td>
      <td>12.2</td>
    </tr>
    <tr>
      <td>AMD Instinct™ MI210</td>
      <td>9.83</td>
    </tr>
    <tr>
      <td>NVIDIA Tesla PG503-216 (V100)</td>
      <td>9.34</td>
    </tr>
    <tr>
      <td>Intel(R) Xeon(R) Gold 5218R CPU</td>
      <td>1.72</td>
    </tr>
    <tr>
      <td>AMD EPYC 7413 24-Core Processor</td>
      <td>0.797</td>
    </tr>
  </tbody>
</table>

<h3 id="future-work">Future work</h3>

<ul>
  <li>Add a link to the page capturing the benchmark characteristics</li>
  <li>Add more metrics such as FLOPS.</li>
  <li>Add more benchmarks:
    <ul>
      <li>Laplacian operator (trivial case helps when working with vendors).</li>
      <li>Gradient operators to stress backward propagation.</li>
      <li>Elastic formulation, as these are of growing importance.</li>
    </ul>
  </li>
  <li>Add support for 3rd parties to provide their implementation of the benchmarks.</li>
  <li>Add more benchmark configurations:</li>
  <li>MPI for NUMA CPUs.</li>
  <li>MPI for multiple GPUs per server.</li>
  <li>Add more bare metal nodes to the development cluster.</li>
  <li>Engage with hardware vendors to get test nodes.</li>
  <li>Add Cloud-based self-hosted runners:
    <ul>
      <li>Ideally, this would be configured as an on-demand runner.</li>
    </ul>
  </li>
  <li>Community engagement.</li>
  <li>Organize benchmarking workshops.</li>
  <li>Engage with hardware and Cloud vendors to review and optimize benchmarks.</li>
</ul>

<h3 id="acknowledgements">Acknowledgements</h3>

<p>Many thanks to Chevron for the funding and feedback to kickstart this
initiative.  We would also like to thank AMD, AWS, Dell, Nvidia and
Supermicro for providing hardware and cloud resources.</p>]]></content><author><name>Gerard Gorman, Fabio Luporini</name></author><category term="DevOps" /><category term="benchmarking" /><category term="HPC" /><category term="Cloud" /><category term="AMD" /><category term="ARM" /><category term="Intel" /><category term="Nvidia" /><summary type="html"><![CDATA[A framework for cross-platform benchmarking of seismic imaging workloads is described. The vision for this platform is that it will be used to support collaboration between Devito Codes, hardware vendors and Cloud providers to continuously optimize the performance of seismic imaging workloads.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://www.devitocodes.com/images/robot_devito_pro_banner.png" /><media:content medium="image" url="https://www.devitocodes.com/images/robot_devito_pro_banner.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>