Integrated Geodynamic Earth Models

Remote Rendering Setup

2023-06-01T10:00:00+00:00

This post summarizes the steps needed for parallel remote rendering using ParaView. With this setup you can visualize very large simulation data that is residing on a workstation while visualizing it over the internet on a laptop.

Setup server

After installing ParaView on the server, go into the bin/ folder and run (over an ssh terminal):

./mpiexec -np 16 ./pvserver -display :0

Note that you need to use the included mpiexec, which is likely different than the system MPI. This will create 16 processes that will take care of loading, filtering, and rendering the data.

Connectivity

To connect to the server, you must be able to reach the machine (and port 11111) directly. This can be achieved by connecting to the school VPN where the workstation sits, or alternatively using tailscale.

On the client

You will need to install the identical version of ParaView that you have running on the server. Simply open it up and connect using “File”, “Connect”.

It is useful to enable debug output using “Edit”, “Settings”, “Render View”, “Show Annotation”. After loading files it will look something like this:

References

https://docs.paraview.org/en/latest/ReferenceManual/parallelDataVisualization.html

(written by Timo Heister)

A quick introduction to performance testing

2023-02-03T15:00:00+00:00

Today I would like to show how one get a quick estimate on the performance impact on a specific code change using the Linux tool perf (see perf tutorial for an introduction).

The Setup

I am considering the ASPECT pull request #5044 that removes an unnecessary copy of a vector inside the linear solver. I was curious how much of a difference this makes in practice.

First, we need to pick a suitable example prm file to run. The change is inside the geometric multigrid solver, so we need to run a test that uses it. We also want it to be large enough that we can easily time it without too much noise. For this, we are going to pick nsinker benchmark and slightly modify the file (disable graphical output, disable adaptive refinement, choose 6 global refinements). See here for the file.

We can only get a good estimate for the performance difference, if we compare optimized versions. That’s why we need to compile both versions of ASPECT (with and without the change) in optimized mode. We also use native optimizations as recommended for geometric multigrid.

A first test

With perf correctly configured, we can get a first idea about the program by running

perf stat ./aspect test.prm

and get something like the following output

perf stat  ../aspect-new test.prm 
-----------------------------------------------------------------------------
-- This is ASPECT, the Advanced Solver for Problems in Earth's ConvecTion.
--     . version 2.5.0-pre
--     . using deal.II 9.4.1 (dealii-9.4, 6a1115bbf6)
--     .       with 32 bit indices and vectorization level 2 (256 bits)
--     . using Trilinos 13.2.0
--     . using p4est 2.3.2
--     . running in OPTIMIZED mode
--     . running with 1 MPI process
-----------------------------------------------------------------------------

Loading shared library <./libnsinker.so>

Vectorization over 4 doubles = 256 bits (AVX), VECTORIZATION_LEVEL=2
-----------------------------------------------------------------------------
-- For information on how to cite ASPECT, see:
--   https://aspect.geodynamics.org/citing.html?ver=2.5.0-pre&mf=1&sha=&src=code
-----------------------------------------------------------------------------
Number of active cells: 262,144 (on 7 levels)
Number of degrees of freedom: 8,861,381 (6,440,067+274,625+2,146,689)

*** Timestep 0:  t=0 seconds, dt=0 seconds
   Solving Stokes system... 
    GMG coarse size A: 81, coarse size S: 8
    GMG n_levels: 7
    Viscosity range: 0.01 - 100
    GMG workload imbalance: 1
    Stokes solver: 28+0 iterations.
    Schur complement preconditioner: 29+0 iterations.
    A block preconditioner: 29+0 iterations.
      Relative nonlinear residual (Stokes system) after nonlinear iteration 1: 0.999967


   Postprocessing:
     System matrix memory consumption:  101.42 MB

Termination requested by criterion: end time


+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |       117s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system rhs      |         1 |      15.9s |        14% |
| Build Stokes preconditioner     |         1 |      5.96s |       5.1% |
| Initialization                  |         1 |     0.106s |         0% |
| Postprocessing                  |         1 |   0.00504s |         0% |
| Setup dof systems               |         1 |        10s |       8.5% |
| Setup initial conditions        |         1 |      8.58s |       7.3% |
| Setup matrices                  |         1 |      3.37s |       2.9% |
| Solve Stokes system             |         1 |      68.1s |        58% |
+---------------------------------+-----------+------------+------------+

-- Total wallclock time elapsed including restarts: 117s
-----------------------------------------------------------------------------
-- For information on how to cite ASPECT, see:
--   https://aspect.geodynamics.org/citing.html?ver=2.5.0-pre&mf=1&sha=&src=code
-----------------------------------------------------------------------------

 Performance counter stats for '../aspect-new test.prm':

        117,495.66 msec task-clock                #    0.987 CPUs utilized          
             1,134      context-switches          #    9.651 /sec                   
                44      cpu-migrations            #    0.374 /sec                   
         4,838,162      page-faults               #   41.177 K/sec                  
   412,324,642,261      cycles                    #    3.509 GHz                    
 1,021,213,808,438      instructions              #    2.48  insn per cycle         
   122,571,132,407      branches                  #    1.043 G/sec                  
       338,189,797      branch-misses             #    0.28% of all branches        

     119.093952900 seconds time elapsed

     112.405895000 seconds user
       5.087723000 seconds sys

As you can see, we are indeed running in optimized mode, with vectorization enabled, and we are solving a 3d problem with 8.8 million degrees of freedom. It takes about 68 seconds to solve the Stokes system with a single MPI rank.

The real setup

For a more realistic test, we will run the same program with 4 MPI ranks (this way at least some small cost for possible changes in communication are accounted for) by using mpirun -n 4 ./aspect. Finally, perf supports running the program several times and averaging the stats. This turns out to be necessary, as the change is otherwise too small to detect.

Our final command line is thus

perf stat -r 10  mpirun -n 4 ../aspect test.prm

The result

The output without the patch is

Performance counter stats for 'mpirun -n 4 ../aspect-old test.prm' (10 runs):

        182,419.23 msec task-clock                #    4.010 CPUs utilized            ( +-  0.44% )
             1,042      context-switches          #    5.934 /sec                     ( +-  7.12% )
               137      cpu-migrations            #    0.780 /sec                     ( +-  4.94% )
         2,016,941      page-faults               #   11.485 K/sec                    ( +-  0.36% )
   536,241,394,539      cycles                    #    3.054 GHz                      ( +-  0.25% )
 1,180,113,849,900      instructions              #    2.25  insn per cycle           ( +-  0.12% )
   159,889,552,768      branches                  #  910.491 M/sec                    ( +-  0.24% )
       446,788,836      branch-misses             #    0.28% of all branches          ( +-  0.30% )

            45.494 +- 0.200 seconds time elapsed  ( +-  0.44% )

while the new version gives

 Performance counter stats for 'mpirun -n 4 ../aspect-new test.prm' (10 runs):

        174,309.09 msec task-clock                #    3.880 CPUs utilized            ( +-  0.21% )
             1,350      context-switches          #    7.787 /sec                     ( +-  4.85% )
               102      cpu-migrations            #    0.588 /sec                     ( +-  5.06% )
         1,993,599      page-faults               #   11.499 K/sec                    ( +-  0.38% )
   522,637,629,676      cycles                    #    3.015 GHz                      ( +-  0.11% )
 1,170,583,946,847      instructions              #    2.24  insn per cycle           ( +-  0.09% )
   157,504,211,601      branches                  #  908.506 M/sec                    ( +-  0.17% )
       448,626,534      branch-misses             #    0.29% of all branches          ( +-  0.23% )

           44.9306 +- 0.0919 seconds time elapsed  ( +-  0.20% )

Conclusion

The new code executes about 1% fewer instructions, the total runtime decreases from 45.5 to 44.9 seconds (with some uncertainty, see above). The Stokes solve takes around 26 seconds (not shown), which means the patch improves the Stokes solve by about 2%.

What is not taking into account here is that the construction and usage of the vector also causes some MPI communication, which is potentially more expensive when running large simulations on more than a single node.

(written by Timo Heister)

Videos and Interactive Visualization Tests

2022-08-02T15:18:18+00:00

This blog post is a quick progress report on experiments on video rendering and interactive visualizations. I am including instructions on how these were generated below.

Video rendering

After loading the single timestep into ParaView, we set up the following filter chain:

We then save the files as .png files using File->Save Animation… and render using ffmpeg:

ffmpeg -framerate 25 -i a.%04d.png -c:v libx264 -profile:v high -crf 20 -pix_fmt yuv420p output.mp4

Youtube version

See https://youtu.be/LRs1Qdm-FSI

Self-hosted and directly embedded

Interactive visualizations

After experimenting with different technologies, we are quite happy with what ParaView Glance has to offer. It can open vtkjs files (these can be obtained by exporting directly from ParaView by clicking File -> Export Scene…) directly and allows the user to toggle visibility and rendering style for individual pieces of the visualization much like in ParaView itself.

You can generate state files by going to our Glance Instance, opening a local vtkjs file, and clicking “save state”. To host the examples online, upload the state file to f.tjhei.info and provide the address to the file in the link in the form

https://f.tjhei.info/glance/?name=NAME&url=URL

where you replace NAME by a filename and URL by the full address on where the file can be downloaded from. Note that NAME has to have the same file extension as the file to load (in our examples it is .glance). Finally, the URL and the instance of Glance have to be on the same server (single origin browser policy), so files need to be served from f.tjhei.info.

Here are two examples (click the image to open glance):

Future work

Test loading data files directly (VTI, etc.)
Test the animation support
Test problems with surface deformation
Using resampled spherical models

References

ParaView Glance

Starting Earth Models

2021-08-25T15:18:18+00:00

Mantle convection and the associated plate tectonics are some of the most fundamental yet complex processes here on Earth. The complexity arises from several physical processes governing the mantle circulation at different temporal and spatial scales. With the recent increase in the availability of computational resources and advanced numerical techniques, global mantle flow models have become possible to investigate the underlying physics of plate tectonics. Here, we will discuss one such model developed as part of the NSF-funded project, Integrated Geodynamic Earth Models.

We setup instantaneous mantle convection models using ASPECT, an open-source code that simulates problems in the Earth’s mantle, with the goal to reproduce present-day GPS velocities and deformation patterns. The material properties in our models are constrained using recent high-resolution geophysical observations. The main components of our models and corresponding parameter values are mentioned below:

1) Input global tomography model: Several global seismic tomography models have emerged since early 21st century revealing detailed heterogeneity throughout the mantle. The models differ in the input travel-times of a seismic phase and frequency content of that phase. Currently, we use the joint P and S wave tomography model, LLNL-G3D-JPS, by Simmons et al.,(2015) with a resolution of ~1 degree in the upper mantle and ~2 degrees in the lower mantle.

2) Density model: Our models are driven by buoyancy forces which are calculated using depth-dependent scaling of density anomalies with the S-wave anomalies. We base densities in the crust from Crust1.0 model (Laske et al., 2013) averaging over the upper crust, middle crust and lower crust layers.

3) Temperature model: We compute temperature variations from a mantle adiabat using a constant scaling factor, -4.2, with the S-wave anomalies in the global tomography model (see Figure below). The lower wavelength variations heterogeneity expected in the upper mantle are often smoothed in the global tomography models. Therefore, we use a high-resolution temperature model (TM1 in Tutu et al., 2018) in the top 300 km that includes the variable ages of continental lithosphere, cooling of oceanic lithospheres and cold slab structures.

4) Plate boundaries: We prescribe plate boundaries from the Global Earthquake fault database in WorldBuilder, an open-source code that can prescribe complex initial conditionsin geodynamic models.

5) Rheology computation: We use dislocation and diffusion creep with different prefactors for different mineral phases. The average lateral variations in viscosity are scaled to a reference viscosity profile (Steinberger and Calderwood, 2006) that is consistent with the observed geoid. Additionally, we weaken the plate boundaries to localize deformation along them.

We include all these components in a modular fashion to test the relative importance of each component to best-match the surface GPS observations.

To resolve the high deformation at the plate boundaries, we use adaptive mesh refinement in ASPECT. Our current highest-resolution models have refinement cell size of up to ~ 10 km:

Scaling parallel IO in ASPECT

2021-08-15T15:18:18+00:00

The goal of this post is to evaluate the performance of generating visualization output for large computations produced by ASPECT. ASPECT uses deal.II to generate the visualization output. By default we generate VTU files of the unstructured mesh. Instead of generating one file per MPI rank, the output can be grouped to a specified number of files (even a single one). These files are written using MPI I/O, which should allow for fast performance.

Machine setup and striping

Computations done on Frontera, nsinker benchmark, adaptive refinement
Computations were done on 32 nodes (56 cores each)
/scratch1/ LFS filesystem has 16 OSTs (file servers), up to 60 GB/s
Striping can be enabled (-1 = maximum striping) by calling
```
lfs setstripe -c -1 <file or fikder>
```
default striping is 1 (disabled)

Results

Conclusions

This little experiment showed some interesting results:

For now, grouping to 16 without striping gives the best performance. This is the default in ASPECT.
We can achieve up to 2 GB/s in performance. This is far from the theoretical maximum.
Overall, performance is good enough: The linear solver takes about 30 seconds for 800m DoFs (IO: 5 seconds).

Future work

Compare against HDF5 output.
Check why striping is slower than writing several files.
How about 32 files instead of 16?

References

Frontera Guide

Sampling AMR data, storing, and rendering it

2020-08-15T15:18:18+00:00

The goal of this article is compare storage formats and rendering of Finite Element solutions produced with ASPECT. The computational mesh in ASPECT is a collection of adaptively refined octrees in 3d, see an example image below. For this experiment, we consider a stationary test benchmark in a unit cube that is part of ASPECT, see nsinker.

As part of this NSF funded project, we are evaluating how sampling unstructured data to a structured mesh works, and how storage and rendering of this structured data compares to the original unstructured output produced by ASPECT and the underlying deal.II library.

Parallel sampling of unstructured AMR data

For this experiment, we wrote an ASPECT postprocessor that sampled arbitrary solution variables on an unstructured AMR grid to a structured grid of arbitrary resolution. The postprocesor can be run at different resolutions to produce “multi-resolution” output.

The structured data is generated by looping over all cells, evaluating the solution variables at each quadrature point for some quadrature rule (slower to evaluate but more accurate with more quadrature points per cell). We then use nearest neighbor interpolation (the values at the closest quadrature point are used for each structured grid point). See the following image for an example:

Here, black is the unstructured mesh, red are the quadrature points in each cell, blue circles are the points of the structured mesh, and the arrows denote what data is used at each point. Notice that a more sophisticated interpolation could be used, but this is certainly accurate enough for graphical visualization.

Internally, the algorithm transforms each quadrature point location to index space and then “splats” the solution to close by structured points. We keep track of the real world distance to the currently closest quadrature point in each structured point (as an additional output variable). When “splatting”, we only overwrite the current value, if the new distance is smaller than the stored one.

This also works in an MPI parallel computation, because we can split the structured mesh between processors and we know based on a given index, who the owner is. Each rank sends a list of indices with values and their distance to the owner, who then performs a similar “splat” operation like it is done for the own values.

The structured mesh is then output in parallel using the netCDF library with the HDF5 backend.

Structured netCDF vs unstructured, compressed vtu

We now compare the output of a structured netCDF file against the unstructured VTU output (as done using deal.II). The example is done on a sequence of adaptively refined meshes (based on the viscosity, visualized below). We output 6 data values in double precision (this corresponds to x,y, and z velocity, pressure, temperature, and viscosity here.

The unstructured solution in one of the intermediate steps looks like this (we show surface rendering, volume rendering, and isosurface rendering with 3 contours):

The unstructured data stores an unstructured list of cells (the leaves of the octree) with vertex coordinates in an compressed VTU file format. This is done by sending the binary representation of the data through zlib, followed by a base64 encoding to end up with a valid XML VTU file.

The structured data at various resolutions looks like this:

The netCDF file stores the data as binary using HDF5 directly and without compression (also see below for compression).

Now to the results. We start with the structured data:

The table show the resolution, file size, and memory consumption and render time inside ParaView for selected data points.

For comparison, this is the unstructured data:

Here, the resolution refers to the maximum resolution (compared to the table above). Notice that file size are orders of magnitude smaller, while rendering time and memory consumption inside ParaView are orders of magnitude slower.

To conclude:

First, contour rendering is very fast, regardless of method and resolution (0.01s not shown in the table), while contour extraction is quite a lot slower for an unstructured mesh.

Second, surface rendering is very fast for both methods (interactive framerates even on a laptop without GPU rendering).

Third, file sizes for structured meshes quickly become very large but memory consumption inside ParaView is very efficient.

float vs double netCDF/hdf5

An obvious question is if we can reduce the file size of the netCDF files. First, let’s consider storing the data as floats instead of doubles. The loss in accuracy is unlikely to be problematic for visualization, but as expected, we save a factor of 2 in file size. Rendering performance and memory consumption also improves by a similar factor:

Other options

netCDF (HDF5) supports compression of the data (using nc_def_var_deflate()), but it turns out that is not supported when writing data in parallel. This makes it unusable for us during a computation. We could compress after the fact using nccopy, as reading from compressed data should work in parallel. See https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf_compression for more details.

Conclusions

This little experiment showed some interesting results:

Structured data is a lot more efficient for rendering (RAM consumption, rendering time, contour extraction time)
High resolution structured data has very large file sizes (floats are a good option, but files will still be quite large without additional compression). Outputting at a lower resolution is certainly an attractive option as we are sampling the data already anyways.
All parts of the algorithms and the file sizes scale with the resolution, making it cheaper to produce the output as well. This makes structured, lower resolution output attractive as a data exchange format and to do quick visualization.
Sampling to structured data and compressing the data could be done as a postprocessing step in a separate code base after the fact if we store the unstructured data.

Future work

Clean up the sampling process and merge into ASPECT.
Update the code to support spherical geometries (lat/long/depth).
Compare unstructured data against vtkHyperOctree or vtkHyperTreeGrid from VTK. This requires investigation of the file formats.
Compare against OSPRay AMR rendering using TAMR.

References

Welcome

2020-03-13T15:18:18+00:00

Hi there, this is the beginning of our project website for the NSF funded project “Collaborative Research: Development and Application of a Framework for Integrated Geodynamic Earth Models”. Stay tuned for more content.