ORNL-TechInt-Page/architecture.src at master · ORNL-TechInt/ORNL-TechInt-Page

232 lines (184 loc) · 9.5 KB
%include('html_intro.inc')
    <meta name="keywords" content="projects">
    <meta name="description" content="NCCS TechInt group thrust area">
    <title>Technology Integration Group: System Architecture and Resilience</title>
%include('hdbody.inc')
%include('hdrnav.inc')
<!-- ========================================================== -->
%include('content_intro.inc')
<div class="overview">
These projects relate to innovations in system architecture that aim
to improve performance, reduce variability, and enhance
reliability/resilience. They may also involve evaluation of new
processor technologies as applicable to the OLCF roadmap.
<div class="plinks">
<a href="#Spectral">Spectral</a>
<br><a href="#Many-Core">Functional Partitioning Runtime for Many-Core Nodes</a>
<br><a href="#TBSchedule">Top and Bottom Scheduling</a>
<br><a href="#SCFailure">Supercomputer Failure Analysis</a>
<br><a href="#Congestion">Interconnect Congestion Analysis</a>
%include('content_close.inc')
<!-- ========================================================== -->
%include('content_intro.inc')
    <h2 id="Spectral">Spectral</h2>
<p>Spectral is a software library for enabling the use of the Summit
Burst-Buffer I/O system without need for code modification. Using function call
intercept techniques, Spectral hooks into an application and detects when files
are written to specially configured directories. Upon detection, spectral
schedules and manages the asynchronous drains from the burst-buffer to the
parallel file system. Spectral has successfully been tested with Fortran GTC and
Lammps on OLCF Power clusters. The use of Spectral in both applications
significantly increased checkpointing performance while requiring no
modifications to the application source code.
    <p>Contributors: Christopher Zimmer and Scott Atchley
    <p>Status: On-going
  <p>&nbsp; <a class="right" href="#">Top</a>
%include('content_close.inc')
<!-- ========================================================== -->
%include('content_intro.inc')
    <h2 id="Many-Core">Functional Partitioning Runtime for Many-Core Nodes</h2>
<p>With the leveling off of processor clockspeeds, chip manufacturers have
increased the number of cores to consume the additional transitors promised by
Moore's Law. As we move towards exascale systems, we may see higher core counts
per node including more cores per socket, more sockets, and add-in boards such
as GP-GPUs and many-core coprocessors connected via PCI Express. As users
initially port their applications to Titan's GP-GPUs, they may not fully utilize
the CPUs leaving them available for functional partitioning.</p>
<p>This effort will explore providing runtime services to applications
using a small subset of cores on a many-core systems by developing a
Functional Partitioning (FP) runtime environment. This environment
will partition a many-core node such that an end-to-end application
(simulation + data analysis tasks) can be scheduled in-situ, on the
same node, alongside the application’s simulation job for better
end-to-end performance.
<p> Services provided may include I/O buffering, which would allow the
application to resume computation more quickly, or various forms of
fine-grain analysis or transformation.</p>
    <p>Contributors: Scott Atchley, Saurabh Gupta, Ross Miller, and
    Sudharshan Vazhkudai
    <p>Status: On-going
<P>Selected publications:
M. Li, S. S. Vazhkudai, A.R. Butt, F. Meng, X. Ma, Y. Kim, C.
Engelmann, G. Shipman, "Functional Partitioning to Optimize End-to-End
Performance on Many-core Architectures", <i>Proceedings of Supercomputing
2010 (SC10): 23rd IEEE/ACM Int'l Conference on High Performance
Computing, Networking, Storage and Analysis</i>, New Orleans, Louisiana,
November 2010.
<a href="http://users.nccs.gov/~vazhkuda/FP.pdf">pdf</a>
  <p>&nbsp; <a class="right" href="#">Top</a>
%include('content_close.inc')
<!-- ========================================================== -->
%include('content_intro.inc')
    <h2 id="TBSchedule">Top and Bottom Scheduling</h2>
    To minimize the runtime variability of jobs and reduce node
    allocation fragmentation, we developed a dual-ended scheduling
    policy for the Moab scheduler on Titan as opposed to the first-fit
    policy. Our algorithm schedules large jobs top down and small jobs
    bottom up, thereby minimizing the fragmentation for large,
    long-running jobs that is caused by small, short-lived jobs.
    Further, we also modified the scheduling of nodes to a job to
    prioritize the z dimension (nodes within a rack), followed by the
    x dimension (nodes in the racks that are in the same row offer
    better communication bandwidth) and then the y dimension (columns).
    Thousands of jobs are benefitting from this technique. This
    project received a Significant Event Award, a prestigious lab-wide
    recognition.
    <p>Contributors: Chris Zimmer, Scott Atchley, Sudharshan Vazhkudai
    <p>Status: Complete
    <p>Selected publications:
    Christopher Zimmer, Saurabh Gupta, Scott Atchley, Sudharshan S.
    Vazhkudai, and Carl Albing, "A Multi-faceted Approach to Job
    Placement for Improved Performance on Extreme-Scale Systems,"
    <i>Proceedings of Supercomputing 2016 (SC16): 29th Int'l
    Conference on High Performance Computing, Networking, Storage and
    Analysis</i>, Salt Lake City, UT, November 2016.
    <a href="http://users.nccs.gov/~vazhkuda/Titan-Scheduling.pdf">pdf</a>
  <p>&nbsp; <a class="right" href="#">Top</a>
%include('content_close.inc')
<!-- ========================================================== -->
%include('content_intro.inc')
    <h2 id="SCFailure">Supercomputer Failure Analysis</h2>
    This project analyzes temporal failure characteristics of Titan’s
    299,008 CPUs and 18,688 GPUs to understand trends in machine
    failure, MTBF, single bit errors, double bit errors, off the bus
    errors and temperature correlation to failure
    [TitanGPUReliability:HPCA15]. This study was the first of its kind
    for a large-scale GPU deployment. Based on the insights, devised
    checkpointing advisory tools [LazyChkpt:DSN14] that are saving
    millions of core hours for production jobs, devised an
    energy-based scheduling algorithm to schedule large node-count GPU
    jobs that stress the GPUs more to be scheduled at the bottom of
    the rack where it is much cooler than the top of the rack, or
    cordoned off nodes with frequent failures
    <p>TechInt contributors: Devesh Tiwari, Sudharshan Vazhkudai
    <p>Status: Complete
    <p>Selected publications:
    Devesh Tiwari, Saurabh Gupta, Jim Rogers, Don Maxwell, Paolo Rech,
    Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan
    Debardeleben, Philippe Navaux, Luigi Carro, Arthur Buddy Bland,
    "Understanding GPU Errors on Large-scale HPC Systems and the
    Implications for System Design and Operation", <i>21st IEEE Symp.
    on High Performance Computer Architecture (HPCA)</i>, San
    Francisco, California, February 2015.
    <a href="http://users.nccs.gov/~vazhkuda/hpca.pdf">pdf</a>
    Devesh Tiwari, Saurabh Gupta, Sudharshan S. Vazhkudai, "Lazy
    Checkpointing: Exploiting Temporal Locality in Failures to
    Mitigate Checkpointing Overheads on Extreme-Scale Systems",
    <i>Proceedings of the 44th Annual IEEE/IFIP Int'l Conference on
    Dependable Systems and Networks (DSN 2014)</i>, Atlanta, Georgia,
    June 2014.
    <a href="http://users.nccs.gov/~vazhkuda/DSN2014.pdf">pdf</a>
  <p>&nbsp; <a class="right" href="#">Top</a>
%include('content_close.inc')
<!-- ========================================================== -->
%include('content_intro.inc')
    <h2 id="Congestion">Interconnect Congestion Analysis</h2>
    <p>High-bandwidth file-systems are critical to today’s
    super-computing applications. To achieve the level of performance
    necessary for leadership class of applications, the underlying
    network must facilitate high aggregate-bandwidth demands.
    Unfortunately, in such a large-scale network, congestion at
    routers leads to limited overall I/O performance and high
    variability.
    <p>In this effort, our goal was to identify and understand
    bottlenecks in the interconnection-network pertaining to the file
    system I/O traffic. This work involved analyzing the impact of job
    placement, router placement on performance, and studying how these
    configurations play a role in reducing congestion in the
    interconnection-network. During the course of this investigation
    we sought to:
    <ol class="numbered">
    <li class="numbered">Understand the dynamic behavior of the system via
    careful experimentation and investigation, and identify where
    bottlenecks occur,</li>
    <li>Develop simulated models of the network for an in-depth study
    of network dynamics in presence of real-life workload scenarios,</li>
    <li>Identify heuristics or layouts that improve upon the existing
    topology, and
    <li>Come up with optimal I/O router layouts for current and
    several other network topologies for future systems.
    As a result of this research, we developed and deployed a
    lightweight, scalable mechanism to monitor the router traffic on
    Titan’s interconnect fabric (9600 routers). The tool is used to
    analyze interference among jobs.
    <p>Contributors: Saurabh Gupta, Chris Zimmer, and Sudharshan Vazhkudai
    <p>Status: On-going
  <p>&nbsp; <a class="right" href="#">Top</a>
%include('content_close.inc')
%include('html_close.inc')
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

architecture.src

Latest commit

History

architecture.src

File metadata and controls