| 书目名称 | High Performance Computing | | 副标题 | 31st International C | | 编辑 | Julian M. Kunkel,Pavan Balaji,Jack Dongarra | | 视频video | http://file.papertrans.cn/427/426285/426285.mp4 | | 概述 | Includes supplementary material: | | 丛书名称 | Lecture Notes in Computer Science | | 图书封面 |  | | 描述 | .This book constitutes the refereed proceedings of the 31st International Conference, ISC High Performance 2016 [formerly known as the International Supercomputing Conference] held in Frankfurt, Germany, in June 2016.. .The 25 revised full papers presented in this book were carefully reviewed and selected from 60 submissions. The papers cover the following topics: Autotuning and Thread Mapping; Data Locality and Decomposition; Scalable Applications; Machine Learning; Datacenters andCloud; Communication Runtime; Intel Xeon Phi; Manycore Architectures; Extreme-scale Computations; and Resilience.. | | 出版日期 | Conference proceedings 2016 | | 关键词 | architectures; dependable systems; fault-tolerance; machine learning; parallel computing methodologies; d | | 版次 | 1 | | doi | https://doi.org/10.1007/978-3-319-41321-1 | | isbn_softcover | 978-3-319-41320-4 | | isbn_ebook | 978-3-319-41321-1Series ISSN 0302-9743 Series E-ISSN 1611-3349 | | issn_series | 0302-9743 | | copyright | Springer International Publishing Switzerland 2016 |
| 1 |
Front Matter |
|
|
Abstract
|
| 2 |
|
|
|
Abstract
|
| 3 |
An Analytical Model-Based Auto-tuning Framework for Locality-Aware Loop Scheduling |
Rengan Xu,Sunita Chandrasekaran,Xiaonan Tian,Barbara Chapman |
|
Abstract
HPC developers aim to deliver the very best performance. To do so they constantly think about memory bandwidth, memory hierarchy, locality, floating point performance, power/energy constraints and so on. On the other hand, application scientists aim to write performance portable code while exploiting the rich feature set of the hardware. By providing adequate hints to the compilers in the form of directives appropriate executable code is generated. There are tremendous benefits from using directive-based programming. However, applications are also becoming more and more complex and we need sophisticated tools such as auto-tuning to better explore the optimization space. In applications, loops typically form a major and time-consuming portion of the code. Scheduling these loops involves mapping from the loop iteration space to the underlying platform - for example GPU threads. The user tries different scheduling techniques until the best one is identified. However, this process can be quite tedious and time consuming especially when it is a relatively large application, as the user needs to record the performance of every schedule’s run. This paper aims to offer a better solution by
|
| 4 |
Performance, Design, and Autotuning of Batched GEMM for GPUs |
Ahmad Abdelfattah,Azzam Haidar,Stanimire Tomov,Jack Dongarra |
|
Abstract
The general matrix-matrix multiplication (.) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance . kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general..This paper presents a high performance batched . kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized . designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.
|
| 5 |
TCU: A Multi-Objective Hardware Thread Mapping Unit for HPC Clusters |
Ravi Kumar Pujari,Thomas Wild,Andreas Herkersdorf |
|
Abstract
Meeting multiple, partially orthogonal optimization targets during thread scheduling on HPC and manycore platforms simultaneously, like maximizing CPU performance, meeting deadlines of time critical tasks, minimizing power and securing thermal resilience, is a major challenge because of associated scalability and thread management overhead. We tackle these challenges by introducing the Thread Control Unit (TCU), a configurable, low-latency, low-overhead hardware thread mapper in compute nodes of an HPC cluster. The TCU takes various sensor information into account and can map threads to 4–16 CPUs of a compute node within a small and bounded number of clock cycles in round-robin, single- or multi-objective manner. The TCU design can consider not just load balancing or performance criteria but also physical constraints like temperature limits, power budgets and reliability aspects. Evaluations of different mapping policies show that multi-objective thread mapping provides about 10 to 40 % less mapping latency for periodic workloads compared to single-objective or round-robin policies. For bursty workloads under high load conditions, a 20 % reduction is achieved..The TCU macro has a m
|
| 6 |
|
|
|
Abstract
|
| 7 |
Dynamic Sparse-Matrix Allocation on GPUs |
James King,Thomas Gilray,Robert M. Kirby,Matthew Might |
|
Abstract
Sparse matrices are a core component in many numerical simulations, and their efficiency is essential to achieving high performance. Dynamic sparse-matrix allocation (insertion) can benefit a number of problems such as sparse-matrix factorization, sparse-matrix-matrix addition, static analysis (e.g., points-to analysis), computing transitive closure, and other graph algorithms. Existing sparse-matrix formats are poorly designed to handle dynamic updates. The compressed sparse-row (CSR) format is fully compact and must be rebuilt after each new entry. Ellpack (ELL) stores a constant number of entries per row, which allows for efficient insertion and sparse matrix-vector multiplication (SpMV) but is memory inefficient and strictly limits row size. The coordinate (COO) format stores a list of entries and is efficient for both memory use and insertion time; however, it is much less efficient at SpMV. Hybrid ellpack (HYB) compromises by using a combination of ELL and COO but degrades in performance as the COO portion fills up. Rows that use the COO portion require it to be completely traversed during every SpMV operation..In this paper we introduce a new sparse matrix format, dynamic co
|
| 8 |
An Efficient Parallel Load-Balancing Framework for Orthogonal Decomposition of Geometrical Data |
Bruno R. C. Magalhães,Farhan Tauheed,Thomas Heinis,Anastasia Ailamaki,Felix Schürmann |
|
Abstract
The accurate subdivision of spatially organized datasets is a complex problem in computer science but specifically important for load balancing in parallel environments. The problem is to (a) find a partitioning where each partition has the same number of elements and (b) the communication between partitions (duplicate members) is minimized. We present a novel parallel load-balancing framework — . (SBS) — the first to our knowledge to perform accurate parallel partitioning of multidimensional data, while requiring a fixed number of communication steps independent of network size or input data distribution. When compared to the state of the art sampling and parallel partitioning methods adopted by HPC problems, it delivers better load balancing on a shorter time to solution. We analyse four partitioning schemes that SBS can be applied to, and evaluated our method on 4096 nodes of an IBM BlueGene/Q supercomputer partitioning up to 1 trillion elements, and exhibiting almost-linear scaling properties.
|
| 9 |
Parallel Community Detection Algorithm Using a Data Partitioning Strategy with Pairwise Subdomain Du |
Diana Palsetia,William Hendrix,Sunwoo Lee,Ankit Agrawal,Wei-keng Liao,Alok Choudhary |
|
Abstract
Community detection is an important data clustering technique for studying graph structures. Many serial algorithms have been developed and well studied in the literature. As the problem size grows, the research attention has recently been turning to parallelizing the technique. However, the conventional parallelization strategies that divide the problem domain into non-overlapping subdomains do not scale with problem size and the number of processes. The main obstacle lies in the fact that the graph algorithms often exhibit a high degree of data dependency, which makes developing scalable parallel algorithms a great challenge..We present PMEP, a distributed-memory based parallel community detection algorithm that adopts an unconventional data partitioning strategy. PMEP divides a graph into subgraphs and assigns each pair of subgraphs to one process. This method duplicates a portion of computational workload among processes in exchange for a significantly reduced communication cost required in the later stages. After data partitioning, each process runs MEP on the assigned subgraph pair. MEP is a community detection algorithm based on the idea of maximizing equilibrium and purity.
|
| 10 |
TiDA: High-Level Programming Abstractions for Data Locality Management |
Didem Unat,Tan Nguyen,Weiqun Zhang,Muhammed Nufail Farooqi,Burak Bastem,George Michelogiannakis,Ann |
|
Abstract
The high energy costs for data movement compared to computation gives paramount importance to data locality management in programs. Managing data locality manually is not a trivial task and also complicates programming. Tiling is a well-known approach that provides both data locality and parallelism in an application. However, there is no standard programming construct to express tiling at the application level. We have developed a multicore programming model, ., based on tiling and implemented the model as C++ and Fortran libraries. The proposed programming model has three high level abstractions, ., . and .. These abstractions in the library hide the details of data decomposition, cache locality optimizations, and memory affinity management in the application. In this paper we unveil the internals of the library and demonstrate the performance and programability advantages of the model on five applications on multiple NUMA nodes. The library achieves up to 2.10x speedup over OpenMP in a single compute node for simple kernels, and up to 22x improvement over a single thread for a more complex combustion proxy application (SMC) on 24 cores. The . implementation of geometric multigri
|
| 11 |
|
|
|
Abstract
|
| 12 |
OpenAtom: Scalable Ab-Initio Molecular Dynamics with Diverse Capabilities |
Nikhil Jain,Eric Bohm,Eric Mikida,Subhasish Mandal,Minjung Kim,Prateek Jindal,Qi Li,Sohrab Ismail-Be |
|
Abstract
The complex interplay of tightly coupled, but disparate, computation and communication operations poses several challenges for simulating atomic scale dynamics on multi-petaflops architectures. . addresses these challenges by exploiting overdecomposition and asynchrony in ., and scales to thousands of cores for realistic scientific systems with only a few hundred atoms. At the same time, it supports several interesting ab-initio molecular dynamics simulation methods including the Car-Parrinello method, Born-Oppenheimer method, k-points, parallel tempering, and path integrals. This paper showcases the diverse functionalities as well as scalability of . via performance case studies, with focus on the recent additions and improvements to .. In particular, we study a metal organic framework (MOF) that consists of 424 atoms and is being explored as a candidate for a hydrogen storage material. Simulations of this system are scaled to large core counts on Cray XE6 and IBM Blue Gene/Q systems, and time per step as low as . is demonstrated for simulating path integrals with 32-beads of MOF on 262,144 cores of Blue Gene/Q.
|
| 13 |
SPRITE: A Fast Parallel SNP Detection Pipeline |
Vasudevan Rengasamy,Kamesh Madduri |
|
Abstract
We present ., a new high-performance data analysis pipeline for detecting single nucleotide polymorphisms (SNPs) in the human genome. A SNP detection pipeline for next-generation sequencing data uses several software tools, including tools for read alignment, processing alignment output, and SNP identification. We target end-to-end scalability and I/O efficiency in . by merging tools in this pipeline and eliminating redundancies. For a benchmark human whole-genome sequencing data set, . takes less than 50 min on 16 nodes of the TACC Stampede supercomputer. A key component of our optimized pipeline is ., a new parallel method and software tool for SNP detection. We find that the quality of results obtained by . (sensitivity and precision using high-confidence variant calls as ground truth) is comparable to state-of-the-art SNP-calling software. A prototype implementation of . is available at ..
|
| 14 |
|
|
|
Abstract
|
| 15 |
Predictive Modeling for Job Power Consumption in HPC Systems |
Andrea Borghesi,Andrea Bartolini,Michele Lombardi,Michela Milano,Luca Benini |
|
Abstract
Power consumption is a critical aspect for next generation High Performance Computing systems: Supercomputers are expected to reach Exascale in 2023 but this will require a significant improvement in terms of energy efficiency. In this domain, power-capping can significant increase the final energy-efficiency by cutting cooling effort and worst-case design margins. A key aspect for an optimal implementation of power capping is the ability to estimate the power consumption of HPC applications before they run on the real system. In this paper we propose a Machine-Learning approach, based on the user and application resource request, to accurately predict the power consumption of typical supercomputer workloads. We demonstrate our method on real production workloads executed on the Eurora supercomputer hosted at CINECA computing center in Bologna and we provide useful insights to apply our technique in other installations.
|
| 16 |
Towards Machine Learning on the Automata Processor |
Tommy Tracy II,Yao Fu,Indranil Roy,Eric Jonas,Paul Glendenning |
|
Abstract
A variety of applications employ ., using a collection of decision trees, to quickly and accurately classify an input based on its vector of features. In this paper, we discuss the implementation of such a method, namely Random Forests, as the first machine learning algorithm to be executed on the Automata Processor (AP). The AP is an upcoming reconfigurable co-processor accelerator which supports the execution of numerous automata in parallel against a single input data-flow. Owing to this execution model, our approach is fundamentally different, translating Random Forest models from existing memory-bound tree-traversal algorithms to pipelined designs that use multiple automata to check all of the required thresholds independently and in parallel. We also describe techniques to handle floating-point feature values which are not supported in the native hardware, pipelining of the execution stages, and compression of automata for the fastest execution times. The net result is a solution which when evaluated using two applications, namely handwritten digit recognition and sentiment analysis, produce up to 63 and 93 times speed-up respectively over single-core state-of-the-art CPU-bas
|
| 17 |
AutoMOMML: Automatic Multi-objective Modeling with Machine Learning |
Prasanna Balaprakash,Ananta Tiwari,Stefan M. Wild,Laura Carrington,Paul D. Hovland |
|
Abstract
In recent years, automatic data-driven modeling with machine learning (ML) has received considerable attention as an alternative to analytical modeling for many modeling tasks. While ad hoc adoption of ML approaches has obtained success, the real potential for automation in data-driven modeling has yet to be achieved. We propose AutoMOMML, an end-to-end, ML-based framework to build predictive models for objectives such as performance, and power. The framework adopts statistical approaches to reduce the modeling complexity and automatically identifies and configures the most suitable learning algorithm to model the required objectives based on hardware and application signatures. The experimental results using hardware counters as application signatures show that the median prediction error of performance, processor power, and DRAM power models are 13 %, 2.3 %, and 8 %, respectively.
|
| 18 |
|
|
|
Abstract
|
| 19 |
Supercomputing Centers and Electricity Service Providers: A Geographically Distributed Perspective o |
Tapasya Patki,Natalie Bates,Girish Ghatikar,Anders Clausen,Sonja Klingert,Ghaleb Abdulla,Mehdi Sheik |
|
Abstract
Supercomputing Centers (SCs) have high and variable power demands, which increase the challenges of the Electricity Service Providers (ESPs) with regards to efficient electricity distribution and reliable grid operation. High penetration of renewable energy generation further exacerbates this problem. In order to develop a symbiotic relationship between the SCs and their ESPs and to support effective power management at all levels, it is critical to understand and analyze how the existing relationships were formed and how these are expected to evolve..In this paper, we first present results from a detailed, quantitative survey-based analysis and compare the perspectives of the European grid and SCs to the ones of the United States (US). We then show that contrary to the expectation, SCs in the US are more open toward cooperating and developing demand-management strategies with their ESPs. In order to validate this result and to enable a thorough comparative study, we also conduct a qualitative analysis by interviewing three large-scale, geographically-distributed sites: Oak Ridge National Laboratory (ORNL), Lawrence Livermore National Laboratory (LLNL), and the Leibniz Supercomputi
|
| 20 |
Resource Management for Running HPC Applications in Container Clouds |
Stephen Herbein,Ayush Dusia,Aaron Landwehr,Sean McDaniel,Jose Monsalve,Yang Yang,Seetharami R. Seela |
|
Abstract
Innovations in operating-system-level virtualization technologies such as resource control groups, isolated namespaces, and layered file systems have driven a new breed of virtualization solutions called containers. Applications running in containers depend on the host operating system (OS) for resource allocation, throttling, and prioritization. However, the OS is designed to provide only best-effort/fair-share resource allocation. Lack of resource management, as in virtual machine managers, constrains the use of containers and container-based clusters to a subset of workloads other than traditional high-performance computing (HPC) workflows. In this paper, we describe problems with the fair-share resource management of CPUs, network bandwidth, and I/O bandwidth on HPC workloads and present mechanisms to allocate, throttle, and prioritize each of these three critical resources in containerized HPC environments. These mechanisms enable container-based HPC clusters to host applications with different resource requirements and enforce effective resource use so that a large collection of HPC applications can benefit from the flexibility, portability, and agile characteristics of conta
|
|
|