Back to Results
First PageMeta Content
Error detection and correction / Fault-tolerant system / Redundancy / Computer cluster / Triple modular redundancy / Fault-tolerant computer systems / Computing / Computer architecture


Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing David Fiala, Frank Mueller North Carolina State University Raleigh, NC {dfiala,fmuelle}@ncsu.edu
Add to Reading List

Document Date: 2013-02-11 03:37:43


Open Document

File Size: 349,46 KB

Share Result on Facebook

City

Salt Lake City / Dublin / /

Company

Checkpoint / Los Alamos National Laboratory / Lawrence Livermore National Laboratory / Sandia Corporation / RedMPI / AMD / Nvidia / Irish Research Council for Science / Engineering & Technology / IBM / Lockheed Martin Corporation / Oak Ridge National Laboratory / UT-Battelle LLC / Sandia National Laboratories / /

Country

United States / Ireland / /

Currency

USD / /

/

Event

FDA Phase / Product Issues / Reorganization / /

Facility

Laboratory Directed Research and Development Program / Frank Mueller North Carolina State University / RedMPI library / /

IndustryTerm

multicore chips / software failures / exascale computing / correction protocol / owned subsidiary / receiver-side protocol / experimental applications / system software layer / parallel debugging tools / communication protocol / progressive applications / input chain / message-passing applications / online comparison / information technology / internal communications / search area / energy-delay characteristics / capacity computing / computing / suited protocols / integrated fault injection tool / message correction protocol / software faults / large systems / Pure software-based solutions / control systems / consistency protocols / memory module chip / verifiable point-to-point communication protocol / long-running applications / redundant computing / correction protocols / large scale applications / weak scaling applications / Software redundancy / /

OperatingSystem

Linux / Fermi / /

Organization

U.S. Department of Energy / Irish Research Council for Science / Engineering & Technology / Oak Ridge National Lab / U.S. Securities and Exchange Commission / DED / National Science Foundation / Industrial Development Agency / Frank Mueller North Carolina State University Raleigh / U.S. Department of Energy’s National Nuclear Security Administration / /

Person

Kurt Ferreira / Ron Brightwell Sandia / David Fiala / Christian Engelmann Oak Ridge Natl / Frank Mueller North / Rolf Riesen / /

Position

memory controller / /

Product

MG 2x OV / Fermi Tesla GPUs / SDC injection / A. Targeted Fault Injections / SDC injections / All SDC injections / messages / MPI messages / memory / SDC / /

ProgrammingLanguage

R / C / /

ProvinceOrState

Utah / HEC / /

PublishedMedium

Engineering & Technology / /

Technology

radiation / receiver-side protocol / RAM / memory module chip / multicore chips / Linux / information technology / API / correction protocols / operating system / communication protocol / two consistency protocols / voting algorithm / 512 processors / message correction protocol / consistency protocols / verifiable point-to-point communication protocol / 128 processors / Ethernet / 1536 processors / correction protocol / simulation / /

SocialTag