Research: Big Data / Supercomputing
I joined the National Center for Supercomputing Applications (NCSA) in 2000 and remain actively involved as a Center Affiliate. My work over the last decade has relied heavily on high performance computing (HPC) techniques, and the use of advanced computing systems and algorithms to make sense of "big data": massive high-dimensionality datasets (such as every large human event on earth since WWII, or every news or social media post around the world on climate change).
- "Big Data". As with all terms du jour, nearly everyone today claims to be a "big data" expert, but analyzing truly large-scale data collections ranging into the hundreds of billions of data points across large numbers of dimensions require highly specialized approaches. Many factors must be taken into account: computational (exploiting algorithmic optimization and data architecture techniques to make problems computationally tractable), data processing (sifting through the noise and making results robust against conflicting information), and interface (how does one visually present a network analysis of a network with 500 million connections?).
- Fixed-Time Criticality. Most computing is based around finishing as much as you can as quickly as you can. However, in certain specialized application areas, there are fixed windows of time to make a decision. A massive collection of data is available that describes the current environment, and computation proceeds through the data, iteratively generating a higher-and-higher quality decision until the time limit is reached and the decision must be rendered regardless of whether all data has been examined. This kind of application area requires highly specialized strategies for "triaging" data and discarding non-relevant material.
- Hardware and Software Architectures. I've worked extensively with a large variety of HPC systems, both large and small-scale SMP systems, clusters, distributed clusters, different types of memory hierarchies, disk, network, and in-system IO architectures, and optimization and instrumentation. I have extensive experience in massively-parallel application architectures and the requisite processes that underlie them.
- Transformation. High performance computing, or "supercomputing" is as much about hardware and software as it is about the processes that transform a "grand challenge" problem into something computationally tractable. How does one take the goal of "catalog all known human events and using them to predict future events" and convert such abstract thoughts into a computational approach, and then optimize that approach such that it can be done using current hardware technology within the necessary time constraints of the application?
- Novel Applications. Traditional scientific research has historically been hypothesis driven: a theory is generated and experiments carried out to support or refute it. Big Data science, on the other hand, offers the tantalizing capability to examine a large experimental dataset and extract the notable features, including the extraction of new insights from existing datasets. For example, in one recent project, a well-known longitudinal US government dataset was placed into an analytical system that performed pairwise correspondence analysis on 2.1 billion datapoints, automatically identifying unexpected relationships among core variables. The system discovered several previously-unknown relationships in a dataset which has been widely used for nearly half a century.
- Technical Brief: IO Performance and Configuration Benchmarking on High Performance Cluster and SMP Computing Systems at the National Center for Supercomputing Applications. (2005, April 12). This is a technical summary brief of IO performance benchmarking performed on NCSA HPC systems on behalf of the United States National Archives and Records Administration. The purpose of this study was to examine various SMP and clustered filesystem environments to determine stability and performance curves over different types of access and storage patterns. One core question to be answered through this project was the ability of cheap commodity clusters and clustered filesystems to achieve similar sustained performance levels for large-scale document repositories as possible through SMP "monolithic" storage systems.