Research: Big Data

Analyzing truly large-scale data collections ranging into the hundreds of trillions of data points across large numbers of dimensions requires highly specialized approaches. Many factors must be taken into account: computational (exploiting algorithmic optimization and data architecture techniques to make problems computationally tractable), data processing (sifting through the noise and making results robust against conflicting information), and interface (how does one visually present a network analysis of a network with 500 trillion connections?). My work makes use of a wide array of approaches including GIS, intelligence, network analysis, relationship extraction, sentiment analysis, and translation.

High Performance Computing (HPC) / Supercomputing

I joined the National Center for Supercomputing Applications (NCSA) in 2000 and remain actively involved as a Center Affiliate. My work over the last decade has relied heavily on high performance computing (HPC) techniques, and the use of advanced computing systems and algorithms to make sense of "big data": massive high-dimensionality datasets (such as every large human event on earth since WWII, or every news or social media post around the world on climate change).

  • "Big Data". As with all terms du jour, nearly everyone today claims to be a "big data" expert, but analyzing truly large-scale data collections ranging into the hundreds of billions of data points across large numbers of dimensions require highly specialized approaches. Many factors must be taken into account: computational (exploiting algorithmic optimization and data architecture techniques to make problems computationally tractable), data processing (sifting through the noise and making results robust against conflicting information), and interface (how does one visually present a network analysis of a network with 500 million connections?).
  • Fixed-Time Criticality. Most computing is based around finishing as much as you can as quickly as you can. However, in certain specialized application areas, there are fixed windows of time to make a decision. A massive collection of data is available that describes the current environment, and computation proceeds through the data, iteratively generating a higher-and-higher quality decision until the time limit is reached and the decision must be rendered regardless of whether all data has been examined. This kind of application area requires highly specialized strategies for "triaging" data and discarding non-relevant material.
  • Hardware and Software Architectures. I've worked extensively with a large variety of HPC systems, both large and small-scale SMP systems, clusters, distributed clusters, different types of memory hierarchies, disk, network, and in-system IO architectures, and optimization and instrumentation. I have extensive experience in massively-parallel application architectures and the requisite processes that underlie them.
  • Transformation. High performance computing, or "supercomputing" is as much about hardware and software as it is about the processes that transform a "grand challenge" problem into something computationally tractable. How does one take the goal of "catalog all known human events and using them to predict future events" and convert such abstract thoughts into a computational approach, and then optimize that approach such that it can be done using current hardware technology within the necessary time constraints of the application?
  • Novel Applications. Traditional scientific research has historically been hypothesis driven: a theory is generated and experiments carried out to support or refute it. Big Data science, on the other hand, offers the tantalizing capability to examine a large experimental dataset and extract the notable features, including the extraction of new insights from existing datasets. For example, in one recent project, a well-known longitudinal US government dataset was placed into an analytical system that performed pairwise correspondence analysis on 2.1 billion datapoints, automatically identifying unexpected relationships among core variables. The system discovered several previously-unknown relationships in a dataset which has been widely used for nearly half a century.

Nearly all of my research themes make extensive use of "big data" supercomputing as their key enabler, but there are also some stand-alone studies I have conducted:


I am the chief architect of the largest open source intelligence project in academia, leveraging commercial, governmental, and declassified intelligence products to produce global databases of human activity across multiple disciplines over time. One project involves cataloging all "societal stability" events (riots, assassinations, protests, etc) across the world from 1946 to present using tens of millions of local news reports captured and translated from the local presses of each nation. I was also the chief architect of the NCSA VIAS project, one of the early "web-scale" industry monitoring systems. I am founder of the Carbon Capture Report, the premier global climate change news analytics service. I have worked extensively on a wide range of corporate intelligence, event mining, and media monitoring initiatives, especially issues such as trend mining and public perception.


I have worked extensively on a wide array of media analysis, brand mining, industry monitoring, and public perception problems, including numerous "grand challenge" problems.

  • Public Perception. Public perception monitoring is the use of sophisticated perceptual models of human stimuli response to model the overall reaction and perception of a given topic in the mainstream and social media spheres. I have worked extensively on public perception monitoring for a wide range of organizations and topical areas, including areas such as [sentiment analysis].
  • Brand Mining. I've worked on numerous "brand mining" initiatives to develop realtime and broad-scoped metrics of how a particular corporate identity or brand is being perceived and portrayed in the mainstream and social media spheres.
  • Industry Monitoring. Brand mining is ideal for flagship brands, but most large corporations have hundreds or even thousands of distinct brands, making it difficult to see the company's overall performance in the marketplace. More importantly, trends in a single brand are not meaningful without the surrounding context of the overall marketplace's performance. For example, a monitoring system tracking an automobile brand 2000-2009 would note that coverage in 2009 was more subdued and had much less volume than 2000. This metric by itself would be misleading, however, since the automobile industry as a whole has fallen in coverage, and an individual brand's performance must be normalized by this larger shift. Industry monitoring captures the state of an entire industry and allows such comparisons.
  • Competitive and Risk Analysis. Industry trends and public perception patterns are all critical components in the risk analysis of new product launches and evolving brands. Robust insights into competitive risk are only possible with a detailed understanding of media flow and interaction patterns and the proper integration of monitoring technologies.
Carbon Capture Report

One of my current signature initiatives is the Carbon Capture Report, a global news and social media monitoring and public perception analytics platform for studying global patterns in climate change coverage. The system continually monitors the mainstream and social media spheres, including the blogosphere, Twitter, YouTube, and social bookmarking sites, in realtime, performing "crowdsourcing", advanced geographic intelligence analysis, data mining and analytics, and compiling autonomous bibliographies of all people and companies mentioned in the news and their relationships. The core underlying technology is easily adapted to other industries or topical areas and a wide range of more sophisticated analytical services are available.


I was the chief architect and technical director of the NCSA VIAS project. VIAS is an award-winning domain-specific information retrieval, archival, and processing system. It has been at the forefront of a number of Fortune 50 corporate intelligence initiatives and deployed extensively in data mining and intelligence applications for corporate, academic, and governmental clients over the better part of a decade. A VIAS Dynamic Knowledge Repository (DKR) is seeded with a set of topic words, URLs, and other information that define the operational parameters of its topical area. VIAS then monitors all known mailing lists, USENET groups, available wirefeeds, and other streaming sources, as well as autonomously crawls the web using a federation of directed web crawlers to collect information on this topic. All information is screened for relevance against the topical parameter space and then undergoes a series of metadata extraction and production processes, where filters identify people, company, and organization names, bibliographic references, and acronym resolutions, together with higher-order processing such as summary abstract generation. An advanced analytics and visualization suite offers a range of features based on this generated metadata, such as temporal entity graphs and relationship mining.


I've worked extensively on the intersection of the social and mainstream media spheres, including diffusion across geographic and typographic borders. I've developed a large array of robust fully automated analytical tools that have been applied in a number of large intelligence and media monitoring initiatives, including advanced crowdsourcing platforms.


In addition to the Drudge Report study above, I've done several other projects externally characterizing specific media properties, one of the most recent being a study of the Chicago Tribune's flagship web property,

  • Chicago Tribune: Content Velocity Analysis. This study for the Center for Research Libaries, was designed to answer key questions around the volume of new content added to the Chicago Tribune's website over a one month period from September to October 2010, the overall rate of change, linking structure and ease of traversal for archival crawlers, and overall structure, linking, and content characterization considerations. Crawlers were used to archive all 105 gateway pages every 30 minutes, resulting in a total of 136,605 snapshots of the site's content.


I am the chief architect of the Social Political Economic Event Database (SPEED) at the University of Illinois Cline Center for Democracy, an effort to compile a global event database covering 172 countries from 1946-present. The resulting database will consist of all major human events in those categories occurring anywhere on earth in the past 60 years, offering an unprecedented view into the underpinnings of societal evolution. More than 1,200 variables are captured for each event, including date and location to city or landmark resolution. A combination of machine and human processes are used to automate much of the heavy lifting of event coding, while still allowing for the sophistication of interpretation afforded by human analysts. The combined event dataset takes the form a vast network of human activity, with each event being a node and all 1,200 variables acting as links connecting those events. Predictive analytics and "scenario modeling" can ultimately use this event network to forecast sets of potential outcomes for emerging conflict across the globe.


Natural Language Processing

Natural language processing, the use of automated tools to extract meaning from text, is a core enabling technology underlying much of my work. I have been developing "web-scale" information extraction and unstructured NLP tools for more than a decade and developed a number of innovative approaches to high-accuracy robust text classification and tagging systems.


One of the most critical tasks when working with large document collections is classification and tagging: assigning category and topical keywords for filtering, search, and analysis tasks. For example, when searching for news articles about "emerging technology", there is no single set of keywords that can be used to search for such articles. Instead, text classification systems can "learn" from examples or automatically derive natural categories from document collections to evaluate whether a given document matches the learned criteria, permitting very sophisticated and multifacted concepts to be reliably searched.

  • High Accuracy. The models I've built on historical document collections for complex subjects achieve less than 1% false negative and 1% false positive accuracy rates when run against a collection in the tens of millions of files, which is extremely accurate for that scale.
  • Robustness of Data. Traditional classification models work very well on the specific dataset they were trained on, but perform poorly when applied to other datasets. For example, a model trained to recognize labor riots in New York Times news articles would likely fair much worse when performing the same task on translated ITAR-TASS reports. The systems I've built are able to perform equally well on very different datasets, including translated broadcast intercepts.
  • Automatic Model Growing. One of the approaches I've found to yield the highest accuracy is to literally "grow" categorization models using a set of growth and fitness functions to automatically determine the parameter space yielding the highest accuracy.
  • Taxonomy Development. I have considerable experience working with domain subject experts to develop taxonomies suitable for automated classification tasks and the iterative process to determine consolidation boundaries.
  • Automated Taxonomy Development. I also have experience developing automated taxonomies using the underlying document collection as a guide.


So-called "structured data", which breaks information into preset fields with specific meaning associated with each field and each value (ie, a database or spreadsheet) is extremely powerful in that it can be subjected to a wide range of statistical and analytical techniques. "Unstructured" data like textual documents are far more complex to incorporate into an analytical workflow in that they must be first converted into a measurable quantity, known as "features".

It is very easy to build systems that perform very well in a laboratory setting, and "entity extraction" systems that identify names or keyword generation tools are all common computer science projects. Yet, robustly operating over the full range of material that is found on the open web or in real-world document collections is a much harder problem. I have vast experience at building "web-scale" systems that maintain high accuracy at converting unstructured textual data into structured database fields using a combination of "expert"-derived rulesets and machine-based statistical and learning approaches.


Another key research area I've worked extensively with is "page extraction": identifying the core textual content on a complex multifaceted web page and separating it from stylistic elements, banners, navigation bars, advertisements, user commenting sectons, etc. The core algorithms I developed at the National Center for Supercomputing Applications have been in use now over more than a decade and have been shown to be robust against content from all major categories of websites in all countries of the world in all languages, in HTML ranging from fully compliant to massively invalid. The system uses a highly adaptive architecture capable of realtime inference and reasoning, yet yields performance levels capable of operating at "web scale" in realtime.

Network Analysis / Relationship Extraction

Many of my research themes make implicit or explicit use of large-scale network representation, visualization, and analysis. In particular, the synthesis of structured and large unstructured data sources into single composite networks in the face of conflicting and incomplete data sources. For example, analyzing a large collection of news articles on a given industry and automatically identifying actors in that industry and their relationships.


Displaying large-scale network data in forms conducive to human analysis is a very complex endeavor: large-scale networks by their very nature consist of enormous levels of fine-grain detail arranged over a very large area. A good analogy is viewing imagery of the craters in the surface of Mars on a desktop computer: if one zooms in far enough to see the individual craters, the macro-level patterns of their arrangement is lost, but when one is zoomed out far enough to see the entire image at once, the smaller craters are lost. Only when using a high-resolution large-format display (in this case a 40-projector seamless tiled display wall) can one see the entire image at once at native resolution and understand the broader patterns in the crater impact sites. Similarly, network visualization tools must make the broader patterns in large-scale networks visible, while not obscuring the fine-level detail of those connections. Analytical tools must be able to make sense of the massive web of connections in network data and sift through the noise to find insight.

  • Network Visualization. There is no single approach that works well for all applications. Some tools are better suited for specific data types or for analysts versus decision makers. Display tools may be two dimensional desktop applications, three dimensional desktop, or three dimensional stereoscopic. Visualizations may use Euclidian, Hyperbolic, or other projections or subsetting techniques, each of which comes with specific interpretive nuances that must be taken into consideration.
  • Network Analysis. A wide range of very powerful techniques are available when data is in a network format: the interconnectedness of a network can be measured, determining its vulnerability to the elimination of selected nodes (ie, are there specific nodes that can be removed to render a given network incomplete and actually sever communication between selected segments), and critical nexus nodes can be identified, for example.
  • Predictive Analytics. One application of network analysis is in predictive analytics. In one approach, known as "scenario modeling", a given sequence of events can be presented as a segment of a network and then overlaid onto a larger network of historical events. Much like a puzzle piece fitting into a puzzle, the small event sequence network will likely match several historical episodes in the larger network, which can then be highlighted for analyst to derive likely outcomes and possible remedial factors influencing the current episode.

As part of a larger project with the United States National Archives and Records Administration (NARA), the ENRON email collection was used as a sample dataset to examine various modes of interaction with large-scale communication datasets.

  • NARA Email Network Visualization and Analysis. This progress report outlines a set of very basic, but extremely powerful, analytical visualizations designed to explore the ENRON email collection using large-scale display surfaces.


A key subcomponent of network analytics is "relationship extraction": the autonomous identification of different classes of connections between entities in a data repository and the conversion of those relationships into structured network connections. For example, a "relationship mining" system might analyze a corpus of news articles covering a specific industry and identify mentions of partnerships or other connections between companies and individuals in that space, and convert the unstructured knowledge represented in that large collection of textual documents into a structured network diagram that can be subjected to various network analysis techniques.

    Converting Unstructured Data to Structured Networks. Unstructured textual document collections contain substantial amounts of information on specific entities and the connections between those entities. Using entity and relationship extraction tools it is possible to automatically construct networks that represent the web of connections in textual documents. For example, as noted above, a relationship extraction system can produce a network showing the connections between various companies in a given industry from a corpus of news articles. Of particular interest, systems can often infer hidden relationships by examining patterns in the associated web of topics and relationships associated with each company.
  • Combining Structured and Unstructured Data. The most powerful networks combine the benefits of structured data with the untapped richness of unstructured data to synthesize extremely rich network.
  • Conflicting / Incomplete Information. Large-scale networks constructed from real-world data must contend with conflicting and incomplete information sources. One input source might specify that person A is person B's brother, while another might list A as B's mother. A common name might appear in hundreds of unrelated contexts. All of these issues must be resolved to a sufficient level of accuracy or the underlying analytical techniques must be made sufficiently robust, to permit informed decisions to be made using the data.

One system I developed that is in use at the University of Illinois exploits many aspects of network analysis and relationship extraction to monitor project-specific email mailing lists and automatically build complex models of communicative patterns among participants, including associating users based not only on structured information (who emails whom / who replies to whom), but also unstructured (who tends to ask similar questions or discusses similar topical areas).

GIS / Spatial Analysis

I have worked extensively with issues of spatial representation and applying GIS and spatial intelligence to global-scale problems, including the automated geocoding and analysis of translated historical news archives and broadcast intercepts numbering in the tens of millions of documents.

Geographic Information Systems (GIS) and the use of spatial information in non-traditional contexts is a tremendous growth area. For example, historians researching the life of a historical figure now use automated geocoding tools to identify geographic references in documents and letters about that person's life and arrange them on a map. Organizing such information spatially and through time can highlight previously undetected patterns in the individual's life.

  • Geocoding. The term "geocoding" refers to the use of software tools to examine textual documents and automatically identifying geographic references and disambiguating them to a specific set of latitude and longitude coordinates. Geocoding is a highly overused term in today's marketing world, but a true geocoding system is capable of using a variety of contextual information to achieve the highest level of accuracy. For example, an article in a local newspaper in Paris, Illinois that describes a sale at a "Paris department store" likely refers to a local store, not one in Paris, France. Conversely, an Iraqi newspaper report about events in "Cairo" probably refer to Cairo, Egypt, not one of the other 30+ Cairo's in the world.
  • International and Historical Geocoding. Geocoding on a global scale and being able to properly disambiguate even obscure and/or localized geographic references from translated and historical materials, where names may have changed over time, is an extremely complex undertaking.
  • Spatial Analysis and Spatial Representation. Once information is in a spatial context, powerful analytical capabilities become available that represent a fundamentally new way of thinking for many organizations and research disciplines. The representation of that information in visual form adds a further layer of complexity. Simply placing dots on a map or coloring countries using a generic scale can result in wildly misleading interpretations: perceptual modeling is critical in representing spatial information.

Sentiment Analysis

I have worked extensively with sentiment analysis systems, applying them to international and domestic news coverage, historical and contemporary material, document collections, social media, and communications flows for academic, governmental, and corporate projects. I have also worked on cross-domain and learning systems for maximizing the accuracy of such systems by tailoring them for specific disciplines.

Sentiment mining, also known as "tone analysis", uses complex perceptual models to estimate the average emotional response of a reader to a given passage of text or the emotional state of the author of that text. Additional dimensions may measure the potential "energy" and "persuasiveness" of the text. Sentiment mining has become an increasingly popular tool for corporate "brand mining," offering a rough estimate of the overall tone of coverage of that organization.

  • Whole-Document vs Entity Sentiment. Basic systems estimate only the overall tone of a given document. A blogger posting that she loves the lens of her new camera but hates the battery life would result in a net tone score of neutral, because the extreme positive and extreme negative scores negate each other at the document level. Entity-level systems delve further to separate which aspects of the product or brand the customer likes and dislikes.
  • Human-Trained vs Learning Systems. Most current sentiment mining systems are built by having a large team of human editors (often college students) read through the dictionary or through a large body of documents, and assign scores from positive to negative to each word, which are then aggregated together. Learning systems learn the "contexts" of positive and negative words and continually update their internal lexicons with new words and "sayings," allowing them to evolve to changing language use.
  • Cross-Domain. Sentiment systems tend to perform best on the text they were initially trained for. A system trained on movie reviews will traditionally do poorly when scoring computer reviews or news articles. Domain customization and cross-domain techniques adjust the internal models to work on data from other disciplines. This requires an understanding of language use and perceptual models.


I work extensively with translated material in much of my global work, including the approaches to translation, machine versus human translation, and the influence of translation on analysis. Translated material can arrive or be incorporated via many channels. There are many very different approaches to translation workflows and automated translation systems and one must understand the nuances of each when developing a workflow for the use of translated material in a research initiative.

  • Vernacular Coding. In some highly-specialized applications, such as scoring a series of news articles or documents by tone, discord between the source and target languages may prevent the translation from permitting accurate measurements of nuance. In these cases native speakers perform the entire coding task using the source text.
  • Commercial Translation. Usually single-iteration translation that captures the majority of the text's meaning. Native speakers or trained translators focus solely on converting material from source to target language. Translated material is then sent to highly trained specialists who then analyze and extract needed analysis from material.
  • Machine Translation. Similar to commercial translation, but orders of magnitude faster and scalable to unlimited volume levels. Highly useful as a screening mechanism. Can use expert, statistical, or hybrid approaches.
  • Hybrid Systems. Systems like the intelligence community's National Virtual Translation Center combine machine translation for screening with human translation for highest accuracy.
  • Iterative Translation. Consumer of translated material works hand-in-hand with translator to clarify the most measured nuances of the translation (for example, is it a "no" or a "never").
  • Foreign Media Analysis. The use of translated material to examine factual or expressive trends in foreign media content.

I have worked extensively with translated material, from the workflows involved in preparing a 75+ language corpus for translation and assigning and managing large teams of human translators to produce nuanced translations, to the use of machine translation for screening and routing purposes. I've worked with the interpretability of translated material, especially the ways in which different approaches to machine translation affect secondary processes such as automated categorization and data mining. I recently authored the first unclassified comprehensive study of the Western intelligence community's use of foreign news and scientific material over the last 30 years, including its use of iterative translation.

Funding Analysis

I was the chief architect and developer of the Office of Naval Research's "federal funding search, discovery, and analysis system." This system combined grant and contracting opportunities from across the federal government and offered advanced trend mining, sophisticated visualizations and pattern analysis, and spatial analysis in a single integrated "funding opportunities portal" to support national small business and entrepreneurship in support of federal needs.