Research: Network Analysis / Relationship Extraction
Many of my research themes make implicit or explicit use of large-scale network representation, visualization, and analysis. In particular, the synthesis of structured and large unstructured data sources into single composite networks in the face of conflicting and incomplete data sources. For example, analyzing a large collection of news articles on a given industry and automatically identifying actors in that industry and their relationships.
ANALYTICS / VISUALIZATION
Displaying large-scale network data in forms conducive to human analysis is a very complex endeavor: large-scale networks by their very nature consist of enormous levels of fine-grain detail arranged over a very large area. A good analogy is viewing imagery of the craters in the surface of Mars on a desktop computer: if one zooms in far enough to see the individual craters, the macro-level patterns of their arrangement is lost, but when one is zoomed out far enough to see the entire image at once, the smaller craters are lost. Only when using a high-resolution large-format display (in this case a 40-projector seamless tiled display wall) can one see the entire image at once at native resolution and understand the broader patterns in the crater impact sites. Similarly, network visualization tools must make the broader patterns in large-scale networks visible, while not obscuring the fine-level detail of those connections. Analytical tools must be able to make sense of the massive web of connections in network data and sift through the noise to find insight.
- Network Visualization. There is no single approach that works well for all applications. Some tools are better suited for specific data types or for analysts versus decision makers. Display tools may be two dimensional desktop applications, three dimensional desktop, or three dimensional stereoscopic. Visualizations may use Euclidian, Hyperbolic, or other projections or subsetting techniques, each of which comes with specific interpretive nuances that must be taken into consideration.
- Network Analysis. A wide range of very powerful techniques are available when data is in a network format: the interconnectedness of a network can be measured, determining its vulnerability to the elimination of selected nodes (ie, are there specific nodes that can be removed to render a given network incomplete and actually sever communication between selected segments), and critical nexus nodes can be identified, for example.
- Predictive Analytics. One application of network analysis is in predictive analytics. In one approach, known as "scenario modeling", a given sequence of events can be presented as a segment of a network and then overlaid onto a larger network of historical events. Much like a puzzle piece fitting into a puzzle, the small event sequence network will likely match several historical episodes in the larger network, which can then be highlighted for analyst to derive likely outcomes and possible remedial factors influencing the current episode.
As part of a larger project with the United States National Archives and Records Administration (NARA), the ENRON email collection was used as a sample dataset to examine various modes of interaction with large-scale communication datasets.
- NARA Email Network Visualization and Analysis. This progress report outlines a set of very basic, but extremely powerful, analytical visualizations designed to explore the ENRON email collection using large-scale display surfaces.
RELATIONSHIP EXTRACTION
A key subcomponent of network analytics is "relationship extraction": the autonomous identification of different classes of connections between entities in a data repository and the conversion of those relationships into structured network connections. For example, a "relationship mining" system might analyze a corpus of news articles covering a specific industry and identify mentions of partnerships or other connections between companies and individuals in that space, and convert the unstructured knowledge represented in that large collection of textual documents into a structured network diagram that can be subjected to various network analysis techniques.-
Converting Unstructured Data to Structured Networks. Unstructured textual document collections contain substantial amounts of information on specific entities and the connections between those entities. Using entity and relationship extraction tools it is possible to automatically construct networks that represent the web of connections in textual documents. For example, as noted above, a relationship extraction system can produce a network showing the connections between various companies in a given industry from a corpus of news articles. Of particular interest, systems can often infer hidden relationships by examining patterns in the associated web of topics and relationships associated with each company.
- Combining Structured and Unstructured Data. The most powerful networks combine the benefits of structured data with the untapped richness of unstructured data to synthesize extremely rich network.
- Conflicting / Incomplete Information. Large-scale networks constructed from real-world data must contend with conflicting and incomplete information sources. One input source might specify that person A is person B's brother, while another might list A as B's mother. A common name might appear in hundreds of unrelated contexts. All of these issues must be resolved to a sufficient level of accuracy or the underlying analytical techniques must be made sufficiently robust, to permit informed decisions to be made using the data.
