Research: Natural Language Processing / Text Mining / Classification / Unstructured Text / Learning Systems
Natural language processing, the use of automated tools to extract meaning from text, is a core enabling technology underlying much of my work. I have been developing "web-scale" information extraction and unstructured NLP tools for more than a decade and developed a number of innovative approaches to high-accuracy robust text classification and tagging systems.
CLASSIFICATION AND TAGGING
One of the most critical tasks when working with large document collections is classification and tagging: assigning category and topical keywords for filtering, search, and analysis tasks. For example, when searching for news articles about "emerging technology", there is no single set of keywords that can be used to search for such articles. Instead, text classification systems can "learn" from examples or automatically derive natural categories from document collections to evaluate whether a given document matches the learned criteria, permitting very sophisticated and multifacted concepts to be reliably searched.
- High Accuracy. The models I've built on historical document collections for complex subjects achieve less than 1% false negative and 1% false positive accuracy rates when run against a collection in the tens of millions of files, which is extremely accurate for that scale.
- Robustness of Data. Traditional classification models work very well on the specific dataset they were trained on, but perform poorly when applied to other datasets. For example, a model trained to recognize labor riots in New York Times news articles would likely fair much worse when performing the same task on translated ITAR-TASS reports. The systems I've built are able to perform equally well on very different datasets, including translated broadcast intercepts.
- Automatic Model Growing. One of the approaches I've found to yield the highest accuracy is to literally "grow" categorization models using a set of growth and fitness functions to automatically determine the parameter space yielding the highest accuracy.
- Taxonomy Development. I have considerable experience working with domain subject experts to develop taxonomies suitable for automated classification tasks and the iterative process to determine consolidation boundaries.
- Automated Taxonomy Development. I also have experience developing automated taxonomies using the underlying document collection as a guide.
UNSTRUCTURED TO STRUCTURED
So-called "structured data", which breaks information into preset fields with specific meaning associated with each field and each value (ie, a database or spreadsheet) is extremely powerful in that it can be subjected to a wide range of statistical and analytical techniques. "Unstructured" data like textual documents are far more complex to incorporate into an analytical workflow in that they must be first converted into a measurable quantity, known as "features".
It is very easy to build systems that perform very well in a laboratory setting, and "entity extraction" systems that identify names or keyword generation tools are all common computer science projects. Yet, robustly operating over the full range of material that is found on the open web or in real-world document collections is a much harder problem. I have vast experience at building "web-scale" systems that maintain high accuracy at converting unstructured textual data into structured database fields using a combination of "expert"-derived rulesets and machine-based statistical and learning approaches.
PAGE EXTRACTION
Another key research area I've worked extensively with is "page extraction": identifying the core textual content on a complex multifaceted web page and separating it from stylistic elements, banners, navigation bars, advertisements, user commenting sectons, etc. The core algorithms I developed at the National Center for Supercomputing Applications have been in use now over more than a decade and have been shown to be robust against content from all major categories of websites in all countries of the world in all languages, in HTML ranging from fully compliant to massively invalid. The system uses a highly adaptive architecture capable of realtime inference and reasoning, yet yields performance levels capable of operating at "web scale" in realtime.