Bookmark and Share

Research: Natural Language Processing / Text Mining / Classification / Unstructured Text / Learning Systems

Natural language processing, the use of automated tools to extract meaning from text, is a core enabling technology underlying much of my work. I have been developing "web-scale" information extraction and unstructured NLP tools for more than a decade and developed a number of innovative approaches to high-accuracy robust text classification and tagging systems.

CLASSIFICATION AND TAGGING

One of the most critical tasks when working with large document collections is classification and tagging: assigning category and topical keywords for filtering, search, and analysis tasks. For example, when searching for news articles about "emerging technology", there is no single set of keywords that can be used to search for such articles. Instead, text classification systems can "learn" from examples or automatically derive natural categories from document collections to evaluate whether a given document matches the learned criteria, permitting very sophisticated and multifacted concepts to be reliably searched.

UNSTRUCTURED TO STRUCTURED

So-called "structured data", which breaks information into preset fields with specific meaning associated with each field and each value (ie, a database or spreadsheet) is extremely powerful in that it can be subjected to a wide range of statistical and analytical techniques. "Unstructured" data like textual documents are far more complex to incorporate into an analytical workflow in that they must be first converted into a measurable quantity, known as "features".

It is very easy to build systems that perform very well in a laboratory setting, and "entity extraction" systems that identify names or keyword generation tools are all common computer science projects. Yet, robustly operating over the full range of material that is found on the open web or in real-world document collections is a much harder problem. I have vast experience at building "web-scale" systems that maintain high accuracy at converting unstructured textual data into structured database fields using a combination of "expert"-derived rulesets and machine-based statistical and learning approaches.

PAGE EXTRACTION

Another key research area I've worked extensively with is "page extraction": identifying the core textual content on a complex multifaceted web page and separating it from stylistic elements, banners, navigation bars, advertisements, user commenting sectons, etc. The core algorithms I developed at the National Center for Supercomputing Applications have been in use now over more than a decade and have been shown to be robust against content from all major categories of websites in all countries of the world in all languages, in HTML ranging from fully compliant to massively invalid. The system uses a highly adaptive architecture capable of realtime inference and reasoning, yet yields performance levels capable of operating at "web scale" in realtime.