Seminars & Colloquia

Andrew McCallum

Computer Science, U. Massachusetts Amherst

"Information Extraction, Data Mining and Joint Inference"

Monday November 17, 2008 04:00 PM
Location: 3211, EB2 NCSU Centennial Campus
(Visitor parking instructions)

This talk is part of the Triangle Computer Science Distinguished Lecturer Series



Although information extraction and data mining appear together in many applications, their interface in most current systems would better be described as serial juxtaposition than as tight integration. Information extraction populates slots in a database by identifying relevant subsequences of text, but is usually not aware of the emerging patterns and regularities in the database. Data mining methods begin from a populated database, and are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and accurate mining of complex text sources has been beyond reach.

In this talk I will describe work in probabilistic models that perform joint inference across multiple components of an information processing pipeline in order to avoid the brittle accumulation of errors. The need for joint inference appears not only in extraction and data mining, but also in natural language processing, computer vision, robotics and elsewhere. I will argue that joint inference is the fundamental issue in articificial intelligence.

After briefly introducing conditional random fields, I will describe recent work in information extraction, entity resolution and alignment that use joint inference, stochastic approximations, weighted first-order logic and other methods of probabilistic programming. I'll close with a demonstration of, a new research paper search engine that leverages these techniques.

Joint work with colleagues at UMass: Charles Sutton, Aron Culotta, Khashayar Rohanemanesh, Greg Druck, Ben Wellner, Michael Hay, Xuerui Wang, David Mimno, Pallika Kanani, Kedare Bellare, Michael Wick, Rob Hall and Gideon Mann.

Short Bio:

Andrew McCallum is an Associate Professor and Director of the Information Extraction and Synthesis Laboratory in the Computer Science Department at University of Massachusetts Amherst. He has published over 100 papers in many areas of AI, including natural language processing, machine learning, data mining and reinforcement learning, and his work has received over 10,000 citations. He received his PhD from University of Rochester in 1995 with Dana Ballard and a postdoctoral fellowship from CMU with Tom Mitchell and Sebastian Thrun. Afterward he worked in an industrial research lab, where he spearheaded the creation of CORA, an early research paper search engine that used machine learning for spidering, extraction, classification and citation analysis. In the early 2000's he was Vice President of Research and Development at at WhizBang Labs, a 170-person start-up company that used machine learning for information extraction from the Web. He is the recipient of two NSF ITR awards, the UMass NSM Distinguished Research Award, the UMass Lilly Teaching Fellowship, and the IBM Faculty Partnership Award. He was the Program Co-chair for the International Conference on Machine Learning (ICML) 2008, and a member of the boards of the International Machine Learning Society, the CRA Community Computing Consortium and the editorial board of the Journal of Machine Learning Research. For the past ten years, McCallum has been active in research on statistical machine learning applied to text, especially information extraction, co-reference, document classification, clustering, finite state models, semi-supervised learning, and social network analysis. New work on search and bibliometric analysis of open-access research literature can be found at

To access the video of this talk, click here.

Host: Alex Hartemink, Duke U.

Media Files:
No media files available at this time

Video Presentation: Host is responsible for requesting video recording by filling out this Web form. For other technical issues, contact us at

Back to Seminar Listings
Back to Colloquia Home Page