Speaker: AnHai Doan , Computer Science and Engineering, University of Washington
Learning to Map between Structured Representations of Data
Abstract: Virtually all information processing efforts encode data using some form of structured or semi-structured representation, such as relational schemas, ontologies, or XML DTDs. Whenever more than one representation is used, *semantic mappings* must be established between them to ensure interoperability. In particular, data processing across disparate sources, within an enterprise or on the WWW, requires knowing the semantic mappings between the source schemas or from them to a unified global schema. Today, such semantic mappings are created manually, in an extremely costly and error-prone process. As a result, the acquisition of mappings has become a key bottleneck in building large-scale data management systems.
In this talk I will describe solutions for semi-automatically creating semantic mappings. I describe three systems that deal with successively more expressive data representations and mapping classes. The first two systems, LSD and GLUE, find one-to-one mappings such as "address = location" in the context of (respectively) data integration and ontology matching. The third system, COMAP, finds more complex mappings such as "name = concatenation(first-name,last-name)". The key idea underlying these three systems is the incorporation of multiple types of knowledge and multiple machine learning techniques into all stages of the mapping process, with the goal of maximizing mapping accuracy. I present experiments on real-world data that validate the proposed solutions. Finally, I discuss how the solutions generalize previous works in databases and AI on creating semantic mappings.
Short Bio: AnHai Doan received his BS degree in Computer Science from Kossuth Lajos University in Hungary in 1993. He received his MS degree in Computer Science from the University of Wisconsin-Milwaukee in 1996. Currently, he is enrolled in the PhD program in the Computer Science Department at the University of Washington. The title of his dissertation is "Learning to Translate between Structured Representations of Data". His research interest are in applying database and machine learning techniques to the problem of accessing and integrating data from a large number of heterogeneous and autonomous data sources.
Host: J. Doyle and D. Bahler, Computer Science
Colloquia Home Page