Jaewoo Kang (Ph.D., UW-Madison, 2003)

Assistant Professor
Department of Computer Science

NC State University

 

 


 

General Information

I joined the faculty of the Department of Computer Science at NC State University in August, 2003. I received my Ph.D. in Computer Science from the University of Wisconsin-Madison in 2003, M.S. from the University of Colorado at Boulder in 1996 and B.S. from Korea University at Seoul in 1994. Prior to joining UW-Madison Ph.D. program, I spent one and a half years in an industry research lab (AT&T Labs., Florham Park, NJ,) and during the course of my Ph.D. work, I spent two years in a start-up company where I served as a CTO/Founder.

 


Research

My research interests in a broad sense focus on understanding the fundamental aspects of building a large-scale internet information system that can answer complex queries over a large number (billions) of heterogeneous internet data sources. The main challenge in achieving this goal is to improve the expressive power of the system without degrading its scalability both in the number of data sources and the number of transactions. I focus on tackling the challenge particularly in: data integration, model management, query optimization, semi-structured data management, text mining, and statistical natural language processing.

 

My students and Collaborators

 

 

Current Projects

 

Gene Signature Cube: Enabling Large-Scale Gene Expression Analysis

An area of much recent interest in Bioinformatics is developing strategies for combining information across different biological experiments in order to answer broad questions or create new and innovative hypotheses for further investigation. Microarray data is a major focus of these efforts, as it represents genetic information derived under many different conditions and tissue types, and it is readily accessible from a number of publicly accessible repositories. Combining data from diverse sources allows scientists to perform global studies such as identifying genes that are involved in different types of cancer, tissues or different stages of cancer progression. Such global studies are not generally possible from the standard results provided by individual microarray studies, in which lists of genes that are differentially expressed are provided for the specific experimental conditions, but no extensions are generally available to other relevant experiments.

 

Integrating such microarray data, however, has been a major challenge to both the bioinformatics and the database communities due to many experiment-specific variables involved in the microarray experiments. Different experiments utilize different tissue types, examine different treatment strategies, and consider different stages of disease development.  This, along with differences in technology and protocols used in different labs, leads to difficulties in combining data across experiments. The goal of this project is to make fundamental contributions toward addressing this problem. As the first step, we developed a novel data model, the gene signature vector, to solve the heterogeneous microarray data problem. The proposed model transforms the heterogeneous gene expressions from different sources into a uniform comparable form by projecting gene expression data onto a coherent information space. As the second step, we are currently developing a novel analysis framework, the gene signature cube, combining large numbers of such gene signatures in a three dimensional data structure that facilitates global studies through intuitive visualization and efficient exploratory access.

 

Collaborators: Jiong Yang (CS, Case Western University), Steffen Heber (CS and Bioinformatics, NCSU), Dahlia Nielsen (Statistics, NCSU)

 

Un-Interpreted Schema Matching

The schema matching problem at the most basic level refers to the problem of mapping schema elements (for example, columns in a relational database schema) in one information repository to corresponding elements in a second repository. While schema matching has always been a problematic and interesting aspect of information integration, the problem is exacerbated as the number of information sources to be integrated, and hence the number of integration problems that must be solved, grows. Most previous solutions to the schema matching problem rely in some fashion upon identifying "similar" column names in the schemas to be matched, and/or by recognizing common domains in the data stored in the schemas. For this reason, they are often developed for a specific target domain in mind and do not generally work for others.

In this work I presented an automated technique that is designed to be of assistance in the particularly difficult cases in which the column names and data values are "opaque." This approach works by computing the "mutual information" between pairs of columns within each schema, and then using this statistical characterization of pairs of columns in one schema to propose matching pairs of columns in the other schema. One of the novel aspects of this approach is that it does not rely on the interpretation of the data values or schema instances, but uses only the statistically measured relational dependencies among the schema elements. As a result this "Un-Interpreted Matching" technique is not dependent on the domain specific knowledge, and hence is applicable to many different domains, even including domains to which the system has not previously been exposed.

 

Stream Data Processing

Streaming Data & Join Optimization Recently, the database research community has begun focusing its attention on query processing over continuous input streams rather than fixed-size stored data sets. In such environments, many assumptions made in traditional query processing are no longer valid, and new problems arise. In this work, I investigated algorithms for evaluating sliding window joins over pairs of unbounded streams. A number of questions arise. For example, how can an optimizer decide which algorithm to use when the traditional metric of execution time to completion does not apply in a sliding window join scenario, since the inputs are infinite? What if one stream is much faster than the other? Can the optimizer take advantage of the differences in the input stream rates? If the query processor has limited computational and/or memory resources how should these resources be distributed among the streams?

To address the questions, I introduced a unit-time-basis cost model to analyze the expected performance of these algorithms. Using this cost model, I proposed strategies for maximizing the efficiency of processing joins in three scenarios. First, I considered the case where one stream is much faster than the other. I showed that asymmetric combinations of join algorithms, (e.g., hash join on one input, nested-loops join on the other) can outperform symmetric join algorithm implementations. Second, I investigated the case where system resources are insufficient to keep up with the input streams. I showed that we can maximize the number of join result tuples produced in this case by properly allocating computing resources across the two input streams. Finally, I investigated strategies for maximizing the number of result tuples produced when memory is limited, and showed that proper memory allocation across the two input streams can result in significantly lower resource usage and/or more result tuples produced. Addressing all of these issues in a unified manner, I developed a powerful optimization framework for sliding window join queries, which, by conducting an experimental study, I proved to be correct and usable in practice.

 

For my old projects, please see here.

 


 

Teaching

  • Spring, 2006: CSC 440 - Database Management Systems.
  • Spring, 2006: CSC 742 - Database Management Systems.
  • Fall, 2005: CSC 440 - Database Management Systems.
  • Spring, 2005: CSC591G - Information Integration and the Web
  • Fall, 2004: CSC 540 - Database Management Concepts and Systems.
  • Spring, 2004:  CSC 591G – Information Integration and the Web.
  • Fall, 2003:  CSC 540 – Database Management Concepts and Systems.

 


 

Professional Activities

  • Program Committee, VLDB Workshop on Data Mining in Bioinformatics 2006
  • Program Committee, VLDB Workshop on CleanDB: Clean Databases 2006
  • Program Committee, AAAI NecTar Track 2006
  • Program Committee, ACM/IEEE JCDL 2006
  • Referee for ACM TODS 2005
  • Referee for IEEE TOC 2005
  • Program Committee, ACM SIGKDD Workshop BIOKDD 2005
  • Program Committee, ACM WIDM 2004, 2005
  • Referee for ACM TOIT 2004
  • Referee for IEEE TKDE 2004, 2005, 2006

 


 

Publications

Journals and Conferences

  • Bin Song, Jeong-Hyeon Choi, Guangyu Chen, Jacek Szymanski, Guo-Qiang Zhang, Anthony K. H. Tung, Jaewoo Kang, Sun Kim, and Jiong Yang: ARCS: An Aggregated Related Column Scoring Scheme for Aligned Sequences. Bioinformatics 2006 (in press).
  • Sungbo Seo, Jaewoo Kang, Dongwon Lee, Keun Ho Ryu: Multivariate Stream Data Classification Using Simple Text Classifiers. To appear in Proceedings of 17th International Conference on Database and Expert Systems Applications (DEXA), Krakow, Poland, September 2006.
  • Dongwon Lee, Jaewoo Kang, Prasenjit Mitra, C. Lee Giles, Byung-Won On: Are Your Citations Clean?: New Scenarios and Challenges in Maintaining Digital Libraries. Communications of the ACM 2006 (in press).
  • ByungWon On, Ergin Elmacioglu, Dongwon Lee, Jaewoo Kang, Jian Pei: An Effective Approach to Entity Resolution Problem Using QuasiClique and its Application to Digital Libraries (short paper). Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), Chapel Hill, NC, June 2006.
  • Amit C. Awekar, Pabitra Mitra, Jaewoo Kang: Selective Hypertext Induced Topic Search (poster paper). Proceedings of the 15th International World Wide Web Conference (WWW), Edinburgh, Scotland, May 2006.
  • Sungbo Seo, Jaewoo Kang, and Keun Ho Ryu: Multivariate Stream Data Reduction in Sensor Network Applications. Proceedings of the Second International Symposium on Ubiquitous Intelligence and Smart Worlds (UISW), Nagasaki, Japan, December 2005, (LNCS 3823, Springer 2005, ISBN 3-540-30803-2).
  • Jaewoo Kang, Dongwon Lee, and Prasenjit Mitra: Identifying Value Mappings for Data Integration: An Unsupervised Approach. Proceedings of the International Conference on Web Information Systems Engineering (WISE), New York, NY, November 2005, (LNCS 3806, Springer 2005, ISBN 3-540-30017-1).
  • Jaewoo Kang, Tae Sik Han, Dongwon Lee, and Prasenjit Mitra: Establishing Value Mappings using Statistical Models and User Feedback. Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Bremen, Germany, October 2005.
  • Jaewoo Kang, Jiong Yang, Wanhong Xu, Pankaj Chopra: Integrating Heterogeneous Microarray Data Sources using Correlation Signatures. Proceedings of the International Workshop on Data Integration in the Life Sciences (DILS), San Diego, CA, August 2005, (LNCS 3615, Springer 2005, ISBN 3-540-27967-9).
  • Dongwon Lee, Byung-Won On, Jaewoo Kang, Sanghyun Park: Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries. Proceedings of the ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), Baltimore, MD, June 2005.
  • Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra: Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), Denver, CO, June 2005.
  • Ramrajprabu Balasubramanian, Injong Rhee, Jaewoo Kang: A Scalable Architecture for SIP Infrastructure using Content Addressable Networks. Proceedings of the IEEE International Conference on Communications (ICC), Seoul, Korea, May 2005.
  • Jaewoo Kang, Jeffrey F. Naughton: On Schema Matching with Opaque Column Names and Data Values.  Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), San Diego, California, June 2003. [Conference Presentation]
  • Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas: Evaluating Window Joins over Unbounded Streams.   Proceedings of the International Conference on Data Engineering (ICDE), Bangalore, India, March 2003. [Conference Presentation]
  • Jeffrey F. Naughton, David J. DeWitt, David Maier, Ashraf Aboulnaga, Jianjun Chen, Leonidas Galanis, Jaewoo Kang, Rajasekar Krishnamurthy, Qiong Luo, Naveen Prakash, Ravishankar Ramamurthy, Jayavel Shanmugasundaram, Feng Tian, Kristin Tufte, Stratis Viglas, Yuan Wang, Chun Zhang, Bruce Jackson, Anurag Gupta, Rushan Chen: The Niagara Internet Query System. IEEE Data Engineering Bulletin 24(2), p.27-33, 2001.
  • Mary F. Fernandez , Daniela Florescu , Jaewoo Kang , Alon Y. Levy , Dan Suciu: Catching the Boat with Strudel: Experiences with a Web-Site Management System. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), p.414-425, Seattle, Washington, June 1998.
  • Mary F. Fernandez , Daniela Florescu , Jaewoo Kang , Alon Y. Levy , Dan Suciu: Overview of Strudel - A Web-Site Management System. Networking and Information Systems Journal, Volume 1, p.115-140, 1998.
  • Mary F. Fernandez, Daniela Florescu, Jaewoo Kang, Alon Y. Levy, Dan Suciu: System Demonstration - STRUDEL: A Web-Site Management System. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), p. 549-552, Tucson, Arizona, June 1997.
  • Jaewoo Kang, Mark Choey, Andreas Weigend: Maximizing Risk-Adjusted Return in Financial Time Series. Computing Science & Statistics (28th Symposium INTERFACE), p.677-681, Sidney, Austrailia, July 1996.

Technical Reports

Other Publications


Patents

  • Method and Apparatus for Web Site Management
    U.S. Patent No. 5,956,720        September 21, 1999
    Inventors: Mary Fernandez, Daniela Florescu, Jaewoo Kang, Alon Levy, and Dan Suciu.

Awards

Microsoft Graduate Fellowship 2000-01
- IDB: Unified Query Interface for Information on the Web

 


Contact

Mailing Address: (US Postal Service)

 

Campus Box 8206

Computer Science Dept.

NC State University

Raleigh, NC 27695

FEDEX, UPS, and other commercial carriers:

ATTN: Carol Holloman

Department of Computer Science

NC State University

890 Oval Drive

3320 Engineering Building II

Raleigh, NC 27606

 

Office:
2272 Engineering Building II, Centennial Campus

E-mail: kang (at) csc.ncsu.edu
Home Page: http://www.csc.ncsu.edu/faculty/kang
Phone: (919) 513-7575

Fax:: (919) 515-7896



Last revised:
June 4, 2006
kang@csc.ncsu.edu