Jaewoo Kang (Ph.D.,
UW-Madison, 2003)
|
|
Assistant
Professor
|
|
I joined the faculty of the Department of Computer Science at NC State University in August, 2003. I received
my Ph.D. in Computer Science from the University
of Wisconsin-Madison in 2003, M.S. from the University of Colorado at Boulder in 1996
and B.S. from Korea University at
My research interests in a
broad sense focus on understanding the fundamental aspects of building a
large-scale internet information system that can answer complex queries over a
large number (billions) of heterogeneous internet data sources. The main
challenge in achieving this goal is to improve the expressive power of the
system without degrading its scalability both in the number of data sources and
the number of transactions. I focus on tackling the challenge particularly in:
data integration, model management, query optimization, semi-structured data
management, text mining, and statistical natural language processing.
My students
and Collaborators
Current
Projects
Gene Signature Cube: Enabling Large-Scale Gene Expression Analysis
An
area of much recent interest in Bioinformatics is developing strategies for
combining information across different biological experiments in order to
answer broad questions or create new and innovative hypotheses for further
investigation. Microarray data is a major focus of these efforts, as it
represents genetic information derived under many different conditions and
tissue types, and it is readily accessible from a number of publicly accessible
repositories. Combining data from diverse sources allows scientists to perform
global studies such as identifying genes that are involved in different types
of cancer, tissues or different stages of cancer progression. Such global
studies are not generally possible from the standard results provided by
individual microarray studies, in which lists of genes that are differentially
expressed are provided for the specific experimental conditions, but no
extensions are generally available to other relevant experiments.
Integrating
such microarray data, however, has been a major challenge to both the
bioinformatics and the database communities due to many experiment-specific
variables involved in the microarray experiments. Different experiments utilize
different tissue types, examine different treatment strategies, and consider
different stages of disease development.
This, along with differences in technology and protocols used in
different labs, leads to difficulties in combining data across experiments. The
goal of this project is to make fundamental contributions toward addressing
this problem. As the first step, we developed a novel data model, the gene
signature vector, to solve the heterogeneous microarray data problem. The
proposed model transforms the heterogeneous gene expressions from different
sources into a uniform comparable form by projecting gene expression data onto
a coherent information space. As the second step, we are currently developing a
novel analysis framework, the gene signature cube, combining large numbers of
such gene signatures in a three dimensional data structure that facilitates
global studies through intuitive visualization and efficient exploratory
access.
Collaborators: Jiong Yang (CS,
Un-Interpreted Schema Matching
The
schema matching problem at the most basic level refers to the problem of
mapping schema elements (for example, columns in a relational database schema)
in one information repository to corresponding elements in a second repository.
While schema matching has always been a problematic and interesting aspect of
information integration, the problem is exacerbated as the number of
information sources to be integrated, and hence the number of integration
problems that must be solved, grows. Most previous solutions to the schema
matching problem rely in some fashion upon identifying "similar"
column names in the schemas to be matched, and/or by recognizing common domains
in the data stored in the schemas. For this reason, they are often developed
for a specific target domain in mind and do not generally work for others.
In this work I presented an automated technique that is designed to be of
assistance in the particularly difficult cases in which the column names and
data values are "opaque." This approach works by computing the
"mutual information" between pairs of columns within each schema, and
then using this statistical characterization of pairs of columns in one schema
to propose matching pairs of columns in the other schema. One of the novel
aspects of this approach is that it does not rely on the interpretation of the
data values or schema instances, but uses only the statistically measured
relational dependencies among the schema elements. As a result this
"Un-Interpreted Matching" technique is not dependent on the domain
specific knowledge, and hence is applicable to many different domains, even
including domains to which the system has not previously been exposed.
Stream Data Processing
Streaming
Data & Join Optimization Recently, the database research community has
begun focusing its attention on query processing over continuous input streams
rather than fixed-size stored data sets. In such environments, many assumptions
made in traditional query processing are no longer valid, and new problems
arise. In this work, I investigated algorithms for evaluating sliding window
joins over pairs of unbounded streams. A number of questions arise. For
example, how can an optimizer decide which algorithm to use when the
traditional metric of execution time to completion does not apply in a sliding
window join scenario, since the inputs are infinite? What if one stream is much
faster than the other? Can the optimizer take advantage of the differences in the
input stream rates? If the query processor has limited computational and/or
memory resources how should these resources be distributed among the streams?
To address the questions, I introduced a unit-time-basis cost model to analyze
the expected performance of these algorithms. Using this cost model, I proposed
strategies for maximizing the efficiency of processing joins in three
scenarios. First, I considered the case where one stream is much faster than
the other. I showed that asymmetric combinations of join algorithms, (e.g.,
hash join on one input, nested-loops join on the other) can outperform
symmetric join algorithm implementations. Second, I investigated the case where
system resources are insufficient to keep up with the input streams. I showed
that we can maximize the number of join result tuples produced in this case by
properly allocating computing resources across the two input streams. Finally,
I investigated strategies for maximizing the number of result tuples produced
when memory is limited, and showed that proper memory allocation across the two
input streams can result in significantly lower resource usage and/or more
result tuples produced. Addressing all of these issues in a unified manner, I
developed a powerful optimization framework for sliding window join queries,
which, by conducting an experimental study, I proved to be correct and usable
in practice.
For my old projects, please
see here.
Microsoft
Graduate Fellowship 2000-01
- IDB: Unified Query Interface for Information
on the Web
Mailing
Address: (US Postal Service)
Campus
Computer Science Dept.
NC
FEDEX, UPS, and
other commercial carriers:
ATTN: Carol Holloman
Department of Computer
Science
NC
3320 Engineering Building II
Office:
2272 Engineering Building II, Centennial Campus
E-mail:
kang (at)
csc.ncsu.edu
Home Page: http://www.csc.ncsu.edu/faculty/kang
Phone: (919) 513-7575
Fax::
(919) 515-7896
Last revised: June 4, 2006
kang@csc.ncsu.edu