Students Win Big on Big Data
Congratulations to NC State Computer Science PhD students Chin-Jung Hsu (First Place Poster Award), George Mathew (Third Place Poster Award), and Vivek Nair (Community Innovation Award), for their major accomplishments at the HPCC Systems Community Day last month in Atlanta, Georgia.
Developed by LexisNexis, the HPCC Systems Summit Community Day brings together researchers and individuals from academia and the industry to exchange ideas and share experiences on the advance in high-performance Computing Clusters. LexisNexis Risk Solutions has funded many research projects in computer science at NC State, helping students and facilities make impacts on research.
The first place poster, “HaaS: HPCC Systems-as-a-Service”, was written by Chin-Jung Hsu with the support of NC State University and LexisNexis Risk Solutions. It is an open source data analytics supercomputer that is widely deployed to underpin its multi-billion dollar business. Chin-Jung Hsu is advised by Dr. Vincent Freeh, Associate Professor of Computer Science at NC State.
The project develops HaaS, a command line tool that helps streamline system deployment and management on the AWS cloud. HaaS enables users to deploy, save and restore an HPCC Systems cluster in minutes all while creating a cost-effective workflow of running data analytics applications in the cloud. The tool, which has greatly benefitted the HPCC community, can be downloaded from https://github.com/vin0110/haas .
The abstract for Hsu’s paper, “HaaS: HPCC Systems-as-a-Service“, follows: HPCC Systems is an open source data analytics supercomputer developed by LexisNexis Risk Solutions and is widely deployed to underpin its multi-billion business. HPCC Systems appears to be an intriguing alternative to Apache Hadoop and Spark. This project develops a command line tool, HaaS, that helps streamline system deployment and management on the AWS cloud. With HaaS, users are able to deploy, save and restore an HPCC Systems cluster in minutes. HaaS enables a cost-effective workflow of running data analytics applications in the cloud.
George Mathew’s third place poster, “Cohesive Framework for Legislative and Research Documents”, explores the correlation between research and legislation within the government. After dissecting 100,000 legislation and 75,000 research documents, a 97% similarity was found. Future plans for the poster include more research on a larger body of research documents and legislation. Mathew is advised by Dr. Timothy Menzies, Professor of Computer Science at NC State.
The abstract for Mathew’s paper follows: What is the connection between the laws we write and the papers generated by researchers? Do government directives guide research? Does government legislation respond appropriately to new research results? How can we check?
To answer these questions, we explore text mining and LDA for legislative and research documents. Specifically, we a) build a vocabulary on corpus using the top words in each document type, b) construct a vectorized representation of the words, c) create vectors for each document using word vectors and d) generate a mapping between different document types based on document similarity. With this approach, we are able to achieve a cosine similarity of up to 97% between legislative and research documents (by way of comparison, the cosine distance between random items is around 75%).
This preliminary study was conducted on a relatively small set of 100,000 legislative and 75,000 research documents. Our future plans focus on repeating this analysis on a larger corpus and also handling temporal analysis of research with respect to legislation.
Vivek Nair, also advised by Dr. Menzies, received the Community Innovation Award at the HPCC Systems Community Day. Nair’s award celebrates his paper, "Spark and HPCC: Strangers no More", and his innovative use of HPCC Systems.
Return To News Homepage