Sunday, August 28, 2011

2011-08-28: KDD 2011 Trip Report

Author: Carlton Northern

The SIGKDD 2011 conference took place August 21 - 24 at the Hyatt Manchester in San Diego, CA.  Researchers from all over the world interested in knowledge discovery and data mining were in attendance.  This conference in particular has a heavy statistical analysis flavor and many presentations were math intensive.

I was invited to present my masters project research at the Mining Data Semantics (MDS2011) Workshop of KDD.  In this paper, we present an approach to find social media profiles of people from an organization.  This is possible due to the links created between members an organization. For instance, co-workers or students will likely friend each other creating hyperlinks between their respective accounts.  These links, if public, can be mined and used to disambiguate other profiles that may share the same names as those individuals we are searching for.  The following figure shows the amount of profiles found from the ODU Computer Science student body for each respective social media site and the links found between them.

This picture represents the actual students themselves and the links between them.  Black nodes are undergrads, green nodes are grads, and red nodes are members of the WS-DL research group.

These are the slides:

Here is the paper:

I've synopsized some of the interesting presentations from the conference:

Stephen Boyd - Stanford University "From Embedded Real-Time to Large-Scale Distributed".  Stephen Boyd's talk focused on his current research area of convex optimization.  He explained that convex optimization is a mathematical technique in which many complex problems of model fitting, resource allocation, engineering design, etc. can be transformed to a simple convex optimization problem to be solved and then transformed back into the original problem to get the solution.  He went on to explain how this can be implemented in real-time embedded systems sych as a hard disk drive head seek problem, to large distributed system such as California's power grid.

Amol Ghoting - IBM "NIMBLE: A Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on MapReduce".  Use Hadoop to write a map function and a reduce function where you can map anything to a (key, value) pair.  The problem with Hadoop is that it has a two-stage data flow which can be cumbersome for programming.  Also, job scheduling and data mangement is handled by the user.  Lastly, code-reuse and portability is diminished.  This toolkit tries to make the key features of Hadoop available to developers but without a Hadoop specific implementation.  NIMBLE actually decouples algorithm computation from data management, parallel communications and control.  It does this through using a series of basic datasets and basic tasks that create a DAG.  Tasks can spawn other tasks.  With this structure in place, simultaneous data and tasks parallelism is achievable.

David Haussler – UC Santa Cruz “Cancer Genomics”.  DNA sequencing cost has reduced dramatically.  DNA sequencing was following Moore’s law but is now reducing cost 10 fold every two years.  Can now cheaply sequence entire genomes.  Created the Cancer Genome Atlas.  10,000 tumors will be sequenced in the next two years using this Atlas.  Cancer genome sequencing will soon be a standard clinical practice.  Because each persons DNA is different, and each tumor resulting from a persons DNA is different, a huge computational processing problem looms in the near distant future. 

Ahmed Metwally - Google. "Estimating the number of people behind an IP Address".  Most research assumes that there is 1 person using 1 IP address, but this is not the case.  IP's also change size of users, for instance, a hotel with a conference will have many more users possibly using the same IP address than usual.  So, how would one estimate the amount of these users in a non-intrusive way?  One method is to look at trusted cookie counts.  Another method is to look at diverse traffic.  Google caps traffic volume per IP to stop people from gaming the system using the same IP address.  Google knows how many users share an IP address because they are logged in with a username and password to Googles sites.  However, some of Googles traffic is from users that don't have a Google account.  This research is for those who want to filter users without asking them for any identification, thus preserving their privacy.  This method is currently being used at Google for determining click fraud.

D. Scully - Google "Detecting Adverserial Advertisements in the Wild".  An adversarial advertiser would be an advertiser that uses Google AdWords or AdSense to advertise misleading products like counterfeit goods or scams.  Most ads are good, only a small amount are bad.  Using in-house trained people to hand build rule based models.  Allowing these people to hand-build the rules gave a great incentive and improved morale rather than just having them do repetitive tasks over and over again.  Automated methods are being used as well, but this part of the presentation went right over my head.

Chunyu Luo - University of Tennessee "Enhanced Investment Decisions in P2P Lending: An Investor Composition Perspective".  In this paper, they are trying to decide which loans are worthwhile to invest, in other words, what makes a good loan?  Use a bipartite investment network with one side investors and the other investees and the edges between them loans.  Each loan can be considered a composition of many investors.  The idea is that by looking at the past performance of the other investors of a given loan, you can improve your prediction of the return rate for that loan.  Performed experiment from dataset of  The composition method far outperformed the average return of investment.

Susan Imberman - College of Staten Island "From Market Baskets to Mole Rats:  Using Data Mining Techniques to Analyze RFID Data Describing Laboratory Animal Behavior".  This paper presents the data mining techniques used in analyzing RFID data from a colony of Mole Rats.  Much like we use RFID in cars for tolls like EZ Pass, they are using RFID on Mole Rats and when they pass specific points of the colony (a series of pipes and rooms) they collect that sample.  They used k-means clustering which showed animal place preference.  Used an adjacency matrix to get an idea of which Mole Rats liked to be near one another.  This created 3 distinct sub graphs which corresponded well to the different colony structure of Mole Rats, queen workers, large workers and small workers.  Next they correlated common transactions made in the grocery store with items in a basket to repeat behavior of Mole Rats.

After the conference ended on Wednesday, Hurricane Irene was on track for a direct hit to Hampton Roads.  My flight was scheduled to arrive in Norfolk Friday night which was cutting it very close to the storm hitting on Saturday.  So I decided to extend the trip till Monday and ride out the storm here in sunny San Diego.  In total, I managed to miss a hurricane, a tornado, an earthquake, and a swamp fire.  I think I made a good decision...

2011-08-28: Fall 2011 WS-DL Classes

The Web Science and Digital Libraries Research Group is offering two classes for the fall 2011 semester. CS 895 Web-Based Information Retrieval will be offered on Tuesdays, 4:20-7:00 in room 2120 of the ECS building. This class will use the recent Croft, Metzler & Strohman book as the required text, and the Manning, Ragahavan, & Schutze book as the recommended text. By choosing the former book as the primary guide for the course, we are intentionally provided a strong engineering component to the class (i.e., a level of coding and development is expected) as opposed to just a theoretical exploration of information retrieval. CS 751/851 Introduction to Digital Libraries is not a prerequisite, but it would help to be familiar with the material covered in that class.

Dr. Weigle will be teaching CS 795/895 Information Visualization on Thursdays, 9:30-12:15 in room 2120 of the ECS building. This class is a follow-on to the CS 796/896 Visual Analytics Seminar from last semester, but that course is not a prerequisite for this semester's course. Unlike the seminar from last semester, this course will be a "regular" course that will count toward one of the four courses required in the "advanced-level course requirement".

Students wishing to become active in the WS-DL research group should take one or both of these courses.


Tuesday, August 2, 2011

2011-07-26: Universal Access to All Knowledge

On July 26, 2011, the Web Science and Digital Library group at Old Dominion University hosted Kris Carpenter Negulescu, Director of the Web Group at the Internet Archive who gave a talk entitled “Universal Access to All Knowledge”. The presentation started with an introduction about what the Internet Archive is, then, she gave us some information about what are the archived materials in Internet Archive for now: Text (+2.9M books), Moving Images (+542,500 items), Audio (+950,000 items), Television broadcast (+1M hours), Web Pages (+150 billion pages). Moreover, she gave an overview about some of the special collections such as K-12 students and NASA images.

After that, Kris explained the common collection strategies that are used by the Internet Archive to crawl the web. Frequently, they are doing a broad survey for the wide range domains such as .com, .net, .org, etc. They also considered the frequency of change for these websites and gave more support to the sites without archiving capabilities. Internet Archive has a special focus on the exhaustive websites when the web master decided to shutdown the website and would like to take a snapshot for the last time (for example, geocities). Another strategy is crawling by topics or specific collections based on a feedback from researchers or experts in this topic. In general, the key inputs for the URIs seeds are nominations from various partners (e.g., domain experts, trusted directories, Wikipedia).

Kris explained the methods of access to the web archive. The default method is "Host based" by browsing the website as it was using the WayBack Machine. In addition to other novel techniques such as full-text search and metadata/catalog look-ups; building an API's mirroring UI based access is also undergoing.

Finally, she gave an overview about some special projects such as:
  • Data mining and extraction
  • Link domain of 20thCF or of an entire domain (e.g., .uk) from 1996-2010
  • Dynamic, on-demand archiving of video and Wikipedia annotation
  • Semantic data extraction, metadata services
The colloquium was recorded and is available below.

-- Ahmed AlSum