Thursday, February 14, 2019

2019-02-14: CISE papers need a shake -- spend more time on the data section

A Crucial Step for Averting AI Disasters

I know this is a large topic and I may not have enough evidence to convince everyone, but based on my reviewing experiences on journal articles and conference proceedings, I strongly feel that computer and information science and engineering (CISE) papers need to put more text on describing and analyzing the data. 

This argument partially comes from my background in astronomy and astrophysics. Astronomers and astrophysicists usually spend a huge chunk of text in their papers talking about data they adopt, including but not limited to where the data are collected, why they do not use another dataset, how the raw data are pre-processed, and carefully justify why they rule out outliers. They also do analysis on the data and report statistical properties, trend, or bias to ensure that they are using legitimate points in their plots.

In contrast, for many papers I read and reviewed, even in top conferences, CISE people do not often do such work. They usually assume the datasets were used before so they could use it. Many emphasize the size of the data, but few look into the structure, completeness, taxonomy, noise, and potential outliers in the data. The consequence is that they spend a lot of space on algorithms and report results better than baselines, but it not a guarantee of anything. Good CISE papers usually discuss the bias and potential risks caused by the data, but good papers are rare, even in top conferences.

Algorithm is one of the pillars of CISE, but this does not mean it is everything. It only provides the framework, like the photo frame. Data is like the photo. Without the right photo, the picture (frame+photo) will not look pleasing. Even if it looks pleasing for a particular photo, it won't for other photos. Of course, no algorithm will fit all data, but at least the paper should discuss what types of data the algorithm should be applied to.

The good news is that many CISE people have started paying attention to this problem. In the IEEE Big Data Conference,  Blaise Aguera y Arcas, the Google AI director emphasizes that AI algorithms have to be accompanied with the right data to be ethical and useful. Recently, a WSJ article titled "A Crucial Step for Averting AI Disasters" echoed the idea. The article quoted Douglas Merrill's word -- “The answer to almost every question in machine learning is more data,” I would supplement this by adding "right" after "more". If we claim we are doing Data Science, how can we neglect the first part?

Jian Wu 

Friday, February 8, 2019

2019-02-08: Google+ Is Being Shuttered, Have We Preserved Enough of It?

Update on 2019/03/27 at 18:29 GMT: The ArchiveTeam needs users to run Warrior instances to capture as much of Google+ as possible. If you have the resources, please pitch in. Details are on the ArchiveTeam Google+ project page.

In 2017 I reviewed many storytelling, curation, and social media tools to determine which might be a suitable replacement for Storify, should the need arise. Google+ was one of the tools under consideration. It did not make the list of top three, but it did produce quality surrogates.

On January 31, 2019, Sean Keane of CNET published an article indicating that the consumer version of Google+ will shut down on April 2, 2019. I knew that the social media service was being shut down in August, but I was surprised to see the new date. Google's blog mentions that they changed the deadline on December 10, 2018, for security reasons. David Rosenthal's recent blog post cites Google+ as yet another example of Google moving up a service decommission date.

This blog post is long because I am trying to answer several useful questions for would-be Google+ archivists. Here are the main bullet points:
  • End users can create a Google Takeout archive of their Google+ content. The pages from the archive do not use the familiar Google+ stylesheets. The archive only includes images that you explicitly posted to Google+.
  • Google+ pages load more content when a user scrolls. is the only web archiving tool that I know of that can capture this content.
  • Google+ consists of mostly empty, unused profiles. We can detect empty, unused profiles by page size. Profile pages less than 568kB are likely empty.
  • The robots.txt for does not block web archives.
  • Even when only considering estimates of active profiles, I estimate that less than 1% of Google+ is archived in either the Internet Archive or
  • I sampled some Google+ mementos from the Internet Archive and found a mean Memento Damage score of 0.347 on a scale where 0 indicates no damage. Though manual inspection does show missing images, stylesheets appear to be consistently present.
Update on 2019/03/20 at 10:37 GMT: Google has started sending email out to users telling them to get started downloading their Google+ data on March 31, 2019. Even though the service will shut down on April 2, it may take a lot of time for some users to save their data. This indicates that the April 2 shutdown date is likely a hard shutdown meaning that any data extraction in progress during the shutdown may not complete. Plan your use of Google Takeout and other preservation methods accordingly.
Google+ will join the long list of shuttered Web platforms. Verizon will be shuttering some former AOL and Yahoo! services in the next year. Here are some more recent examples:
Sometimes the service is not shuttered, but large swaths of content are removed, such as with Tumblr's recent crackdown on porn blogs, and Flickr's mass deletion of the photos of non-paying users.

The content of these services represents serious value for historians. Thus Geocities, Vine, and Tumblr were the targets of concerted hard-won archiving efforts.

Google launched Google+ in 2011. Writers have been declaring Google+ dead since its launch. Google+ has been unsuccessful for many reasons. Here are some mentioned in the news over the years:
As seen below, Google+ still has active users. I lost interest in 2016, but WS-DL Member Sawood Alam, Dave Matthews Band, and Marvel Entertainment still post content to the social network. Barack Obama did not last as long I did, with his last post in 2015.

I stopped posting to Google+ in 2016.
WS-DL member Sawood Alam is a more active Google+ member, having posted 17 weeks ago.

Dave Matthews Band uses Google+ to advertise concerts. Their last post was 1 week ago.

Marvel Entertainment was still posting to Google+ while I was writing this blog post.

Barack Obama lost interest in Google+. His last post was on March 6, 2015.

Back in July of 2018, I analyzed how much of the U.S. Government's AHRQ websites were archived. Google+ is much bigger than those two sites. Google+ allows users to share content with small groups or the public. In this blog post, I focus primarily on public content and current content.

I will use the following Memento terminology in this blog post:
  • original resource - a live web resource
  • memento - an observation, a capture, of a live web resource
  • URI-R - the URI of an original resource
  • URI-M - the URI of a memento
ArchiveTeam has a wiki page devoted to the shutdown of Google+. They list the archiving status as "Not saved yet." As shown below, I have found less than 1% of Google+ pages in the Internet Archive or

Update on 2019/03/18 at 16:07 GMT: ArchiveTeam's archiving status has been updated to "In progress...". According to this article by the Verge, there is a concerted effort now underway by ArchiveTeam and the Internet Archive to archive parts of Google+. There are limitations to web archiving, as only up to 500 comments can be archived per post. To help in these efforts, please read the rest of this post so that you can ensure that your own Google+ data is preserved.
Update on 2019/03/18 at 16:21 GMT: On Twitter, Sawood Alam has mentioned that this Reddit post has more information on ArchiveTeam's efforts. For live tracking of the process, visit this page.
Update on 2019/03/26 at 12:21 GMT: The UK Web Archive has been asking for recommendations on UK-based Google+ accounts suitable for preservation. See the tweet below for more information:
Update on 2019/03/27 at 20:08 GMT: Via Google+, Edward Morbius informed me that the Google+ group Google+ Mass Migration is discussing migrating from Google+. Topics range from preservation of content to migration to other services.

In the spirit of my 2017 blog post about saving data from Storify, I cover how one can acquire their own Google+ data. My goal is to provide information for archivists trying to capture the accounts under their care. Finally, in the spirit of the AHRQ post, I discuss how I determined much of Google+ is probably archived.

Saving Google+ Data

Google Takeout

There are professional services like PageFreezer that specialize in preserving Google+ content for professionals and companies. Here I focus on how individuals might save their content.

Google Takeout allows users to acquire their data from all of Google's services. 

Google provides Google Takeout as a way to download personal content for any of their services. After logging into Google Takeout, it presents you with a list of services. Click "Select None" and then scroll down until you see the Google+ entries.

Select "Google+ Stream" to get the content of your "main stream" (i.e., your posts). There are additional services from which you can download Google+ data. "Google+ Circles" allows you to download vCard-formatted data for your Google+ contacts. "Google+ Communities" allows you to download the content for your communities.

Once you have selected the desired services, click Next. Then click Create Archive on the following page. You will receive an email with a link allowing you to download your archive.

From the email sent by Google, a link to a page like the one in the screenshot allows one to download their data.

The Takeout archive is a zip file that decompresses into a folder containing an HTML file and a set of folders. These HTML pages include your posts, events, information about posts that you +1'd, comments you made on others' posts, poll votes, and photos.

Note that the actual files of some of these images are not part of this archive. It does include your profile pictures and pictures that you uploaded to posts. Images from any Google+ albums you created are also available. With a few exceptions, references to images from within the HTML files in the archive are all absolute URIs pointing to  They will no longer function if is shut down. Anyone trying to use this Google Takeout archive will need to do some additional crawling for the missing image content.
Google Takeout (right) does not save some formatting elements in your Google+ posts (left). The image, in this case, was included in my Google Takeout download because it is one that I posted to the service.

One could use to preserve their profile pages. Webrecorder saves web content to WARCs for use in many web archive playback tools. I chose Webrecorder because Google+ pages require scrolling to load all content, and scrolling is a feature with which Webrecorder assists.

A screenshot of my public Google+ profile replayed in
One of Webrecorder's strengths is the ability to authenticate to services. We should be able to use this authentication ability to capture private Google+ data.

I tried saving it using my native Firefox, but that did not work well. Unfortunately, as shown below, sometimes Google's cookie infrastructure got in the way of authenticating with Google from within

In Firefox, Google does not allow me to log into my Google+ account via

I recommend changing the internal browser to Chrome to preserve your profile page. I tried to patch the recording a few times to capture all of the JavaScript and images. Even in these cases, I was unable to record all private posts. If someone else has better luck with Webrecorder and their private data, please indicate how you got it to work in the comments.

Other Web Archiving Tools

The following methods only work on your public Google+ pages. Google+ supports a robots.txt that does not block web archives.

The robots.txt for as of February 5, 2019, is shown below:

You can manually browse through each of your Google+ pages and save them to multiple web archives using the Mink Chrome Extension. The screenshots below show it in action saving my public Google+ profile.

The Mink Chrome Extension in action, click to enlarge. Click the Mink icon to show the banner (far left), and then click on the "Archive Page To..." button (center left). From there choose the archive to which you wish to save the page (center right), or select "All Three Archives" to save to multiple archives. The far right displays a WebCite memento of my profile saved using this process. and the Internet Archive both support forms where you can insert a URI and have it saved. Using the URIs of your Google+ public profile, collections, and other content, manually submit them to these forms and the content will be saved.

The Internet Archive (left) has a Save Page Now form as part of the Wayback Machine. (right) has similar functionality on its front page.
My Google+ profile saved using the Internet Archive's Save Page Now form.
If you have all of your Google+ profile, collection, community, post, photo, and so on URIs in a file and wish to push them to web archives automatically, submit them to the ArchiveNow tool. ArchiveNow can save them to,, the Internet Archive, and WebCite by default. It also provides support for if you have a API key.

Current Archiving Status of Google+

How Much of Google+ Should Be Archived?

This section is not about making relevance judgments based on the historical importance of specific Google+ pages. A more serious problem exists. Most Google+ profiles are indeed empty. Google made it quite difficult to enroll in Google's services without signing up for Google+ at the same time. At one time, if one wanted a Google account for Gmail, Android, Google Sheets, Hangouts, or a host of other services, they would inadvertently be signed up for Google+. Acquiring an actual count of active users has been difficult because Google reported engagement numbers for all services as if they were for Google+. President Obama, Tyra Banks, and Steven Speilberg have all hosted Google Hangouts. This participation can be misleading, as Google Hangouts and Photos were features most often used by users, and these users may not have maintained a Google+ profile. Again, there are a lot of empty Google+ profiles.

In 2015, Forbes wrote that less than 1% of users (111 million) are genuinely active, citing a study done by Stone Temple Consulting. In 2018, Statistics Brain estimated 295 million active Google+ accounts.

As archivists trying to increase the quality of our captures, we need to detect the empty Google+ profiles. Crawlers start with seeds from sitemaps. I reviewed the robots.txt for and found four sitemaps, one of which focuses on profiles. The sitemap at consists of 50000 additional sitemaps. Due to the number and size of the files, I did not download them all to get an exact profile count. Each consists of between 67,000 and 68,000 URIs for an estimated total of 3,375,000,000 Google+ profiles.

An example of an "empty" Google+ profile.

How do we detect accounts that were never used, like the one shown above?  The sheer number of URIs makes it challenging to perform an extensive lexical analysis in a short amount of time, so I took a random sample of 601 profile page URIs from the sitemap. I chose the sample size using the Sample Size Calculator provided by Qualtrics and verified it with similar calculators provided by SurveyMonkeyRaosoftAustralian Bureau of Statistics, and Creative Research Systems. These sample sizes represent a confidence level of 95% and a margin of error of ±4%.

Detecting unused and empty profiles is similar to the off-topic problem that I tackled for web archive collections, and it turns out that the size of the page is a good indicator of a profile being unused.  I attempted to download all 601 URIs with wget, but 18 returned a 404 status. A manual review of this sample indicated that profiles of size 568kB or higher contain at least one post. Anyone attempting to detect an empty Google+ profile can issue an HTTP HEAD and record the byte size from the Content-Length header. If the byte size is less than 568kB, then the page likely represents an empty profile and can be ignored.

One could automate this detection using a tool like curl. Below we see a command that extracts the status line, date, and content-length for an "empty" Google+ profile of 567,342 bytes:

The same command for a different profile URI shows a size of 720,352 bytes for a non-empty Google+ profile:

An example of a 703kB Google+ profile with posts from 3 weeks ago.

An example of Google+ loading more posts in response to user scrolling. Note the partial blue circle on the bottom of the page, indicating that more posts will load.

As seen above, Google+ profiles load more posts on scroll. Profiles at 663kB or greater have filled the first "page" of scrolling. Any Google+ profile larger than this has more posts to view. Unfortunately, crawling tools must execute a scroll event on the page to load this additional content. Web archive recording tools that do not automatically scroll the page will not record this content.
This histogram displays counts of the file sizes of the downloaded Google+ profile pages. Most profiles are empty, hence a significant spike for the bin containing 554kB.
From my sample 57/601 (9.32%) had content larger than 568kB. Only 12/601 (2.00%) had content larger than 663kB, potentially indicating active users. By applying this 2% to the total number of profiles, we estimate that 67.5 million profiles are active. Of course, based on the sample size calculator, my estimate may be off by as much as 4%, leaving an upper estimate of 135 million, which is between the 111 million number from the 2015 Stone Temple study and the 295 million number from the 2018 StatisticsBrain web page. The inconsistencies are likely the result of the sitemap not reporting all profiles for the entire history of Google+ as well as differences in the definition of a profile between these studies.

I looked at various news sources that had linked to Google+ profiles. The profile URIs from the sitemaps do not correspond to those often shared and linked by users. For example, my vanity Google+ profile URI is, but it is displayed in the sitemap as a numeric profile URI Google+ uses the canonical link relation to link these two URIs but reveals this relation in the HTML of these pages. For a tool to discover this relationship, it must dereference each sitemap profile URI, an expensive discovery process at scale.  If Google had placed these relations in the HTTP Link header, then archivists could use an HTTP HEAD to discover the relationship. The additional URI forms make it difficult to use profile URIs from sitemaps alone for analysis.

The content of the pages found at the vanity and numeric profile URIs is slightly different. Their SHA256 hashes do not match. A review in vimdiff indicates that the differences are self-referential identifiers in the HTML (i.e., JavaScript variables containing +ShawnMJones vs. 115814011603595795989), a nonce that is calculated by Google+ and inserted into the content when it serves each page, and some additional JavaScript. Visually they look the same when rendered.

How much of Google+ is archived?

The lack of easy canonicalization of profile URIs makes it challenging to use the URIs found in sitemaps for web archive analysis. I chose instead to evaluate the holdings reported by two existing web archives.

For comparison, I used numbers from the sitemaps downloaded directly from
I use these totals for comparison in the following sections.
Internet Archive Search Engine Result Pages
I turned to the Internet Archive to understand how many Google+ pages exist in its holdings. I downloaded the data file used in the AJAX call that produces the page shown in the screenshot below.

The Internet Archive reports 83,162 URI-Rs captured for

The Internet Archive reports 83,162 URI-Rs captured. Shown in the table below, I further analyzed the data file and broke it into profiles, posts, communities, collections, and other by URI.

Category # in Internet Archive % of Total from Sitemap
Collections 1 0.00000572%
Communities 0 0%
Posts 12,946 Not reported in sitemap
Profiles 65,000 0.00193%
Topics 0 0%
Other 5,217 Not reported in sitemap

The archived profile page URIs are both of the vanity and numeric types. Without dereferencing each, it is difficult to determine how much overlap exists. Assuming no overlap, the Internet Archive possesses 65,000 profile pages, which is far less than 1% of 3 billion profiles and 0.0481% of our estimate of 135 million active profiles from the previous section.

I randomly sampled 2,334 URI-Rs from this list, corresponding to a confidence level of 95% and a margin of error of ±2%. I downloaded TimeMaps for these URI-Rs and calculated a mean of 67.24 mementos per original resource. Search Engine Result Pages
As shown in the screenshot below, also provides a search interface on its web site. reports 2551 URI-Rs captured for reports 2,551 URI-Rs, but scraping its search pages returns 3,061 URI-Rs. I analyzed the URI-Rs returned from the scraping script to place them into the categories shown in the table below.

Category # in % of Total from Sitemap
Collections 10 0.0000572%
Communities 0 0%
Photos 22 Not reported in sitemap
Posts 1994 Not reported in sitemap
Profiles 989 0.0000293%
Topics 1 0.248%
Other 45 Not reported in sitemap contains 989 profiles, a tiny percent of the 3 billion suggested by the sitemap and the 135 million active profile estimate that we generated from the previous section. is Memento-compliant, so I attempted to download TimeMaps for these URI-Rs. For 354 URI-Rs, I received 404s for their TimeMaps, leaving me with 2707 TimeMaps. Using these TimeMaps, I calculated a mean of 1.44 mementos per original resource.

Are these mementos of good quality?

Archives just containing mementos is not enough. Their quality is relevant as well. Crawling web content often results in missing embedded resources such as stylesheets. Fortunately, Justin Brunelle developed an algorithm for scoring the quality of a memento that takes missing embedded resources into account. Erika Siregar developed the Memento Damage tool based on Justin's algorithm so that we can calculate these scores. I used the Memento Damage to score the quality of some mementos from the Internet Archive.

The histogram of memento damage scores from our random sample shows that most have a damage score of 0.
Memento damage takes a long time to calculate, so I needed to keep the sample size small. I randomly sampled 383 URI-Rs from the list acquired from the Internet Archive and downloaded their TimeMaps. I acquired a list of 383 URI-Ms by randomly sampling 1 URI-M from each of TimeMap. I then fed these URI-Ms into a local instance of the Memento Damage tool. The Memento Damage tool experienced errors for 41 URI-Ms.

This memento has the highest damage score of 0.941 in our sample. The raw size of its base page is 635 kB.

The mean damage score for these mementos is 0.347. A score of 0 indicates no damage. This score may be misleading, however, because more content is loaded via JavaScript when the user scrolls down the page. Most crawling software does not trigger this JavaScript code and hence misses this content.

The screenshot above displays the largest memento in our sample. The base page has a size of 1.3 MB and a damage score of 0.0. It is not a profile page, but a page for a single post with comments.
The screenshot above displays the smallest memento in our sample with a size greater than zero and no errors while computing damage. This single post page redirects to a page not captured by the Internet Archive. The base page has a size of 71kB and a damage score of 0.516.
The screenshot above displays a memento for a profile page of size 568kB, the lower bound of pages with posts from our earlier live sample. It has a memento damage score of 0.

This histogram displays the file sizes in our memento sample. Note how most have a size between 600kB and 700kB. 

As an alternative to memento damage, I also downloaded the raw memento content of the 383 mementos to examine their sizes.  The HTML has a mean size of 466kB and a median of 500kB. In this sample, we have mementos of posts and other types of pages mixed in. Post pages appear to be smaller. The memento of a profile page shown below still contains posts at 532kB. Mementos for profile pages smaller than this had just a user name and no posts. It is possible that the true lower bound in size is around 532kB.

This memento demonstrates a possible new lower bound in profile size at 532kB. The Internet Archive captured it in January of 2019.

Discussion and Conclusions

Google+ is being shut down on April 2, 2019. What direct evidence will future historians have of its existence? We have less than two months to preserve much of Google+. In this post, I detailed how users might preserve their profiles with either Google Takeout,, and other web archiving tools.

I mentioned that there are questions about how many active users ever existed on Google+. In Google's attempt to make all of its services "social" it conflated the number of active Google+ users with active users of other Google services. Third-party estimates of active Google+ users over the years have ranged from 111 million to 295 million.  With a sample size of 601 profiles from the profile sitemap at, I estimated that the number might be as high as 135 million.

To archive non-empty Google+ pages, we have to be able to detect pages that are empty. I analyzed a small sample of Google+ profile pages and discovered that pages of size 663kB or larger contain enough posts to fill the first "page" of scrolling. I also discovered that inactive profile pages tend to be less than 568kB. Using the HEAD method of HTTP and the Content-Length header, archivists can use this value to detect unused or poorly contributed to Google+ profiles before downloading their content.

I estimated how much of Google+ exists in public web archives. Scraping URIs from the search result pages of the Internet Archive, the most extensive web archive, reveals only 83,162 URI-Rs for Google+. only reveals 2,551 URI-Rs. Both have less than 1% of the totals of different Google+ page categories found in the sitemap. The fact that so few are archived may indicate that few archiving crawls found Google+ profiles because few web pages linked to them.

I sampled some mementos from the Internet Archive and found a mean damage score of 0.347 on a scale where 0 indicates no damage. Though manual inspection does show missing images, stylesheets appear to be consistently present.

Because Google+ uses page scrolling to allow users to load more content, this means that many mementos will likely be of poor quality if recorded outside of tools like With the sheer number of pages to preserve, we may have to choose quantity over quality.

If a sizable sample of those profiles is considered to be valuable to historians, then web archives have much catching up to do.

A concerted effort will be necessary to acquire a significant number of profile pages by April 2, 2019. My recommendations are for users to archive their public profile URIs with ArchiveNow, Mink, or the save page forms at the Internet Archive or Archivists looking to archive Google+ more generally should download the topics sitemap and at least capture the 404 (four hundred four, not 404 status) topics pages using these same tools. Enterprising archivists can search news sources, like this Huffington Post article and this Forbes article, that feature popular and famous Google+ users. Sadly, because of the lack of links, much of the data from these articles is not in a machine-readable form.  A Google+ archivist would need to search Google+ for these profile page URIs manually. Once that is done, the archivist can then save these URIs using the tools mentioned above.
Update on 2019/03/27 at 20:23 GMT: The ArchiveTeam is engaged in a heroic effort with the Internet Archive to try to archive as much of Google+ as possible. As of this moment, their project tracker currently shows 20,004,087 items (and rising) have been archived.

Due to its lower usage compared to other social networks and its controversial history, some may ask "Is Google+ worth archiving?" Only future historians will know, and by then it will be too late, so we must act now.

-- Shawn M. Jones

Sunday, February 3, 2019

2019-02-02: Two Days in Hawaii - the 33rd AAAI Conference on Artificial Intelligence (AAAI-19)

The 33rd AAAI Conference on Artificial Intelligence, the 31st Innovative Applications of Artificial Intelligence Conference, and the 9th Symposium on Educational Advances in Artificial Intelligence were held at the Hilton Hawaiian Village, Honolulu, Hawaii. I have one paper accepted by IAAI 2019 on Cleaning Noisy and Heterogeneous Metadata for Record Linking across Scholarly Big Datasets, coauthored with Athar Sefid (my student at PSU), Jing Zhao (my mentee at PSU), Lu Liu (a graduate student who published a Nature Letter), Cornelia Caragea(my collaborator at UIC), Prasenjit Mitra, and C. Lee Giles
This year, AAAI receives the greatest number of submissions -- 7095 which doubles the submission in 2018. There are 18191 reviews collected and over 95% papers have 3 reviews. There are 1147 papers accepted, which takes 16.2% of all submissions. This is the lowest acceptance rate in history. There are in total 122 technical sessions, 460 oral presentations (15 min talk) and 687 posters (2 min flash). People from China submitted the largest number of papers and got the largest number of papers accepted (382, about 16%). US people got 264 papers submitted (21%). Isreal got the highest acceptance rate (24.4%). The topics on MVP (Machine learning, NLP, and computer vision) take over 61% of all submissions and 59% of all accepted. The total 3 submission increase are reasoning under uncertainty, applications, and humans and AI. The top 3 submission decrease are cognitive systems, computational sustainability, and human computation and crowdsourcing. Papers with supplementary got 5% of more acceptance rate (27%) than peoples without supplementary (12%). 
The IAAI 2019 are less competitive, with 118 submissions. The acceptance rate is 35%. There are 36 emerging applications (including ours), and 5 deployed applications. The deployed application awards are conferred to 5 papers:
  • Grading uncompilable programs by Rohit Takhar & Varun Aggarwal (Machine Learning India)
  • Automated Dispatch of Helpdesk Email Tickets by Atri Mandal et al. (IBM Research India)
  • Transforming Underwriting in the Life Insurance Industry by Marc Maier et al. (Massachusetts Mutual Life Insurance Company)
  • Large Scale Personalized Categorization of Financial Transactions by Christopher Lesner et al. (Intuit Inc.)
  • A Genetic Algorithm for Finding a Small and Diverse Set of RecentNews Stories on a Given Subject: How We Generate AAAI’s AI-Alert by Joshua Eckroth and Eric Schoen (i2kconnect)
The Robert S. Engelmore Memorial Award was conferred to Milind Tambe (USC) for outstanding research in the area of multi-agent systems and their application to problems of societal significance. I know Tambe's work partially from his student Amulya Yadav whom I interviewed to the assistant professor position at Penn State IST. He is well-known for his work on connecting AI with social goods.
The classic paper award was conferred to Prem Melville et al. for their 2002 AAAI paper on "Content-boosted collaborative filtering for improved recommendations" (cited by 1621 times on Google Scholar). This work proposed the collaborative filtering idea on recommendation systems, which is currently a classic textbook algorithm for recommendation systems.
Due to the limited amount of time I spent at Hawaii, I went to 3 invited talks. 
The first was given by Cynthia Breazeal, who is the director of the personal robots group at MIT. Her presentation was on a social robot called Jibo. Different from Google home and Amazon Echo, this robot features more on social communications with people, instead of selling products and controlling devices. It was based on the Bayesian Theory of Mind Communication Framework and Bloom’s learning theory. Jino has been tested with early childhood education and fostering aging people community connection. The goal is to promote early learning with peers and treating loneliness, helplessness, and boredom. It could talk like a human, and do some simple motions, such as dancing. My personal opinion is that we should be careful when using these robots. They may be used for medical treatment but people should always be encouraged to reach people, instead of robots. 
The second was given by Milind Tambeon "AI and Multiagent Systems Research for Social Good". He divided this broad topic in 3 aspects: public safety and security, conservation, and public health. He views social problems as multiagent systems and pointed out that the key research challenge is how to optimize our limited intervention resources when interacting with other agents. Examples include conservation/wildlife protection in which they used game theory to successfully predict the poachers in national parks in Uganda, homeless youth shelters in Los Angeles (this is Amulya's work), and scheduling patrol scheduling using game theory. 
The last one was given by the world-famous Deep Learning expert Ian Goodfellow, Senior Staff Research Scientists of Google AI, and the author of the widely used Deep Learning book. His talk was on "Adversarial Machine Learning" -- of course he invented Generative Adversarial Network(GAN). He described the prosperity of machine learning as a Cambrian Explosion, and gave applications of GAN in security, model-based optimization, reinforcement learning, extreme reliability, label efficiency, domain adaptation, fairness, accountability, transparency, and finally neuroscience. His current research focuses on designing extremely reliable systems used in autonomous vehicles, air traffic control, surgery robots, and medical diagnosis, etc. A lot of his data is images. 
There are too many sessions and I was interested in many of them but I finally chose to focus on the NLP sessions. The paper titles can be found from the conference website. Most NLP papers use AI techniques to deal with fundamental NLP problems such as representation learning, sentence-level embedding, entity, and relation extraction. I summarize what I learned below:
(1) GAN, attentive models, and Reinforce Learning (RL) are gaining more attention, especially the latter. For example, RL is used to learn embed sentences using attentive recursive trees(Jiaxin Shi et al.; Tsinghua University). RL is used to build a hierarchical framework for relation extraction(Takanobu et al. Tsinghua University. Attentive GAN was used to generate responses of chatbot (Yu Wu et al. Beihang University). RL is used to generate topically coherent visual stories (Qiuyuan Huang et al. MSR). Deep neural networks are still popular but not that popular in NLP tasks. 
(2) Zero-shot learning became a popular topic. Zero-shot learning means learning without any instances. For example, Lee and Jha (MSR) presented Zero-shot Adaptive Transfer for Conversational Language UnderstandingShruti Rijhwani (CMU) presented Zero-Shot Neural Transfer for Cross-Lingual Entity Linking
(3) Entity and relation extraction, one of the fundamental tasks in NLP is still not well-solved. People are approaching this problem in different ways, but it seems that joint extraction is better than dealing with them separately. The model proposed in Cotype by Xiang Ren et al. has become a baseline. New baselines are proposed, which are better, though the boost is marginal. For example, Rijhwani et al. (CMU) proposed Zero-shot neural transfer for cross-lingual entity linking. Changzhi Sun & Yuanbin Wu (East China Normal University) proposed Distantly Supervised Entity Relation Extraction with Adapted Manual Annotations.  Gupta et al. (Siemens) proposed Neural relation extraction within and across sentence boundaries
(4) Most advances in QA systems are still limited to answer selection task. Generating NL is still a very difficult task even with DNN. There is an interesting work by Lili Yao (Peking University) in which they generate short stories by a given keyphrase. But the code is not ready to be released. 

(5) There is one paper talking about a framework for question generation from phrase extraction by Siyuan wang et al. (Fudan University), which is related to my recent research in summarization. However, the input of the system is single sentences, rather than paragraphs, not to mention full text. So it is not directly applicable to our work. Some session names look interesting in general, but the papers are not very interesting as they usually focus on a very narrow topic. 

The IAAI session I attended featured 5 presentations. 
·       Early-stopping of scattering pattern observation with Bayesian Modeling by Asahara et al. (Japan). This is a good example to apply AI with physics. They are basically using unsupervised learning to predict the neutron scattering patterns. The goal was to reduce the cost to build equipment to generate powerful neutron beams.
·       Novelty Detection for Multispectral Images with Application to Planetary Exploration by Hannah R. Kerner et al. (ASU). These people are designing AI techniques to facilitate fast decision making for the Mars project.
·       Expert Guided Rule Based Prioritization of Scientifically Relevant Images for Downlinking over Limited Bandwidth from Planetary Orbiters by Srija Chakraborty (ASU).
·       Ours on Cleaning Noisy and Heterogeneous Metadata for Record Linking across Scholarly Big Datasets. I received a comment and a question. The comment given by an audience named Chris Lesner refers me to the MinHashing Shingles, which can be another potential solution to the scalability problem. The question came from a person on understanding the entity matching problem. I also got the name card of Diane Oyen, who is a staff scientist in the Information Science group at Los Alamos National Lab. She has some interesting problems to detect plagiarisms that we can potentially collaborate. 
·       A fast machine learning workflow for rapid phenotype prediction from whole shotgun metagenomes by Anna Paola Carrieri et al. (IBM)
One impression to me about the conference is that most presentations are terrible. This is agreed by Dr. Huan Sun, assistant professor at OSU. This is the disadvantage of sending students to the conference. It is much more beneficial to students than the audiences. The slides are not readable, the voice of presenters is low, and many presenters do not spend enough time explaining key results, leaving essentially no room for high-quality questions: the audiences just do not understand what they were talking! In particular, although Chinese scholars got many papers accepted and presented, most do not present well. Most audiences were swiping smartphones rather than listening to talks. 
Another impression is that the conference is too big! There is virtually little chance to get enough coverage and meet with speakers. I was lucky to meet with my old colleagues at Penn State: Madian Khabsa (now at Apple), Shuting Wang (now at Facebook), and Alex Ororbia II (now at RIT). I also meet with Prof. Huan Liu at ASU and had lunch with a few new friends at Apple. 
Overall, the conference was well organized, although the program arrived very late, which delayed my trip planning. The location had very good scenes, except that it is too expensive. The registration is almost $1k but lunch is not covered! Hawaii is very beautiful. I enjoyed the Waikiki beach and came across a rainbow. 

Jian Wu