Friday, October 19, 2018

2018-10-19: Some tricks to parse XML files

Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs.


  • CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is
    <script> 
    <![CDATA[  <message> Welcome to TutorialsPoint </message>  ]] >
    </script >
  • Encoding. Encoding is a pain in text processing. The problem is that there is no way to know what the encoding the text is before opening it and reading it (at least in Python). So we must sniff it by trying to open and read the file using an encoding. If the encoding is wrong, the program usually will throw an error message. In this case, we try another possible encoding. The "file" command in Linux gives the encoding information so I know there are 2 encodings in the ACM DL XML file: ASCII and ISO-8859. 
  • HTML entities, such as &auml; The only 5 built-in entities in XML are quotampaposlt and gt. So any other entities should be defined in the DTD file to show what they mean. For example, the DBLP.xml file comes with a DTD file. The ACM DL XML should have associated DTD files: proceedings.dtd and periodicals.dtd but they are not in my dataset.
The following snippet of Python code solves all the three problems above and give me the correct parsing results.

encodings = ['ISO-8859-1','ascii']
for e in encodings:
    try:
        fh = codecs.open(confc['xmlfile'],'r',encoding=e)
        fh.seek(0)
    except UnicodeDecodeError:
        logging.debug('got unicode error with %s, trying a different encoding' % e)
    else:
        logging.debug('opening the file with encoding: %s' % e)
        break

f = codecs.open('xmlfile',encoding=e)
soup = BeautifulSoup(f.read(),'html.parser')


Note that we use codecs.open() instead of the Python built-in open(). And we open the file twice, the first time only to check the encoding, and the second time the whole file is pass to a handle before it is parsed by BeautifulSoup. I found that BeautifulSoup is better to handle XML parsing than lxml, not just because it is easier to use but also because you are allowed to pick the parser. Note I choose the html.parser instead of the lxml parser. This is because the lxml parser is not able to parse all entries (for some unknown reason). This is reported by other users on stackoverflow.

Jian Wu

Thursday, October 11, 2018

2018-10-11: iPRES 2018 Trip Report

September 24th marked the beginning of iPRES 2018 located in Boston, MA, for which both Shawn Jones and I traveled from New Mexico to present our accepted papers: Measuring News Similarity Across Ten U.S. News SitesThe Off-Topic Memento Toolkit, and The Many Shapes of Archive-It.

iPRES ran paper and workshop sessions in parallel, therefore I will focus on the sessions I was able to attend. However, this year organizers created and shared collaborative notes with all attendees for all sessions to help others who couldn't attend many individual sessions. All the presentation materials and associated papers were also made available via google drive.

Day 1 (September 24, 2018): Workshops & Tutorials

The first day of iPRES attendees gathered at the Joseph B. Martin Conference Center at Harvard Medical School to get their registration lanyards and iPRES swag.

Afterwards, there were scheduled workshops and tutorials to enjoy throughout the day. Attending registrants needed to sign up early to get into these workshops. Many different topics were available for to attendees choose from found on Open Science Framework event page. Shawn and I chose to attend: 
  • Archiving Email: Strategies, Tools, Techniques. A tutorial by: Christopher John Prom and Tricia Patterson.
  • Human Scale Web Collecting for Individuals and Institutions (Webrecorder Workshop). A workshop by: Anna Perricci.
Our first session on Archiving Email consisted of talks and small group discussion on various topics and tools for archiving email. It started with talks on the adoption of email preservation systems into our organizations. Within our group talk, it was found that few organizations have email preservation systems. I found the research ideas and topics stemming from these talks to be very interesting especially in the aspect of studying natural language from email content.
Many of the difficulties of archiving email unsurprisingly revolve around issues of privacy. Some of the difficulties range from actually requesting and acquiring emails from users, discovering and disclosing sensitive information inside emails, and also other ethical decisions for preserving emails.

Email preservation also has the challenge of curating at scale. As one can imagine, going through millions of emails inside of a collection can be time consuming and redundant which requires the development of new tools to combat these challenges.
This workshop also exposed many interesting tools to use for archiving and exploring emails including:



Many different workflows for archiving email and also using the aforementioned tools for archiving emails were explained thoroughly at the end of the session. These workflows covered migrations with different tools, accessing disk images of stored emails and attachments via emulation, and bit-level preservation.

Following the email archiving session we continued on for the Human Scale Web Collecting for Individuals and Institutions session presented by Anna Perricci from the Webrecorder team.


Having used Webrecorder before I was very excited for this session. Anna walked through process of registering and starting your first collection. She explained how to start sessions and also how collections are formed as easily as clicking different links on a website. Webrecorder can handle javascript replay very efficiently. For example, past videos streamed from a website like Vine or YouTube are recorded from a user's perspective and then available for replay later in time. Other examples included automated scrolling through twitter feeds or capturing interactive news stories from the New York Times.
During the presentation Anna showed Webrecorder's capability of extracting mementos from other web archives for the possibility of repairing missing content. For example, it managed to take CNN mementos from the Internet Archive past November 1, 2016 and then fix their replay by aggregating resources from other web archives and also the live web - although this could also be potentially harmful. This is an example of Time Travel Reconstruct implemented in pywb.

Ilya Kreymer presented the use of Docker containers for emulating different browser environments and how it could play an important role for replaying specific content like Flash. He demonstrated various tools available open source on Github including: pywb, Webrecorder WARC player, warcio, and warcit.
Ilya also teased at Webrecorder's Auto Archiver Prototype, a system that understands how Scalar websites work and can anticipate URI patterns and other behaviors for these platforms. Auto Archiver introduces automation of the capture of many different web resources on a website, including video and other sources.
Webrecorder Scalar automation demo for a Scalar website

To finish the first day, attendees were transported to a reception hosted at the MIT Samberg Conference Center accompanied by a great view of Boston.

Day 2 (September 25, 2018): Paper Presentations and Lightning Talks

To start the day attendees gathered for the plenary session which was opened by a statement from Chris Bourg.



Eve Blau then continued the session by presenting the Urban Intermedia: City, Archive, Narrative capstone project of a Mellon grant. This talk was about a Mellon Foundation project the Harvard Mellon Urban Initiative. It is a collaborative effort across multiple institutions of architecture, design and humanities. Using multimedia and visual constructs it looked at processes and practices that shape geographical boundaries, focusing on blind spots in:
  • Planned / unplanned - informal processes
  • Migration / mobility, patterns, modalities of inclusion & exclusion
  • Dynamic of nature & technology, urban ecologies
After the keynote I hurried over to open for the Web Preservation session with my paper on Measuring News Similarity Across Ten U.S. News Sites. I explained our methodology of selecting archived news sites, the tool top-news-selectors we created for mining archived news, how the similarity of news collections were calculated, the events that peaked in similarity, and how the U.S. election was recognized as a significant event among many of the news sites.


Following my presentation, Shawn Jones presented his paper The Off-Topic Memento Toolkit. Shawn presentation focused on the many different use cases of Archive-It, and then detailed how many of these collections can go of topic. For example, pages that have missing resources at a point in time, content drift causes different languages to be included in a collection, site redesigns, and etc. This lead to the development of the Off-Topic Memento Toolkit to detect these off-topic mementos inside of a collection through a process of collection a memento and then assigning a score, testing multiple different measures. It showed that in this study Word Count had the highest accuracy and best F1 score for detecting off-topic mementos.

Shawn also presented his paper The Many Shapes of Archive-It. He explained how to understand Archive-It collections using the content, metadata (Dublin Core and custom fields), and collection structure, but also the issues that come with these methods. Using 9351 collections from Archive-It as data, Shawn explained the concept of growth curves for collections which compares seed count, memento count, and also memento-datetime. Using different classifiers Shawn showed that using structural features of a collection one can predict the semantic category of a collection, with the best classifier found to be Random Forest.


Following lunch, I headed to the amphitheater to see Dragan Espenschied's short paper presentation Fencing Apparently Infinite Objects. Dragan questioned how objects, synonymous with file or a collection of files, are bound in digital preservation. The concept of "performative boundaries" was explained to explain different potentials of an object - bound, blurry, and boundless. Using many early software examples like early 2000 Microsoft Word (bound), Apple's QuickTime (blurry), and Instagram (boundless). He shared productive approaches for future replay of these objects:

  • Emulation of auxiliary machines
  • Synthetic stub services or simulations
  • Capture network traffic and re-enact on access 

Dragan Espenschied presenting on Apparently Infiinite Objects 
The next presentation was Digital Preservation in Contemporary Visual Art Workflows by Laura Molloy who presented remotely. This presentation informs us that on a regular basis digital preservation for someone's work isn't a main part of the teachings at an art school, and it should be. Digital technologies are used widely today for creating art with a variety of different formats. When asking various artist about digital preservation this is how they answered:
“It’s not the kind of thing that gets taught in art school, is it?”
“You don’t need to be trained in [using and preserving digital objects]. It’s got to be instinctive and you just need to keep it very simple. Those technical things are invented by IT guys who don’t have any imagination.” 
The third presentation was by Morgane Stricot for her short paper Open the museum’s gates to pirates: Hacking for the sake of digital art preservation. Morgane explained the that software dependency is a large threat for digital art and supporting media archaeology is required for preservation of some forms of these digital arts. Backups of older operating systems (OS) on disks help avoid issues of incompatibility. She also detailed how copyright prohibitions, for example older Mac OS, are difficult to find and that many pirates as well as "harmless hackers" have cracks to gain access to these OS environments while some are unsalvageable.
The final paper presentation was presented by Claudia Roeck on her long paper Evaluation of preservation strategies for an interactive, software-based artwork with complex behavior using the case study Horizons (2008) by Geert Mul. Claudia explored different possible preservation strategies for software such as reprogramming to a different programming language, migration of software, virtualization, and emulation, and also significant properties for what determines the qualities one would want to preserve. She used Horizons as an example project to explore the use cases and determined that reprogramming was of the options they decided was suitable for it. However, it was stated that there were no clear winner for the best preservation strategy in the mid-term of the work.
For the rest of the day lightning talks were available to the attendees and it became packed with viewers. Some of these talks consisted of preservation games to be held the next day such as: Save my Bits, Obsolescence, Digital Preservation Storage Criteria Game, and more. Ilya, from Webrecorder, held a lightning talk showing a demo of the new Auto Archiver prototype for Webrecorder.


After the proceedings another fantastic reception was held, this time at the Harvard Art Museum.

Harvard Art Museum at night

Day 3 (September 26, 2018): Minute Madness, Poster Sessions, and Awards 

This day was opened by a review of iPRES's achievements and challenges for past 15 years with a panel discussion composed of: William Kilbride, Eld Zierau, Cal Lee, and Barbara Sierman. Achievements included the innovation of new research as well as the courage to share and collaborate among peers with similarities in research. This lead to iPRES's adoption of cross-domain preservation in libraries, archives, and digital art. Some of the challenges include decisions for archivists to decide of what to do with past and future data and also conforming to the standard of OAIS.
After talking about the past 15 years it was time to talk about the next 15 years with a panel discussion composed of: Corey Davis, Micky Lindlar, Sally Vermaaten, and Paul Wheatley. This panel discussed what would be good for the future for more attendees be available to attend. They discussed possible organization models to emulate for regional meetings such as code4lib and NDSR. There were suggestions for updates to the Code of Conduct and the value for it to hold for the future.
After the discussion panels it was time for minute madness. I had seen videos of this before but it was the first time I personally had seen this. I found it somewhat theatrical. It was where most people had to explicitly pitch their research in a minute so we would later come visit them during the poster session while some of them put up a show, like Remco van Veenendaal. The topics ranged from workflow integration, new portals for preserving digital materials, code ethics, and timelines for detailing file formats.

After the minute madness attendees wandered around to view the posters available. The first poster I visited conveniently was referencing work from our WSDL group!
Another interesting poster consisted of research into file format usage over time.
I was also surprised at the amount of tools and technologies some of the new preservation platforms for government agencies that had emerged, like the French government IT program for digital archiving, Vitam.

Vitam poster presentation for their digital archiving architecture
Following the poster sessions I was back to paper presentations where Tomasz Miksa presented his long paper Defining requirements for machine-actionable data management plans. This talk involved machine actionable data management plans (maDMPs), which represents living documents automated by information collection systems and notification systems. He showed how current formatted data management systems could be transformed to reuse existing standards such as Dublin Core and PREMIS.
Alex Green then went on to present her short paper Using blockchain to engender trust in public digital archives. It was explained that archivist alter, migrate, normalize, and sometimes make changes to digital files but there is little proof that a researcher receives an authentic copy of a digital file. The ARCHANGEL project proposed to use blockchain to verify integrity of these files and their provenance. It is still unknown if blockchain tech will prevail as a lasting technology as it is still very new. David Rosenthal wrote a review of this paper found on his blog.
I then went on to the Storage Organization and Integrity session to see a long paper presentation Checksums on Modern Filesystems, or: On the virtuous consumption of CPU cycles by Alex Garnett and Mike Winter. The focus of the talk was the computing of checksums on files to prevent bit rot in digital objects and compares different approaches for verifying bit-level preservation. It showed that data integrity can be achieved when computer hardware, such as filesystems using ZFS, are dedicated to digital preservation. This work shows a bridge between digital preservation practices and high-performance computing for detecting bit-rot.

After this presentation I stayed for short paper presentation The Oxford Common File Layout by David Wilcox. The Oxford Common File Layout (OCFL) is an effort to define a shared approach to file hierarchy for long-term preservation. The goal of this layout is to have structure at scale, easily ready for migrations and minimize file transfers, and designed to be managed by many different applications. With a set of defined principles for this file layout, such as ability to log transactions on digital objects among other principles, there is plan for a draft spec release sometime at the end of 2018.
This day closed with the award ceremony for best poster, short papers, and long papers. My paper, Measuring News Similarity Across Ten U.S. New Sites, was nominated for best long paper but did not prevail as the winner. The winners were as follows:
  • Best short paper: PREMIS 3 OWL Ontology: Engaging Sets of Linked Data
  • Best long paper: The Rescue of the Danish Bits - A case study of the rescue of bits and how the digital preservation community supported it  by Eld Zierau
  • Best poster award: Precise & Persistent Web Archive References by Eld Zierau



Day 4 (September 27, 2018): Conference Wrap-up

The final day of iPRES 2018 was composed of paper presentations, discussion panels, community discussions, and games. I chose to attend the paper presentations.

The first paper presentation I viewed was Between creators and keepers: How HNI builds its digital archive by Ania Molenda. Over 4 million documents were recorded to track progressive thinking for Dutch architecture. When converting and pushing these materials into a digital archive there were many issues observed, such as: duplicate materials, file formats with complex dependencies, time and effort to digitalize the multitude of documents, and knowledge lost over time for accessing these documents with no standards in place.

Afterwards I watched the presentation on Data Recovery and Investigation from 8-inch Floppy Disk Media: Three Use Cases by Abigail Adams. This showed the acquisition of three different floppy disk collections ranging in date ranges from 1977-1989! This presentation introduced me to some foreign hardware, software, and encodings required for attempting to recover data from floppy disk media and also a workflow for data recovery from these floppies.

The last paper presentation of my viewing was Email Preservation at Scale: Preliminary Findings Supporting the Use of Predictive Coding by Joanne Kaczmarek and Brent West. Having already been to the email preservation workshop I was excited for this presentation and I was not let down. Using 20gb of emails publicly available they used two different methods, a capstone approach and predictive coding approach, for discovering sensitive content inside emails. With the predictive coding approach, machine learning for training and prediction of documents, they showed preliminary results that classifying emails automatically is an approach that is capable of handling emails at scale.

As a final farewell, attendees were handed bags of tulip buds and told this:
"An Honorary Award will be presented to the people with the best tulip pictures."
It seems William Kilbride, among others, have already got a foot up on all the competition.
This marks the end of my first academic conference as well as my first visit to Boston, Massachusetts. It was an enjoyable experience with a lot of exposure to diverse research fields in digital preservation. I look forward to submitting work to this conference again and hearing about future research in the realm of digital preservation.


Resources for iPRES 2018:


Wednesday, October 10, 2018

2018-10-10: Americans More Open Than Asians to Sharing Personal Information on Twitter: A Paper Review

Mat Kelly reviews "A Personal Privacy Preserving Framework..." by Song et al. at SIGIR 2018.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ


Americans are more open to share personal aspects on the Web than Asians. — Song et al. 2018

I recently read a paper published at SIGIR 2018 by Song et al. titled "A Personal Privacy Preserving Framework: I Let You Know Who Can See What" (PDF). The title alone captivated my interest with the above claim deep within the text.

The authors' goal of the work was to reduce users' privacy risks on social networks by determining who could see what sort of information they posted. They did so by establishing boundary regulations through summarizing the literature and associate them with 32 categories corresponding to a personal aspect of a user, broken down into 8 groups spanning the categories of personal attributes to life milestones. The authors then fed a list of keywords to the Twitter Search Service for each category they established. From this taxonomy they created a model to be used to uncover personal aspects from users' posts. Their model, TOKEN (a forced acronym of laTent grOup multi-tasK lEarniNg), allowed the authors to create guidelines for information disclosure by users into four kinds of social circles and generate a data set consisting of a rich set of privacy-oriented features (available here).

The authors noted that users' private tweets are very sparse and thus they used the Twitter service to gather posts that met the categories in their taxonomy to collect just over 269k tweets. To reduce the noise in the collection, the authors filtered tweets that contain URLs that were not in reference to the users' respective other social media posts. Retweets and tweets less than 50 characters were excluded. The authors did not justify this exclusion.

To establish a ground truth, the authors used Amazon Mechanical Turk to have each post annotated with their selective categories. Turkers that did not validate at least 80% with the authors sampling were excluded from the results. This procedure resulted in just over 11k posts being labeled. To determine inter-worker reliability, the authors employed Fleiss' kappa (PDF of 1969 paper), adapting for the potential variance in label count/post by reducing to a binary classification, to determine moderate agreement (Fleiss' coefficient of 0.43).

The authors then extracted a set of privacy-oriented linguistic features using Linguistic Inquiry Word Count (LIWC), a Privacy Dictionary (per Vasalou et al.'s 2011 JASIST work), Sentiment Analysis (via Stanford's NLP classifier), Sentence2Vector (with each tweet a sentence), and an ad hoc meta-feature approach. The aforementioned final approach considered the presence of hashtags, slang words, images, emojis, and user mentions. Slang, here, was identified using the Internet Slang Dictionary.

Following this analysis, the authors established a prediction component by first formulating a predictive model inter-relating each of the 32 "tasks" within the 8 "groups". The authors anticipated that tasks within the same group would share relevant features, e.g., "places planning to go" and "current location", would share common features within the location group in their taxonomy. From this initial formulation they established the matrix L, whose columns represent the latent features, and S, whose rows represent the weights of the features in L.

To solve L and S, the authors optimized one variable while fixing the other in each iteration of analysis. To determine L, they took the derivative of their objective function (their Equation 5, see paper) with respect to L to produce a linear system with a vector B representing the stacking of columns into a single matrix and A, a definite and invertible matrix. Computing S with L fixed was a bit more mathematically complex that I will leave as an exercise in understanding to the interested reader.

Prescription

...there is still a societal consensus that certain information is more private than the others (sic) from a general societal view.A. Islam et al. 2014

The authors used Mechanical Turk to build guidelines regarding disclosure norms in different circles. This was performed on two selections of Turkers limited by respective geographies of the U.S. and Asia. The authors note that 99% of the Asian participants were Indians. An anticipated real world goal of the authors was, when posting a tweet containing information on a health condition (for example), to set the privacy setting to only share this with her family members. This, I felt, would be an odd recommendation given:

  1. The corpus was of publicly available tweets.
  2. Twitter does not currently have a means of limiting who may see a tweet akin to services like Facebook.

This drastically reduces the usefulness of the recommendation, I feel, in the context of the medium observed.

Verification

The authors sought to detect privacy leakage by comparing the precision of TOKEN as compared to the S@K and P@K metrics, as they had previously done in Song et al. 2015 from (IJCAI). Here, S@K is representative of the mean probability that a correct interest is captured within the top K recommended categories and P@K standing for the proportion of the top K recommendations being correct. They used a grid search strategy to obtain the optimal parameters with 10-fold cross-validation.

Using S@K and P@K where K was set to 1, 3, and 5, the authors found LIWC to be most representative of the characterization of users' privacy features as compared to the aforementioned Privacy Dictionary, Sentence2Vector, etc. approaches. They attributed this to LIWC's inclusion of pronouns and verb tense that provide references and temporal hints.

In applying these feature configurations to their corpus, the authors noticed that timestamps played an important role in identifying private information leakage, so took a detour to cursorily explore this. Based on the patterns found (pictured below), various activities peak at certain times of day, e.g., drug and alcohol tweets around "20pm" (sic). It is unclear from the paper whether this was applied to both the U.S. and the Asian results. Further, the multiple plot display with variable inter-plot y-axis scales produces a deceptive result that the authors do not address.

Plots per Song et al. show temporal patterns.

To finally validate their model compared to S@K and P@K, they used SVM, MTL_Lasso,

A different set of categories was compared to show similarities in sharing comfort between Americans and Asians. Of these (healthcare treatments, health conditions, passing away, specific complaints, home address, current location, contact information, and places planning to go), American Turkers were much more restrictive about sharing with the outside world where Asian Turkers exhibited a similarly and relatively conservative sentiment about sharing. From this, the author concluded:

Americans are more open to share personal aspects on the Web than Asians. — Song et al. 2018

Take Home

I found this study to be interesting despite some of the methodological problems and derived conclusions. As I mentioned, the inability to regulate who sees tweets when posting (a la Facebook) affects the nature of the tweet with a potential likely bias toward the tweeter being less concerned for privacy. The authors did not mention whether it was asked if each Turker personally used Twitter or if they even mentioned to the Turkers that the text they judged were tweets and not just "messages posted online". This context, if excluded, could make those judging the tweets unsuitable to do so. I would hope to see an expanded version of this study (say, posted to arXiv) with more comprehensive results, as the authors stated space was a limitation, but there was no indication as such.

—Mat (@machawk1)

Monday, September 3, 2018

2018-09-03: Trip Report for useR! 2018 Conference




This year I was really lucky to get my abstract and poster accepted for useR! 2018 conference. The UseR! conference is an annual worldwide conference for international R users and developer community. The fourteenth annual conference was held in the Southern hemisphere in Brisbane, Australia from July 10-13, 2018. This four-day conference consists of nineteen 3-hour tutorials, seven keynote speeches, and more than 200 contributed talks, lightning talks and posters on using, extending, and deploying R. This year, the program successfully gathered almost 600 users of the data analysis language R, from all corners of the world from various expertise levels of R.

Distribution map of useR! 2018 participants across the globe


Fortunately, I was also granted a travel scholarship from the useR! 2018 and could attend the conference including the tutorial sessions for free (thanks useR! 2018).

Day 1 (July 10, 2018): Registration and Tutorial

The conference was held at Brisbane Convention and Exhibition Centre (BCEC). Each participant must register themselves at the secretariat desk and received a goodie bag containing a t-shirt, a pair of socks, and a lanyard (if lucky). The name tags can be picked from a board which are ordered by last name.

The Secretariat Desk


T-shirt and name tag from useR! 2018


useR! 2018 is identified with hexagonal shapes which can be found everywhere in useR! 2018: the name tags, the hex stickers, and of course, the amazing Hexwall designed by Mitchell O'Hara-Wild. He also wrote a blog post about how he created the hexwall. There was also a hexwall photo contest where all conference attendees are requested to take a picture with the hexwall and post it on twitter with hashtag #hexwall.

Me and the hexwall


The R tutorials are conducted in parallel sessions from Tuesday to Wednesday morning (July 10 - 11, 2018). Each participant can only participate in a maximum of three tutorials. The first tutorial that I attend is Wrangling Data in the Tidyverse by Simon Jackson.

This is my first time using Tidyverse, and I found it really helpful for data transformation and visualization, once I got familiar with it. Using the data example from booking.com, we got hands-on experience with various data wrangling techniques such as handle missing values, reshaping, filtering, and selecting data. The thing that I love the most about Tidyverse is the dplyr package. It comes with a very interesting feature pipe (%>%) which allows us to chain together many operations.

In the second tutorial by Statistical Models for Sport in R by Stephanie Kovalchik, we learned how to use R to implement statistical models that are common in sports statistics. The tutorial consists of three parts:
  1. web scraping to gather and clean public sports data using RSelenium and rvest
  2. explore data with graphics
  3. implementing three models: Bradley-Terry paired comparison models, Pythagorean Theorem, Generalized Additive Models, and Forecasting with Bayes.
During the tutorials session, I met three other Indonesians who are currently studying in Australia as Ph.D. students (small world!).
Indonesian students at useR! 2018

Day 2 (July 11, 2018): Tutorial, Opening Ceremony, and Poster Presentation. 

Tutorial

The morning session is filled with tutorial activities which are a continuation of the series of tutorials that began the day before. I attended the tutorial Follow Me: Introduction to social media analysis in R by Maria Prokofieva, Saskia Freytag, and Anna Quaglieri.

Dr. Maria Prokofieva talked about social media analytics using R

During this 2.5 hour tutorial, we learned how to use R libraries twitteR and rtweet for extracting data from twitter and then convert the tweet in the text column to token using tidytext. In general, the whole process is a bit similar to what I have learned in Web Science class by Dr. Michael Nelson at Old Dominion University (ODU), except that all of the processes are conducted in R instead of Python. At the end of the session, we were given a challenge to compare tweets which mention Harry to tweets mentioning Meghan in the royal wedding time series. The answer should be uploaded to twitter using the hashtags #useR2018, #rstats and #socialmediachallenge. All tutorial materials are available on R-Ladies Melbourne's GitHub.

R-Ladies Gathering

There was an R-Ladies gathering that took place during the lunch after the tutorial session. It was such an excellent opportunity to meet other amazing R-Ladies members who have done various project and research in R and get their R libraries published on CRAN. It was really inspiring to hear their stories of promoting gender diversity in the R community. There are 75 R-Ladies groups spread across the globe. Unfortunately, there is no R-Ladies group in Indonesia at this moment. Maybe, I should start creating one?
With Jenny Bryan during the R-Ladies meeting

Opening Ceremony, Keynote Speeches, and Poster Lightning Talk

At 1.30 pm, all conference attendees gathered in the auditorium for the Opening Ceremony. The event started with a performance by Songwoman Maroochy Welcome to the Country followed by an opening speech delivered by useR! 2018 chief organizer, Professor Di Cook from the Department of Econometrics and Business Statistics at Monash University. In her remarks, Professor Cook encourages all attendees to enjoy the meeting, learn as much as we can, and be cognizant of ensuring others have a good experience, too.

Opening speech by Professor Di Cook

By the way, for those who are curious, here's a sneak peek of the Songwoman Maroochy performance.



Next, we had a keynote speech by Steph de Silva Beyond Syntax, Towards Culture: The Potentiality of Deep Open Source Communities.

After the keynote speech, there was a poster lightning talk session where every presenter is given a chance to advertise and let everyone know what the work is about and encourage them to come and see it during the poster session.
My poster lightning talk
Before ending the opening ceremony, there was another keynote speech by Kelly O'Briant of RStudio titled RStudio 2018 - Who We are and What We Do.

Poster Session.

The poster session wrapped up the day. I am so grateful that useR! 2018 uses all-electronic posters. So, we did not have to bother ourself printing a large poster and carried it across the globe all the way to Australia. There are two poster sessions, one on Wednesday evening and another one during lunch on Thursday. For poster presentation, the conference committee provides 20 47-inch TVs that have HDMI connections to connect the TV to our laptop. This way, if someone asked, we can directly do a demo or showing a specific part of our code on the TV as well.

In this conference, I presented a poster titled AnalevR: An Interactive R-Based Analysis Environment for Utilizing BPS-Statistics Indonesia Data. This project idea originated from the challenge we faced at BPS-Statistics Indonesia. BPS produces a massive amount of strategic data every year. However, these data are still underutilized by public users because of several issues such as bureaucratic procedure, the money that they have to pay, and long waiting time to get their requested data processed. That’s why we introduce AnalevR, an online R-based analysis environment that allows anyone anywhere to access bps data and perform analyses by typing R codes on a notebook-like interface and get the output immediately. This project is still a prototype and currently in the development stage. The poster and the code are available on my GitHub.
Me during the poster session

Day 3 (July 12, 2018): Keynote Speech, Talk, Poster Presentation, and Conference Dinner

The agenda for day 3 was packed with two keynote speeches, several talks, poster presentation, and conference dinner.

Keynote Speech

The first keynote speech was The Grammar of Animation by Thomas Lin Pedersen (video, slides). In his speech, Pedersen explains that visualization is an element that falls somewhere between three dimensions of DataViz nirvana, which are static, interactive, and animated. Each dimension has its own pros and cons. Mara Averick's tweet below gives us a clearer illustration of this.
Pedersen implements this grammar concept by rewriting the gganimate package which extends the ggplot2 package to include the description of animation such as transition, view, and shadow. He made his presentation even more engaging by showing an example that channels Hans Rosling's 200 Countries, 200 Years, 4 Minutes visualization. The example is made by utilizing the transition_time() function in the gganimate package.

The second keynote speech was Adventures with R: Two Stories of Analyses and a New Perspective on Data by Bill Venables. He discussed two recent analyses, one from psycholinguistics and the other from fisheries, that show the versatility of R to tackle the full range of challenges facing the statistician/modeler adventurer. He also made a comparison between Statistics and Data Science and how they relate to each other. The emerging data science is not natural a successor of Statistics. There are some subtle differences between them. Professor Venables said that both sides are important domains and connected, but we have to think of them as essentially bifurcating to some extent and not taking on each other's roles. Things work best when domain expert and analyst work hand in hand.
Professor Venables ended his speech by mentioning two quotes that I would like to requote here:

"The relationship between Mathematics and Statistics is like that between chemistry and winemaking. You can bring as much chemistry as you can to winemaking, but it takes more than chemistry to make a drinkable dry red wine." 

"Everyone here is smart, distinguish yourself by being kind."


There was a tribute to Bill Venables at the end of the event.

The Talk Sessions

There are 18 parallel sessions of talks conducted from 10.30 am to 4.50 pm. Those sessions were held in three parts, where each part are separated by two tea breaks and one lunch break. I managed to attend eight talks that covered topics of data handling and visualization.
  1. Statistical Inference: A Tidy Approach using R by Chester Ismay.
    Chester Ismay from DataCamp introduces the infer package which was created to implement common classical inferential techniques in a tidyverse-friendly framework that is expressive of the underlying procedure. There are four main objectives of this package:
    1. Dataframe in, dataframe out
    2. Compose tests and intervals with pipes
    3. Unite computational and approximation methods
    4. Reading a chain of infer code should describe the inferential procedure
  2. Data Preprocessing using Recipes by Max Kuhn.
    Max Kuhn of RStudio gives a talk about the recipes package which aims for predictive data modeling. Recipes works in three steps (recipe → prepare → bake):
    1. Create a recipe, which is the blueprint of how your data will be processed. No data has been modified at this point.
    2. Prepare the recipe using the training set. 
    3. Bake the training set and the test set. At this step, the actual modification will take place.
  3. Build Scalable Shiny Applications for Employee Attrition Prediction on Azure Cloud by Le Zhang
    Le Zhang of Microsoft delivers a talk about building a model for employee attrition prediction and deploy the analytical solution as Shiny-based web service on Azure cloud. The project is available on GitHub.
  4. Moving from Prototype to Production in R: A Look Inside the Machine Learning Infrastructure at Netflix by Bryan Galvin
    Bryan Galvin of Netflix gave the audience a look inside the machine learning infrastructure at Netflix. Galvin explained briefly on how Netflix moves to production using microframework named Metaflow and R. Here's the link to the slides.
  5. Rjs: Going Hand in Hand with Javascript by Jackson Kwok
    rjs is a package that is designed is designed for utilizing JavaScript's visualization libraries and R's modeling packages to build tailor-made interactive apps. I think this package is super cool and it was an absolute highlight for me at useR! 2018. I will definitely spend some time to learn this package. Below is an example of rjs implementation. Check the complete project on GitHub.
  6. Shiny meets Electron: Turn your Shiny App into a Standalone Desktop App in No Time by Katie Sasso
    Katie Sasso of Columbus Collaboratory shares how the Columbus Collaboratory team overcame the barriers of using Shiny for large enterprise consulting by coupling R Portable and Electron. The result is a Shiny app in a stand-alone executable format. The details of her presentation along with the source code and tutorial video are available on her GitHub.
  7. Combining R and Python with GraalVM by Stepan Sindelar
    Stepan Sindelar of Oracle Labs told us how to combine R and Python into a polyglot application which is running on
    GraalVM.  GraalVM enables us to operate on the same data without the need to copy the data when crossing language boundaries.
  8. Large Scale Data Visualization with Deck.gl and Shiny by Ian Hansel.
    Ian Hansel of Verge Labs talked about how to integrate deck.gl, a web data visualization framework released by Uber, with Shiny using the R package deckard.
Conference Dinner

The conference dinner ticket

The conference dinner can only be attended by people who have the ticket only. I was fortunate because as a scholarship recipient, I got a free ticket for the dinner (again, thank you, useR! 2018 and R-Ladies Melbourne). There was a trivia quiz at the end of the dinner. All attendees are grouped based on the table they were sitting at and must team up to answer all the questions on the question sheets. The solution for the quiz can be found here. The teams who won the quiz got free books as the prizes.

The conference dinner and the trivia quiz
Day 4 (July 13, 2018): Keynote Speech, Talk, and Closing Ceremony

Keynote Speech

The last day of the conference starts with a keynote speech Teaching R to New Users: From tapply to Tidyverse by Roger Peng. In his talk, Dr. Peng talked about teaching R and selling R to new users. It could be difficult to describe the value proposition of R to someone who had never seen it before. Is it an interactive system for data analysis or is it a sophisticated programming language for software developers? To answer this, Dr. Peng quote a remark from John Chambers (one of the creators of the S language):

"The ambiguity [of the S language] is real and goes to a key objective: we wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important."

I think this is the beauty of R that attracts me. I do not have to jump into the developing things directly, but instead gradually transitioning myself into the programming. To sum up, Dr. Peng shares the keywords that could be useful in selling R to new users: free, open source, graphics, reproducibility - reporting - automation, R packages + community, RStudio, transferability skills, and jobs ($$).

Some tips for selling R by Dr. Roger Peng
The second keynote speech was R for Psychological Science by Danielle Navarro (video, slides). Dr. Navarro shared her experience in teaching R for psychology students. Fear apparently is the main challenge that prevents students from learning. She also talked about the difficulty she faced to find a good textbook to use in her class that finally lead her to write her own lecture notes. Her lecture notes tried to address student fears by using a relaxed style. This works well for her that she ended up having her own book and won a teaching award. Dr. Navarro ended her talk by encouraging everyone to conquer their fears and climb the mountain of R. It might not be easy to avoid the 'dragon' at the top, but there are always people who will support and help us. Reminds our community that we are stronger when we are kind to each other.
The third and the last keynote was Code Smells and Feels by Jenny Bryan. She shared some tips and tricks on how to write codes elegantly in a way that it is easier to understand and cheaper to modify. Some code smells apparently have official names such as Primitive Obsession and Inappropriate Intimacy.
Here are some tips that I summarize from her talk:
  1. Write simple conditions
  2. Use helper functions
  3. Handle class properly
  4. Return and exit early
  5. Use polymorphism
  6. Use switch() if you need to dispatch different logic based on a string.
Besides the three great keynotes above, I also attended several short talks:
  1. Tidy forecasting in R by Rob Hyndman
  2. jstor: An R Package for Analysing Scientific Articles by Thomas Klebel
  3. What is in a name? 20 Years of R Release Management by Peter Dalgaard
  4. Sustainability Community Investment in Action - A Look at Some of the R Consortium Funded Grant Projects and Working Groups by Joseph Rickert
  5. What We are Doing in the R Consortium Funds by various funded researchers

Closing Ceremony

The closing speech was delivered by Professor Di Cook from the Department of Econometrics and Business statistic at Monash University. There was also a  small handover ceremony between Di Cook and Nathalie Vialaneix who will organize next year's useR! 2019 in Toulouse, France.
At the end of the ceremony, there was an announcement for the winners of hexwall photo contest which are chosen randomly.
It was indeed a delightful experience for me. I am happy and went home with a list of homework and new packages that I have to learn. For those who did not make it to the useR! 2018 Conference, do not feel FOMO. All talks and keynote speech are posted online on R Consortium's youtube account.

I would like to thank Professor Di Cook of Monash University as well as R-Ladies Melbourne for giving me a scholarship and make it possible for me to attend this conference. I also would like to congratulate all useR! 2018 organizing committee for the great and brilliant efforts to make this event a great success. I really look for joining next year's useR! 2019 which will be held from July 9 - 12, 2019, in Toulouse, France. So, do not miss the updates. Check its website as well as follow the twitter account @UseR2019_Conf with hashtag #useR2019.

@erikaris