Thursday, June 26, 2014

2014-06-26: InfoVis Fall 2011 Class Projects

(Note: This is continuing a series of posts about visualizations created either by students in our research group or in our classes.)

I've been teaching the graduate Information Visualization course (then CS 795/895, now CS 725/825) since Fall 2011. Each semester, I assign an open-ended final project that asks students to create an interactive visualization of something they find interesting.  Here's an example of the project assignment. In this series of blog posts, I want to highlight a few of the projects from each course offering.  Some of these projects are still active and available for use, while others became inactive after their creators graduated.

The following projects are from the Fall 2011 semester.  Both Sawood and Corren are PhD students in our group.  Another nice project from this semester was done by our PhD student Yasmin AlNoamany and MS alum Kalpesh Padia.  The project led directly to Kalpesh's MS Thesis, which has its own blog post. (All class projects are listed in my InfoVis Gallery.)

K-12 Archive Explorer
Created by Sawood Alam and Chinmay Lokesh


The K-12 Web Archiving Program was developed for high schools in partnership with the Archive-It team at the Internet Archive and the Library of Congress. The program has been active since 2008 and allows students to capture web content to create collections that are archived for future generations. The visualization helps to aggregate this vast collection of information. The explorer (currently available at http://k12arch.herokuapp.com/) provides users with a single interface for fast exploration and visualization of the K-12 archive collections.

The video below provides a demo of the tool.




We Feel Fine: Visualizing the Psychological Valence of Emotions
Created by Corren McCoy and Elliot Peay



This work was inspired by the "We Feel Fine" project by Jonathan Harris and Sep Kamvar.  The creators harvested blog entries for occurrences of the phrases "I feel" and "I am feeling" to determine the emotion behind the statement. They collected and maintained a database of several million human feelings from prominent websites such as Myspace and Flickr. This work uses the "We Feel Fine" data to measure the nature and intensity of a person’s emotional state as noted in the emotion-laden sentiment of individual blog entries. The specific words in the blogs related to feelings are rated on a continuous 1 to 9 scale using a psychological valence score to determine the degree of happiness. This work also incorporates elements of a multi-dimensional color wheel of emotions popularized by Plutchik to visually show the similarities between words. For example, happy positive feelings are bright yellow, while sad negative feelings are dark blue. The final visualization method combines a standard histogram which describes the emotional states with an embedded word frequency bar chart. We refer to this visualization technique as a "valence bar" which allows us to compare not only how the words used to express emotion have changed over time, but how this usage differs between men and women.

The video below shows a screencast highlighting how the valence bars change for different age groups and different years.



-Michele

Monday, June 23, 2014

2014-06-23: Federal Big Data Summit


On June 19th and 20th, I attended the Federal Big Data Summit at the Ronald Reagan Building in the heart of Washington, D.C. The summit is hosted by the Advanced Technology Academic Research Center (ATARC).

I participated as an employee of the MITRE Corporation -- we help ATARC organize a series of collaboration sessions that are designed to help identify and make recommendations for solutions to big challenges in the federal government. I lead a collaboration session between government, industry, and academic representatives on Big Data Analytics and Applications. The goal of the session was to facilitate discussions between the participants regarding the application of big data in the government and preparing for the continued growth in importance of big data. The targeted topics included access to data in disconnected environments, interoperability between data providers, parallel processing (e.g., MapReduce), and moving from data to decision in an optimal fashion.

Due to the nature of the discussions (protected by Chatham House Rule), I cannot elaborate on the specific attendees or specific discussions. In a few weeks, MITRE will produce a publicly released summary and set of recommendations for the federal government based on the discussions. When it is released, I will update this blog with a link to the report. It be in a similar format and contain information at a similar level as the 2013 Federal Cloud Computing Summit deliverable.

On July 8th and 9th, I will be attending the Federal Cloud Computing Summit where I will run the MITRE-ATARC Collaboration Sessions on July 8th and moderate a panel of collaboration session participants on July 9th. Stay tuned for another blog posting on the Cloud Summit!

--Justin F. Brunelle

Wednesday, June 18, 2014

2014-06-18: Google and JavaScript


In this blog post, we detail three short tests in which we challenge the Google crawler's ability to index JavaScript-dependent representations. After an introduction to the problem space, we describe our three tests as introduced below.
  1. String and DOM modification: we modify a string and insert it into the DOM. Without the ability to execute JavaScript on the client, the string will not be indexed by the Google crawler.
  2. Anchor Tag Translation: we decode an encoded URI and add it to the DOM using JavaScript. The Google crawler should index the decoded URI after discovering it from the JavaScript-dependent representation.
  3. Redirection via JavaScript: we use JavaScript to build a URI and redirect the browser to the newly built URI. The Google crawler should be able to index the resource to which JavaScript redirects.

Introduction

JavaScript continues to create challenges for web crawlers run by web archives and search engines. To summarize the problem, our web browsers are equipped with the ability to execute JavaScript on the client, while crawlers commonly do not have the same ability. As such, content created -- or requested, as in the case of Ajax -- by JavaScript are often missed by web crawlers. We discuss this problem and its impacts in more depth on our TPDL '13 paper.

Archival institutions and search engines are attempting to mitigate the impact JavaScript has on their archival and indexing effectiveness. For example, Archive-It has integrated Umbra into its archival process in an effort to capture representations dependent upon JavaScript. Google has announced that its crawler will index content created by JavaScript, as well. There is evidence that Google's crawler has been able to index JavaScript-dependent representations in the past, but they have announced a commitment to improve and more widely use the capability.

We wanted to investigate how well the Google solution could index JavaScript-dependent representations. We created a set of three extremely simple tests to gain some insight into how Google's crawler operated.

Test 1: String and DOM Modification

To challenge the Google crawler in our first test, we constructed a test page with a MD5 hash string "1dca5a41ced5d3176fd495fc42179722" embedded in the Document Object Model (DOM). The page includes a JavaScript function that changes the hash string  by performing a ROT13 translation on page load. The function overwrites the initial string with the ROT13 translated string "1qpn5n41prq5q3176sq495sp42179722".

Before the page was published, both hash strings returned 0 results when searched in Google. Now, Google shows the result of the JavaScript ROT13 translation that was embedded in the DOM (1qpn5n41prq5q3176sq495sp42179722) but not the original string (1dca5a41ced5d3176fd495fc42179722). The Google Crawler successfully passed this test and accurately crawled and indexed this JavaScript-dependent representation.

Test 2: Anchor Tag Translation

Continuing our investigation with a second test, we wanted to discover if Google could discover a URI to add to its frontier if the anchor tag is generated by JavaScript and only inserted into the DOM after page load. We constructed a page that uses JavaScript to ROT13 decode the string "uggc://jjj.whfgvasoeharyyr.pbz/erqverpgGnetrg.ugzy" to get a decoded URI. The JavaScript inserts an anchor tag linking to the decoded URI. This test evaluates whether the Google crawler will extract the URI from the anchor tag after JavaScript performs the insertion or if the crawler only indexes the original DOM before it is modified by JavaScript.

The representation of the resource identified by the decoded URI contains the MD5 hash string "75ab17894f6805a8ad15920e0c7e628b". At the time of this blog posting's publication, this string returned 0 results in Google. To protect our experiment from contamination (i.e., linking to the resource from a source other than the JavaScript-reliant page), we will not post the URI of the hidden resource in this blog.


The text surrounding the anchor tag is "The deep web link is: " followed by the anchor tag with the target being the decoded URI and the text of "HIDDEN!". If we search for the text surrounding the anchor tag, we receive a single result which includes the link to the decoded URI. However, at the time of this blog posting's publication, the Google crawler has not discovered the hidden resource identified by the decoded URI. It appears Google's crawler is not extracting URIs for its frontier from the JavaScript reliant resources.

Test 3: Redirection via JavaScript

In a third test, we created two pages. One of which was linked by my homepage and is called "Google Test Page 1". This page has a MD5 hash string embedded in the DOM "d41d8cd98f00b204e9800998ecf8427e".

A JavaScript function changes the hash code to "4e4eb73eaad5476aea48b1a849e49fb3" when the page's onload event fires. In short, when the page finishes loading in the browser, a JavaScript function will change the original hash string to a new hash string. After the DOM is changed, JavaScript constructs a URI string to redirect to another page.



In the impossible case (1==0 always evaluates to "false"), the redirect URI is testerpage1.php. This page does not exist. We put in this false URI to try to trick the Google crawler into indexing a page that never existed. (Google was not fooled.)

JavaScript constructs the URI of testerpage2.php that has the hash string "13bbd0f0352dc9f61f8a3d8b015aef67" embedded in the DOM. This page -- prior to this blog post -- is not linked from anywhere, and Google cannot discover it without executing the JavaScript redirect embedded in Google Test Page 1. When we searched for the hash string, Google returned 0 results.

testerpage2.php also writes to a text file whenever the page is loaded. We waited for a string to appear in the text file. After that point, when we search Google for the hash string in testerpage2.php, we receive a result that shows the content and hash of testerpage2.php, but shows the URI of the original Google Test Page 1.


While some may argue that the URI returned in our third test's search result should show the URI of testerpage2.php, this is a choice by Google to provide the original URI rather than the URI of the redirect.

Conclusion

This very simple test set shows that Google is effectively executing JavaScript and indexing the resulting representation. However, the crawler is not expanding its frontier to include URIs that are generated by JavaScript. In all, Google shows that crawling resources reliant on JavaScript is possible at Web scale, but more work is left to be done to properly crawl all JavaScript reliant representations.

--Justin F. Brunelle

2014-06-18: Navy Hearing Conservation Program Visualizations

(Note: This is the first in a series of posts about visualizations created either by students in our research group or in our classes.)

The US Navy runs a Hearing Conservation Program (HCP) which aims to protect the hearing and prevent hearing loss in service members.  Persons who are exposed to levels in the range 85-100 dB are in the program and have their hearing regularly tested.  In the audiogram, there is a beep sounded at different frequencies with increasing volume.  The person being tested raises their hand when they hear the beep and the frequency and volume (in dBA) are recorded.  A higher volume value means worse hearing (i.e., the beep had to be louder before it was audible). Not only are people in the HCP regularly tested, but they are also provided hearing protection to help prevent hearing loss.  The audiogram data includes information about the job the person currently holds as well as if they are using hearing protection.

Researchers are interested in studying Noise Induced Hearing Loss (NIHL).  The theory behind NIHL is that if you're exposed to a massive noise event, you lose lots of hearing instantly, but that if you're exposed to long-term noise, there could be up to a 5 year lag before you notice hearing loss.  One goal of the HCP is to track hearing over time to see if this can be identified.  Hearing in the 4000-6000 Hz range is the most affected by NIHL.

We obtained a dataset of audiograms from the HCP with over 700,000 records covering 20 years of the program.  From this, PhD student Lulwah Alkwai produced three interactive visualizations.

In the first visualization, we show frequency (Hz) vs. hearing level (dB) averaged by job code.  The average over all persons with that job code is shown as the solid line (black is left ear, blue is right ear).  Normal impairment in each ear is shown as a dotted line.  The interactive visualization (currently available at https://ws-dl.cs.odu.edu/vis/Navy-HCP/hz-db.html) allows the user to explore the hearing levels of various job codes.  The visualization also includes representative job codes for the different hearing levels as a guide for the user.

The second visualization (currently available at https://ws-dl.cs.odu.edu/vis/Navy-HCP/age-year.html) shows the age of the person tested vs. the year in which they were tested.  The colored dots indicating hearing level use the same color scheme as the first visualization.  The visualization allows the user to filter the displayed data between all persons, those who used hearing protection, and those who used no hearing protection.  Note that this visualization shows only a sample of the full dataset.


The final visualization (currently available at https://ws-dl.cs.odu.edu/vis/Navy-HCP/age-db.html) is an animated visualization showing how age vs. total hearing (left ear hearing level + right ear hearing level) has changed through time.  Once the page loads, the animation begins, with the year shown indicated in the bottom right corner.  The visualization is also interactive.  If the user hovers over the year, the automatic animation stops and the user takes control of the year displayed by moving the mouse left or right. As with the previous visualization, this shows only a sample of the full dataset.


We created a short demo video of all three visualizations in action.



All three of these visualizations were made using the D3.js library, using examples from Mike Bostock's gallery.  The animated chart was based on Mike Bostock's D3.js recreation of the Gapminder Wealth and Health of Nations chart.

-Michele

Monday, June 2, 2014

2014-06-02: WikiConference USA 2014 Trip Report


Amid the smell of coffee and bagels, the crowd quieted down to listen to the opening by Jennifer Baek, who, in addition to getting us energized, also paused to recognize Ardrianne Wadewitz and Cythia Sheley-Nelson, two Wikipedians who had, after contributing greatly to the Wikimedia movement, had recently passed.  The mood became more uplifting as Sumana Harihareswara began her keynote, discussing the wonders of contributing knowledge and her experience with the Ada Initiative, Geek Feminism, and Hacker School.  She detailed how the Wikimedia culture can learn from the experiences at Hacker School, discussing different methods of learning, and how these methods allow all of us to nurture learning in a group.  She went on to discuss the difference between liberty and hospitality, and the importance of both to community, detailing how the group must ensure that individuals do not feel marginalized due to their gender or ethnicity, but also detailing how good hospitality engenders contribution as well as learning.  Thus began WikiConference USA 2014 at 9:30 am, on May 30, 2014.


At 10:30 am, I attended a session on the Global Economic Map by Alex Peek.  The Global Economic Map is a Wikidata project whose goal is to make economic data available to all in a consistent and easy to access format.  It pulls in sources of data from the World Bank, UN Statistics, U.S. Census, the Open Knowledge Foundation, and more.  It will use a bot to incorporate all of this data into a single location for processing.  They're looking for community engagement to assist in the work.

At 10:50 am, Katie Filbert detailed what Wikidata is and what the community can do with it, which consists mainly of bots collecting information from many sources and consolidating them into a usable format using MediaWiki.  The data is stored and accessible with XML, and is also multi-lingual.  This data is also fed back into the other Wikimedia projects, like Wikipedia, for use in infoboxes.  They are incorporating Lua into the mix in order to allow the infoboxes to be more intelligent about what data they are displaying.  They will be developing a query interface for Wikidata so information can be more readily retrieved from their datastore.

At 11:24 am, Max Klein showed us how to answer big questions with Wikidata.  In addition to the possibilities expressed in previous talks, Wikidata aims to provide modeling of all Wikipedias, allowing further analysis and comparison between each Wikipedia.  He showed visual representations of the gender bias of each Wikipedia, how much each language writes about other languages, and a map of the connection of data within Wikipedia by geolocation.  He showed us the exciting Wikidata Toolkit that allows for translation from XML to RDF and other formats as well as simplifying queries.  The toolkit uses Java to generate JSON, which can be processed by Python for analysis.

At noon, Frances Hocutt gave a presentation on how to use the MediaWiki API to get data out of Wikipedia.  She expounded upon the ability to extract specific structured data, such as links, from given Wikipedia pages.  She mentioned that the data can be directly accessed in XML or JSON, but there is also the mwclient Python library which may be easier to use.  Afterwards, she led a workshop on using the API, guiding us through the API sandbox and the API documentation.

Our lunch was provided by the conference and our lunch keynote was given by DC Vito.  He is a founding member of the Learning About Multimedia Project (LAMP), an organization that educates people about the effects of media in their lives  He wanted to highlight their LAMPlatoon effort and specifically their Media Breaker tool, which allows users to take existing commercials and other media, and edit them in a legal way to interject critical thinking and commentary.  They are working on a deal with the Wikimedia foundation to crowdsource the legal review of the uploaded media so that his organization can avoid lawsuits.

At 3:15 pm, Mathias Klang gave a presentation concerning the attribution of images on Wikipedia and how Wikipedia deals with the copyright of images.  He highlighted how though images are important, the interesting part is often the caption and also the metadata.  He mentioned how sharing came first on the web, but it is only recently that the law has begun to catch up with such easy-to-use licenses as Creative Commons.  His organization, Commons Machinery, is working to return that metadata, such as the license of the image, back to the image itself.  He reveal Elogio, a browser plugin that allows one to gather and store resources from around the web while storing their legal metadata for later use.  Then he mentioned that Wikimedia does not store the attribution metadata in a way that Elogio and other tools can find it.  One of the audience members indicated that Wikimedia is actively interested in this.

At 4:15 pm, Timothy A. Thompson and Mairelys Lemus-Rojas gave a presentation on the Remixing Archival Metadata Project (RAMP) Editor, which is a browser-based tool that uses traditional library finding aids to create individual and organization authority pages for creators of archival collections.  Under the hood, it takes in a traditional finding aid as a Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF) record.  Then it processes this file, pulling in relevant data from other external sources, such as Encoded Archival Description (EAD) files and OCLC.  Then it transforms these EAC-CPF records into wiki markup, allowing for direct publication to English Wikipedia via the Wikipedia API.  The goal is to improve Wikipedia's existing authority records for individuals and organizations with data from other sources.

At 5:45 pm, Jennifer Baek presented her closing remarks, mentioning the conference reception on Saturday at 6:00 pm.  Thus we closed out the first day and socialized with Sumana and several other Wikimedians for the next hour.

At 8:30 am on Saturday, our next morning keynote was given by Phoebe Ayers, who wanted to discuss the state of the Wikimedia movement and several community projects.  She detailed the growth of Wikipedia, even in the last few years, while also expressed concern over the dip in editing on English Wikipedia in recent years, echoing the concerns of a recent paper picked up by the popular press.  She showed how there are many classes being taught on Wikipedia right now.  She highlighted many of the current projects being worked on by the Wikimedia community, briefly focusing on the Wikidata Game as a way to encourage data contribution via gamification.  She mentioned what the Wikimedia Foundation has been focusing on a new Visual editor and other initiatives to support their editors, including grantmaking.  She closed with big questions that face the Wikimedia community, such as promoting growth, providing access for all, and fighting government and corporate censorship.  And our second day of fun had started.

At 10:15 am, I began my first talk.

In my talk, Reconstructing the past with MediaWiki, I detailed our attempts and successes in bringing temporal coherence to MediaWiki using the Memento MediaWiki Extension.  I partitioned the problem into the individual resources needed to faithfully reproduce the past revision of a web page.  I covered old HTML, images, CSS, and JavaScript and how MediaWiki should be able to achieve temporal coherence because all of these resources are present in MediaWiki.



My second talk, Using the Memento MediaWiki Extension to Avoid Spoilers, detailed a specific use case of the Memento MediaWiki Extension.  I showed how we could avoid spoilers by using Memento.  This generated a lot of interest from the crowd.  Some wanted to know when it would be implemented.  Others indicated that there were past efforts to implement spoiler notices in Wikipedia but they were never embraced by the Wikipedia development team.


Using the Memento MediaWiki Extension to Avoid Spoilers

At noon, Isarra Yos gave a presentation on vector graphics, detailing the importance of their use on Wikipedia.  She mentioned how Wikimedia uses librsvg for rendering SVG images, but Inkscape gets better results.  She has been unable to convince Wikimedia to change because of performance and security considerations.  She also detailed the issues in rendering complex images with vector graphics, and why bitmaps are used instead.  Using Inkscape, she showed how to convert bitmaps into vector images.

At 12:30 pm, Jon Liechty gave a presentation on languages in Wikimedia Commons.  He indicated that half of Wikimedia uses the English language template, but the rest of the languages fall off logarithmically.  He is concerned about the "exponential hole" separating the languages on each side of the curve.  He has reached out to different language communities to introduce Commons to them in order to get more participation from those groups.  He also indicated that some teachers are using Wikimedia Commons in their foreign language courses.

After lunch, Christie Koehler, a community builder from Mozilla, gave a presentation on encouraging community building in Wikipedia.  She indicated that community builders are not merely specialized people, but all of us are, by virtue of working together, are community builders.  She has been instrumental in growing Open Source Bridge, an event that brings together discussions on all kinds of open source projects, both for technical and maker communities.  According to her, a community provides access to experienced people you can learn from, and also provides experienced people the ability to deepen skills by letting them share their knowledge in new ways.  She detailed how it is important for a community to be accessible socially and logistically, otherwise the community will not be as successful.  She highlighted how a community must also preserve and share knowledge for present and future methods.  She mentioned that some resources in a community may be essential, but may also be invisible until they are no longer available, so it is important to value those who maintain these resources.  She also mentioned how important it is for communities to value all contributions, not just those from those who most often contribute.

At 3:15 pm, Jason Q. Ng gave a highly attended talk on a comparison of Chinese Wikipedia with Hudong and Baidu Baike.  He works on Blocked on Weibo, which is a project showing what content Weibo blocks that is otherwise available on the web.  He mentioned that censorship can originate from a government, industry, or even users.  Sensitive topics flourish on Chinese Wikipedia, which creates problems for those entities that want to censor information.  Hundong Baike and Baidu Baike are far more dominant than Wikipedia in China, even though they censor their content.  He has analyzed articles and keywords between these three encyclopedias, using HTTP status codes, character count, number of likes, number of edits, number of deleted edits, and if an article is locked from editing, to determine if a topic is censored in some way.

At 5:45 pm, James Hare gave closing remarks detailing the organizations that made the conference possible.  Richard Knipel, of Wikimedia NYC, told us about his organization and how they are trying to grow their Wikimedia chapter within the New York metropolitan area.  James Hare returned to the podium and told us about the reception upstairs.

At 6:00 pm, we all got together on the fifth floor, got to know each other, and discussed the events of the day at the conference reception.

Sunday was the unstructured unconference.  There were lightning talks and shorter discussions on digitizing books (George Chris), video on Wikimedia and the Internet Archive (Andrew Lih), new projects from Wikidata (Katie Filbert), contribution strategies for Wikipedia (Max Klein), low Earth micro-satellites (Gerald Shields), the importance of free access to laws via Hebrew WikiSource (Asaf Bartov), the MozillaWiki (Joelle Fleurantin), ACAWiki, religion on Wikipedia, Wikimedia program evaluation and design (Edward Galvez), Wikimedia meetups in various places, Wikipedia in education (Flora Calvez), and Issues with Wikimedia Commons (Jarek Tuszynski).

I spent time chatting with Sumana Harihareswara, Frances Hocutt, Katie Filbert, Brian Wolff, and others about the issues facing Wikimedia.  I was impressed by the combination of legal and social challenges to Wikimedia.  It helped me understand the complexity of their mission.

At the end of the Sunday, information was exchanged, goodbyes were said, lights were turned off, and we all spread back to the corners of the Earth from which we came, but within each of us was a renewed spirit to improve the community of knowledge and contribute.


-Shawn M. Jones