Saturday, December 20, 2014

2014-12-20: Using Search Engine Queries For Reliable Links

Earlier this week Herbert brought to my attention Jon Udell's blog post about combating link rot by crafting search engine queries to "refind" content that periodically changes URIs as the hosting content management system (CMS) changes.

Jon has a series of columns for InfoWorld, and whenever InfoWorld changes their CMS the old links break and Jon has to manually refind all the new links and update his page.  For example, the old URI:

is currently:

The same content had at least one other URI as well, from at least 2009--2012:

The first reaction is to say InfoWorld should use "Cool URIs", mod_rewrite, or even handles.  In fairness, Inforworld is still redirecting the second URI to the current URI:

And it looks like they kept redirecting the original URI to the current URI until sometime in 2014 and then quit; currently the original URI returns a 404:

Jon's approach is to just give up on tracking different URIs for his 100s of articles and instead use a combination of metadata (title & author) and the "site:" operator submitted to a search engine to locate the current URI (side note: this approach is really similar to OpenURL).  For example, the link for the article above would become:

Herbert had a number of comments, which I'll summarize as:
  • This problem is very much related to Martin's PhD research, in which web archives are used to generate lexical signatures to help refind the new URIs on the live web (see "Moved but not gone: an evaluation of real-time methods for discovering replacement web pages").  
  • Throwing away the original URI is not desirable because that is a useful key for finding the page in web archives.  The above examples used the Internet Archive's Wayback Machine, but Memento TimeGates and TimeMaps could also be used (see Memento 101 for more information).   
  • One solution to linking to a SE for discovery while retaining the original URI is to use the data-* attributes from HTML (see the "Missing Link" document for more information).  
For the latter point, including the original URI (and its publishing date), the SE URI, and the archived URI would result in html that looks like:

I posted a comment saying that a search engine's robots.txt page would prevent archives like the Internet Archive from archiving the SERPs and thus not discover (and archive) the new URIs themselves.  In an email conversation Martin made the point that rewriting the link to search engine is assuming that the search engine URI structure isn't going to change (anyone want to bet how many links to or queries are still working?).  It is also probably worth pointing out that while metadata like the title is not likely to change for Jon's articles, that's not always true for general web pages, whose titles often change (see "Is This A Good Title?"). 

In summary, Jon's solution of using SERPs as interstitial pages as a way to combat link rot is an interesting solution to a common problem, at least for those who wish to maintain publication (or similar) lists.  While the SE URI is a good tactical solution, disposing of the original URI is a bad strategy for several reasons, including working against web archives instead of with them, and betting on the long-term stability of SEs.  The solution we need is a method to include > 1 URI per HTML link, such as proposed in the "Missing Link" document.


Thursday, November 20, 2014

2014-11-20: Archive-It Partners Meeting 2014

I attended the 2014 Archive-It Partners Meeting in Montgomery, AL on November 18.  The meeting attendees are representatives from Archive-It partners with interests ranging from archiving webpages about art and music to archiving government webpages.  (Presentation slides are now available on the Archive-It wiki.)  This is ODU's third consecutive Partners Meeting (see trip reports from 2012 and 2013).

The morning program was focused on presentations from partners who are building collections.  Here's a brief overview of each of those.

Penny Baker and Susan Roeper from the Clark Art Institute talked about their experience in archiving the 2013 Venice Biennale international art exhibition (Archive-It collection) and plans for the upcoming exhibition.  Their collection includes exhibition catalogs, monographs, and press releases about the event.  The material also includes a number of videos (mainly from vimeo), which Archive-It can now capture.

Beth Downs from the Montana State Library (Archive-It collection) spoke about working with partners around the state to fulfill the state mandate to make all government documents publicly available and working to make the materials available to state employees, librarians, teachers, and the general public.  One of the nice things they've added to their site footer is a Page History link that goes directly to the Archive-It Wayback calendar page for the current page.

Beth has also provided instructions for their state agencies on how to include the Page History link and how to embed a Search box into the archive on their pages.  This could be easily adapted to point to other state government archives or to the general Internet Archive Wayback Machine.

Dory Bower from the US Government Printing Office talked about the FDLP (Federal Depository Library Program) Web Archive (Archive-It collections).  They have several archiving strategies and use Archive-It mainly for the more content rich websites along with born-digital materials.

Heather Slania, Director of the Betty Boyd Dettre Library and Research Center at the National Museum of Women in the Arts (Archive-It collections) spoke about the challenges of capturing dynamic content from artists websites.  This includes animation, video (mainly vimeo), and other types of Internet art. She has initially focused on capturing websites of a selection of Internet artists.  These sites include over 6000 videos (from just 30 artists).  The next step is to archive the work of video artists and web comics.  As part of this project, she has been considering what types of materials are currently capture-able and categorizing the amount of loss in the archived sites.  This is related to our group's recent work on measuring memento damage (pdfslides) and investigating the archivability of websites over time (pdf at arXivslides).

Nicholas Taylor from Stanford University Libraries gave an overview of the 2013 NDSA (National Digital Stewardship Alliance) Survey Report (pdf).  The latest survey was conducted in 2013 and the first was done in 2011.  NDSA's goal is to conduct this every 2 years.  Nicholas had lots of great stats in his slides, but here are a few that I noted:
  • 50% of respondents were university programs
  • 7% affiliated with IIPC, 33% with NDSA, 45% Web Archiving Roundtable, 71% with Archive-It
  • many are concerned with capturing social media, databases, and video
  • about 80% respondents are using external services for archiving, like Archive-It
  • 80% haven't transferred data to their local repository
  • many are using tools that don't support WARC (but the percentage using WARC has increased since 2011)
Abbie Nordenhaug and Sara Grimm from the Wisconsin Historical Society (Archive-It collections) presented next.  They're just getting started archiving in a systematic manner.  They have a range of state agency partners with websites that are dynamic to those that are fairly static.  So far, they've set up monthly, quarterly, semi-annual, and annual crawls for those sites.

After these presentations, it was time for lunch.  Since we were in Alabama, I found my way to Dreamland BBQ.

After lunch, the presentations focused on collaborations, an update on 2014-2015 Archive-It plans, BOF breakout sessions, and strategies and services.

Anna Perricci from Columbia University Libraries spoke about their experiences with collaborative web archiving projects (Archive-It collections), including the Collaborative Architecture, Urbanism, and Sustainability Web Archive (CAUSEWAY) collection and the Contemporary Composers Web Archive (CCWA) collection.

Kent Underwood, Head of the Avery Fisher Center for Music and Media at the NYU Libraries, spoke about web archiving for music history (Archive-It collection).  Kent gave an eloquent argument for web archiving:  "Today’s websites will become tomorrow’s historical documents, and archival websites must certainly be an integral part of tomorrow’s libraries. But websites are fragile and impermanent, and they cannot endure as historical documents without active curatorial attention and intervention. We must act quickly to curate and preserve the memory of the Internet now, while we have the chance, so that researchers of tomorrow will have the opportunity to discover their own past. The decisions and actions that we take today in web archiving will be crucial in determining what our descendants know and understand about their musical history and culture."

Patricia Carlson from Mount Dora High School in Florida spoke about Archive-It's K-12 Archiving Program and its impact on her students (Mount Dora's Archive-It collection).  She talked about its role in introducing her students to primary sources and metadata.  She's also been able to use things that they already do (like tag people on Facebook) as examples of adding metadata. The students have even made a video chronicling their archiving experiences.

After the updates on ongoing collaborations, Lori Donovan and Maria LaCalle from Archive-It gave an overview of Archive-It's 2014 activities and upcoming plans for 2015.  Archive-It currently has 330 partners in 48 US states (only missing Arkansas and North Dakota!) and 16 countries.  In 2014, with version 4.9, Archive-It crawls certain pages with Heritrix and Umbra, which allows Heritrix to access sites in the same way a browser would.  This allows for capture of client-side scripting (such as JavaScript) and improves the capture of social media sites.  There were several new features in the 5.0 release, among them integration with Google Analytics. There will be both a winter 2014 release and a spring/summer 2015 release.  In the spring/summer release several new features are planned, including visual/UI redesign of the web app, the ability to move and share seeds between collections, ability to manually rank metadata facets on public site, enhanced integration with, updated Wayback look and feel, and linking related pages on the Wayback calendar (in case URI changed over time).

After a short break, we divided up into BOF groups:
  • v2
  • Researcher Services
  • Cross-archive collaboration
  • QA (quality assurance)
  • Archiving video, audio, animations, social media
  • State Libraries
I attended the Research Services BOF, led by Jefferson Bailey and Vinay Goel from Internet Archive and Archive-It.  Jefferson and Vinay described their intentions with launching research services and asked for feedback and requests.  The idea is to use the Internet Archive's big data infrastructure to process data and provide smaller datasets of derived data to partners from their collections.  This would allow researchers to work on smaller datasets that would be manageable without necessarily needing big data tools.  This could also be used to provide a teaser as to what's in the collection, highlight link structure in the collection, etc.  One of the initial goals is to seed example use cases of these derivative datasets to show others what might be possible.  The ultimate goal is to help people get more value from the archive.  Jefferson and Vinay talked in more detail about what's upcoming in the last talk of the meeting (see below). Most of the other participants in the BOF were interested in ways that their users could make research use out of their archived collections.

After the BOF breakout, the final session featured talks on strategies and services.

First up was yours truly (Michele Weigle from the WS-DL research group at Old Dominion University).  My talk was a quick update on several of our ongoing projects, funded by NEH Office of Digital Humanities and the Columbia University Libraries Web Archiving Incentives program.

The tools I mentioned (WARCreate, WAIL, and Mink) are all available from our Software page.  If you try them out, please let us know what you think (contact info is on the last slide).

Mohamed Farag from Virginia Tech's CTRnet research group presented their work on an event focused crawler (EFC).  Their previous work on automatic seed generation from URIs shared on Twitter produced lots of seeds, but not all of them were relevant.  The new work allows a curator to select high quality seed URIs and then uses the event focused crawler (EFC) to retrieve webpages that are highly similar to the seeds.  The EFC can also read WARCs and perform text analysis (entities, topics, etc.) from them.  This enables event modeling, describing what happened, where, and when.

In the final presentation of the meeting, Jefferson Bailey and Vinay Goel from Internet Archive spoke about building Archive-It Research Services, planned to launch in January 2015. The goals are to expand access models to web archives, enable new insights into collections, and facilitate computational analysis.  The plan is to leverage the Internet Archive's infrastructure for large-scale processing.  This could result in increasing the use, visibility, and value of Archive-It collections.  Initially, three main types of datasets are planned:
  • WAT - consists of key metadata from a WARC file, includes text data (title, meta-keywords, description) and link data (including anchor text) for HTML
  • LGA - longitudinal graph analysis - what links to what over time
  • WANE - web archive named entities
All of these datasets are significantly smaller than the original WARC files.  Jefferson and Vinay have built several visualizations based on some of this data for demonstration and will be putting some of these online.  Their future work includes developing programmatic APIs, custom datasets, and custom processing.

All in all, it was a great meeting with lots of interesting presentations. It good to see some familiar faces and to actually meet others I'd only previously emailed with.  It was also nice to be in an audience where I didn't have to motivate the need for web archiving.

There were several people live-tweeting the meeting (#ait14).  I'll conclude with some of the tweets.


Friday, November 14, 2014

2014-11-14: Carbon Dating the Web, version 2.0

For over 1 year, Hany SalahEldeen's Carbon Date service has been out of service mainly because of API changes in some of the underlying modules on which the service is built upon. Consequently, I have taken up the responsibility of maintaining the service, beginning with the following now available in Carbon Date v2.0.

Carbon Date v2.0

The Carbon Date service currently makes requests to the different modules (Archives, backlinks, etc.), in a concurrent manner through threading.
The server framework has been changed from bottle server to CherryPy server which is still a python minimalist WSGI server, but a more robust framework which features a threaded server.

How to use the Carbon Date service

There are three ways:
  • Through the website, Given that carbon dating is highly computationally intensive, the site should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally ( or
  • Through the local server ( The second way to use the Carbon Date service is through the local server application which can be found at the following repository: Consult for instructions on how to install the application.
  • Through the local application ( The third way to use the Carbon Date service is through the local python application which can be found at the following repository: Consult for instructions on how to install the application.

The backlinks efficiency problem

Upon running the Carbon Date service, you will notice a significant difference in the runtime of the backlinks module compared to the other modules, this is because the most expensive operation in the carbon dating process involves carbon dating backlinks. Consequently, in the local application (, the backlinks module is switched off by default and reactivated with the --compute-backlinks option. For example, to Carbon Date, with the backlinks module switched on:
Some effort was put towards optimizing the backlinks module, however, my conclusion is that the current implementation cannot be optimized.

This is because of the following cascade of operations associated with the inlinks:

Given a single backlink (an incoming link - inlink to the URL), the application retrieves all mementos (which could range from tens to hundreds). Thereafter, the application searches for the first occurrence of the link in the memento.

At first glance, one may suggest binary search since the mementos are in chronological order. However, given that there are potentially multiple memento instances which contain the URL, binary search does not help us because if we check the midpoint memento for the URL, we cannot act upon this information to narrow the search space by half, since the left half of the list of mementos or the right half of the list of mementos could contain the first occurrence of the URL. Therefore, the linear method is the only possible method.

I am grateful to everyone who contributed to the debugging of Carbon Date such as George Micros and the members of the Old Dominion University Introduction to Web Science class (Fall 2014). Further recommendation or comments about how this service can be improved is welcome and will be appreciated.


Sunday, November 9, 2014

2014-11-09: Four WS-DL Classes for Spring 2015

We're excited to announce that four Web Science & Digital Library (WS-DL) courses will be offered in Spring 2015:
Web Programming, Big Data, Information Visualization, & Digital Libraries -- we have you covered for spring 2015.  


Monday, October 27, 2014

2014-10-27: 404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent

Herbert and I attended the "404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent" at the Georgetown Law Library on Friday, October 24, 2014.  Although the origins for this workshop are many, catalysts for it probably include the recent Liebler  & Liebert study about link rot in Supreme Court opinions,  and the paper by Zittrain, Albert, and Lessig about and the problem of link rot in the scholarly and legal record and the resulting popular media coverage resulting from it  (e.g., NPR and the NYT). 

The speakers were naturally drawn from the legal community at large, but some notable exceptions included David Walls from the GPO, Jefferson Bailey from the Internet Archive, and Herbert Van de Sompel from LANL. The event was streamed and recorded, and videos + slides will be available from the Georgetown site soon so I will only hit the highlights below. 

After a welcome from Michelle Wu, the director of the Georgetown Law Library, the workshop started with an excellent keynote from the always entertaining Jonathan Zittrain, called "Cites and Sites: A Call To Arms".  The theme of the talk centered around "Core Purpose of .edu", which he broke down into:
  1. Cultivation of Scholarly Skills
  2. Access to the world's information
  3. Freely disseminating what we know
  4. Contributing actively and fiercely to the development of free information platforms

For each bullet he gave numerous anecdotes and examples; some innovative, and some humorous and/or sad.  For the last point he mentioned Memento,, and timed release crypto

Next up was a panel with David Walls (GPO), Karen Eltis (University of Ottawa), and Ed Walters (Fastcase).  David mentioned the Federal Depository Library Program Web Archive, Karen talked about the web giving us "Permanence where we don't want it and transience where we require longevity" (I tweeted about our TPDL 2011 paper that showed for music videos on Youtube, individual URIs die all the time but the content just shows up elsewhere), and Ed generated a buzz in the audience when he announced that in rendering their pages they ignore the links because of the problem of link rot.  (Panel notes from Aaron Kirschenfeld.)

The next panel had Raizel Liebler (Yale) author of another legal link rot study mentioned above and an author of one of the useful handouts about links in the 2013-2014 Supreme Court documentsRod Wittenberg (Reed Tech) talked about the findings of the Chesapeake Digital Preservation Group and gave a data dump about link rot in Lexis-Nexis and the resulting commercial impact (wait for the slides).  (Panel notes from Aaron Kirschenfeld.)

After lunch, Roger Skalbeck (Georgetown) gave a web master's take on the problem, talking about best practices, URL rewriting, and other topics -- as well as coining the wonderful phrase "link rot deniers".  During this talk I also tweeted TimBL's classic 1998 resource "Cool URIs Don't Change". 

Next was Jefferson Bailey (IA) and Herbert.  Jefferson talked about web archiving, the IA, and won approval from the audience for his references to Lionel Hutz and HTTP status dogs.  Herbert's talk was entitled "Creating Pockets of Persistence", and covered a variety of topics, obviously including Memento and Hiberlink.

The point is to examine web archiving activities with an eye to the goal of making access to the past web:
  1. Persistent
  2. Precise
  3. Seamless
Even though this was a gathering of legal scholars, the point was to focus on technologies and approaches that are useful across all interested communities.  He also gave examples from our "Thoughts on Referencing, Linking, Reference Rot" (aka "missing link) document, which was also included in the list of handouts.  The point on this effort is enhance existing links (with archived versions, mirror versions, etc.), but not at the expense of removing the link to the original URI and the datetime of intended link.  See our previous blog post on this paper and a similar one for Wikipedia.

The closing session was Leah Prescott (Georgetown; subbing for Carolyn Cox),  Kim Dulin (Harvard), and E. Dana Neacşu (Colombia).   Leah talked some more about the Chesapeake Digital Preservation Group and how their model of placing materials in a repository doesn't completely map to the model of web archiving (note: this actually has fascinating implications for Memento that are beyond the scope of this post).  Kim gave an overview of Harvard's archive, and Dana gave an overview of a prior archiving project at Columbia.  Note that recently received a Mellon Foundation grant (via Columbia) to add Memento capability.

Thanks to Leah Prescott and everyone else that organized this event.  It was an engaging, relevant, and timely workshop.  Herbert and I met several possible collaborators that we will be following up with. 


-- Michael

Thursday, October 16, 2014

2014-10-16: Grace Hopper Celebration of Women in Computing (GHC) 2014

Photo credit to my friend Mona El Mahdy
I was thrilled and humbled for the second time to attend Grace Hopper Celebration of women in computing (GHC) 2014, the world’s largest gathering for technologists women. GHC is presented by the Anita Borg Institute for Women and Technology, which was founded by Dr. Anita Borg and Dr. Telle Whitney in 1994 to bring together research and career interests of women in computing and encourage the participation of women in computing. The twentieth anniversary of GHC was held in Phoenix, Arizona on October 8-10, 2014. This year, GHC has almost doubled the number of women who have research and business interests from the last year to be 8,000 women from about 67 countries and about 900 organizations to get inspired, gain expertise, get connected, and have fun.

Aida Ghazizadeh from the Department of Computer Science at Old Dominion University also was awarded travel scholarships to attend this year's GHC. I hope ODU will have more participation in the upcoming years.

The conference theme this year was "Everywhere. Everyone.”. Computer technologies are everywhere and everyone should be included for driving innovations. There were multiple technical tracks featuring the latest technologies in many fields such as cloud computing, data science, security, and Swift Playgrounds Programming language by Apple. Conference presenters represented many different fields, such as academia, industry, and government. The non-profit organization "Computing Research Association Committee on Women in Computing (CRA-W)", also offered sessions targeted towards academics and business. I had a chance to attend Graduate Cohort Workshop in 2013, which was held in Boston, MA, and created a blog post about it.

The first day started off with welcoming the 8,000 conference attendees by Dr. Telle Whitney, the president and the CEO of Anita Borg Institute. She mentioned how the GHC started the first time on 1994 in Washington DC to bring together research and career interests of women in computing and encourage the participation of women in computing. "Join in, connect with one another, be inspired by our speakers, be inspired by our award winners, develop your own skill and knowledge at the leadership workshops and at the technical sessions, let's all inspire and increase the ratio,  and make technology for everyone  everywhere,” Whitney said. Then she introduced Alex Wolf, the President of the Association of Computing Machinery (ACM) and a professor in Computing at Imperial College London, UK, for opening remarks.

Ruthe Farmer
Barbara Biungi and Durbana Habib
After the opening keynote, the ABIE Awards for social impact and Change Agent were presented by the awards' sponsors. The recognitions went to Ruthe FarmerBarbara Birungi and Durdana Habib who gave nice and motivated talks. Some highlights from Farmer's talk was:
  • "The next time you witness a technical woman doing something great, please tell her, or better tell others about her."
  • "The core of aspiration in computing is a powerful formula recognition plus community.” 
  • "Technical Women are not outliers."
  • "Heads up to all of you employers out there. There is a legion of young women heading your way that will negotiating their salaries ... so budget accordingly!"

The keynote of the first day was for Shafi Goldwasser, RSA Professor of Electrical Engineering and Computer Science at MIT and 2012 recipient of the Turing Award, about the history and benefits of cryptography and also her work in cryptography. She discussed the challenges in encryption and cloud computing. Here are some highlights from Goldwasser's talk:
  • "With the magic of cryptography, we can get the benefits of technology without the risks."
  • "Cryptography is not just about finding the bad guys, it is really about correctness, and privacy of computation"
  • "I believe that a lot of the challenges for the future of computer science are to think about new representations of data. And these new representations of data will enable us to solve the challenges of the future."

Picture taken from My Ramblings blog
After the opening keynote, we attended the Scholarship Recipients Lunch which was sponsored this year by Apple. We had engineers from Apple on each table to communicate with us during the lunch.

The sessions started after the lunch break. I attended CRA-W track: Finding Your Dream Job Presentations, which had presentations by Jaeyeon Jung from Microsoft Research and Lana Yarosh from University of Minnesota. The session targeted the late stage graduate students for helping them in deciding how to apply for jobs, how to prepare for interview, and also how to negotiate a job offer. The presenters allotted a big time slot for questions after they finished their presentations. For more information about "Finding Your Dream Job Presentations" session and the highlights of the session, here is an informative blog post:
GHC14 - Finding your Dream Job Presentations

A global community of women leaders panel
The next session I attended was "A Global Community of Women Leaders" panel in the career track, moderated by Jody Mahoney (Anita Borg Institute). The panelists were Sana Odeh (New York University), Judith Owigar (Akirachix), Sheila Campbell (United States Peace Corps), Andrea Villanes (North Carolina State University).  They explained their roles in increasing the number of women in computing and the best ways to identify global technology leaders through their experience. At the end, they opened questions to the audience. "In the middle east, the women in technology represents a big ratio of the people in computing," said Sana Odeh.

There were many interesting sessions, such as, "Building Your Professional Persona Presentations" and "Building Your Professional Network Presentations", for presenting how to build your professional image and how to promote yourself and present your ideas in a concise and appealing way to the people. These are two blog posts that cover the two sessions in details:
Facebook booth in the career fair #GHC14
In the meantime, the career fair was launched on the first day, Wednesday 8 October at 4:30 - 6:30 p.m and continued the second day and part of the third day. The career fair is a great forum for facilitating open conversations about career positions in industry and academia. Many famous companies, such as Google, FacebookMicrosoftIBM, Yahoo,  Thomson Reuters, etc.,  many universities such as, Stanford University, Carnegie Mellon UniversityThe George Washington UniversityVirginia Tech University, etc., and non-profit organizations such as CRA-W. Each company had many representatives to discuss the different opportunities they have for women. The poster session was held in the evening.

Cotton candy in the career fair #GHC14
Like the last year, Thomson Reuters attracted many women's attention with a great promotion through bringing up a caricature artists. Other companies used nice ideas to promote themselves, such as cotton candy. There were many representatives for promoting each organization and also for interviewing. I enjoyed being among all of these women in the career fair which inspired me enough to think about how to direct my future in a way to contribute to computing and also encourage many other women to computing. My advice to anyone who will go to GHC next year, print many copies of your resumes to be prepared for the career fair.

Day 2 started with welcoming from the audience by Barb Gee, the vice president of programs for Anita Borg institute. Gee presented the GirlRising videoclip "I'm not a number".

After the clip, Dr. Whitney introduced the special guest, the amazing Megan Smith, the new Chief Technology Officer of the United States and the previously vice president of Google[x]. Smith was a last year's keynote speaker, in which she gave a very inspiring talk entitled, "Passion, Adventure and Heroic Engineering". Smith welcomed the audience and talked about her new position as the CTO of the United States. She expressed her happiness to serve the president of USA and serve her country. "Let’s work together together to bring everyone a long and to bring technology that we know how to solve the problems with," Smith said at the end of her short inspiring talk.

Dr. Whitney talked about the the Building Recruiting And Inclusion for Diversity (BRAID) initiative between the Anita Borg Institute and Harvey Mudd College to increase the diversity in computer science undergraduates. The BRAID initiative is funded by Facebook, Google, Intel, and Microsoft.

The 2014 GHC technical leadership ABIE award went to Anne Condon, a professor and the head of the Department of Computer Science at University of British Columbia. Condon donated her award to Grace Hopper India and Programs of the Computing Research Association (CRA).

Maria Kawle on the right Satya Nadella on the left 
Satya Nadella, the Chief Executive Officer (CEO) of Microsoft, in an interesting conversation with Maria Kawle, the president of Harvey Mudd College, was the second keynote of GHC 2014. Nadella is the first male speaker at GHC. Nadella was asked many interesting questions. One of them as "Microsoft has competitors like Apple, Google, Facebook, Amazon. What can Microsoft do uniquely do in this new world?" Nadella answered that the two things that he believes Microsoft contribute to the world are the productivity and the platform. Maria continued, "it is not a competition, it is a partnership".

In answer to a tough question "Why does Microsoft hire fewer female engineer employers than male?", Nadella said that they all now have the numbers out there. Microsoft number is about 17% and it is almost the same numbers as Google, Facebook, and little below Apple. He said, "the real issue in our company how to make sure that we are getting women who are very capable into company and well represented".

In response to a question about how to ask for a raise in salary, Nadella said: "It’s not really about asking for a raise, but knowing and having faith that the system will give you the right raise." Nadella got a torrent of criticism and irate reaction on twitter.

Nadella later apologized for his "inarticulate” remarks in a tweet, followed by an issued statement to Microsoft employee, which was published on company's website.

"I answered that question completely wrong," said Nadella. "I believe men and women should get equal pay for equal work. And when it comes to career advice on getting a raise when you think it’s deserved, Maria’s advice was the right advice. If you think you deserve a raise, you should just ask."

Day 3 started with some announcements from the ABI board, then the best posters announcement and the Awards Presentation. The last keynote was by Dr. Arati Prabhakar, the Director of the Defense Advanced Research Projects Agency (DARPA). Dr. Prabhakar talked about "how do we shape our times with the technology that we work on and we passionate about?". Dr. Prabhakar shared neat technologies with us in her keynote. She started with a video of a quadriplegic using her thoughts to control a robotic arm by blogged her brain to the computer. She talked about building technologies at DARPA. She answered many questions from at the end related to her work in DARPA. It is an amazing to see a successful women who creates technology that serves her country. The keynote ended with a nice video promoting GHC 2015.

Latest trends and technical challenges of big data panel
After the keynote, I attended "Latest Trends and Technical Challenges of Big Data Analytics Panel", which was moderated by Amina Eladdadi (College of Saint Rose). The Panelists were Dr. Bouchra Bouqata from GE, Dr. Kaoutar El Maghraoui from IBM, Dr. Francine Berman from RPI, and Dr. Deborah Agarwal from LBNL. This panel focused on discussing new Big Data Analytics data-driven technologies, infrastructure, and challenges. The panelists introduced use cases from industry and academia. They are many challenges that faces big data: storage, security (specifically for cloud computing), and the scale of the data and bring everything together to solve the problem.

ArabWIC lunch table
After the panel, I attended the career fair then I attended the Arab Women in Computing (ArabWIC) meeting during the lunch. I had my first real experience with ArabWIC organization in GHC 2013. ArabWIC had more participation this year. I also attended ArabWIC reception, Sponsored by Qatar Computing Research Institute (QCRI),on Wednesday's night and get a chance to connect many Arab women in computing in business and academia.

After that I attended the "Data Science in Social Media Analysis Presentations", which included three presentations that talk about data analysis. The three useful presentations were:
"How to be a data scientist?" by Christina Zou
The presenters talked about real-life projects. The highlights of the presentations were:

  • "Improve the accuracy is what we strove for."
  • "It’s important to understand the problem."
  • "Divide the problem into pieces."
  • After the presentations, I talked to Christina about my research, and she gave me some ideas that I'll apply.
    The picture taken from GHC Facebook page
    At the end of the day, Friday celebration, which was sponsored by Google, Microsoft, GoDaddy, begins at 7:30. The dancing floor was full of amazing ladies celebrating and dancing with glowing sticks!

    It was fantastic meeting a large number of like-minded peers and future employers. I'm pleased to have this great opportunity which allowed me to network and communicate with many great women in computing. GHC allowed me to discuss my research ideas with many senior women and got positive feedback about it. I came back with multiple ideas that will help me shape my next phase of my research and my next career path.


    Tuesday, October 7, 2014

    2014-10-07: FluNet Visualization

    (Note: This wraps up the current series of posts about visualizations created either by students in our research group or in our classes. I'll post more after the Spring 2015 offering of the course.)

    I've been teaching the graduate Information Visualization course since Fall 2011.  In this series of posts, I'm highlighting a few of the projects from each course offering.  (Previous posts: Fall 2011, Fall 2012, 2013)

    The final visualization in this series is an interactive visualization of the World Health Organization's global influenza data, created by Ayush Khandelwal and Reid Rankin in the Fall 2013 InfoVis course. The visualization is currently available at and is best viewed in Chrome.

    The Global Influenza Surveillance and Response System (GISRS) has been in operation since 1995 and aggregates data weekly from laboratories and flu centers around the world. The FluNet website was constructed to provide access to this data, but does not include interactive visualizations.

    This project presents an interactive visualization of all of the GISRS data available through FluNet as of October 2013. The main visualization is an animated 3D choropleth globe where hue corresponds to virus lineage (influenza type A or type B) and color intensity corresponds to infection level. This shows the propagation of influenza across the globe over time.  The globe is also semi-transparent, so that the user can see how influenza infection rates change on the opposite hemisphere. The user may pick a specific time period or press the play button and watch the yearly cycle of infection play itself out on the globe's surface.

    The visualization also includes the option to show a 2D version of the globe, using the Natural Earth projection.

    There is a stacked area slider located under the globe for navigating through time (example of a "scented widget").  The stacked area chart provides a view of the progression of infection levels over time and is shown on a cubic-root scale to compensate for the peaks during the 2009 flu pandemic.

    If the user clicks on a country, a popout chart will be displayed, showing a single year of data for that country, centered on the current point in time.  The default view is a stacked area chart, but there are options to show either a streamgraph or an expanded 100% stacked area chart.  The popout chart animates with the choropleth.

    The video below shows a demo:

    Although the data was freely available from the GISRS website, there was still a significant amount of data cleaning involved.  Both OpenRefine and Mr. Data Converter were used to clean and format the data into JSON.  The D3.js, NVD3, and TopoJSON libraries were used to create the visualization.

    Our future work on this project involves turning this into an extensible framework that can be used to show other global datasets over time.


    Friday, October 3, 2014

    2014-10-03: Integrating the Live and Archived Web Viewing Experience with Mink

    The goal of the Memento project is to provide a tighter integration between the past and current web.    There are a number of clients now that provide this functionality, but they remain silent about the archived page until the user remembers to invoke them (e.g., by right-clicking on a link).

    We have created another approach based on persistently reminding the user just how well archived (or not) are the pages they visit.  The Chrome extension Mink (short for Minkowski Space) queries all the public web archives (via the Memento aggregator) in the background and will display the number of mementos (that is, the number of captures of the web page) available at the bottom right of the page.  Selecting the indicator allows quick access to the mementos through a dropdown.  Once in the archives, returning to the live web is as simple as clicking the "Back to Live Web" button.

    For the case where there are too many mementos to make navigating an extensive list useable (think captures), we have provided a "Miller Columns" interface that allows hierarchical navigation and is common in many operating systems (though most don't know it by name).

    For the opposite case where there are no mementos for a page, Mink provides a one-click interface to submit the page to Internet Archive or for immediate preservation and provides just-as-quick access to the archived page.

    Mink can be used concurrently with Memento for Chrome, which provides a different modality of letting the user specify desired Memento-Datetime as well as reading cues provided by the HTML pages themselves.  For those familiar with Memento terminology, Memento for Chrome operates on TimeGates and Mink operates on TimeMaps.  We also presented a poster about Mink at JCDL 2014 in London (proceedings, poster, video).

    Mink is for Chrome, free, publicly available (go ahead and try it now!), and open source (so you know there's no funny business going on).

    —Mat (@machawk1)

    Thursday, September 25, 2014

    2014-09-25: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

    The Internet Archive (IA) and Open Library offer over 6 million fully accessible public domain eBooks. I searched for the term "dictionary" while I was casually browsing the scanned book collection to see how many dictionaries they have. I found several dictionaries in various languages. I randomly picked A Dictionary of the English Language (1828) - Samuel Johnson, John Walker, Robert S. Jameson from the search result. I opened the dictionary in fullscreen mode using IA's opensource online BookReader application. This book reader application has common tools for browsing an image based book such as flipping pages, seeking a page, zooming, and changing the layout. In the toolbar it has some interesting features like reading aloud and full-text searching. I wondered how could it possibly perform text searching and read aloud an scanned raster image based book? I sneaked inside the page source code which pointed me to some documentation pages. I realized it is using an Optical Character Recognition (OCR) engine called ABBY FineReader to power these features.

    I was curious to find out how do they define the term "dictionary" in a dictionary of early 19th century? So I gave the "search inside" feature of IA's book reader a try and searched for the term "dictionary" there. It took about 40 seconds to search for the lookup term in a book with 850 pages and returned three results. Unfortunately, they were pointing to the title and advertisement pages where this term appeared, but not the page where it was defined. After this failed OCR attempt, I manually flipped pages in the BookReader back and forth the way word lookup is performed in printed dictionaries until I reached the appropriate page. Then I located the term on the page and the definition there was, "A book containing the words of any language in alphabetical order, with explanations of their meaning; a lexicon; a vocabulary; a word-book." I thought I would give the "search inside" feature another try. According to the definition above, dictionary is a book, hence I chose "book" as the next lookup term. This time the BookReader took about 50 seconds to search and returned 174 possible places where the term was highlighted in the entire book. These matches include derived words and definitions or examples of other words where the term "book" appeared. Although the OCR engine did work, the goal of finding the definition of the lookup term was still not achieved.

    After experimenting with an English dictionary, I was tempted to give another language a try. When it comes to a non-Latin language, there is no better choice for me than Urdu. Urdu is a Right-to-Left (RTL) complex script language inspired from Arabic and Persian languages, shares a lot of vocabulary and grammar rules with Hindi, spoken by more than 100 million people globally (majority in Pakistan and India), and it happens to be my mother tongue as well. I picked an old dictionary entitled, Farhang-e-Asifia (1908) - Sayed Ahmad Dehlavi (four volumes). I searched for several terms one after the other, but every time the response was "No matches were found.", although I verified their existence in the book. It turns out that the ABBY FineReader claims OCR support for about 190 languages, but it does not support more than 60% of the world's 100 most popular languages and the recognition accuracy of the supported languages is not reliable.

    Dictionaries are a condensed collection of words and definitions of languages and capture the essence of cultural vocabularies of the era they are prepared, hence they have great archival value and are of equal interest to linguistics and archivists. Improving accessibility of the preserved scanned dictionaries will make them more useful not only for linguistics and archivists, but for the general users too. Unlike general literature books, dictionaries have some special characteristics such as they are sorted to make the lookup of words easy and lookup in dictionaries is fielded searching as opposed to the full-text searching. These special properties can be leveraged when developing an application for accessing scanned dictionaries.

    To solve the scanned dictionary exploration and word lookup problem, we chose a crowdsourced manual approach that works well for every language irrespective of how poorly it is supported by OCR engines. In our approach pages or words of each dictionary are indexed manually to load appropriate pages that correspond to the lookup word. Our indexing approach is progressive hence it increases the usefulness and ease of lookup as more crowdsourced energy is put into the system, starting from the base case, "Ordered Pages" which is at least as good as IA's current BookReader. In the next stage the dictionary can go into "Sparse Index" state in which the first lookup word of each page is indexed that is sufficient to determine the page where any arbitrary lookup word can be found if it exists in the dictionary. To further improve the accessibility of these dictionaries, exhaustive "Full Index" is prepared that indexes every single lookup word found in the dictionary with corresponding pages as opposed to just the first lookup words of each page. This index is very helpful in certain dictionaries where sorting of words is not linear. To determine the exact location of the lookup word on the page we have "Location Index" that highlights the place on the page where the lookup word is located to point user's attention there. Apart from indexing we have introduced an annotation feature where users can link various resources to words on dictionary pages. Users are encouraged to help and contribute improving various indexes and annotations as they use the application. For more detailed description of our approach, please read our technical report:
    Sawood Alam, Fateh ud din B Mehmood, Michael L. Nelson. Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages. Technical Report arXiv:1409.1284, 2014.
    We have built an online application called "Dictionary Explorer" that utilizes the indexing described above and it has an interfaces suitable for dictionaries. The application serves as the explorer of various dictionaries in various languages at the same time it represents various context-aware controls for feedback to contribute to indexes and annotations. In the Dictionary Explorer the user selects a lookup language that loads a tree like word index in the sidebar for the selected language and various tabs in the center region, each tab corresponds to one monolingual or multilingual dictionary that has indexes in the selected language. The user can then either directly input the lookup term in the search field or locate the search term in the sidebar by expanding corresponding prefixes. Once the lookup is performed, all the tabs are loaded simultaneously with appropriate pages corresponding to the lookup term in each dictionary. A pin is placed on pages where the word exists on the page if the location index is available for the lookup word which allows interaction with the word and annotations. A special tab accumulates all the related resources such as user contributed definitions, audio, video, images, examples, and resources from third party online dictionaries and services.

    Following are some feature highlights to summarize the Dictionary Explorer application:
    • Support for various indexing stages.
    • Indexes in multiple languages and multiple monolingual and multilingual dictionaries in each language.
    • Bidirectional (right-to-left and left-to-right) language support.
    • Multiple input methods such as keyboard input, on screen keyboard, and word prefix tree.
    • Simultaneous lookup in multiple dictionaries.
    • Pagination and zoom controls.
    • Interactive location marker pins.
    • Context aware user feedback and annotations.
    • Separate tab for related resources such as user contributions, related media, and third-party resources.
    • API for third-party applications.
    We have successfully developed a progressive approach of indexing that enables lookup in scanned dictionaries of any language with very little initial effort and improves over time as more people interact with the dictionaries. In the future we want to explore specific challenges of indexing and interaction in several other languages such as Mandarin or Japaneses where dictionaries are not sorted essentially based on their huge alphabet. We also want to utilize our current indexes that were developed by users over time to predict pages for lookup terms in dictionaries that are not indexed yet or have partial indexing. We have intuition that we can automatically predict pages of an arbitrary dictionary for a lookup term with acceptable variance by aligning pages of a dictionary with one or more resources such as indexes of other dictionaries in the same language, corpus of the language, most popular words in the language, and partial indexes of the dictionary.


    Sawood Alam