Posts

2017-08-07: rel="canonical" does not mean what you think it means

Image
The rel="identifier" draft has been submitted to the IETF.  Some of the feedback we've received via Twitter and email are variations of 'why don't you use rel="canonical" to link to the DOI?'  We discussed this in our original blog post about rel="identifier" , but in fairness that post discussed a great deal of things and through updates and comments it has become quite lengthy.  The short answer is that rel="canonical" handles cases where there are two or more URIs for a single resource (AKA " URI aliases "), whereas  rel="identifier" specifies relationships between multiple resources. Having two or more URIs for the same resource is also known as " DUST: different URLs, similar text ".  This is common place with SEO and catalogs (see the 2009 Google blog post and help center article about rel="canonical").  RFC 6596 gives abstract examples, but below we will examine real world e

2017-07-24: Replacing Heritrix with Chrome in WAIL, and the release of node-warc, node-cdxj, and Squidwarc

Image
I have written posts detailing how an archives modifications made to the JavaScript of a web page being replayed collided with the JavaScript libraries used by the page and how JavaScript + CORS is a deadly combination during replay . Today I am here to announce the release of a suite of high fidelity web archiving tools that help to mitigate the problems surrounding web archiving and a dynamic JavaScript powered web.To demonstrate this, consider the image above: the left-hand screen shot shows today's cnn.com archived and replayed in WAIL, whereas the right-hand screen shot shows cnn.com in the Internet Archive on 2017-07-24T16:00:02 . In this post, I will be covering: Updates to WAIL Release of node-warc Release of node-cdxj Release of Squidwarc WAIL Let me begin by announcing that WAIL has transitioned away from using Heritrix as the primary preservation method . Instead, WAIL now directly uses a full Chrome browser (Electron provided) as the pres

2017-07-19: Archives Unleashed 4.0: Web Archive Datathon Trip Report

Image
They : Hey Sawood , nice to see you again. Me : Hi, I am glad to see you too. They : Did you attend all hackathons, I mean datathons? Me : Yes, I attended all of the four Archives Unleashed events so far. They : How did you like it? Me : Well, there is a reason why I attended all of them, despite being a seemingly busy PhD researcher. They : So, what is your research about? Me : I am trying to profile various web archives to build a high-level understanding of their holdings, primarily, for the sake of efficiently routing Memento aggregation requests, but there can be many more use cases of such profiles... [and the conversation continues...] On day zero of Archives Unleashed 4.0 in London, conversations among many familiar and unfamiliar faces started with travel and lodging related questions, but soon emerged into mass storage challenges, scaling issues, quality and coverage of web archives, long-term maintenance of archival tools, documentation and d

2017-07-06: Web Science 2017 Trip Report

Image
I was fortunate enough to have the opportunity to present Yasmin AlNoamany 's work at Web Science 2017 . Dr. Nelson offers an excellent class on Web Science , but it has been years since I had taken it and I still was uncertain about the current state of the art. Web Science 2017 took place in Troy, a small city in upstate New York that is home to Rensselaer Polytechnic Institute (RPI) . The RPI team had organized an excellent conference focused on a variety of Web Science topics, including cyber bullying, taxonomies, social media, and ethics. Keynote Speakers Day One The opening keynote by Steffen Staab from the Institute for Web Science and Technologies (WeST) was entitled "The Web We Want". He discussed how we need to determine what values we want to meet before deciding on the web we want. Dr. Staab defined three key values: accessibility for the disabled, freedom from harassment, and a useful semantic web. Staab detailed the MAMEM project wh