Posts

Showing posts with the label Python

2018-03-14: Twitter Follower Count History via the Internet Archive

Image
The USA Gymnastics team shows significant growth during the years the Olympics are held. Due to Twitter's API, we have limited ability to collect historical data for a user's followers. The information for when one account starts following another is unavailable. Tracking the popularity of an account and how it grows cannot be done without that information. Another pitfall is when an account is deleted, Twitter does not provide data about the account after the deletion date. It is as if the account never existed. However, this information can be gathered from the Internet Archive . If the account is popular enough to be archived, then a follower count for a specific date can be collected.  The previous method to determine followers over time is to plot the users in the order the API returns them against their join dates. This works on the assumption that the Twitter API returns followers in the order they started following the account being observed. The creation

2017-11-22: Deploying the Memento-Damage Service

Image
Many web services such as  archive.is ,  Archive-It ,  Internet Archive , and  UK Web Archive  have provided archived web pages or mementos  for us to use. Nowadays, the web archivists have shifted their focus from how to make a good archive to measuring how well the archive preserved the page. It raises a question about how to objectively measure the damage of a memento that can correctly emulate user (human) perception. Related to this,  Justin Brunelle  devised a prototype for measuring the impact of missing embedded resources (the damage) on a web page. Brunelle, in his IJDL paper (and the earlier JCDL version), describes that the quality of a memento depends on the availability of its resources. The straight percentage of missing resources in a memento is not always a good indicator of how "damaged" it is. For example, one page could be missing several small icons whose absence users never even notice, and a second page could be missing a single embedd

2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

Image
Examples: Archive Now (archivenow) CLI A small part of my research is to ensure that certain web pages are preserved in public web archives to hopefully be available and retrievable whenever needed at any time in the future. As archivists believe that "lots of copies keep stuff safe", I have created a Python library ( Archive Now ) to push web resources into several on-demand archives, such as The Internet Archive , WebCite , Perma.cc , and Archive.is . For any reason, one archive stops serving temporarily or permanently, it is likely that copies can be fetched from other archives. By Archive Now , one command like:     $ archivenow --all www.cnn.com is sufficient for the current CNN homepage to be captured and preserved by all configured archives in this Python library. Archive Now allows you to accomplish the following major tasks: A web page can be pushed into one archive A web page can be pushed into multiple archives A web page can be pushed into all archi

2016-06-03: Lipstick or Ham: Next Steps for WAIL

Image
The development, state, and future of 🐳 Web Archiving Integration Layer. 💄∨🐷?                                                                  Some time ago I created and deployed Web Archiving Integration Layer (frequently abbreviated as WAIL ), an application that provides users pre-configured local instances of Heritrix and OpenWayback. This tool was originally created for the Personal Digital Archiving 2013 conference and has gone through a metamorphosis. The original impetus for creating the application was that the browser-based WARCreate extension required some sort of server-like software to save files locally because of the limitations of the Google Chrome API and JavaScript at the time (2012). WARCreate would perform an HTTP POST to this local server instance, which could would then return an HTTP response with an appropriate MIME type that would cause the browser to download the file. I initially used XAMPP for this with a PHP script within the Apache instance. Th

2016-03-22: Language Detection: Where to start?

Image
Language detection is not a simple task, and no method results in 100% accuracy. You can find different packages online to detect different languages. I have used some methods and tools to detect the language of either websites or some texts. Here is a review of methods I came across during working on my JCDL 2015 paper, How Well are Arabic Websites Archived? . Here I discuss detecting a webpage's language using the HTTP language header and the HTML language tag. In addition, I reviewed several language detection packages, including Guess-Language , Python-Language Detector , LangID and Google Language Detection API . And since Python is my favorite coding language I searched for tools that were written in Python. I found that a primary way to detect the language of a webpage is to use the HTTP language header and the HTML language tag. However, only a small percentage of pages include the language tag and sometimes the detected language is affected by the browser setti