Posts

Showing posts with the label research

2020-05-28: Richard Pates (Computer Science PhD Student)

Image
     Welcome to my profile on Blogger! My name is  Richard Pates  and I joined the  Web Sciences and Digital Libraries  (WS-DL) research group in the  Department of Computer Science  (CS) at  Old Dominion Univeristy  (ODU) during the Summer of 2020 as a PhD Student in CS advised by  Dr. Jian Wu  as a member of the research team in the  Lab for Applied Machine Learning and Natural Language Processing Systems  (LAMP-SYS) Group working on the  Mining Electronic Theses and Dissertations  (METD) Project. Upon earning the  Masters of Science in Computer Science  (MSCS) from ODU during the Fall of 2018 approval was granted to join the PhD program in CS during the Spring of 2019 jointly advised by  Dr. Ravi Mukkamala  and  Dr. Cong Wong  with an interest in Artificial Intelligence (AI), Cybersecurity and Systems.      This year the main goal in the PhD program for me will be to advance as a  PhD Candidate  during the Fall of 2020 ( Current Academic Calendar ) having made the  Doctoral Dissert

2020-01-09: Kritika Garg (Computer Science PhD Student)

Image
I am Kritika Garg, a first-year Ph.D. student at Old Dominion University. My research interests are in the fields of web archiving, social media, and natural language processing. I joined the Old Dominion University in the fall of 2019 under the supervision of Dr. Michael L. Nelson and Dr. Michele C. Weigle . I work with Web Science and Digital Libraries Research Group (WS-DL) where our focus is in the fields of web archiving, digital preservation, social media, and human-computer interaction. My current research work is in the field of web archiving, including analyzing access patterns of robots and humans in web archives and studying whether the patterns prevalent with the Internet Archive are present across different web archives. I completed my undergrad from  Guru Gobind Singh Indraprastha University  in June 2019. During my undergrad, I started attending various tech events by tech-groups such as  Google Developer Group ,  PyDelhi ,  Women Techmakers,   Women Who Code ,

2016-08-15: Mementos In the Raw, Take Two

Image
In a previous post , we discussed a way to use the existing Memento protocol combined with link headers to access unaltered (raw) archived web content. Interest in unaltered content has grown as more use cases arise for web archives. Ilya Kremer and David Rosenthal had previously suggested that a new dimension of content negotiation would be necessary to allow clients to access unaltered content. That idea was not originally pursued, because it would have required the standardization of new HTTP headers. At the time, none of us were aware of the standard Prefer header from RFC7240 . Prefer can solve this problem in an intuitive way much like their original suggestion of content negotiation. To recap, most web archives augment mementos when presenting them to the user, often for usability or legal purposes. The figures below show examples of these augmentations. Figure 1: The PRONI web archive augments mementos for user experience; augmentations outlined in red Fi

2016-07-24: Improve research code with static type checking

The Pain of Late Bug Detection [The web] is big. Really big. You just won't believe how vastly, hugely, mindbogglingly big it is... [1] When it comes to quick implementation, Python is an efficient language used by many web archiving projects. Indeed, a quick search of github for WARC and Python yields a list of 80 projects and forks . Python is also the language used for my research into the temporal coherence of existing web archive holdings. The sheer size of the Web means lots of variation and lots low-frequency edge cases. These variations and edge cases are naturally reflected in web archive holdings. Code used to research the Web and web archives naturally contains many, many code branches. Python struggles under these conditions. It struggles because minor changes can easily introduce bugs that go undetected until much later. And later for Python means at run time. Indeed the sheer number of edge cases introduces code branches that are exercised so infrequent

2016-04-27: Mementos in the Raw

Image
While analyzing mementos in a recent experiment, we discovered problems processing archived content .  Many web archives augment the mementos they serve with additional archive-specific information, including HTML, text, and JavaScript.  We were attempting to compare content across many web archives, and had to develop custom solutions to remove these augmentations. Most augment their mementos in order to provide additional user experience features, such as navigation to additional mementos, by rewriting links and providing additional discovery tools. From an end-user perspective, these  augmented  mementos enhance the usability and overall experience of web archives and are the default case for user access to mementos.  An example from the PRONI web archive is shown below, with the augmentations outlined in red. Others have  requirements to differentiate archived content from live content , because they expose archived content to web search engines. Below, we see that a Goog

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

Image
My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader ( D-Lib Magazine 2013 , TPDL2013 , JCDL2014 , IJDL2015 ), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript. For example,  Heritrix  (the  Internet Archive 's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax. For example, the Kelly Blue Book Car Values website (Figure 1) uses

2011-08-28: KDD 2011 Trip Report

Image
Author:  Carlton Northern The SIGKDD 2011 conference took place August 21 - 24 at the Hyatt Manchester in San Diego, CA.  Researchers from all over the world interested in knowledge discovery and data mining were in attendance.  This conference in particular has a heavy statistical analysis flavor and many presentations were math intensive. I was invited to present my masters project research at the Mining Data Semantics (MDS2011) Workshop of KDD.  In this paper, we present an approach to find social media profiles of people from an organization.  This is possible due to the links created between members an organization. For instance, co-workers or students will likely friend each other creating hyperlinks between their respective accounts.  These links, if public, can be mined and used to disambiguate other profiles that may share the same names as those individuals we are searching for.  The following figure shows the amount of profiles found from the ODU Computer Science st