Friday, April 18, 2014

2014-04-18: Grad Cohort Workshop (CRA-W) 2014 Trip Report

Last week on April 10-11, 2014 I attended the Graduate Cohort Workshop 2014 that took place at the Hyatt Regency in Santa Clara. While there, I enjoyed the nice weather of California and saw the home to the headquarters of several high-tech companies.

CRA-W (Computer Research Association's Committee on the Status of Women in Computing Research) sponsors a number of activities focused on helping graduate students succeed in CSE research careers. These include educational and community building events, and mentoring.

The event was part of CRA-W, which has several goals, including (1) increase the number of women in computing (2) provide strategies and information on navigating graduate school (3) early insight into career paths (4) meet others, speaker, graduate students, networking among others and among others.

Women students in their first, second or third year of graduate school in computer science and engineering or a closely related field, who are studying at a US or Canadian institution are eligible to apply to attend the event. This year was the eleventh year of the workshop, which started in 2004.  In that first workshop, there were only 100 applicants and all got accepted, this year there were 503 applications and only 304 that got accepted.

There were general sessions for all audience and there were three simultaneous sessions, for students in their first, second, third year of graduate school. The audience could attend what ever they think is relevant.

The program agenda is available in the Grad Cohort Workshop website, previous agenda and slides are available. They will be uploading the slides from the talks from this year as well.
Friday morning started with the registration process, then breakfast was served where I got to meet some wonderful graduate students and we all shared our personal experience in graduate school and we also got to exchange our CONNECT ID's, which provides conferences a searchable online attendee list. It  allows us to upload our picture, name, school, year in graduate school, interest, personal website link and share it with other attendees. CONNECT allows us to look at other attendee profiles, and people with similar interests and send messages.
After that there was a welcome session that explained what was the workshop about, how important it is and why we are there and how we were selected to be attendees.

Dr.Tracy Camp from the Colorado School of Mines presented the first session, Networking. She started by introducing her self and talked about her professional and personal background. The she provided some information about networking strategies, and that it is not genetic it is a skill that could be developed. Then she discussed how networking takes all direction top, down and across. At the end of the session she made us practice one to one conversation with the person seated both in front and behind each attendee.
In the second session, Dr.Yuanyuan Zhou, a Professor at University of California, San Diego presented on Finding a Research Topic. She first talked about her personal experience on struggling to find what she wanted to work on, and what she was passionate about.
During her talk she noted that zigzag path in finding your research topic is fine and not to expect to find it in only one shot. Some pointers to help you find your right path is (1) find your own strength and what to look for in a topic (2) what is your interest? Pick your strength (3)set your goals and milestones so you can successfully finish (4) think out of the box Also, show showed some of other graduate student experience in finding a research topic.

After that there was lunch break where we found tables with research topic tags. For me, I  sat at the visualization  table, but were able to talk about web archiving as well. It was interesting to talk to both graduate students and professors with similar interests.

Next, I attended the session on “Balancing Graduate School and Personal Life”, presented by both Dr.Yanlei Diao an Associated Professor in the Department of Computer Science at the University of Massachusetts Amherst and Dr.Angela Demke Brown  an Associate Professor in the Department of Computer Science at the University of Toronto. The talk was about how to set long and short term goals achieving them and enjoy each day at a time. Always set target dates and manage time. The main tip introduced was to management and choosing activities carefully. Also, to treat grad school as a job that is separate work and personal life.

After that Dr.Farnam Jahanian  who leads the US NSF's Directorate on Computer and Information Science and Engineering (CISE) talked about “Future of Computer Science”.  In the talk, he mentioned who belongs in the CISE community, which is comprised of is 61% Computer Science and Information Science & Computer Engineering, the rest is 24% Science and Humanities, 12% Engineering (excluding computer engineering), and finally 3% Interdisciplinary Centers. Also, he pointed out the divisions and core research areas.

Then he mentioned the six Emerging Frontiers: (1) Data Explosion (2) Smart Systems: Sensing, Analysis and Decision- such as Environment Sensing, People Centric Sensing, Energy response and Smart Health Care (3) Expanding the limits of Computation (4) Secure Cyberspace – Securing our nation cyberspace (5) Universal Connectivity (6) Augmenting Human Capabilities.
In addition in his talk he mentioned some awards that were granted each year to explore the frontiers of computing.
After that there was a poster session where there were about 90 posters that displayed different research topics that were interesting. The  primary research areas were Networking, HCI, AI, Database, Graphics, Security and many other computer research areas.

Between sessions there was breaks where snacks are provided and the sponsors had some information on their work and job availabilities.

Saturday morning started with breakfast and a session on “Strategies for Human-Human Interaction” presented by three speakers Dr.Amanda Stent Principal Research Scientist at Yahoo, Dr.Laura Haas  IBM researcher and Dr.Margaret Matonosi  Professor at Princeton University.
The session started with small introduction about all the speakers then the talk focused on (1) interaction strategies between faculty and students (2) the challenges of being a women in a computing technology field (3) examples of uncomfortable situation that may occur and how to response.
After that I attended a session on “Building Self Confidence”, presented by Dr.Jullia Hirschberg a Professor and the Department Chair at Columbia University. The talk mainly focused on (1) how to recover from not doing as well in a course as you expected (2) frustration of not knowing what your specific research project (3) feeling that you don’t know as much as your fellow graduate students (4) some examples on situations that may occur and how to help yourself build your self confidence in your own way.

Then there was Wrap-up and Final Remarkswhere all the speakers and attendees were thanked for coming, after that lunch was provided.

Finally, there was a Resume “Writing Clinic” and an “Individual Advising” session where all the speakers provide one to one help to the attendees if needed.

It was nice to attend this kind of sessions and to meet all the wonderful women both professors and students in computing from all over the world, sharing our thoughts and experiences in graduate school.

--Lulwah Alkwai

Special thanks to Professor Michele C. Weigle for editing this post

Thursday, April 17, 2014

2014-04-17: TimeGate Design Options For MediaWiki

We've been working on the development, testing, and improvement of the Memento MediaWiki Extension.  One of our principle concerns is performance.

The Memento MediaWiki Extension supports all Memento concepts:
  • Original Resource (URI-R) - in MediaWiki parlance referred to as a "topic URI"
  • Memento (URI-M) - called "oldid page" in MediaWiki
  • TimeMap (URI-T) - analogous to the MediaWiki history page, but in a machine readable format
  • TimeGate (URI-G) - no native equivalent in MediaWiki; acquires a datetime from the Memento client, supplies back the appropriate URI-M for the client to render
This article will focus primarily on the TimeGate (URI-G), specifically the analysis of two different alternatives in the implementation of TimeGate.  In this article we use the following terms to refer to these two alternatives:
  • Special:TimeGate - where we use a MediaWiki Special Page to act as a URI-G explicitly
  • URI-R=URI-G - where a URI-R acts as a URI-G if it detects an Accept-Datetime header in the request
Originally, the Memento MediaWiki Extension used Special:TimeGate.
A Special:TimeGate datetime negotiation session would proceed as follows, also as described as Pattern 2.1 in Section 4.2.1 of RFC 7089:
  1. HEAD request is sent with Accept-Datetime header to the URI-R*; URI-R responds with a Link header containing the location of the URI-G
  2. GET request is sent with Accept-Datetime header to the URI-G; URI-G responds with a 302 response header containing the location of the URI-M
  3. GET request is sent to the URI-M; URI-M responds with a 200 response header and Memento content
Obviously, this consists of 3 separate round trips between the client and server.  This URI-G architecture is referred to as Special:TimeGate.
The duration for Special:TimeGate is represented by:
dstg = a + RTT + b + RTT + M + RTT
dstg = 3RTT + a + b + M                            (1)
  • a - time to generate the initial URI-R response in step 1
  • b - time to generate the URI-G response in step 2
  • M - time to generate the URI-M response in step 3
  • RTT - round trip time for each request
Based on a conversation with the Wikimedia team, we chose to optimize this exchange by reducing the number of round trips, effectively implementing Pattern 1.1 in Section 4.1.1 of RFC 7089:
  1. HEAD request is sent with Accept-Datetime header to the URI-R; URI-R response with a 302 response header containing the location of the URI-M
  2. GET request is sent to the URI-M; URI-M responds with a 200 response header and Memento content
This URI-G architecture is referred to as URI-R=URI-G.

The duration for URI-R=URI-G is represented by:
drg = B + RTT + M + RTT
drg = 2RTT + B + M                                 (2)
  • B - time to generate the URI-G response in step 1
  • M - time to generate the URI-M response in step 2
  • RTT - round trip time for each request
Intuitively, URI-R=URI-G should be faster.  It has fewer round trips to make between client and server.

For URI-R=URI-G to be the better choice, drg < dstg, which is the same as the following derived relationship:
2RTT + B + M < 3RTT + a + b + M
2RTT - 2RTT + B + M < 3RTT - 2RTT + a + b + M
B + M - M < RTT + a + b + M - M


B < RTT + a + b                                       (3)
First, let's try to acquire the value of the term a.

After review of the Wikimedia architecture, it also became apparent that caching was an important aspect of our design and architecture plans.  Because the initial architecture utilized a Special:TimeGate URI and 302 responses are not supposed to be cached, caching was not of much concern.  Now that we've decided to pursue URI-R=URI-G, it becomes even more important.
Experiments with Varnish (the caching server used by Wikimedia) indicate that the Vary header correctly indicates what representations of the resource are to be cached.  If the URI-R contains a Vary header with the value Accept-Datetime, this indicates to Varnish that it should cache each
URI-R representation in response to an Accept-Datetime in the request for that URI-R.  Other values of the Vary header have a finite number of values, but Accept-Datetime can have a near-infinite number of values, making caching near useless for URI-R=URI-G.

Those visitors of a URI-R that don't use Accept-Datetime in the request header will be able to reap the benefits of caching readily.  Memento users of a URI-R=URI-G system will never reap this benefit, because Memento clients send an initial Accept-Datetime with every initial request.
Caching is important to our duration equations because a good caching server returns a cached URI-R in a matter of milliseconds, meaning our value of a in (3) above is incredibly small, on the order of 0.1 seconds on average.

Next we attempt to get the values of b and B in (3) above.

To get a good range of values, we conducted testing using the benchmarking tool Siege on our demonstration wiki.  The test machine is running an Apache HTTP Server 2.2.15 on top of Red Hat Enterprise Linux 6.5.  This server is a virtual machine consisting of two 2.4 GHz Intel Xeon CPUs and 2 GB of RAM.  The test machine consists of two installs of MediaWiki containing the Memento MediaWiki Extension: one with Special:TimeGate implemented, and a second using URI-R=URI-G.

Both TimeGate implementations use the same function for datetime negotiation.  The only major difference being whether they are called from a topic page (URI-R) or a Special page.

Tests were performed against localhost to avoid benefits to using the installed Varnish caching server.

The output from siege looks like the following:

This output was processed to extract the 302 responses, which correspond to those instances of datetime negotiation (the 200 responses are just siege dutifully following the 302 redirect). The URI then indicates which version of the Memento MediaWiki Extension is installed. URIs beginning with /demo-special use the Special:TimeGate design option. URIs beginning with /demo use the URI-R=URI-G design option. From these lines we can compare the amount of time it takes to perform datetime negotiation using each design option.

The date of Mon, 30 Jun 2011 00:00:00 GMT was used for datetime negotiation, because the test wiki contains fan-created data for the popular book series A Song of Ice And Fire (aka Game of Thrones), and this date corresponds to a book released during the wiki's use.

Figure 1: Differences in URI-G performance between URI-R=URI-G and Special:TimeGate
Figure 1 shows the results of performing datetime negotiation against 6304 different wiki pages.  The plot shows the difference between the URI-R=URI-G durations and the Special:TimeGate durations. Seeing as most values are above 0, there is a marked benefit to using Special:TimeGate.

Why the big difference?  It turns out that the earliest MediaWiki hook in the chain that we can use for URI-R=URI-G is ArticleViewHeader, because we needed something that provides an object that allows access to both the request (for finding Accept-Datetime) and response (for providing a 302) at the same time.  This hook is called once all of the data for a page has been loaded, leading to a lot of extra processing that is not incurred by the Special:TimeGate implementation.

Figure 2:  Histogram showing the range of URI-R=URI-G values
Figure 2 shows a histogram with 12 buckets containing the values for the durations of URI-R=URI-G.  The minimum value is 0.56 seconds.  The maximum value is 12.06 seconds.  The mean is 1.24 seconds. The median is 0.77 seconds. The biggest bucket spans 0 and 1.0.
Figure 3: Histogram showing the range of Special:TimeGate values
Figure 3 shows a histogram also with 12 buckets (for comparison) containing the values for the duration of Special:TimeGate.  The Special:TimeGate values only stretch between 0.22 and 1.75 seconds. The mean is 0.6 seconds. The median is 0.59 seconds. The biggest bucket spans 0.5 and 0.6.

Using this data, we can derive a solution for (3).  The values for B range from 0.56 to 12.06.  The values for b range from 0.22 to 1.75 seconds.

Now, the values of RTT can be considered.

The round trip time (RTT) is a function of the transmission delay (dt) and propagation delay (dp):
RTT = dt + dp                                                (4)
And transmission delay is a function of the number of bits (N) divided by the rate of transmission (R)
dt = N / R                                                      (5)
The average TimeGate request-response pair consists of a 300 Byte HTTP HEAD request header + 600 Byte HTTP 302 response header + 20 Byte TCP header + 20 Byte IP header = 940 Byte  = 7520 bit payload.

For 1G wireless telephony (28,800 bps), the end user would experience a transmission delay of
dt = 7520 b / 28800 bps
dt = 0.26 s
So, in our average case for both URI-G implementations (using a = 0.1 for a cached URI-R in (3)):
B < RTT + a + b
B < dp + dt + a + b
1.24 s < dp + dt + 0.1 s + 0.6 s
we replace RTT with our value for 1G wireless telephony:
1.24 s < dp + 0.26 s + 0.1 s + 0.6 s
1.24 s < dp + 0.96 s
So, an end user with 1 G wireless telephony would need to experience an additional 0.22 s of propagation delay in order for URI-R=URI-G to even be comparable to Special:TimeGate.

Propagation delay is a function of distance and propagation speed:
dp = d / sp                                                (6)
Seeing as 1G wireless telephony travels at the speed of light, the distance one would need to transmit a signal to make URI-R=URI-G viable becomes
0.22 s = d / (299,792,458 m/s)
(0.22 s) (299,792,458 m/s) = d
d = 65,954,340.76 m = 65,954 km = 40,981 miles
This is more than the circumference of the Earth.  Even if we used copper wire (which is worse) rather than radio waves, the order of magnitude is still the same.  Considering the amount of redundancy on the Internet, the probability of hitting this distance is quite low, so let's ignore propagation delay for the rest of this article.

That brings us back to transmission delay.  At what transmission delay, and essentially what bandwidth, does URI-R=URI-G win out over Special:TimeGate using our average values for the generation of the 302 response?
B < dt + a + b from (1) and (4), dropping dp
1.24 s < dt + 0.1 s + 0.6 s
1.24 s < dt + 0.7 s
0.54 s < dt

dt = N / R
0.54 s = 7520 b / R
(0.54 s) R = 7520 b
R = 7520 b / 0.54 s
R = 13,925 bps = 13 kbps
Thus, those MediaWiki sites with users using something slower than a 14.4 modem will benefit from the URI-R=URI-G implementation for TimeGate using our average values for the generation of TimeGate responses.

Therefore, we have decided that Special:TimeGate provides the best performance in spite of the extra request needed between the client and server.  The reason that the intuitive choice did not work out in most cases is due to idiosyncrasies in the MediaWiki architecture, rather than network concerns, as originally assumed.

--Shawn M. Jones

* It is not necessary for a client to send an Accept-Datetime header to a URI-R.  Most Memento clients do (and RFC 7089 Section 3.1 demonstrates this), in hopes that they encounter a URI-R=URI-G pattern and can save on an extra request.

Tuesday, April 1, 2014

2014-04-01: Yesterday's (Wiki) Page, Today's Image?

Web pages, being complex documents, contain embedded resources like images.  As practitioners of digital preservation well know, ensuring that the correct embedded resource is captured when the main page is preserved presents a very difficult problem.  In A Framework for Evaluation of Composite Memento Temporal Coherence, Scott Ainsworth, Michael L. Nelson, and Herbert Van de Sompel explore this very concept.

Figure 1: Web Archive Weather Underground Example Showing the Different Ages of Embedded Resources
In Figure 1, borrowed from that paper, we see a screenshot of the Web Archive's December 9, 2004 memento from Weather Underground.  Even though the age of most of these embedded images differ greatly from the main page, they don't really impact its meaning.  Of interest is the weather map that differs by 9 months, which shows clear skies even though the forecast of the main page calls for clouds and light rain.

The Web Archive, as a service external to the resource that it is trying to preserve, only has access to resources that exist at the time it can make a crawl, leading to inconsistencies.  Wikis, on the other hand, have access to all resources under their control, embedded or otherwise.

This is why it is surprising that MediaWiki, even though it allows for access to all previous revisions of a given page, does not tie the datetime of those embedded resources back to that main page.

A pertinent example is that of the Wikipedia article Same-sex marriage law in the United States by state.

Figure 2:  Screenshot of Wikipedia article on Same-sex marriage law in the United States by state
Figure 2 shows the current (as of this writing) version of this article, complete with a color-coded map indicating the types of same-sex marriage laws applying to each state.  In this case, the correctness of the version of the embedded resource is pertinent to the understanding of the article.

Figure 3: Screenshot of the same Wikipedia page, but for a revision from June of 2013
Figure 3 shows a June 2013 revision of this article, with the same color-coded map.  This is a problem because it is an old revision of the article with the same version of this color-coded map.  When accessing the June 2013 version of the article on Wikipedia, I get the March 2014 version of the embedded resource.  To ensure that this revision makes sense to the reader, the map from Figure 4 should be displayed with the article instead.  As Figure 5 shows, Wikipedia has all previous revisions of this resource.

Figure 4: The June 2013 revision of the embedded map resource
Figure 5:  Listing of all of the revisions of the map resource on Wikipedia

For this particular topic, any historian (or paralegal) attempting to trace the changes in laws on this topic will be confused when presented by a map that does not match the text, and may possibly question the validity of this resource as a whole.

We tried to address this issue with the Memento MediaWiki extension.  MediaWiki provides the ImageBeforeProduceHTML hook, which appears to do what we want.  It provides a $file argument, giving access the the LocalFile Object for the image. It also provides a $time argument that signifies the Timestamp of file in 'YYYYMMDDHHIISS' string form, or false for current.

We were perplexed when the hook did not perform as expected, so we examined the source of MediaWiki version 1.22.5.  Below we see the makeImageLink function that calls the hook on line 569 of Linker.php.

We see that later on, inside this conditional block, $time is used on line 655 as an argument to the makeThumbLink2 function (bottom of code snippet).
And, within the makeThumbLink function, it gets used to make a boolean argument for a call to the function makeBrokenImageLinkObj on line 861.
Back inside the makeImageLink function, we see a second opportunity to use the $time value on line 675, but again it is used to create a boolean argument to the same function.
Note that its timestamp value in 'YYYYMMDDHHIISS' string form is never actually used as prescribed.  So, the documentation for the ImageBeforeProduceHTML hook is incorrect on the use of this $time argument.  In fact, the hook was introduced in MediaWiki version 1.13.0 and this code doesn't appear to have changed much since that time.  It is possible that the $time functionality is intended to be implemented in a future version.

Alternatively, we considered using the &$res argument from that hook to replace the HTML with the images of our choosing, but we would still need to use the object provided by the $file argument, which has no ready-made way to select a specific revision of the embedded resource.

At this point, in spite of having all of the data needed to solve this problem, MediaWiki, and transitively Wikipedia, does not currently support rendering old revisions of articles as they truly looked in the past.

--Shawn M. Jones