Monday, March 23, 2015

2015-03-23: 2015 Capital Region Celebration of Women in Computing (CAPWIC)

On February 27-28, I attended the 2015 Capital Region Celebration of Women in Computing (CAPWIC) in Harrisonburg, VA on the campus of James Madison University.  Two of our graduating Masters students, Apeksha Barhanpur (ACM president) and Kayla Henneman (ACM-W president) attended with me.

With the snow that had blanketed the Hampton Roads region, we were lucky to get out of town on Friday morning.  We were also lucky that Harrisonburg had their foot of snow over the previous weekend so that there was plenty of time for all of the roads to be cleared.  We had some lovely scenery to view along the way.

We arrived a little late on Friday afternoon, but Apeksha and Kayla were able to attend "How to Get a Tech Job" by Ann Lewis, Director of Engineering at Pedago.  This talk focused on how each student has to pick the right field of technology for their career. The speaker presented some basic information on the different fields of technology and different levels of job positions and companies. The speaker also mentioned the "Because Software is Awesome" Google Group, which is a private group for students seeking information on programming languages and career development.

While they attended the talk, I caught up with ODU alum and JMU assistant professor, Samy El-Tawab.

After a break, I put on my Graduate Program Director hat and gave a talk titled "What's Grad School All About?"


I got to reminisce about my grad school days, share experiences of encountering the imposter syndrome, and discuss the differences between the MS and PhD degrees in computer science.


After my talk, we set up for the College and Career Fair.  ODU served as an academic sponsor, meaning that we got a table where were able to talk with several women interested in graduate school.  Apeksha and Kayla also got to pass out their resumes to the companies that were represented.

I also got to show off my deck of Notable Women in Computing playing cards.  (You can get your own deck at notabletechnicalwomen.org.)


Our dinner keynote, "Technology and Why Diversity Matters," was given by Sydney Klein, VP for Information Security and Risk Management at Capital One. (Capital One had a huge presence at the conference.) One thing she emphasized is that Capital One now sees itself as more of a technology company than a bank. Klein spoke about the importance of women in technology and the percentages of women that are represented in the field at various levels. She also mentioned various opportunities present within the market for women.

After dinner, we had a ice breaker/contest where everyone was divided into groups with the task of creating a flag representing the group and their relation with the field of computer science. Apeksha was on the winning team!  Their flag represented the theme of the conference and how they were connected to the field of technology, “Women make the world work”. Apeksha noted that this was a great experience to work with a group of women from different regions around the world.

On Saturday morning, Apekska and Kayla attended the "Byte of Pi" talk given by Tejaswini Nerayanan and Courtney Christensen from FireEye. They demonstrated programming using the Raspberry Pi device, a single board computer.  The students were given a small demonstration on writing code and building projects.

Later Saturday, my grad school buddy, Mave Houston arrived for her talk.  Mave is the Founder and Head of USERLabs and User Research Strategy at Capital One. Mave gave a great talk, titled "Freedom to Fail". She also talked about using "stepping stones on the way to success." She let us play with Play-Doh, figuring out how to make a better toothbrush. My partner, a graduate student at Virginia State University, heard me talk about trying to get my kids interested in brushing their teeth and came up with a great idea for a toothbrush with buttons that would let them play games and give instructions while they brushed. Another group wanted to add a sensor that would tell people where they needed to focus their brushing.

We ended Saturday with a panel on graduate school that both Mave and I helped with and hopefully encouraged some of the students attending to continue their studies.

-Michele

Tuesday, March 10, 2015

2015-03-10: Where in the Archive Is Michele Weigle?

(Title is an homage to a popular 1980s computer game "Where in the World Is Carmen Sandiego?")

I was recently working on a talk to present to the Southeast Women in Computing Conference about telling stories with web archives (slideshare). In addition to our Hurricane Katrina story, I wanted to include my academic story, as told through the archive.

I was a grad student at UNC from 1996-2003, and I found that my personal webpage there had been very well preserved.  It's been captured 162 times between June 1997 and October 2013 (https://web.archive.org/web/*/http://www.cs.unc.edu/~clark/), so I was able to come up with several great snapshots of my time in grad school.

https://web.archive.org/web/20070912025322/
http://www.cs.unc.edu/~clark/
Aside: My UNC page was archived 20 times in 2013, but the archived pages don't have the standard Wayback Machine banner, nor are their outgoing links re-written to point to the archive. For example, see https://web.archive.org/web/20130203101303/http://www.cs.unc.edu/~clark/
Before I joined ODU, I was an Assistant Professor at Clemson University (2004-2006). The Wayback Machine shows that my Clemson home page was only crawled 2 times, both in 2011 (https://web.archive.org/web/*/www.cs.clemson.edu/~mweigle/). Unfortunately, I no longer worked at Clemson in 2011, so those both return 404s:


Sadly, there is no record of my Clemson home page. But, I can use the archive to prove that I worked there. The CS department's faculty page was captured in April 2006 and lists my name.

https://web.archive.org/web/20060427162818/
http://www.cs.clemson.edu/People/faculty.shtml
Why does the 404 show up in the Wayback Machine's calendar view? Heritrix archives every response, no matter the status code. Everything that isn't 500-level (server error) is listed in the Wayback Machine. Redirects (300-level responses) and Not Founds (404s) do record the fact that the target webserver was up and running at the time of the crawl.

Wouldn't it be cool if when I request a page that 404s, like http://www.cs.clemson.edu/~mweigle/, the archive could figure out that there is a similar page (http://www.cs.unc.edu/~clark/) that links to the requested page?
https://web.archive.org/web/20060718131722/
http://www.cs.unc.edu/~clark/
It'd be even cooler if the archive could then figure out that the latest memento of that UNC page now links to my ODU page (http://www.cs.odu.edu/~mweigle/) instead of the Clemson page. Then, the archive could suggest http://www.cs.odu.edu/~mweigle/ to the user.

https://web.archive.org/web/20120501221108/
http://www.cs.unc.edu/~clark/
I joined ODU in August 2006.  Since then, my ODU home page has been saved 53 times (https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/).

The only memento from 2014 is on Aug 9, 2014, but it returns a 302 redirecting to an earlier memento from 2013.



It appears that Heritrix crawled http://www.cs.odu.edu/~mweigle (note the lack of a trailing /), which resulted in a 302, but http://www.cs.odu.edu/~mweigle/ was never crawled. The Wayback Machine's canonicalization is likely the reason that the redirect points to the most recent capture of http://www.cs.odu.edu/~mweigle/. (That is, the Wayback Machine knows that http://www.cs.odu.edu/~mweigle and http://www.cs.odu.edu/~mweigle/ are really the same page.)

My home page is managed by wiki software and the web server does some URL re-writing. Another way to get to my home page is through http://www.cs.odu.edu/~mweigle/Main/Home/, which has been saved 3 times between 2008 and 2010. (I switched to the wiki software sometime in May 2008.) See https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/Main/Home/

Since these two pages point to the same thing, should these two timemaps be merged? What happens if at some point in the future I decide to stop using this particular wiki software and end up with http://www.cs.odu.edu/~mweigle/ and http://www.cs.odu.edu/~mweigle/Main/Home/ being two totally separate pages?

Finally, although my main ODU webpage itself is fairly well-archived, several of the links are not.  For example, http://www.cs.odu.edu/~mweigle/Resources/WorkingWithMe is not archived.


Also, several of the links that are archived have not been recently captured.  For instance, the page with my list of students was last archived in 2010 (https://web.archive.org/web/20100621205039/http://www.cs.odu.edu/~mweigle/Main/Students), but none of these students are still at ODU.

Now, I'm off to submit my pages to the Internet Archive's "Save Page Now" service!

--Michele

Monday, March 2, 2015

2015-03-02 Reproducible Research: Lessons Learned from Massive Open Online Courses

Source: Dr. Roger Peng (2011). Reproducible Research in Computational Science. Science 334: 122

Have you ever needed to look back at a program and research data from lab work performed last year, last month or maybe last week and had a difficult time recalling how the pieces fit together? Or, perhaps the reasoning behind the decisions you made while conducting your experiments is now obscure due to incomplete or poorly written documentation.  I never gave this idea much thought until I enrolled in a series of Massive Open Online Courses (MOOCs) offered on the Coursera platform. The courses, which I took during the period from August to December of 2014, were part of a nine course specialization in the area of data science. The various topics included R Programming, Statistical Inference and Machine Learning. Because these courses are entirely free, you might think they would lack academic rigor. That's not the case. In fact, these particular courses and others on Coursera are facilitated by many of the top research universities in the country. The courses I took were taught by professors in the biostatistics department of the Johns Hopkins Bloomberg School of Public Health. I found the work to be quite challenging and was impressed by the amount of material we covered in each four-week session. Thank goodness for the Q&A forums and the community teaching assistants as the weekly pre-recorded lectures, quizzes, programming assignments, and peer reviews required a considerable amount of effort each week.

While the data science courses are primarily focused on data collection, analysis and methods for producing statistical evidence, there was a persistent theme throughout -- this notion of reproducible research. In the figure above, Dr. Roger Peng, a professor at Johns Hopkins University and one of the primary instructors for several of the courses in the data science specialization, illustrates the gap between no replication and the possibilities for full replication when both the data and the computer code are made available. This was a recurring theme that was reinforced with the programming assignments. Each course concluded with a peer-reviewed major project where we were required to document our methodology, present findings and provide the code to a group of anonymous reviewers; other students in the course. This task, in itself, was an excellent way to either confirm the validity of your approach or learn new techniques from someone else's submission.

If you're interested in more details, the following short lecture from one of the courses (16:05), also presented by Dr. Peng, gives a concise introduction to the overall concepts and ideas related to reproducible research.





I received an introduction to reproducible research as a component of the MOOCs, but you might be wondering why this concept is important to the data scientist, analyst or anyone interested in preserving research material. Consider the media accounts in the latter part of 2014 of admonishments for scientists who could not adequately reproduce the results of groundbreaking stem cell research (Japanese Institute Fails to Reproduce Results of Controversial Stem-Cell Research) or the Duke University medical research scandal which was documented in a 2012 segment of 60 Minutes. On the surface these may seem like isolated incidents, but they’re not.  With some additional investigation, I discovered some studies, as noted in a November 2013 edition of The Economist, which have shown reproducibility rates as low as 10% for landmark publications posted in scientific journals (Unreliable Research: Trouble at the Lab). In addition to a loss of credibility for the researcher and the associated institution, scientific discoveries which cannot be reproduced can also lead to retracted publications which affect not only the original researcher but anyone else whose work was informed by possibly erroneous results or faulty reasoning. The challenge of reproducibility is further compounded by technology advances that empower researchers to rapidly and economically collect very large data sets related to their discipline; data which is both volatile and complex. You need only think about how quickly a small data set can grow when it's aggregated with other data sources.


Cartoon by Sidney Harris (The New Yorker)


So, what steps should the researcher take to ensure reproducibility? I found an article published in 2013, which lists Ten Simple Rules for Reproducible Computational Research. These rules are a good summary of the ideas that were presented in the data science courses.
  • Rule 1: For Every Result, Keep Track of How It Was Produced. This should include the workflow for the analysis, shell scripts, along with the exact parameters and input that was used.
  • Rule 2: Avoid Manual Data Manipulation Steps. Any tweaking of data files or copying and pasting between documents should be performed by a custom script.
  • Rule 3: Archive the Exact Versions of All External Programs Used. This is needed to preserve dependencies between program packages and operating systems that may not be readily available at a later date.
  • Rule 4: Version Control All Custom Scripts. Exact reproduction of results may depend upon a particular script. Archiving tools such as Subversion or Git can be used to track the evolution of code as its being developed.
  • Rule 5: Record All Intermediate Results, When Possible in Standardized Formats. Intermediate results can reveal faulty assumptions and uncover bugs that may not be apparent in the final results.
  • Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds. Using the same random seed ensures exact reproduction of results rather than approximations.
  • Rule 7: Always Store Raw Data behind Plots. You may need to modify plots to improve readability. If raw data are stored in a systematic manner, you can modify the plotting procedure instead of redoing the entire analysis.
  • Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected. In order to validate and fully understand the main result, it is often useful to inspect the detailed values underlying any summaries.
  • Rule 9: Connect Textual Statements to Underlying Results. Statements that are connected to underlying results can include a simple file path to detailed results or the ID of a result in the analysis framework.
  • Rule 10: Provide Public Access to Scripts, Runs, and Results. Most journals allow articles to be supplemented with online material. As a minimum, you should submit the main data and source code as supplementary material and be prepared to respond to any requests for further data or methodology details by peers.
In addition to the processing rules, we were also encouraged to adopt suitable technology packages as part of our toolkit. The following list represents just a few of the many products we used to assemble a reproducible framework and also introduce literate programming and analytical techniques into the assignments.
  • R and RStudio: Integrated development environment for R.
  • Sweave: An R package that allows you to embed R code in LaTeX documents.
  • Knitr: New enhancements to the Sweave package for dynamic report generation. It supports publishing to the web using R Markdown and R HTML.
  • R Markdown: Integrates with knitr and RStudio. Allows you to execute R code in chunks and create reproducible documents for display on the web.
  • RPubs: Web publication tool for sharing R markdown files. The gallery of example documents illustrates some useful techniques.
  • Git and GitHub: Open source version control repository.
  • Apache Subversion (SVN): Open source version control repository.
  • iPython Notebook: Creates literate webpages and documents interactively in Python. You can combine code execution, text, mathematics, plots and rich media into a single document. This gallery of videos and screencasts includes tutorials and hands-on demonstrations.
  • Notebook Viewer: Web publication tool for sharing iPython notebook files.

As a result of my experience with the MOOCs, I now have a greater appreciation for the importance of reproducible research and all that it encompasses. For more information on the latest developments, you can refer to any of these additional resources or follow Dr. Peng (@rdpeng) on Twitter.

-- Corren McCoy