Monday, March 2, 2015

2015-03-02 Reproducible Research: Lessons Learned from Massive Open Online Courses

Source: Dr. Roger Peng (2011). Reproducible Research in Computational Science. Science 334: 122

Have you ever needed to look back at a program and research data from lab work performed last year, last month or maybe last week and had a difficult time recalling how the pieces fit together? Or, perhaps the reasoning behind the decisions you made while conducting your experiments is now obscure due to incomplete or poorly written documentation.  I never gave this idea much thought until I enrolled in a series of Massive Open Online Courses (MOOCs) offered on the Coursera platform. The courses, which I took during the period from August to December of 2014, were part of a nine course specialization in the area of data science. The various topics included R Programming, Statistical Inference and Machine Learning. Because these courses are entirely free, you might think they would lack academic rigor. That's not the case. In fact, these particular courses and others on Coursera are facilitated by many of the top research universities in the country. The courses I took were taught by professors in the biostatistics department of the Johns Hopkins Bloomberg School of Public Health. I found the work to be quite challenging and was impressed by the amount of material we covered in each four-week session. Thank goodness for the Q&A forums and the community teaching assistants as the weekly pre-recorded lectures, quizzes, programming assignments, and peer reviews required a considerable amount of effort each week.

While the data science courses are primarily focused on data collection, analysis and methods for producing statistical evidence, there was a persistent theme throughout -- this notion of reproducible research. In the figure above, Dr. Roger Peng, a professor at Johns Hopkins University and one of the primary instructors for several of the courses in the data science specialization, illustrates the gap between no replication and the possibilities for full replication when both the data and the computer code are made available. This was a recurring theme that was reinforced with the programming assignments. Each course concluded with a peer-reviewed major project where we were required to document our methodology, present findings and provide the code to a group of anonymous reviewers; other students in the course. This task, in itself, was an excellent way to either confirm the validity of your approach or learn new techniques from someone else's submission.

If you're interested in more details, the following short lecture from one of the courses (16:05), also presented by Dr. Peng, gives a concise introduction to the overall concepts and ideas related to reproducible research.

I received an introduction to reproducible research as a component of the MOOCs, but you might be wondering why this concept is important to the data scientist, analyst or anyone interested in preserving research material. Consider the media accounts in the latter part of 2014 of admonishments for scientists who could not adequately reproduce the results of groundbreaking stem cell research (Japanese Institute Fails to Reproduce Results of Controversial Stem-Cell Research) or the Duke University medical research scandal which was documented in a 2012 segment of 60 Minutes. On the surface these may seem like isolated incidents, but they’re not.  With some additional investigation, I discovered some studies, as noted in a November 2013 edition of The Economist, which have shown reproducibility rates as low as 10% for landmark publications posted in scientific journals (Unreliable Research: Trouble at the Lab). In addition to a loss of credibility for the researcher and the associated institution, scientific discoveries which cannot be reproduced can also lead to retracted publications which affect not only the original researcher but anyone else whose work was informed by possibly erroneous results or faulty reasoning. The challenge of reproducibility is further compounded by technology advances that empower researchers to rapidly and economically collect very large data sets related to their discipline; data which is both volatile and complex. You need only think about how quickly a small data set can grow when it's aggregated with other data sources.

Cartoon by Sidney Harris (The New Yorker)

So, what steps should the researcher take to ensure reproducibility? I found an article published in 2013, which lists Ten Simple Rules for Reproducible Computational Research. These rules are a good summary of the ideas that were presented in the data science courses.
  • Rule 1: For Every Result, Keep Track of How It Was Produced. This should include the workflow for the analysis, shell scripts, along with the exact parameters and input that was used.
  • Rule 2: Avoid Manual Data Manipulation Steps. Any tweaking of data files or copying and pasting between documents should be performed by a custom script.
  • Rule 3: Archive the Exact Versions of All External Programs Used. This is needed to preserve dependencies between program packages and operating systems that may not be readily available at a later date.
  • Rule 4: Version Control All Custom Scripts. Exact reproduction of results may depend upon a particular script. Archiving tools such as Subversion or Git can be used to track the evolution of code as its being developed.
  • Rule 5: Record All Intermediate Results, When Possible in Standardized Formats. Intermediate results can reveal faulty assumptions and uncover bugs that may not be apparent in the final results.
  • Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds. Using the same random seed ensures exact reproduction of results rather than approximations.
  • Rule 7: Always Store Raw Data behind Plots. You may need to modify plots to improve readability. If raw data are stored in a systematic manner, you can modify the plotting procedure instead of redoing the entire analysis.
  • Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected. In order to validate and fully understand the main result, it is often useful to inspect the detailed values underlying any summaries.
  • Rule 9: Connect Textual Statements to Underlying Results. Statements that are connected to underlying results can include a simple file path to detailed results or the ID of a result in the analysis framework.
  • Rule 10: Provide Public Access to Scripts, Runs, and Results. Most journals allow articles to be supplemented with online material. As a minimum, you should submit the main data and source code as supplementary material and be prepared to respond to any requests for further data or methodology details by peers.
In addition to the processing rules, we were also encouraged to adopt suitable technology packages as part of our toolkit. The following list represents just a few of the many products we used to assemble a reproducible framework and also introduce literate programming and analytical techniques into the assignments.
  • R and RStudio: Integrated development environment for R.
  • Sweave: An R package that allows you to embed R code in LaTeX documents.
  • Knitr: New enhancements to the Sweave package for dynamic report generation. It supports publishing to the web using R Markdown and R HTML.
  • R Markdown: Integrates with knitr and RStudio. Allows you to execute R code in chunks and create reproducible documents for display on the web.
  • RPubs: Web publication tool for sharing R markdown files. The gallery of example documents illustrates some useful techniques.
  • Git and GitHub: Open source version control repository.
  • Apache Subversion (SVN): Open source version control repository.
  • iPython Notebook: Creates literate webpages and documents interactively in Python. You can combine code execution, text, mathematics, plots and rich media into a single document. This gallery of videos and screencasts includes tutorials and hands-on demonstrations.
  • Notebook Viewer: Web publication tool for sharing iPython notebook files.

As a result of my experience with the MOOCs, I now have a greater appreciation for the importance of reproducible research and all that it encompasses. For more information on the latest developments, you can refer to any of these additional resources or follow Dr. Peng (@rdpeng) on Twitter.

-- Corren McCoy

No comments:

Post a Comment