Tuesday, September 8, 2015

2015-09-08: Releasing an Open Source Python Project, the Services That Brought py-memento-client to Life

The LANL Library Prototyping Team recently received correspondence from a member of the Wikipedia team requesting Python code that could find the best URI-M for an archived web page based on the date of the page revision. Collaborating with Wikipedia, Harihar Shankar, Herbert Van de Sompel, Michael Nelson, and I were able to create the py-mement-client Python library to suit the needs of pywikibot.

Over the course of library development, Wikipedia suggested the use of two services, Travis CI and Pypi, that we had not used before.  We were very pleased with the results of those services and learned quite a bit from the experience.  We have been using GitHub for years, and also include it here as part of the development toolchain for this Python project.

We present three online services that solved the following problems for our Python library:
  1. Where do we store source code and documentation for the long term? - GitHub
  2. How do we ensure the project is well tested in an independent environment?  - Travis CI
  3. Where do we store the final installation package for others to use? - Pypi
We start first with storing the source code.

GitHub

As someone who is concerned about the longevity of the scholarly record, I cannot emphasize enough how important it is to check your code in somewhere safe.  GitHub provides a wide variety of tools, at no cost, that allow one to preserve and share their source code.

Git and GitHub are not the same thing.  Git is just a source control system.  GitHub is a dedicated web site providing additional tools and hosting for git repositories.

Here are some of the benefits of just using Git (without GitHub):
  1. Distributed authoring - many people can work separately on the same code and commit to the same place
  2. Branching is built in, allowing different people to work on features in isolation (like unfinished support for TimeMaps)
  3. Tagging can easily be done to annotate a commit for release
  4. Many IDEs and other development tools support Git out of the box
  5. Ease of changing remote git repositories if switching from one site to another is required
  6. Every git clone is actually a copy of the master branch of the repository and all of its history, talk about LOCKSS!!!
That last one is important.  It means that all one needs to do is clone a git repository and they now have a local archive of that repository branch, with complete history, at the time of cloning.  This is in contrast to other source control systems, such as Subversion, where the server is the only place storing the full history of the repository.  Using git avoids this single point of failure, allowing us to still have a archival copy, including history, in the case that our git local server or GitHub goes away.


Here are some of the benefits of using GitHub:
  1. Collaboration with others inside and outside of the project team, through the use of pull requests, code review, and an issue tracker
  2. Provides a GUI for centralizing and supporting the project
  3. Allows for easy editing of documentation using Markdown, and also provides a wiki, if needed
  4. The wiki can also be cloned as a Git repository for archiving!
  5. Integrates with a variety of web services, such as Travis CI
  6. Provides release tools that allow adding of release notes to tags while providing compiled downloads for users
  7. Provides a pretty-parsed view of the code where quick edits can be made on the site itself
  8. Allows access from multiple Internet-connected platforms (phone, tablet, laptop, etc.)
  9. And so much more that we have not yet explored....
We use GitHub for all of these reasons and we are just scratching the surface.  Now that we have our source code centralized, how do we independently build and test it?

Travis CI

Travis CI provides a continuous integration environment for code. In our case, we use it to determine the health of the existing codebase.

We use it to evaluate code for the following:
  1. Does it compile? - tests for syntax and linking errors
  2. Can it be packaged? - tests for build script and linking errors
  3. Does it pass automated tests? - tests that the last changes have not broken functionality
Continuous integration provides an independent test of the code. In many cases, developers get code to work on their magic laptop or their magic network and it works for no one else. Continuous Integration is an attempt to mitigate that issue.

Of course, far more can be done with continuous integration, like publish released binaries, but with our time and budget, the above is all we have done thus far.

Travis CI provides a free continuous integration environment for code.  It easily integrates with GitHub.  In fact, if a user has a GitHub account, logging into Travis CI will produce a page listing all GitHub projects that they have access to. To enable a project for building, one just ticks the slider next to the desired project.

It then detects the next push to GitHub and builds the code based on the a .travis.yml file, if present in the root of the Git repository.

The .travis.yml file has a relatively simple syntax whereby one specifies the language, language version, environment variables, pre-requisite requirements, and then build steps.

Our .travis.yml looks as follows:

language: python
cache: # caching is only available for customers who pay
    directories:
        - $HOME/.cache/pip
python:
    - "2.7"
    - "3.4"
env:
    - DEBUG_MEMENTO_CLIENT=1
install: 
    - "pip install requests"
    - "pip install pytest-xdist"
    - "pip install ."
script:
    - python setup.py test
    - python setup.py sdist bdist_wheel
branches:
    only:
        - master

The language section tells Travis CI which language is used by the project. Many languages are available, including Ruby and Java.

The cache section allows caching of installed library dependencies on the server between builds. Unfortunately, the cache section is only available for paid customers.

The python section lists for which versions of Python the project will be built.  Travis CI will attempt a parallel build in every version specified here.  The Wikimedia folks wanted our code to work with both Python 2.7 and 3.4.

The env section contains environment variables for the build.

The install section runs any commands necessary for installing additional dependencies prior to the build.  We use it in this example to install dependencies for testing.  In the current version this section is removed because we now handle dependencies directly via Python's setuptools, but it is provided here for completeness.

The script section is where the actual build sequence occurs.  This is where the steps are specified for building and testing the code.   In our case, Python needs no compilation, so we skip straight to our automated tests before doing a source and binary package to ensure that our setup.py is configured correctly.

Finally, the branches section is where one can indicate additional branches to build.  We only wanted to focus on master for now.

There is extensive documentation indicating what else one can do with .travis.yml.

Once changes have have pushed to GitHub, Travis CI detects the push and begins a build.  As seen below, there are two builds for py-memento-client:  for Python 2.7 and 3.4.



Clicking on one of these boxes allows one to watch the results of a build in real time, as shown below. Also present is a link allowing one to download the build log for later use.


All of the builds that have been performed are available for review.  Each entry contains information about the the commit, including who performed the commit, as well as how long it took, when it took place, how many tests passed, and, most importantly, if it was successful.  Status is indicated by color:  green for success, red for failure, and yellow for in progress.


Using Travis CI we were able to provide an independent sanity check on py-memento-client, detecting test data that was network-dependent and also eliminating platform-specific issues.  We developed py-memento-client on OSX, tested it at LANL on OSX and Red Hat Enterprise Linux, but Travis CI runs on Ubuntu Linux so we now have confidence that our code performs well in different environments.
Closing thought:  all of this verification only works as well as the automated tests, so focus on writing good tests.  :)

Pypi

Finally, we wanted to make it straightforward to install py-memento-client and all of its dependencies:

pip install memento_client

Getting there required Pypi, a site that globally hosts Python projects (mostly libraries).  Pypi not only provides storage for built code so that others can download it, it also requires that metadata be provided so that others can see what functionality the project provides.  Below is an image of the Pypi splash page for the py-memento-client.


Getting support for Pypi and producing the data for this splash page required that we use Python setuptools for our build. Our setup.py file, inspired by Jeff Knupp's "Open Sourcing a Python Project the Right Way", provides support for a complete build of the Python project.  Below we highlight the setup function that is the cornerstone of the whole build process.

setup(
    name="memento_client",
    version="0.5.1",
    url='https://github.com/mementoweb/py-memento-client',
    license='LICENSE.txt',
    author="Harihar Shankar, Shawn M. Jones, Herbert Van de Sompel",
    author_email="prototeam@googlegroups.com",
    install_requires=['requests>=2.7.0'],
    tests_require=['pytest-xdist', 'pytest'],
    cmdclass={
        'test': PyTest,
        'cleanall': BetterClean
        },
    download_url="https://github.com/mementoweb/py-memento-client",
    description='Official Python library for using the Memento Protocol',
    long_description="""
The memento_client library provides Memento support, as specified in RFC 7089 (http://tools.ietf.org/html/rfc7089)
For more information about Memento, see http://www.mementoweb.org/about/.
This library allows one to find information about archived web pages using the Memento protocol.  It is the goal of this library to make the Memento protocol as accessible as possible to Python developers.
""",
    packages=['memento_client'],
    keywords='memento http web archives',
    extras_require = {
        'testing': ['pytest'],
        "utils": ["lxml"]
    },
    classifiers=[

        'Intended Audience :: Developers',

        'License :: OSI Approved :: BSD License',

        'Operating System :: OS Independent',

        'Topic :: Internet :: WWW/HTTP',
        'Topic :: Scientific/Engineering',
        'Topic :: Software Development :: Libraries :: Python Modules',
        'Topic :: Utilities',

        'Programming Language :: Python :: 2.7',
        'Programming Language :: Python :: 3.4'
    ]
)

Start by creating this function call to setup, supplying all of these named arguments.  Those processed by Pypy are name, version, url, license, author, download_url, description, long_description, keywords, and classifiers.  The other arguments are used during the build to install dependencies and run tests.

The name and version arguments are used as the title for the Pypi page.  They are also used by those running pip to install the software.  Without these two items, pip does not know what it is installing.

The url argument is interpreted by Pypi as Home Page and will display on the web page using that parameter.

The license argument is used to specify how the library is licensed. Here we have a defect, we wanted users to refer to our LICENSE.txt file, but Pypi interprets it literally, printing License: LICENSE.txt.  We may need to fix this.

The author argument maps to the Pypi Author field and will display literally as typed, so commas are used to separate authors.

The download_url argument maps to the Pypi Download URL field.

The description argument becomes the subheading of the Pypi splash page.

The long_description argument becomes the body text of the Pypi splash page.  All URIs become links, but attempts to put HTML into this field produced a spash page displaying HTML, so we left it as text until we required richer formatting.

The keywords argument maps to the Pypi Keywords field.

The classifiers argument maps to the Pypi Categories field.  When choosing classifiers for a project, use this registry.  This field is used to index the project on Pypi to make finding it easier for end user.

For more information on what goes into setup.py, check out "Packaging and Distributing Projects" and "The Python Package Index (PyPI)" on the Python.org site.

Once we had our setup.py configured appropriately, we had to register for an account with Pypi.  We then created a .pypirc file in the builder's home directory with the contents shown below.

[distutils]
index-servers =
    pypi

[pypi]
repository: https://pypi.python.org/pypi
username: hariharshankar
password: <password>

The username and password fields must both be present in this file. We encountered a defect while uploading the content whereby the setuptools did not prompt for the password if it was not present and the download failed.

Once that is in place, use the existing setup.py to register the project from the project's source directory:

python setup.py register

Once that is done, the project show up on the Pypi web site under the Pypi account. After that, publish it by typing:

python setup sdist upload

And now it will show up on Pypi for others to use.

Of course, one can also deploy code directly to Pypi using Travis CI, but we have not yet attempted this.

Conclusion


Open source development has evolved quite a bit over the last several years.  The first successful achievement being sites such as Freshmeat (now defunct) and SourceForge, providing free repositories and publication sites for projects.  GitHub fulfills this role now, but developers and researchers need more complex tools.

Travis CI, coupled with good automated tests, allows independent builds, and verification that software works correctly.  It ensures that a project not only compiles for users, but also passes functional tests in an independent environment.  As noted, one can even use it to deploy software directly.

Pypi is a Python-specific repository of Python libraries and other projects.  It is the backend of the pip tool commonly used by Python developers to install libraries.  Any serious Python development team should consider the use of Pypi for hosting and providing easy access to their code.

Using these three tools, we not only developed py-memento-client in a small amount of time, but we also independently tested, and published that library for others to enjoy.

--Shawn M. Jones
Graduate Research Assistant, Los Alamos National Laboratory
PhD Student, Old Dominion University

No comments:

Post a Comment