Rewrite Metadata Validator/SoC 2008
Contents
Introduction
Creative Commons participates in Google Summer of Code™ and has accepted a proposal (see the abstract) of Hugo Dworak based on its description of a task to rewrite its now-defunct metadata validator. Asheesh Laroia has been assigned as a mentor of the project. The work began on May 26th, 2008 as per the project timeline.
Synopsis
A licensor or a licensee wants to check whether a digitally embedded Creative Commons license associated with a file (a Web site in particular) is valid, does not use deprecated means to express it, and matches what the user has been expecting. Therefore, one opens the Web site and either pastes the source code as a direct input, uploads a file, or provides a link to retrieve it. The software displays the results of the analysis in a human-readable manner.
Proposed timeline
- Week 1 — Preparing the SVN/Git or alike repository. Installing the required framework and libraries. Setting up temporary input (source code) and output (result) facilities. Testing the Python Web environment.
- Week 2 — Parsing cc-related RDFa information from well-formed XHTML files.
- Week 3 — Parsing cc-related RDF comments embedded in the XHTML code and those put directly in "head" and "body" elements.
- Week 4 — Parsing cc-related RDF files linked externally or embedded in the "link" element in the header section of the XHTML.
- Week 5 — Parsing cc-related dc-style XHTML-conforming information (embedded in "meta" elements or anchors).
- Week 6 — Ability to clean up the invalid XHTML code whenever possible.
- Midterm milestone — Developing a raw metadata validator capable of parsing Web sites and outputting valid cc-related information about them in a human-readable fashion.
- Week 7 — Parsing cc-related information contained within syndication feeds (RSS 1.0, RSS 2.0, Atom 1.0).
- Week 8 — Ability to submit an URI to the Web site to be parsed and upload a file using a form. Auto-detection of the MIME content type of such submissions based on HTTP headers and file extension. Ability for the user to override the automatic choice.
- Week 9 — Traversal of embedded objects and special links to obtain more information about licensing terms.
- Week 10 — Generating and storing statistics about the effects of validation such as: content type, errors, means of input, types of licences.
- Week 11 — Extensive testing and providing automatic test suites covering all of the aforementioned capabilities.
- Week 12 — Writing the documentation summarising the architecture. Making the application more user-friendly. Cloning the layout of the Creative Commons Web site.
- Final milestone — A full-fledged Web application capable of parsing licensing information from a variety of sources.
Accessing the source code
The project has been split into two Git repositories called libvalidator and validator. The first one is responsible for parsing the input with regard to licensing information. The latter is a Web application that is going to utilise the former to provide an interface for end users. To browse the source code, one only needs to click on the two hyperlinks in the previous paragraph. To download the source code to a local machine, one needs to install Git first (which is available under the git-core package in Ubuntu) and then issue the following commands in the console: <source lang="bash"> git clone git://code.creativecommons.org/libvalidator.git git clone git://code.creativecommons.org/validator.git </source>
Related Web applications
- The W3C Markup Validation Service
- The W3C CSS Validation Service
- The W3C RDF Validation Service
- RDFa Distiller
- Online SWI-Prolog RDF parser demo
Web framework and libraries
Once Python becomes the programming language of the application, one has to decide about the software to be used in building a Web application, so that not everything is written from scratch. First of all, the framework which will handle the tasks that are typically found in Web applications such as the implementation of the MVC pattern, URL mapping, and so on. There are several Web application frameworks in Python, for instance Django and TurboGears. As Creative Commons uses the BSD-like licensed Pylons, this is the preferred choice for the project. Pylons supports many template systems, like Mako and Jinja, and it is our design choice to use the BSD-like licensed Genshi for this purpose.
Other than the above, it comes to choosing Python packages to facilitate parsing and extracting information from the documents. To detect the character encoding of files provided by the users of the validator, the dual-licensed (cc-by and LGPL 3 or later) encutils might be used. It is powered by the Universal Encoding Detector (if present) and is designed to handle XML (including XHTML and RSS) and HTML. Next, one cannot expect that the users will provide only well-formed documents, therefore µTidylib (available under a MIT-style license) and the BSD-like licensed Beautiful Soup may be used to clean up the ill-formed mark-up, so that it can still be parsed. Since Python packages based on tidy have poor Unicode support (as of July 2008), only the Beautiful Soup will be used to provide the fallback.
Once the document is well-formed and decoded properly, one can proceed to extracting its embedded information relevant to the license terms. It can be represented in numerous different ways. For instance, RDF data can be provided inside comments or as elements inside head and body elements. Such data can also be encoded using a data: URI scheme or linked externally using the link element. To parse RDF data one can use the BSD-like licensed RDFLib, the dual-licensed (W3C Software License and GNU GPL 2 or newer) rdfxml.py, and many others. To do the same with RDFa one can employ librdfa (which has Python bindings and is licensed under GNU LGPL 2.1 or newer) or the MIT-style licensed rdfadict. However, we have decided to take advantage of pyRdfa (available under the W3C Software License) as it supports not only RDFa, but also expressing Dublin Core in HTML/XHTML meta and link elements.
We shall use nose for unit testing and both setuptools and zc.buildout to handle dependencies and build Python packages. Git is the version control system of choice.
Helper scripts
The following tools have been developed in order to facilitate the accomplishment of the goal, though they are not directly related to the project:
- Wine browsers setup — a Bash script to install Safari and Microsoft Internet Explorer with the Internet Explorer Developer Toolbar under Wine,
- IRC Scanner — a PHP 5 script that searches for the occurrences of a particular nickname in the logs of the Creative Commons' IRC channel,
- PyDev integration with the Darklooks theme — syntax highlighting on a dark background for PyDev.