Creative Commons participates in Google Summer of Code™ and has accepted a proposal (see the abstract) of Hugo Dworak based on its description of a task to rewrite its now-defunct metadata validator. Asheesh Laroia has been assigned as a mentor of the project. The work began on May 26th, 2008 as per the project timeline.
A licensor or a licensee wants to check whether a digitally embedded Creative Commons license associated with a file (a Web site in particular) is valid, does not use deprecated means to express it, and matches what the user has been expecting. Therefore, one opens the Web site and either pastes the source code as a direct input, uploads a file, or provides a link to retrieve it. The software displays the results of the analysis in a human-readable manner.
The project has been split into two Git repositories called libvalidator and validator. The first one is responsible for parsing the input with regard to licensing information. The latter is a Web application that is going to utilise the former to provide an interface for end users. To browse the source code, one only needs to click on the two hyperlinks in the previous paragraph. To download the source code to a local machine, one needs to install Git first (which is available under the git-core package in Ubuntu) and then issue the following commands in the console: <source lang="bash"> git clone git://code.creativecommons.org/libvalidator.git git clone git://code.creativecommons.org/validator.git </source>
Once Python becomes the programming language of the application, one has to decide about the software to be used in building a Web application, so that not everything is written from scratch. First of all, the framework which will handle the tasks that are typically found in Web applications such as the implementation of the MVC pattern, URL mapping, and so on. There are several Web application frameworks in Python, for instance Django and TurboGears. As Creative Commons uses the BSD-like licensed Pylons, this is the preferred choice for the project. Pylons supports many template systems, like Mako and Jinja, and it is our design choice to use the BSD-like licensed Genshi for this purpose.
Other than the above, it comes to choosing Python packages to facilitate parsing and extracting information from the documents. To detect the character encoding of files provided by the users of the validator, the dual-licensed (cc-by and LGPL 3 or later) encutils might be used. It is powered by the Universal Encoding Detector (if present) and is designed to handle XML (including XHTML and RSS) and HTML. Next, one cannot expect that the users will provide only well-formed documents, therefore µTidylib (available under a MIT-style license) and the BSD-like licensed Beautiful Soup may be used to clean up the ill-formed mark-up, so that it can still be parsed. Since Python packages based on tidy have poor Unicode support (as of July 2008), only the Beautiful Soup will be used to provide the fallback.
Once the document is well-formed and decoded properly, one can proceed to extracting its embedded information relevant to the license terms. It can be represented in numerous different ways. For instance, RDF data can be provided inside comments or as elements inside head and body elements. Such data can also be encoded using a data: URI scheme or linked externally using the link element. To parse RDF data one can use the BSD-like licensed RDFLib, the dual-licensed (W3C Software License and GNU GPL 2 or newer) rdfxml.py, and many others. To do the same with RDFa one can employ librdfa (which has Python bindings and is licensed under GNU LGPL 2.1 or newer) or the MIT-style licensed rdfadict. However, we have decided to take advantage of pyRdfa (available under the W3C Software License) as it supports not only RDFa, but also expressing Dublin Core in HTML/XHTML meta and link elements.
The following tools have been developed in order to facilitate the accomplishment of the goal, though they are not directly related to the project: