Rewrite Metadata Validator/SoC 2008

From Creative Commons
Revision as of 18:51, 3 June 2008 by Hdworak (talk | contribs) (Related Web applications, Web framework and libraries)
Jump to: navigation, search

Introduction

Creative Commons participates in Google Summer of Code™ and has accepted a proposal (see the abstract) of Hugo Dworak based on its description of a task to rewrite its now-defunct metadata validator. Asheesh Laroia has been assigned as a mentor of the project. The work began on May 26th, 2008 as per the project timeline.

Synopsis

A licensor or a licensee wants to check whether a digitally embedded Creative Commons license associated with a file (a Web site in particular) is valid, does not use deprecated means to express it, and matches what the user has been expecting. Therefore, one opens the Web site and either pastes the source code as a direct input, uploads a file, or provides a link to retrieve it. The software displays the results of the analysis in a human-readable manner.

Proposed timeline

  • Week 1 — Preparing the SVN/Git or alike repository. Installing the required framework and libraries. Setting up temporary input (source code) and output (result) facilities. Testing the Python Web environment.
  • Week 2 — Parsing cc-related RDFa information from well-formed XHTML files.
  • Week 3 — Parsing cc-related RDF comments embedded in the XHTML code and those put directly in "head" and "body" elements.
  • Week 4 — Parsing cc-related RDF files linked externally or embedded in the "link" element in the header section of the XHTML.
  • Week 5 — Parsing cc-related dc-style XHTML-conforming information (embedded in "meta" elements or anchors).
  • Week 6 — Ability to clean up the invalid XHTML code whenever possible.
  • Midterm milestone — Developing a raw metadata validator capable of parsing Web sites and outputting valid cc-related information about them in a human-readable fashion.
  • Week 7 — Parsing cc-related information contained within syndication feeds (RSS 1.0, RSS 2.0, Atom 1.0).
  • Week 8 — Ability to submit an URI to the Web site to be parsed and upload a file using a form. Auto-detection of the MIME content type of such submissions based on HTTP headers and file extension. Ability for the user to override the automatic choice.
  • Week 9 — Traversal of embedded objects and special links to obtain more information about licensing terms.
  • Week 10 — Generating and storing statistics about the effects of validation such as: content type, errors, means of input, types of licences.
  • Week 11 — Extensive testing and providing automatic test suites covering all of the aforementioned capabilities.
  • Week 12 — Writing the documentation summarising the architecture. Making the application more user-friendly. Cloning the layout of the Creative Commons Web site.
  • Final milestone — A full-fledged Web application capable of parsing licensing information from a variety of sources.

Related Web applications

Web framework and libraries

Once Python becomes the programming language of the application, one has to decide about the software to be used in building a Web application, so that not everything is written from scratch. First of all, the framework which will handle the tasks that are typically found in Web applications such as the implementation of the MVC pattern, URL mapping, and so on. There are several Web application frameworks in Python, for instance Django and TurboGears. As Creative Commons uses the BSD-like licensed Pylons, this is the preferred choice for the project. Pylons supports many template systems, like Mako and Jinja, and it is our design choice to use the BSD-like licensed Genshi for this purpose.

Other than the above, it comes to choosing Python packages to facilitate parsing and extracting information from the documents. To detect the character encoding of files provided by the users of the validator, the dual-licensed (cc-by and LGPL 3 or later) encutils shall be used. It is powered by the Universal Encoding Detector (if present) and is designed to handle XML (including XHTML and RSS) and HTML. Next, one cannot expect that the users will provide only well-formed documents, therefore µTidylib (available under a MIT-style license) and the BSD-like licensed Beautiful Soup may be used to clean up the ill-formed mark-up, so that it can still be parsed.

Once the document is well-formed and decoded properly, one can proceed to extracting its embedded information relevant to the license terms. It can be represented in numerous different ways. For instance, RDF data can be provided inside comments or as elements inside head and body elements. Such data can also be encoded using a data: URI scheme or linked externally using the link element. To parse RDF data one can use the BSD-like licensed RDFLib, the dual-licensed (W3C Software License and GNU GPL 2 or newer) rdfxml.py, and many others. To do the same with RDFa one can employ librdfa (which has Python bindings and is licensed under GNU LGPL 2.1 or newer) or the MIT-style licensed rdfadict.