Rewrite Metadata Validator/SoC 2008
Introduction
Creative Commons participates in Google Summer of Code™ and has accepted a proposal (see the abstract) of Hugo Dworak based on its description of a task to rewrite its now-defunct metadata validator. Asheesh Laroia has been assigned as a mentor of the project. The work began on May 26th, 2008 as per the project timeline.
Synopsis
A licensor or a licensee wants to check whether a digitally embedded Creative Commons license associated with a file (a Web site in particular) is valid, does not use deprecated means to express it, and matches what the user has been expecting. Therefore, one opens the Web site and either pastes the source code as a direct input, uploads a file, or provides a link to retrieve it. The software displays the results of the analysis in a human-readable manner.
Proposed timeline
- Week 1 — Preparing the SVN/Git or alike repository. Installing the required framework and libraries. Setting up temporary input (source code) and output (result) facilities. Testing the Python Web environment.
- Week 2 — Parsing cc-related RDFa information from well-formed XHTML files.
- Week 3 — Parsing cc-related RDF comments embedded in the XHTML code and those put directly in "head" and "body" elements.
- Week 4 — Parsing cc-related RDF files linked externally or embedded in the "link" element in the header section of the XHTML.
- Week 5 — Parsing cc-related dc-style XHTML-conforming information (embedded in "meta" elements or anchors).
- Week 6 — Ability to clean up the invalid XHTML code whenever possible.
- Midterm milestone — Developing a raw metadata validator capable of parsing Web sites and outputting valid cc-related information about them in a human-readable fashion.
- Week 7 — Parsing cc-related information contained within syndication feeds (RSS 1.0, RSS 2.0, Atom 1.0).
- Week 8 — Ability to submit an URI to the Web site to be parsed and upload a file using a form. Auto-detection of the MIME content type of such submissions based on HTTP headers and file extension. Ability for the user to override the automatic choice.
- Week 9 — Traversal of embedded objects and special links to obtain more information about licensing terms.
- Week 10 — Generating and storing statistics about the effects of validation such as: content type, errors, means of input, types of licences.
- Week 11 — Extensive testing and providing automatic test suites covering all of the aforementioned capabilities.
- Week 12 — Writing the documentation summarising the architecture. Making the application more user-friendly. Cloning the layout of the Creative Commons Web site.
- Final milestone — A full-fledged Web application capable of parsing licensing information from a variety of sources.