Tracker CC Indexing
Google Summer of Code Project: “Indexing Embedded License Claims in Tracker”
Here's some relevant (now revised) sections of the Summer of Code application:
Formats and where XMP is stored in each:
* MP3 -- id3v24 - PRIV,XMP field * PDF -- metadata field * OGG -- XMP comment field * JPEG -- Exif XML Packet field * PNG -- iTXt, XML:com.adobe.xmp field * HTML -- within <script type='text/xml'> * All XML formats (SMIL, RSS, SVG) -- XMP may be placed in any valid location
Other forms of metadata:
* HTML -- RDFa of the form: <a rel="license" href="http://...">...</a>
Here's a conservative timeline. I expect to get more done; as actual coding begins, I'll update this with more accurate estimates.
By week 2: Write an XMP parser. Tracker includes examples of parsing XML using glib which I will also follow as a guide. (Based on recommendations/specifications at http://wiki.creativecommons.org/XMP)
By week 4: Add support to Tracker to automatically read and index XMP sidecar files, which will take advantage of the XMP parser. (Done by modifying src/trackerd/tracker-metadata.c)
By week 5: Once Tracker indexes the metadata of independent XMP sidecar files, I will go about extending the extractors of various formats to index embedded XMP metadata. Again, the above XMP parser will be utilized. Primary formats include HTML, SVG, and PDF.
By week 7: Write a parser for RDFa for relevant formats. (Based on recommendations/specifications at http://wiki.creativecommons.org/RDFa)
By week 9: Put the RDFa parser to use
By week 11: Additionally add support for new formats, such as SMIL and RSS. As time permits, I may also extract metadata from other image formats such as TIFF, PNG, and GIF (all of which could contain XMP metadata). Formats that take advantage of previously written code may take only a day, while others may take up to a week.
By week 12: Tie things up for submission of code
Formats to support:
* MP3 * OGG * RSS * SVG * HTML * JPEG * PDF * SMIL
With the exception of MP3, OGG, and RSS, all of the above formats embed information as XMP. This relieves the burden of processing such a wide variety of formats.
The following is an excerpt of raw XMP describing a work licensed under the CC Attribution 3.0 license.
<?xpacket begin= id=?><x:xmpmeta xmlns:x='adobe:ns:meta/'> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about= xmlns:xapRights='http://ns.adobe.com/xap/1.0/rights/'> <xapRights:Marked>True</xapRights:Marked> </rdf:Description>
<rdf:Description rdf:about= xmlns:dc='http://purl.org/dc/elements/1.1/'> <dc:rights> <rdf:Alt> <rdf:li xml:lang='x-default' >This work is licensed under a Creative Commons Attribution 3.0 License.</rdf:li> </rdf:Alt> </dc:rights> </rdf:Description>
<rdf:Description rdf:about= xmlns:cc='http://web.resource.org/cc/'> <cc:license rdf:resource='http://creativecommons.org/licenses/by/3.0/'/> </rdf:Description>
</rdf:RDF> </x:xmpmeta> <?xpacket end='r'?>
To fit Tracker's metadata structure, the extracted license claim would be placed into the File:License field in a form such as: "This work is licensed under a Creative Commons Attribution 3.0 License (http://creativecommons.org/licenses/by/3.0/)". Such a format allows for searching the license field by both license name and URL.
Will be stored in Tracker's File:License field.
cclookup (http://wiki.creativecommons.org/CcLookup) - Python application for extracting license RDF metadata or license metadata from mp3s. Code may be adapted for parsing license claims in C ccpublisher (http://wiki.creativecommons.org/CcPublisher) - Licenses embedded by ccpublisher should all be correctly extracted.