Tracker CC Indexing

Google Summer of Code Project: “Indexing Embedded License Claims in Tracker”

Here's some relevant (now revised) sections of the Summer of Code application:

Progress

Format	Form of Metadata	Location of Metadata	Extraction with Tracker	Test content
MP3	XMP / Native id3 tags	The PRIV,XMP field / WCOP tag	Need to patch GStreamer to allow it to read the WCOP tag.	XMP embedded with Exempi / Tags embedded with id3v2
Vorbis	XMP / Native comment field	XMP comment field / LICENSE comment field	Extend the GStreamer extractor to check for the presence of an XMP comment field. GStreamer places this within the EXTENDED_COMMENTS tag (requires GStreamer 0.10.10).	XMP embedded with vorbiscomment
FLAC	Native comment fields (id3v2 or vorbis-style comments)	Same as with MP3 for id3v2 or Vorbis for vorbis-style comments	Support for reading either method is in GStreamer.	embedded with id3v2 or metaflac
JPEG	XMP	Exif XML Packet field	Extend the Imagemagick extractor, using 'convert file.jpg xmp:-' to read XMP	XMP embedded with Exempi
PNG	XMP	iTXt, XML:com:adobe:xmp field	Extend the PNG extractor, adding a check for XML:com:adobe:xmp. (For backwards compatibility, the ability to read iTXt in libpng is disabled by default until version 1.3.)	XMP embedded with Exempi
PDF	XMP	metadata field	Extend the current PDF extractor (which uses Poppler) to read the metadata field. However reading the metadata field isn't wrapped in Poppler's glib bindings, but I have written and submitted a patch.	XMP embedded with Exempi
HTML	RDFa	<a rel="license" href="..."></a>	Write a new HTML extractor, using libxml2, and scan for RDFa	Various actual sites, including creativecommons.org
SVG	RDF	/svg/metadata/rdf	I could specifically parse the XML, checking for the RDF schema used by Inkscape. Should I check for XMP also???	Inkscape
Any XML	XMP	Wherever valid	Write a generic XML extractor (and/or extractor for each particular format), scanning with libxml2
OpenOffice.org (OASIS)	OO.org CC License Add-In SoC Project is working on the spec			OO.org Add-In
MS Office	DocumentSummaryInformation Infile, CreativeCommons_LicenseURL property		Extend existing msoffice extractor	MSOffice Add-in

Timeline

Here's a conservative timeline. I expect to get more done; as actual coding begins, I'll update this with more accurate estimates.

http://www.google.com/calendar/images/ext/gc_button1_en.gif

By week 2: Write an XMP parser. Tracker includes examples of parsing XML using glib which I will also follow as a guide. (Based on recommendations/specifications at http://wiki.creativecommons.org/XMP)

By week 4: Add support to Tracker to automatically read and index XMP sidecar files, which will take advantage of the XMP parser. (Done by modifying src/trackerd/tracker-metadata.c)

By week 5: Once Tracker indexes the metadata of independent XMP sidecar files, I will go about extending the extractors of various formats to index embedded XMP metadata. Again, the above XMP parser will be utilized. Primary formats include HTML, SVG, and PDF.

By week 7: Write a parser for RDFa for relevant formats. (Based on recommendations/specifications at http://wiki.creativecommons.org/RDFa)

By week 9: Put the RDFa parser to use

By week 11: Additionally add support for new formats, such as SMIL and RSS. As time permits, I may also extract metadata from other image formats such as TIFF, PNG, and GIF (all of which could contain XMP metadata). Formats that take advantage of previously written code may take only a day, while others may take up to a week.

By week 12: Tie things up for submission of code

Implementation

Formats to support:

 * MP3
 * OGG
 * RSS
 * SVG
 * HTML
 * JPEG
 * PDF
 * SMIL

With the exception of MP3, OGG, and RSS, all of the above formats embed information as XMP. This relieves the burden of processing such a wide variety of formats.

XMP

The following is an excerpt of raw XMP describing a work licensed under the CC Attribution 3.0 license.

<?xpacket begin= id=?><x:xmpmeta xmlns:x='adobe:ns:meta/'> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

<rdf:Description rdf:about=
 xmlns:xapRights='http://ns.adobe.com/xap/1.0/rights/'>
 <xapRights:Marked>True</xapRights:Marked>
</rdf:Description>

<rdf:Description rdf:about=
 xmlns:dc='http://purl.org/dc/elements/1.1/'>
 <dc:rights>
  <rdf:Alt>
   <rdf:li xml:lang='x-default' >This work is licensed under a Creative Commons
Attribution 3.0  License.</rdf:li>
  </rdf:Alt>
 </dc:rights>
</rdf:Description>

<rdf:Description rdf:about=
 xmlns:cc='http://web.resource.org/cc/'>
 <cc:license rdf:resource='http://creativecommons.org/licenses/by/3.0/'/>
</rdf:Description>

</rdf:RDF> </x:xmpmeta> <?xpacket end='r'?>

To fit Tracker's metadata structure, the extracted license claim would be placed into the File:License field in a form such as: "This work is licensed under a Creative Commons Attribution 3.0 License (http://creativecommons.org/licenses/by/3.0/)". Such a format allows for searching the license field by both license name and URL.

Specifically,

/rdf:Description[@xmlns:dc='http://purl.org/dc/elements/1.1/']/dc:rights/rdf:Alt/rdf:li/text()

and

/rdf:Description[@xmlns:cc='http://web.resource.org/cc/']/cc:license/attribute:rdf:resource

Will be stored in Tracker's File:License field.

Links

Related works:

 cclookup (http://wiki.creativecommons.org/CcLookup) - Python application for extracting license RDF metadata or license metadata from mp3s. Code may be adapted for parsing license claims in C
 ccpublisher (http://wiki.creativecommons.org/CcPublisher) - Licenses embedded by ccpublisher should all be correctly extracted.

Tracker CC Indexing

Contents

Google Summer of Code Project: “Indexing Embedded License Claims in Tracker”

Progress

Timeline

Implementation

Links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

default links

wiki navigation

Tools