Difference between revisions of "Tracker CC Indexing"

From Creative Commons
Jump to: navigation, search
Line 79: Line 79:
   <td>&lt;a rel="license" href="..."&gt;&lt;/a&gt;</td>
   <td>&lt;a rel="license" href="..."&gt;&lt;/a&gt;</td>
  <td>[http://wiki.creativecommons.org/RDFa CC Wiki, RDFa]</td>
Line 84: Line 85:
   <td>[http://wiki.creativecommons.org/SVG CC Wiki based on Inkscape]</td>
   <td>[http://wiki.creativecommons.org/SVG CC Wiki, SVG, based on Inkscape]</td>
Line 95: Line 96:
   <td>OpenOffice.org (OASIS)</td>
   <td>OpenOffice.org (OASIS)</td>
   <td colspan="2">OO.org CC License Add-In SoC Project is working on the spec</td>
   <td colspan="2">OO.org CC License Add-In SoC Project is working on the spec</td>

Revision as of 00:01, 19 June 2007

Google Summer of Code Project: “Indexing Embedded License Claims in Tracker”

Here's some relevant (now revised) sections of the Summer of Code application:


License Metadata Summary

Format Form of Metadata Location of Metadata Links
MP3 XMP / Native id3 tags The PRIV,XMP field / WCOP tag XMP Spec ID3v2.3 Spec
Vorbis XMP / Native comment field XMP comment field / LICENSE comment field Ogg Vorbis Docs
FLAC Native comment fields (id3v2 or vorbis-style comments) Same as with MP3 for id3v2 or Vorbis for vorbis-style comments FLAC Format Spec
JPEG XMP APP1 Markers XMP Spec
TIFF XMP XMLPacket tag XMP Spec
PNG XMP iTXt, XML:com:adobe:xmp field XMP Spec
GIF XMP Application block XMP Spec
Matroska Native tag COPYRIGHT tag Matroska Tagging Spec
Quicktime Native tag kMDItemCopyright(old)/kUserDataTextCopyright(new) tag Quicktime 7 API Reference
PDF XMP metadata field XMP Spec
HTML RDFa <a rel="license" href="..."></a> CC Wiki, RDFa
SVG RDF /svg/metadata/rdf CC Wiki, SVG, based on Inkscape
Any XML XMP Wherever valid XMP Spec
OpenOffice.org (OASIS) OO.org CC License Add-In SoC Project is working on the spec
MS Office DocumentSummaryInformation Infile CreativeCommons_LicenseURL property Office Add-in

Indexing Licenses in Tracker Summary

Status Format Extraction Method Test content
Done, GStreamer patch pending MP3 Reading native tags already complete. Maybe extend GStreamer extractor to read XMP. XMP embedded with Exempi / Tags embedded with id3v2
In progress Vorbis Extend the GStreamer extractor to check for the presence of an XMP comment field. GStreamer places this within the EXTENDED_COMMENTS tag (requires GStreamer 0.10.10). XMP embedded with vorbiscomment
Done, GStreamer patch pending FLAC Native tags already extracted through the GStreamer extractor. Maybe extend GStreamer extractor to read XMP. embedded with id3v2 or metaflac
In progress JPEG Extend the Imagemagick extractor, using 'convert file.jpg xmp:-' to read XMP XMP embedded with Exempi
In progress TIFF Extend the Imagemagick extractor, using 'convert file.tif xmp:-' to read XMP XMP embedded with Exempi (Note: there's a bug in Adobe's XMP SDK that prevents Exempi from embedding valid XMP)
Pending release of libpng-1.3 PNG Extend the PNG extractor, adding a check for XML:com:adobe:xmp. (For backwards compatibility, the ability to read iTXt in libpng is disabled by default until version 1.3.) XMP embedded with Exempi
In progress GIF Would need to write a GIF extractor Palimpsest
Poppler patch pending PDF Extend the current PDF extractor (which uses Poppler) to read the metadata field. XMP embedded with Exempi
In progress HTML Write a new HTML extractor, using libxml2, and scan for RDFa Various actual sites, including creativecommons.org
In progress SVG I could specifically parse the XML, checking for the RDF schema used by Inkscape. Should I check for XMP also??? Inkscape
Any XML Write a generic XML extractor (and/or extractor for each particular format), scanning with libxml2
Awaiting spec OpenOffice.org (OASIS) Extend OASIS extractor OO.org Add-In
In progress MS Office Extend existing msoffice extractor MSOffice Add-in


Here's a conservative timeline. I expect to get more done; as actual coding begins, I'll update this with more accurate estimates.

UPDATE: It's going to be difficult to estimate progress of when I'll do what. Patches are being sent upstream of various dependancies of Tracker. Completion of indexing of particular formats varies as I await approval of patches and feedback from upstream, as well as Jamie's approval of patches into Tracker itself. I'll be working on various aspects of the project in parallel as I await feedback/approval on a particular aspect.


By week 2: Write an XMP parser. Tracker includes examples of parsing XML using glib which I will also follow as a guide. (Based on recommendations/specifications at http://wiki.creativecommons.org/XMP)

By week 4: Add support to Tracker to automatically read and index XMP sidecar files, which will take advantage of the XMP parser. (Done by modifying src/trackerd/tracker-metadata.c)

By week 5: Once Tracker indexes the metadata of independent XMP sidecar files, I will go about extending the extractors of various formats to index embedded XMP metadata. Again, the above XMP parser will be utilized. Primary formats include HTML, SVG, and PDF.

By week 7: Write a parser for RDFa for relevant formats. (Based on recommendations/specifications at http://wiki.creativecommons.org/RDFa)

By week 9: Put the RDFa parser to use

By week 11: Additionally add support for new formats, such as SMIL and RSS. As time permits, I may also extract metadata from other image formats such as TIFF, PNG, and GIF (all of which could contain XMP metadata). Formats that take advantage of previously written code may take only a day, while others may take up to a week.

By week 12: Tie things up for submission of code



The following is an excerpt of raw XMP describing a work licensed under the CC Attribution 3.0 license.

<?xpacket begin= id=?><x:xmpmeta xmlns:x='adobe:ns:meta/'> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

<rdf:Description rdf:about=
<rdf:Description rdf:about=
   <rdf:li xml:lang='x-default' >This work is licensed under a Creative Commons
Attribution 3.0  License.</rdf:li>
<rdf:Description rdf:about=
 <cc:license rdf:resource='http://creativecommons.org/licenses/by/3.0/'/>

</rdf:RDF> </x:xmpmeta> <?xpacket end='r'?>

To fit Tracker's metadata structure, the extracted license claim would be placed into the File:License field in a form such as: "This work is licensed under a Creative Commons Attribution 3.0 License (http://creativecommons.org/licenses/by/3.0/)". Such a format allows for searching the license field by both license name and URL.





Will be stored in Tracker's File:License field.


Related works:

 cclookup (http://wiki.creativecommons.org/CcLookup) - Python application for extracting license RDF metadata or license metadata from mp3s. Code may be adapted for parsing license claims in C
 ccpublisher (http://wiki.creativecommons.org/CcPublisher) - Licenses embedded by ccpublisher should all be correctly extracted.

Related links: