Tracker CC Indexing
Contents
Google Summer of Code Project: “Indexing Embedded License Claims in Tracker”
Here's some relevant (now revised) sections of the Summer of Code application:
License Metadata Summary
Format | Form of Metadata | Location of Metadata | Links |
Audio | |||
MP3 | XMP / Native id3 tags | The PRIV,XMP field / WCOP tag | XMP Spec ID3v2.3 Spec |
Vorbis | XMP / Native comment field | XMP comment field / LICENSE comment field | Ogg Vorbis Docs |
FLAC | Native comment fields (id3v2 or vorbis-style comments) | Same as with MP3 for id3v2 or Vorbis for vorbis-style comments | FLAC Format Spec |
Monkey's Audio (APE) | Native Vorbis-like comment field | AFAIK, there is no standard tag spec | |
Images | |||
JPEG | XMP | APP1 Markers | XMP Spec |
JPEG-2000 | XMP | UUID Box | XMP Spec |
TIFF | XMP | XMLPacket tag | XMP Spec |
PNG | XMP | iTXt, XML:com:adobe:xmp field | XMP Spec |
GIF | XMP | Application block | XMP Spec |
PSD (Adobe Photoshop) | XMP | Resource block | XMP Spec |
Video | |||
AVI | ? | ? | XMP Spec |
Matroska | Native tag | COPYRIGHT tag | Matroska Tagging Spec |
Quicktime | Native tag | kMDItemCopyright(old)/kUserDataTextCopyright(new) tag | Quicktime 7 API Reference XMP Spec |
OGG | No metadata standard | Ogg Metadata Draft | |
Documents | |||
XMP | metadata field | XMP Spec | |
Postscript/EPS | XMP | Document-level metadata | XMP Spec |
HTML | RDFa | <a rel="license" href="..."></a> | CC Wiki, RDFa |
SMIL | RDF | /smil/head/metadata@id="meta-rdf"/RDF | CreativeCommons SMIL Module |
SVG | RDF | /svg/metadata/rdf | CC Wiki, SVG, based on Inkscape |
Any XML | XMP | Wherever valid | XMP Spec |
OpenOffice.org (OASIS) | OO.org CC License Add-In SoC Project is working on the spec | ||
MS Office | DocumentSummaryInformation Infile | CreativeCommons_LicenseURL property | Office Add-in |
Indexing Licenses in Tracker Summary
Status | Format | Extraction Method | Test content |
Done, GStreamer patch pending | MP3 | Reading native tags already complete. Maybe extend GStreamer extractor to read XMP. | XMP embedded with Exempi / Tags embedded with id3v2 |
In progress | Vorbis | Extend the GStreamer extractor to check for the presence of an XMP comment field. GStreamer places this within the EXTENDED_COMMENTS tag (requires GStreamer 0.10.10). | XMP embedded with vorbiscomment |
Done, GStreamer patch pending | FLAC | Native tags already extracted through the GStreamer extractor. Maybe extend GStreamer extractor to read XMP. | embedded with id3v2 or metaflac |
In progress | JPEG | Extend the Imagemagick extractor, using 'convert file.jpg xmp:-' to read XMP | XMP embedded with Exempi |
Patch pending | TIFF | Extend the Imagemagick extractor, using 'convert file.tif xmp:-' to read XMP | XMP embedded with Exempi (Note: there's a bug in Adobe's XMP SDK that prevents Exempi from embedding valid XMP) |
Pending release of libpng-1.3/Pending Tracker patch | PNG | Extend the PNG extractor, adding a check for XML:com:adobe:xmp. (For backwards compatibility, the ability to read iTXt in libpng is disabled by default until version 1.3.) | XMP embedded with Exempi |
In progress | GIF | Would need to write a GIF extractor | Palimpsest |
Poppler/Tracker patch pending | Extend the current PDF extractor (which uses Poppler) to read the metadata field. | XMP embedded with Exempi | |
Patch pending | HTML | Write a new HTML extractor, using libxml2, and scan for RDFa | Various actual sites, including creativecommons.org |
In progress | SVG | I could specifically parse the XML, checking for the RDF schema used by Inkscape. Should I check for XMP also??? | Inkscape |
Any XML | Write a generic XML extractor (and/or extractor for each particular format), scanning with libxml2 | ||
Awaiting spec | OpenOffice.org (OASIS) | Extend OASIS extractor | OO.org Add-In |
Patch pending | MS Office | Extend existing msoffice extractor | MSOffice Add-in |
Timeline
UPDATE: It's going to be difficult to estimate progress of when I'll do what. Patches are being sent upstream of various dependencies of Tracker. Completion of indexing of particular formats varies as I await approval of patches and feedback from upstream, as well as Jamie's approval of patches into Tracker itself. I'll be working on various aspects of the project in parallel as I await feedback/approval on a particular aspect.
Check out the above table of progress for where I'm at with what.
By week 12: Tie things up for submission of code
Implementation
XMP
The following is an excerpt of raw XMP describing a work licensed under the CC Attribution 3.0 license.
<?xpacket begin= id=?><x:xmpmeta xmlns:x='adobe:ns:meta/'> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about= xmlns:xapRights='http://ns.adobe.com/xap/1.0/rights/'> <xapRights:Marked>True</xapRights:Marked> </rdf:Description>
<rdf:Description rdf:about= xmlns:dc='http://purl.org/dc/elements/1.1/'> <dc:rights> <rdf:Alt> <rdf:li xml:lang='x-default' >This work is licensed under a Creative Commons Attribution 3.0 License.</rdf:li> </rdf:Alt> </dc:rights> </rdf:Description>
<rdf:Description rdf:about= xmlns:cc='http://web.resource.org/cc/'> <cc:license rdf:resource='http://creativecommons.org/licenses/by/3.0/'/> </rdf:Description>
</rdf:RDF> </x:xmpmeta> <?xpacket end='r'?>
To fit Tracker's metadata structure, the extracted license claim would be placed into the File:License field in a form such as: "This work is licensed under a Creative Commons Attribution 3.0 License (http://creativecommons.org/licenses/by/3.0/)". Such a format allows for searching the license field by both license name and URL.
Specifically,
/rdf:Description[@xmlns:dc='http://purl.org/dc/elements/1.1/']/dc:rights/rdf:Alt/rdf:li/text()
and
/rdf:Description[@xmlns:cc='http://web.resource.org/cc/']/cc:license/attribute:rdf:resource
Will be stored in Tracker's File:License field.
Links
Related works:
cclookup (http://wiki.creativecommons.org/CcLookup) - Python application for extracting license RDF metadata or license metadata from mp3s. Code may be adapted for parsing license claims in C ccpublisher (http://wiki.creativecommons.org/CcPublisher) - Licenses embedded by ccpublisher should all be correctly extracted.
Related links:
http://creativecommons.org/technology/usingmarkup http://wiki.creativecommons.org/RDFa