Difference between revisions of "Tracker CC Indexing"

Latest revision as of 19:45, 21 September 2007

This article has been identified as a candidate for merging with Embedding Specifications.

The contents of this article have been identified as candidates for conversion to Semantic Markup. You can help Creative Commons by splitting the article and constructing ask queries. See the Semantic MediaWiki page for more information.

Google Summer of Code Project: “Indexing Embedded License Claims in Tracker”

Here's some relevant (now revised) sections of the Summer of Code application:

License Metadata Summary

Format	Form of Metadata	Location of Metadata	Links
Audio
MP3	XMP / Native id3 tags	The PRIV,XMP field / WCOP tag	XMP Spec ID3v2.3 Spec
Vorbis	XMP / Native comment field	XMP comment field / LICENSE comment field	Ogg Vorbis Docs
FLAC	Native comment fields (id3v2 or vorbis-style comments)	Same as with MP3 for id3v2 or Vorbis for vorbis-style comments	FLAC Format Spec
Monkey's Audio (APE)	Native Vorbis-like comment field	AFAIK, there is no standard tag spec
Images
JPEG	XMP	APP1 Markers	XMP Spec
JPEG 2000	XMP	UUID Box	XMP Spec
TIFF	XMP	XMLPacket tag	XMP Spec
PNG	XMP	iTXt, XML:com:adobe:xmp field	XMP Spec
GIF	XMP	Application block	XMP Spec
SVG	RDF	/svg/metadata/rdf	CC Wiki, SVG, based on Inkscape
PSD (Adobe Photoshop)	XMP	Resource block	XMP Spec
Video
AVI	?	?	XMP Spec
Matroska	Native tag	COPYRIGHT tag	Matroska Tagging Spec
Quicktime	Native tag	kMDItemCopyright(old)/kUserDataTextCopyright(new) tag	Quicktime 7 API Reference XMP Spec
OGG	No metadata standard		Ogg Metadata Draft
Theora	Theora comments (similar to Vorbis comments)		Theora Spec
Flash	RDF	?
Documents
PDF	XMP	metadata field	XMP Spec
Postscript/EPS	XMP	Document-level metadata	XMP Spec
HTML	RDFa	<a rel="license" href="..."></a>	CC Wiki, RDFa
SMIL	RDF	/smil/head/metadata@id="meta-rdf"/RDF	CreativeCommons SMIL Module
RSS 1.0	/RDF/channel/license or /RDF/channel/item/license		CreativeCommons RSS 1.0 Module
RSS 2.0	/rss/channel/cc:license or /rss/channel/item/cc:license		CreativeCommons RSS 2.0 Module
Atom	/feed/entry/link@rel=license		Atom License Extension
Any XML	XMP	Wherever valid	XMP Spec
OpenOffice.org (OASIS)	OO.org CC License Add-In SoC Project is working on the spec
MS Office (2003)	DocumentSummaryInformation Infile	CreativeCommons_LicenseURL property	Office Add-in
MS Office OpenXML (2007)	?	?	OpenXML Spec Relevant mailing list post

Indexing Licenses in Tracker Summary

Status	Format	Extraction Method	Test content
Done, GStreamer patch pending	MP3	Reading native tags already complete. Maybe extend GStreamer extractor to read XMP.	XMP embedded with Exempi / Tags embedded with id3v2
In progress	Vorbis	Extend the GStreamer extractor to check for the presence of an XMP comment field. GStreamer places this within the EXTENDED_COMMENTS tag (requires GStreamer 0.10.10).	XMP embedded with vorbiscomment
Done, GStreamer patch pending	FLAC	Native tags already extracted through the GStreamer extractor. Maybe extend GStreamer extractor to read XMP.	embedded with id3v2 or metaflac
In progress	JPEG	Extend the Imagemagick extractor, using 'convert file.jpg xmp:-' to read XMP	XMP embedded with Exempi
Done	TIFF	Extend the Imagemagick extractor, using 'convert file.tif xmp:-' to read XMP	XMP embedded with Exempi (Note: there's a bug in Adobe's XMP SDK that prevents Exempi from embedding valid XMP)
Done/td>	PNG	Extend the PNG extractor, adding a check for XML:com:adobe:xmp. (For backwards compatibility, the ability to read iTXt in libpng is disabled by default until version 1.3.)	XMP embedded with Exempi
In progress	GIF	Would need to write a GIF extractor	Palimpsest
Done	PDF	Extend the current PDF extractor (which uses Poppler) to read the metadata field.	XMP embedded with Exempi
Done	HTML	Write a new HTML extractor, using libxml2, and scan for RDFa	Various actual sites, including creativecommons.org
In progress	SVG	I could specifically parse the XML, checking for the RDF schema used by Inkscape. Should I check for XMP also???	Inkscape
	Any XML	Write a generic XML extractor (and/or extractor for each particular format), scanning with libxml2
Awaiting spec	OpenOffice.org (OASIS)	Extend OASIS extractor	OO.org Add-In
Done	MS Office (old format)	Extend existing msoffice extractor	MSOffice Add-in

Difference between revisions of "Tracker CC Indexing"

Latest revision as of 19:45, 21 September 2007

Google Summer of Code Project: “Indexing Embedded License Claims in Tracker”

License Metadata Summary

Indexing Licenses in Tracker Summary

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

default links

wiki navigation

Tools

@@ Line 1: / Line 1: @@
+[[Category:Technology]]
+[[Category:Developer]]
+[[Category:metadata]]
+[[Category:Tracker]]
+{{template:Merge|Embedding Specifications}}
+{{template:SMW}}
 == Google Summer of Code Project: “Indexing Embedded License Claims in Tracker” ==
-In progress...
+Here's some relevant (now revised) sections of the Summer of Code application:
-Here's some relevant sections of the Summer of Code application:
-== Implementation ==
-Formats to support:
-  * MP3
-  * OGG
-  * RSS
-  * SVG
-  * HTML
-  * JPEG
-  * PDF
-  * SMIL
-With the exception of MP3, OGG, and RSS, all of the above formats embed information as XMP. This relieves the burden of processing such a wide variety of formats.
-XMP/RDF and RDFa:
-Probably the most significant portion of the work done will be processing of XML XMP data. Once the XML data is extracted, it will be passed to a shared piece of code that indexes each field independent of it's format.
-The following is an except of raw XMP describing a work licensed under the CC Attribution 3.0 license.
- <?xpacket begin='' id=''?>
-  ...
-  <rdf:Description rdf:about=''
-   xmlns:dc='http://purl.org/dc/elements/1.1/'>
-   <dc:rights>
-    <rdf:Alt>
-     <rdf:li xml:lang='x-default' >This work is licensed under a Creative Commons
-  Attribution 3.0 License.</rdf:li>
-    </rdf:Alt>
-   </dc:rights>
-  </rdf:Description>
-  <rdf:Description rdf:about=''
-   xmlns:cc='http://web.resource.org/cc/'>
-   <cc:license rdf:resource='http://creativecommons.org/licenses/by/3.0/'/>
-  </rdf:Description>
-  ...
- <?xpacket end='r'?>
-To fit Tracker's metadata structure, the extracted license claim would be placed into the File:License field in a form such as: "This work is licensed under a Creative Commons Attribution 3.0 License (http://creativecommons.org/licenses/by/3.0/)". Such a format allows for searching the license field by both license name and URL.
-License claims will similarly be extracted from RDFa metadata embedding in XML formats.
-== Timeline ==
-Assuming ironing out of the design phase and a clear assessment of goals of the project, coding will begin with an XMP parser. At the moment, I am not aware of any C libraries that perform this function, although I may use ccLookup (written in Python) as a guide. Tracker includes examples of parsing XML using glib which I will also follow as a guide. This is particularly important aspect that the rest of my project will build off of. It should be done before the first month.
-I then will add support to Tracker to automatically read and index XMP sidecar files (I have the okay from Jamie from Tracker), which will take advantage of the XMP parser. Once Tracker indexes the metadata of independent XMP sidecar files, I will go about extending the extractors of various formats to index embedded XMP metadata. I expect that many extractors will be trivial to implement while others will require more technical research of the formats. However, most formats already take the form of XML and so extracting the relevant XMP/RDF element will be straightforward.
-In addition to parsing XML for XMP/RDF metadata, I will also write a parser for RDFa for relevant formats.
-I will then additionally add support for new formats to Tracker, such as SMIL and RSS. As time permits, I may also extract metadata from other image formats such as TIFF, PNG, and GIF (all of which could contain XMP metadata). Formats that take advantage of previously written code may take only a day, while others may take up to a week.
+== License Metadata Summary ==
-Also, time permitting, I could take my work and attempt to separate it into a stand-alone library. Doing so will be a constant consideration in the design of each stage of integration into Tracker. The feasibility of a stand-alone library, however cannot be guaranteed, as the feasibility of doing so is contingent upon the requirements for successful integration into Tracker. At the very least, I would like my work to apply outside of Tracker, such as indirectly promoting embedding license claims or gaining familiarity in order to write a library to extract license claims from outside of Tracker (such a library would be useful to Sidestream, which I mention in my bio).
+<table border="1">
+<tr>
+<td><strong>Format</strong></td><td><strong>Form of Metadata</strong></td><td><strong>Location of Metadata</strong></td><td><strong>Links</strong></td>
+</tr>
+<tr><td colspan="4"><strong>Audio</strong></td></tr>
+<tr>
+  <td>MP3</td>
+  <td>XMP / Native id3 tags</td>
+  <td>The PRIV,XMP field / WCOP tag</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]
+[http://www.id3.org/id3v2.3.0 ID3v2.3 Spec]</td>
+</tr>
+<tr>
+  <td>Vorbis</td>
+  <td>XMP / Native comment field</td>
+  <td>XMP comment field / LICENSE comment field</td>
+  <td>[http://xiph.org/vorbis/doc/v-comment.html Ogg Vorbis Docs]</td>
+</tr>
+<tr>
+  <td>FLAC</td>
+  <td>Native comment fields (id3v2 or vorbis-style comments)</td>
+  <td>Same as with MP3 for id3v2 or Vorbis for vorbis-style comments</td>
+  <td>[http://flac.sourceforge.net/format.html#metadata_block_vorbis_comment FLAC Format Spec]</td>
+</tr>
+<tr>
+  <td>Monkey's Audio (APE)</td>
+  <td>Native Vorbis-like comment field</td>
+  <td>AFAIK, there is no standard tag spec</td>
+  <td></td>
+</tr>
+<tr><td colspan="4"><strong>Images</strong></td></tr>
+<tr>
+  <td>JPEG</td>
+  <td>XMP</td>
+  <td>APP1 Markers</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr>
+  <td>JPEG 2000</td>
+  <td>XMP</td>
+  <td>UUID Box</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr>
+  <td>TIFF</td>
+  <td>XMP</td>
+  <td>XMLPacket tag</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr>
+  <td>PNG</td>
+  <td>XMP</td>
+  <td>iTXt, XML:com:adobe:xmp field</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr>
+  <td>GIF</td>
+  <td>XMP</td>
+  <td>Application block</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr>
+  <td>SVG</td>
+  <td>RDF</td>
+  <td>/svg/metadata/rdf</td>
+  <td>[http://wiki.creativecommons.org/SVG CC Wiki, SVG, based on Inkscape]</td>
+</tr>
+<tr>
+  <td>PSD (Adobe Photoshop)</td>
+  <td>XMP</td>
+  <td>Resource block</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr><td colspan="4"><strong>Video</strong></td></tr>
+<tr>
+  <td>AVI</td>
+  <td>?</td>
+  <td>?</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr>
+  <td>Matroska</td>
+  <td>Native tag</td>
+  <td>COPYRIGHT tag</td>
+  <td>[http://www.matroska.org/technical/specs/tagging/index.html Matroska Tagging Spec]</td>
+</tr>
+<tr>
+  <td>Quicktime</td>
+  <td>Native tag</td>
+  <td>kMDItemCopyright(old)/kUserDataTextCopyright(new) tag</td>
+  <td>[http://developer.apple.com/documentation/QuickTime/Conceptual/QT7UpdateGuide/Chapter03/chapter_3_section_1.html#//apple_ref/doc/uid/TP40001163-CH314-553378 Quicktime 7 API Reference]
+[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr>
+  <td>OGG</td>
+  <td>No metadata standard</td>
+  <td></td>
+  <td>[http://wiki.xiph.org/Metadata Ogg Metadata Draft]</td>
+</tr>
+<tr>
+  <td>Theora</td>
+  <td colspan="2">Theora comments (similar to Vorbis comments)</td>
+  <td>[http://www.theora.org/doc/Theora_I_spec.pdf Theora Spec]</td>
+</tr>
+<tr>
+  <td>Flash</td>
+  <td>RDF</td>
+  <td>?</td>
+  <td></td>
+</tr>
+<tr><td colspan="4"><strong>Documents</strong></td></tr>
+<tr>
+  <td>PDF</td>
+  <td>XMP</td>
+  <td>metadata field</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr>
+  <td>Postscript/EPS</td>
+  <td>XMP</td>
+  <td>Document-level metadata</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr>
+  <td>HTML</td>
+  <td>RDFa</td>
+  <td>&lt;a rel="license" href="..."&gt;&lt;/a&gt;</td>
+  <td>[http://wiki.creativecommons.org/RDFa CC Wiki, RDFa]</td>
+</tr>
+<tr>
+  <td>SMIL</td>
+  <td>RDF</td>
+  <td>/smil/head/metadata@id="meta-rdf"/RDF</td>
+  <td>[http://web.resource.org/cc/modules/smil/ CreativeCommons SMIL Module]</td>
+</tr>
+<tr>
+  <td>RSS 1.0</td>
+  <td colspan="2">/RDF/channel/license or /RDF/channel/item/license</td>
+  <td>[http://web.resource.org/rss/1.0/modules/cc/ CreativeCommons RSS 1.0 Module]</td>
+</tr>
+<tr>
+  <td>RSS 2.0</td>
+  <td colspan="2">/rss/channel/cc:license or /rss/channel/item/cc:license</td>
+  <td>[http://backend.userland.com/creativeCommonsRssModule CreativeCommons RSS 2.0 Module]</td>
+</tr>
+<tr>
+  <td>Atom</td>
+  <td colspan="2">/feed/entry/link@rel=license</td>
+  <td>[http://ietfreport.isoc.org/idref/draft-snell-atompub-feed-license/ Atom License Extension]</td>
+</tr>
+<tr>
+  <td>Any XML</td>
+  <td>XMP</td>
+  <td>Wherever valid</td>
+  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
+</tr>
+<tr>
+  <td>OpenOffice.org (OASIS)</td>
+  <td colspan="2">OO.org CC License Add-In SoC Project is working on the spec</td>
+  <td></td>
+</tr>
+<tr>
+  <td>MS Office (2003)</td>
+  <td>DocumentSummaryInformation Infile</td>
+  <td>CreativeCommons_LicenseURL property</td>
+  <td>[http://www.microsoft.com/downloads/details.aspx?FamilyID=113b53dd-1cc0-4fbe-9e1d-b91d07c76504&displaylang=en Office Add-in]</td>
+</tr>
+<tr>
+  <td>MS Office OpenXML (2007)</td>
+  <td>?</td>
+  <td>?</td>
+  <td>[http://www.ecma-international.org/publications/standards/Ecma-376.htm OpenXML Spec]
-LINKS
+[http://lists.ibiblio.org/pipermail/cc-devel/2007-June/000466.html Relevant mailing list post]</td>
+</tr>
-Related works:
+</table>
-  cclookup (http://wiki.creativecommons.org/CcLookup) - Python application for extracting license RDF metadata or license metadata from mp3s. Code may be adapted for parsing license claims in C
+== Indexing Licenses in Tracker Summary ==
-  ccpublisher (http://wiki.creativecommons.org/CcPublisher) - Licenses embedded by ccpublisher should all be correctly extracted.
-Related links:
+<table border="1">
-   http://creativecommons.org/technology/usingmarkup
+<tr>
-   http://wiki.creativecommons.org/RDFa
+<td><strong>Status</strong></td><td><strong>Format</strong></td><td><strong>Extraction Method</strong></td><td><strong>Test content</strong></td>
+</tr>
+<tr>
+  <td>Done, GStreamer patch pending</td>
+  <td>MP3</td>
+  <td>Reading native tags already complete.  Maybe extend GStreamer extractor to read XMP.</td>
+  <td>XMP embedded with Exempi / Tags embedded with id3v2</td>
+</tr>
+<tr>
+  <td>In progress</td>
+  <td>Vorbis</td>
+  <td>Extend the GStreamer extractor to check for the presence of an XMP comment field.  GStreamer places this within the EXTENDED_COMMENTS tag (requires GStreamer 0.10.10).</td>
+  <td>XMP embedded with vorbiscomment</td>
+</tr>
+<tr>
+  <td>Done, GStreamer patch pending</td>
+  <td>FLAC</td>
+  <td>Native tags already extracted through the GStreamer extractor.  Maybe extend GStreamer extractor to read XMP.</td>
+  <td>embedded with id3v2 or metaflac</td>
+</tr>
+<tr>
+  <td>In progress</td>
+  <td>JPEG</td>
+  <td>Extend the Imagemagick extractor, using 'convert file.jpg xmp:-' to read XMP</td>
+  <td>XMP embedded with Exempi</td>
+</tr>
+<tr>
+  <td>Done</td>
+  <td>TIFF</td>
+  <td>Extend the Imagemagick extractor, using 'convert file.tif xmp:-' to read XMP</td>
+   <td>XMP embedded with Exempi (Note: there's a bug in Adobe's XMP SDK that prevents Exempi from embedding valid XMP)</td>
+</tr>
+<tr>
+  <td>Done/td>
+  <td>PNG</td>
+  <td>Extend the PNG extractor, adding a check for XML:com:adobe:xmp.  (For backwards compatibility, the ability to read iTXt in libpng is disabled by default until version 1.3.)</td>
+  <td>XMP embedded with Exempi</td>
+</tr>
+<tr>
+  <td>In progress</td>
+  <td>GIF</td>
+  <td>Would need to write a GIF extractor</td>
+  <td>Palimpsest</td>
+</tr>
+<tr>
+  <td>Done</td>
+  <td>PDF</td>
+  <td>Extend the current PDF extractor (which uses Poppler) to read the metadata field.</td>
+  <td>XMP embedded with Exempi</td>
+</tr>
+<tr>
+  <td>Done</td>
+  <td>HTML</td>
+  <td>Write a new HTML extractor, using libxml2, and scan for RDFa</td>
+  <td>Various actual sites, including creativecommons.org</td>
+</tr>
+<tr>
+  <td>In progress</td>
+  <td>SVG</td>
+  <td>I could specifically parse the XML, checking for the RDF schema used by Inkscape.  Should I check for XMP also???</td>
+  <td>Inkscape</td>
+</tr>
+<tr>
+  <td></td>
+  <td>Any XML</td>
+  <td>Write a generic XML extractor (and/or extractor for each particular format), scanning with libxml2</td>
+  <td></td>
+</tr>
+<tr>
+  <td>Awaiting spec</td>
+  <td>OpenOffice.org (OASIS)</td>
+  <td>Extend OASIS extractor</td>
+   <td>OO.org Add-In</td>
+</tr>
+<tr>
+  <td>Done</td>
+  <td>MS Office (old format)</td>
+  <td>Extend existing msoffice extractor</td>
+  <td>[http://www.microsoft.com/downloads/details.aspx?FamilyID=113b53dd-1cc0-4fbe-9e1d-b91d07c76504&amp;displaylang=en MSOffice Add-in]</td>
+</tr>
+</table>