Difference between revisions of "Tracker CC Indexing"

From Creative Commons
Jump to: navigation, search
m (TrackerGSoC moved to Tracker CC Indexing: Figuring out the wiki...)
 
(47 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 +
[[Category:Technology]]
 +
[[Category:Developer]]
 +
[[Category:metadata]]
 +
[[Category:Tracker]]
 +
{{template:Merge|Embedding Specifications}}
 +
{{template:SMW}}
 +
 
== Google Summer of Code Project: “Indexing Embedded License Claims in Tracker” ==
 
== Google Summer of Code Project: “Indexing Embedded License Claims in Tracker” ==
  
In progress...
+
Here's some relevant (now revised) sections of the Summer of Code application:
 +
 
 +
== License Metadata Summary ==
 +
 
 +
<table border="1">
 +
<tr>
 +
<td><strong>Format</strong></td><td><strong>Form of Metadata</strong></td><td><strong>Location of Metadata</strong></td><td><strong>Links</strong></td>
 +
</tr>
 +
<tr><td colspan="4"><strong>Audio</strong></td></tr>
 +
<tr>
 +
  <td>MP3</td>
 +
  <td>XMP / Native id3 tags</td>
 +
  <td>The PRIV,XMP field / WCOP tag</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]
 +
[http://www.id3.org/id3v2.3.0 ID3v2.3 Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>Vorbis</td>
 +
  <td>XMP / Native comment field</td>
 +
  <td>XMP comment field / LICENSE comment field</td>
 +
  <td>[http://xiph.org/vorbis/doc/v-comment.html Ogg Vorbis Docs]</td>
 +
</tr>
 +
<tr>
 +
  <td>FLAC</td>
 +
  <td>Native comment fields (id3v2 or vorbis-style comments)</td>
 +
  <td>Same as with MP3 for id3v2 or Vorbis for vorbis-style comments</td>
 +
  <td>[http://flac.sourceforge.net/format.html#metadata_block_vorbis_comment FLAC Format Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>Monkey's Audio (APE)</td>
 +
  <td>Native Vorbis-like comment field</td>
 +
  <td>AFAIK, there is no standard tag spec</td>
 +
  <td></td>
 +
</tr>
 +
<tr><td colspan="4"><strong>Images</strong></td></tr>
 +
<tr>
 +
  <td>JPEG</td>
 +
  <td>XMP</td>
 +
  <td>APP1 Markers</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>JPEG 2000</td>
 +
  <td>XMP</td>
 +
  <td>UUID Box</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>TIFF</td>
 +
  <td>XMP</td>
 +
  <td>XMLPacket tag</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>PNG</td>
 +
  <td>XMP</td>
 +
  <td>iTXt, XML:com:adobe:xmp field</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>GIF</td>
 +
  <td>XMP</td>
 +
  <td>Application block</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>SVG</td>
 +
  <td>RDF</td>
 +
  <td>/svg/metadata/rdf</td>
 +
  <td>[http://wiki.creativecommons.org/SVG CC Wiki, SVG, based on Inkscape]</td>
 +
</tr>
 +
<tr>
 +
  <td>PSD (Adobe Photoshop)</td>
 +
  <td>XMP</td>
 +
  <td>Resource block</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr><td colspan="4"><strong>Video</strong></td></tr>
 +
<tr>
 +
  <td>AVI</td>
 +
  <td>?</td>
 +
  <td>?</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>Matroska</td>
 +
  <td>Native tag</td>
 +
  <td>COPYRIGHT tag</td>
 +
  <td>[http://www.matroska.org/technical/specs/tagging/index.html Matroska Tagging Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>Quicktime</td>
 +
  <td>Native tag</td>
 +
  <td>kMDItemCopyright(old)/kUserDataTextCopyright(new) tag</td>
 +
  <td>[http://developer.apple.com/documentation/QuickTime/Conceptual/QT7UpdateGuide/Chapter03/chapter_3_section_1.html#//apple_ref/doc/uid/TP40001163-CH314-553378 Quicktime 7 API Reference]
 +
[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>OGG</td>
 +
  <td>No metadata standard</td>
 +
  <td></td>
 +
  <td>[http://wiki.xiph.org/Metadata Ogg Metadata Draft]</td>
 +
</tr>
 +
<tr>
 +
  <td>Theora</td>
 +
  <td colspan="2">Theora comments (similar to Vorbis comments)</td>
 +
  <td>[http://www.theora.org/doc/Theora_I_spec.pdf Theora Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>Flash</td>
 +
  <td>RDF</td>
 +
  <td>?</td>
 +
  <td></td>
 +
</tr>
 +
<tr><td colspan="4"><strong>Documents</strong></td></tr>
 +
<tr>
 +
  <td>PDF</td>
 +
  <td>XMP</td>
 +
  <td>metadata field</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>Postscript/EPS</td>
 +
  <td>XMP</td>
 +
  <td>Document-level metadata</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>HTML</td>
 +
  <td>RDFa</td>
 +
  <td>&lt;a rel="license" href="..."&gt;&lt;/a&gt;</td>
 +
  <td>[http://wiki.creativecommons.org/RDFa CC Wiki, RDFa]</td>
 +
</tr>
 +
<tr>
 +
  <td>SMIL</td>
 +
  <td>RDF</td>
 +
  <td>/smil/head/metadata@id="meta-rdf"/RDF</td>
 +
  <td>[http://web.resource.org/cc/modules/smil/ CreativeCommons SMIL Module]</td>
 +
</tr>
 +
<tr>
 +
  <td>RSS 1.0</td>
 +
  <td colspan="2">/RDF/channel/license or /RDF/channel/item/license</td>
 +
  <td>[http://web.resource.org/rss/1.0/modules/cc/ CreativeCommons RSS 1.0 Module]</td>
 +
</tr>
 +
<tr>
 +
  <td>RSS 2.0</td>
 +
  <td colspan="2">/rss/channel/cc:license or /rss/channel/item/cc:license</td>
 +
  <td>[http://backend.userland.com/creativeCommonsRssModule CreativeCommons RSS 2.0 Module]</td>
 +
</tr>
 +
<tr>
 +
  <td>Atom</td>
 +
  <td colspan="2">/feed/entry/link@rel=license</td>
 +
  <td>[http://ietfreport.isoc.org/idref/draft-snell-atompub-feed-license/ Atom License Extension]</td>
 +
</tr>
 +
<tr>
 +
  <td>Any XML</td>
 +
  <td>XMP</td>
 +
  <td>Wherever valid</td>
 +
  <td>[http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf XMP Spec]</td>
 +
</tr>
 +
<tr>
 +
  <td>OpenOffice.org (OASIS)</td>
 +
  <td colspan="2">OO.org CC License Add-In SoC Project is working on the spec</td>
 +
  <td></td>
 +
</tr>
 +
<tr>
 +
  <td>MS Office (2003)</td>
 +
  <td>DocumentSummaryInformation Infile</td>
 +
  <td>CreativeCommons_LicenseURL property</td>
 +
  <td>[http://www.microsoft.com/downloads/details.aspx?FamilyID=113b53dd-1cc0-4fbe-9e1d-b91d07c76504&displaylang=en Office Add-in]</td>
 +
</tr>
 +
<tr>
 +
  <td>MS Office OpenXML (2007)</td>
 +
  <td>?</td>
 +
  <td>?</td>
 +
  <td>[http://www.ecma-international.org/publications/standards/Ecma-376.htm OpenXML Spec]
 +
 
 +
[http://lists.ibiblio.org/pipermail/cc-devel/2007-June/000466.html Relevant mailing list post]</td>
 +
</tr>
 +
 
 +
</table>
 +
 
 +
== Indexing Licenses in Tracker Summary ==
 +
 
 +
<table border="1">
 +
<tr>
 +
<td><strong>Status</strong></td><td><strong>Format</strong></td><td><strong>Extraction Method</strong></td><td><strong>Test content</strong></td>
 +
</tr>
 +
<tr>
 +
  <td>Done, GStreamer patch pending</td>
 +
  <td>MP3</td>
 +
  <td>Reading native tags already complete.  Maybe extend GStreamer extractor to read XMP.</td>
 +
  <td>XMP embedded with Exempi / Tags embedded with id3v2</td>
 +
</tr>
 +
<tr>
 +
  <td>In progress</td>
 +
  <td>Vorbis</td>
 +
  <td>Extend the GStreamer extractor to check for the presence of an XMP comment field.  GStreamer places this within the EXTENDED_COMMENTS tag (requires GStreamer 0.10.10).</td>
 +
  <td>XMP embedded with vorbiscomment</td>
 +
</tr>
 +
<tr>
 +
  <td>Done, GStreamer patch pending</td>
 +
  <td>FLAC</td>
 +
  <td>Native tags already extracted through the GStreamer extractor.  Maybe extend GStreamer extractor to read XMP.</td>
 +
  <td>embedded with id3v2 or metaflac</td>
 +
</tr>
 +
<tr>
 +
  <td>In progress</td>
 +
  <td>JPEG</td>
 +
  <td>Extend the Imagemagick extractor, using 'convert file.jpg xmp:-' to read XMP</td>
 +
  <td>XMP embedded with Exempi</td>
 +
</tr>
 +
<tr>
 +
  <td>Done</td>
 +
  <td>TIFF</td>
 +
  <td>Extend the Imagemagick extractor, using 'convert file.tif xmp:-' to read XMP</td>
 +
  <td>XMP embedded with Exempi (Note: there's a bug in Adobe's XMP SDK that prevents Exempi from embedding valid XMP)</td>
 +
</tr>
 +
<tr>
 +
  <td>Done/td>
 +
  <td>PNG</td>
 +
  <td>Extend the PNG extractor, adding a check for XML:com:adobe:xmp.  (For backwards compatibility, the ability to read iTXt in libpng is disabled by default until version 1.3.)</td>
 +
  <td>XMP embedded with Exempi</td>
 +
</tr>
 +
<tr>
 +
  <td>In progress</td>
 +
  <td>GIF</td>
 +
  <td>Would need to write a GIF extractor</td>
 +
  <td>Palimpsest</td>
 +
</tr>
 +
<tr>
 +
  <td>Done</td>
 +
  <td>PDF</td>
 +
  <td>Extend the current PDF extractor (which uses Poppler) to read the metadata field.</td>
 +
  <td>XMP embedded with Exempi</td>
 +
</tr>
 +
<tr>
 +
  <td>Done</td>
 +
  <td>HTML</td>
 +
  <td>Write a new HTML extractor, using libxml2, and scan for RDFa</td>
 +
  <td>Various actual sites, including creativecommons.org</td>
 +
</tr>
 +
<tr>
 +
  <td>In progress</td>
 +
  <td>SVG</td>
 +
  <td>I could specifically parse the XML, checking for the RDF schema used by Inkscape.  Should I check for XMP also???</td>
 +
  <td>Inkscape</td>
 +
</tr>
 +
<tr>
 +
  <td></td>
 +
  <td>Any XML</td>
 +
  <td>Write a generic XML extractor (and/or extractor for each particular format), scanning with libxml2</td>
 +
  <td></td>
 +
</tr>
 +
<tr>
 +
  <td>Awaiting spec</td>
 +
  <td>OpenOffice.org (OASIS)</td>
 +
  <td>Extend OASIS extractor</td>
 +
  <td>OO.org Add-In</td>
 +
</tr>
 +
<tr>
 +
  <td>Done</td>
 +
  <td>MS Office (old format)</td>
 +
  <td>Extend existing msoffice extractor</td>
 +
  <td>[http://www.microsoft.com/downloads/details.aspx?FamilyID=113b53dd-1cc0-4fbe-9e1d-b91d07c76504&amp;displaylang=en MSOffice Add-in]</td>
 +
</tr>
 +
</table>

Latest revision as of 18:45, 21 September 2007

Google Summer of Code Project: “Indexing Embedded License Claims in Tracker”

Here's some relevant (now revised) sections of the Summer of Code application:

License Metadata Summary

Format Form of Metadata Location of Metadata Links
Audio
MP3 XMP / Native id3 tags The PRIV,XMP field / WCOP tag XMP Spec ID3v2.3 Spec
Vorbis XMP / Native comment field XMP comment field / LICENSE comment field Ogg Vorbis Docs
FLAC Native comment fields (id3v2 or vorbis-style comments) Same as with MP3 for id3v2 or Vorbis for vorbis-style comments FLAC Format Spec
Monkey's Audio (APE) Native Vorbis-like comment field AFAIK, there is no standard tag spec
Images
JPEG XMP APP1 Markers XMP Spec
JPEG 2000 XMP UUID Box XMP Spec
TIFF XMP XMLPacket tag XMP Spec
PNG XMP iTXt, XML:com:adobe:xmp field XMP Spec
GIF XMP Application block XMP Spec
SVG RDF /svg/metadata/rdf CC Wiki, SVG, based on Inkscape
PSD (Adobe Photoshop) XMP Resource block XMP Spec
Video
AVI ? ? XMP Spec
Matroska Native tag COPYRIGHT tag Matroska Tagging Spec
Quicktime Native tag kMDItemCopyright(old)/kUserDataTextCopyright(new) tag Quicktime 7 API Reference XMP Spec
OGG No metadata standard Ogg Metadata Draft
Theora Theora comments (similar to Vorbis comments) Theora Spec
Flash RDF ?
Documents
PDF XMP metadata field XMP Spec
Postscript/EPS XMP Document-level metadata XMP Spec
HTML RDFa <a rel="license" href="..."></a> CC Wiki, RDFa
SMIL RDF /smil/head/metadata@id="meta-rdf"/RDF CreativeCommons SMIL Module
RSS 1.0 /RDF/channel/license or /RDF/channel/item/license CreativeCommons RSS 1.0 Module
RSS 2.0 /rss/channel/cc:license or /rss/channel/item/cc:license CreativeCommons RSS 2.0 Module
Atom /feed/entry/link@rel=license Atom License Extension
Any XML XMP Wherever valid XMP Spec
OpenOffice.org (OASIS) OO.org CC License Add-In SoC Project is working on the spec
MS Office (2003) DocumentSummaryInformation Infile CreativeCommons_LicenseURL property Office Add-in
MS Office OpenXML (2007) ? ? OpenXML Spec Relevant mailing list post

Indexing Licenses in Tracker Summary

Status Format Extraction Method Test content
Done, GStreamer patch pending MP3 Reading native tags already complete. Maybe extend GStreamer extractor to read XMP. XMP embedded with Exempi / Tags embedded with id3v2
In progress Vorbis Extend the GStreamer extractor to check for the presence of an XMP comment field. GStreamer places this within the EXTENDED_COMMENTS tag (requires GStreamer 0.10.10). XMP embedded with vorbiscomment
Done, GStreamer patch pending FLAC Native tags already extracted through the GStreamer extractor. Maybe extend GStreamer extractor to read XMP. embedded with id3v2 or metaflac
In progress JPEG Extend the Imagemagick extractor, using 'convert file.jpg xmp:-' to read XMP XMP embedded with Exempi
Done TIFF Extend the Imagemagick extractor, using 'convert file.tif xmp:-' to read XMP XMP embedded with Exempi (Note: there's a bug in Adobe's XMP SDK that prevents Exempi from embedding valid XMP)
Done/td> PNG Extend the PNG extractor, adding a check for XML:com:adobe:xmp. (For backwards compatibility, the ability to read iTXt in libpng is disabled by default until version 1.3.) XMP embedded with Exempi
In progress GIF Would need to write a GIF extractor Palimpsest
Done PDF Extend the current PDF extractor (which uses Poppler) to read the metadata field. XMP embedded with Exempi
Done HTML Write a new HTML extractor, using libxml2, and scan for RDFa Various actual sites, including creativecommons.org
In progress SVG I could specifically parse the XML, checking for the RDF schema used by Inkscape. Should I check for XMP also??? Inkscape
Any XML Write a generic XML extractor (and/or extractor for each particular format), scanning with libxml2
Awaiting spec OpenOffice.org (OASIS) Extend OASIS extractor OO.org Add-In
Done MS Office (old format) Extend existing msoffice extractor MSOffice Add-in