Difference between revisions of "Tracker CC Indexing"

Revision as of 00:07, 3 May 2007

Google Summer of Code Project: “Indexing Embedded License Claims in Tracker”

In progress...

Here's some relevant sections of the Summer of Code application:

Implementation

Formats to support:

 * MP3
 * OGG
 * RSS
 * SVG
 * HTML
 * JPEG
 * PDF
 * SMIL

With the exception of MP3, OGG, and RSS, all of the above formats embed information as XMP. This relieves the burden of processing such a wide variety of formats.

XMP/RDF and RDFa:

Probably the most significant portion of the work done will be processing of XML XMP data. Once the XML data is extracted, it will be passed to a shared piece of code that indexes each field independent of it's format.

The following is an except of raw XMP describing a work licensed under the CC Attribution 3.0 license.

<?xpacket begin= id=?>
 ...
 <rdf:Description rdf:about=
  xmlns:dc='http://purl.org/dc/elements/1.1/'>
  <dc:rights>
   <rdf:Alt>
    <rdf:li xml:lang='x-default' >This work is licensed under a Creative Commons
 Attribution 3.0 License.</rdf:li>
   </rdf:Alt>
  </dc:rights>
 </rdf:Description>
 <rdf:Description rdf:about=
  xmlns:cc='http://web.resource.org/cc/'>
  <cc:license rdf:resource='http://creativecommons.org/licenses/by/3.0/'/>
 </rdf:Description>
 ...
<?xpacket end='r'?>

To fit Tracker's metadata structure, the extracted license claim would be placed into the File:License field in a form such as: "This work is licensed under a Creative Commons Attribution 3.0 License (http://creativecommons.org/licenses/by/3.0/)". Such a format allows for searching the license field by both license name and URL.

License claims will similarly be extracted from RDFa metadata embedding in XML formats.

Timeline

By week 2: Write an XMP parser. Tracker includes examples of parsing XML using glib which I will also follow as a guide. (Based on recommendations/specifications at http://wiki.creativecommons.org/XMP)

By week 4: Add support to Tracker to automatically read and index XMP sidecar files, which will take advantage of the XMP parser. (Done by modifying src/trackerd/tracker-metadata.c)

By week 5: Once Tracker indexes the metadata of independent XMP sidecar files, I will go about extending the extractors of various formats to index embedded XMP metadata. Again, the above XMP parser will be utilized. Primary formats include HTML, SVG, and PDF.

By week 7: Write a parser for RDFa for relevant formats. (Based on recommendations/specifications at http://wiki.creativecommons.org/RDFa)

By week 11: Additionally add support for new formats, such as SMIL and RSS. As time permits, I may also extract metadata from other image formats such as TIFF, PNG, and GIF (all of which could contain XMP metadata). Formats that take advantage of previously written code may take only a day, while others may take up to a week.

By week 12: Tie things up for submission of code

Links

Related works:

 cclookup (http://wiki.creativecommons.org/CcLookup) - Python application for extracting license RDF metadata or license metadata from mp3s. Code may be adapted for parsing license claims in C
 ccpublisher (http://wiki.creativecommons.org/CcPublisher) - Licenses embedded by ccpublisher should all be correctly extracted.

Difference between revisions of "Tracker CC Indexing"

Revision as of 00:07, 3 May 2007

Contents

Google Summer of Code Project: “Indexing Embedded License Claims in Tracker”

Implementation

Timeline

Links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

default links

wiki navigation

Tools

@@ Line 50: / Line 50: @@
 == Timeline ==
-Assuming ironing out of the design phase and a clear assessment of goals of the project, coding will begin with an XMP parser. At the moment, I am not aware of any C libraries that perform this function, although I may use ccLookup (written in Python) as a guide. Tracker includes examples of parsing XML using glib which I will also follow as a guide. This is particularly important aspect that the rest of my project will build off of. It should be done before the first month.
+<a target="_blank" href="http://www.google.com/calendar/render?cid=o7o9ms264j8p9h2doaqpdbikh8%40group.calendar.google.com"><img src="http://www.google.com/calendar/images/ext/gc_button1_en.gif" border=0></a>
-I then will add support to Tracker to automatically read and index XMP sidecar files (I have the okay from Jamie from Tracker), which will take advantage of the XMP parser. Once Tracker indexes the metadata of independent XMP sidecar files, I will go about extending the extractors of various formats to index embedded XMP metadata. I expect that many extractors will be trivial to implement while others will require more technical research of the formats. However, most formats already take the form of XML and so extracting the relevant XMP/RDF element will be straightforward.
+By week 2: Write an XMP parser. Tracker includes examples of parsing XML using glib which I will also follow as a guide.  (Based on recommendations/specifications at http://wiki.creativecommons.org/XMP)
-In addition to parsing XML for XMP/RDF metadata, I will also write a parser for RDFa for relevant formats.
+By week 4: Add support to Tracker to automatically read and index XMP sidecar files, which will take advantage of the XMP parser.  (Done by modifying src/trackerd/tracker-metadata.c)
-I will then additionally add support for new formats to Tracker, such as SMIL and RSS. As time permits, I may also extract metadata from other image formats such as TIFF, PNG, and GIF (all of which could contain XMP metadata). Formats that take advantage of previously written code may take only a day, while others may take up to a week.
+By week 5: Once Tracker indexes the metadata of independent XMP sidecar files, I will go about extending the extractors of various formats to index embedded XMP metadata.  Again, the above XMP parser will be utilized.  Primary formats include HTML, SVG, and PDF.
-Also, time permitting, I could take my work and attempt to separate it into a stand-alone library. Doing so will be a constant consideration in the design of each stage of integration into Tracker. The feasibility of a stand-alone library, however cannot be guaranteed, as the feasibility of doing so is contingent upon the requirements for successful integration into Tracker. At the very least, I would like my work to apply outside of Tracker, such as indirectly promoting embedding license claims or gaining familiarity in order to write a library to extract license claims from outside of Tracker (such a library would be useful to Sidestream, which I mention in my bio).
+By week 7: Write a parser for RDFa for relevant formats. (Based on recommendations/specifications at http://wiki.creativecommons.org/RDFa)
-LINKS
+By week 11: Additionally add support for new formats, such as SMIL and RSS. As time permits, I may also extract metadata from other image formats such as TIFF, PNG, and GIF (all of which could contain XMP metadata). Formats that take advantage of previously written code may take only a day, while others may take up to a week.
+By week 12: Tie things up for submission of code
+== Links ==
 Related works: