Difference between revisions of "DiscoverEd Metadata"

From Creative Commons
Jump to: navigation, search
 
(15 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{lowercase}}
+
{{Infobox|'''This is a basic guide to increasing the discoverability of online educational resources by preparing them for inclusion into search engines that utilize structured data, like [http://discovered.creativecommons.org/search/ DiscoverEd]. This guide contains technical language and sample XHTML and RDFa.'''}}
{{draft}}
 
== Overview ==
 
  
This document outlines some of the metadata ccLearn is utilizing for the [[DiscoverEd]] project.  This metadata is collected from syndicated feeds (RSS, Atom), OAI-PMH aggregations  and crawls (embedded [[RDFa]]). While the metadata store may include additional information from resources, these fields are exposed by default in the search results.
+
[http://discovered.creativecommons.org/search/ DiscoverEd] is an experimental project from Creative Commons intended to explore how [[Structured Data|structured data]] may be used to enhance the search experience. Metadata about the resources, including the license and subject information available, are exposed in the search result set. We are particularly interested in open educational resources (OER) and are collaborating with other open education projects to improve search and discovery capabilities for OER, using DiscoverEd and other available tools. For in-depth details, read the [http://learn.creativecommons.org/wp-content/uploads/2009/07/discovered-paper-17-july-2009.pdf white paper] that describes the goals and design of DiscoverEd.
  
ccLearn's [[DiscoverEd]] will be indexing RDFa when pages are crawled, making this our preferred way of encoding machine readable metadata.  We believe this will have the broadest possible exposure for current and future software agents.
+
This page is meant to be a quick checklist for maximizing the discoverability of your resources in DiscoverEd and similarly designed search engines. Not all of these steps are necessary for inclusion into DiscoverEd. For example, structured data are not technically required for resources to be included in search results, but without them users of the search engine will be provided with very little information about your resources.
  
* '''Title''' (<code>[http://dublincore.org/documents/dcmi-terms/#terms-title DCT:title]</code>): A brief descriptive title for the resource.
+
== Resource Feed ==
* '''Summary''' (<code>[http://dublincore.org/documents/dcmi-terms/#terms-description DCT:description]</code>): A relatively short summary or synopsis of the resource.
 
* '''License''' (<code>[http://dublincore.org/documents/dcmi-terms/#terms-license DCT:license]</code>, <code>[http://creativecommons.org/ns cc:license]</code>, <code>[http://www.w3.org/1999/xhtml/ xhtml:license]</code>): The URL of the work's license; e.g., <code><nowiki>http://creativecommons.org/licenses/by/3.0/</nowiki></code>.
 
* '''Education level''' (<code>[http://dublincore.org/documents/dcmi-terms/#terms-educationLevel DCT:educationLevel]</code>): What grade(s) or age-level(s) this material is suitable for.
 
* '''Language''' (<code>[http://www.w3.org/XML/1998/namespace xml:lang]</code>, <code>[http://dublincore.org/documents/dcmi-terms/#terms-language DCT:language]</code>): The language(s) of the referenced resource (not of your site).
 
* '''Subject''' (<code>[http://dublincore.org/documents/dcmi-terms/#terms-subject DCT:subject]</code>): The subject(s) of the resource; e.g., math.
 
  
__TOC__
+
DiscoverEd uses resource feeds to direct its resource crawl. In order to index your educational resources, DiscoverEd will need the URL to an RSS or Atom feed that is limited to your educational resources. It is not likely that a site is composed entirely of educational materials, instead consisting of "About" pages, links to staff profiles, and so on, in addition to the educational resources. An index of educational resources should be composed of only actual educational materials, thereby reducing or eliminating clutter that typically accompanies web-scale queries.
  
== Vocabulary ==
+
DiscoverEd consumes the feeds for each site that has been listed for inclusion. Your feed essentially provides a URL "road map" of your resources, which can then be used to run a directed crawl of the resources you curate. In other words, the crawler knows where the relevant resources are located because you, the curator, have pointed at them directly using the feed.
  
=== Specifying Subject ===
+
Many curatorial sites already have feed functionality (RSS or Atom) or support the Open Archive Initiative's Protocol for Metadata Harvesting (OAI-PMH). The MIT Open CourseWare site, for example, allows you to subscribe to a feed of the courses, which means that you can get an update every time a course is added, deleted, or changed. This type of feed also usually contains a list of the URLs for every course already on the site. Both feeds and OAI-PMH also provide a convenient method of polling, allowing the system to periodically check for new resources. Once a feed is set up, the DiscoverEd system can be kept up to date with minimal oversight.
  
The subject refers to the actual content in the resource; i.e., what is this resource ''about''? For many resources, more than one subject will be necessary; in this case, specify multiple subject elements.  We ask that you try to limit the contents of the subject to only those subjects that are objectively reflective of the entire resource. Other types of categories (opinions, metrics, etc) may have other vocabularies available which are more appropriate.
+
== Resource Metadata ==
  
=== Specifying Education level ===
+
Once you have located the URL to a feed that is limited to your educational resources, a good next step to increasing their discoverability would be to provide metadata about those resources. We recommend XHTML+[[RDFa]] for metadata encoding and transport.
  
The education level should indicate all levels (student ages) for which the resource is deemed appropriate. The education level should be labeled using the [http://dublincore.org/documents/dcmi-terms/#terms-educationLevel DCT:educationLevel] term.
+
As a curator, you have certain goals for the resources you curate. Generally, you want curated resources to be as easy to find as possible. Core to this goal is enabling machines to detect and interpret metadata about the resources, such as title, language, or licensing terms, in a way that is interoperable with as many detection and interpretation methods as possible. Interoperability here means not only that different programs can read particular metadata properties, but also that the vocabularies themselves, which are sets of related properties, can evolve and be extended. It is also important that potential extensions be backward compatible: existing tools should not be disrupted when new properties are added. If possible, existing tools should even be able to handle basic aspects of new properties. This is precisely the kind of "interoperability of meaning" that [http://en.wikipedia.org/wiki/Resource_Description_Framework RDF] is designed to support.
  
Though we will accept any descriptions that seem appropriate to you, please consider using one of the following schemas:  
+
For this and other reasons, the ideal method for metadata encoding/transport is XHTML+[[RDFa]]. We believe this has the broadest possible exposure for current and future software agents. For more information as to why we recommend and require [[RDFa]] for metadata transport, see the [[CC REL]] W3C specification and our [http://learn.creativecommons.org/wp-content/uploads/2009/07/discovered-paper-17-july-2009.pdf white paper]. For technical information on XHTML and RDFa, see the [http://www.w3.org/TR/xhtml-rdfa-primer/ W3C RDFa Primer].
  
* primary, secondary, tertiary, adult;
+
This section outlines some of the [[RDFa]] metadata Creative Commons is collecting for the DiscoverEd project and gives some examples of using RDFa in XHTML documents. These metadata are extracted from the document at crawl time. While our metadata store may include additional metadata information from resources, these fields are exposed by default in the search results:
* K,1,2,3,...,20 (where the number refers to the actual grade-level).
 
  
You may include equivalent terms as well by specifying more than one <code>DCT:educationLevel</code> <category>.  For example, you might include a <code>DCT:educationLevel</code> for <code>9</code>, <code>10</code>, and <code>secondary</code>.
+
*Title
 +
*Summary
 +
*License
 +
*Education level
 +
*Language
 +
*Subject
  
=== Specifying Language ===
+
'''Title''' (<code>[http://dublincore.org/documents/dcmi-terms/#terms-title DCT:title]</code>)<br />
 +
A brief descriptive title for the resource.
  
When specifying the language for a resource, the value should be specified as described by [http://www.ietf.org/rfc/rfc4646.txt RFC-4646]. For example, <code>en</code> for English. To distinguish English (United States) from English (United Kindgom), the language would be specified as <code>en-US</code> and <code>en-GB</code>, respectively.
+
'''Summary''' (<code>[http://dublincore.org/documents/dcmi-terms/#terms-description DCT:description]</code>)<br />
 +
A relatively short summary or synopsis of the resource.
 +
 
 +
'''License''' (<code>[http://dublincore.org/documents/dcmi-terms/#terms-license DCT:license]</code>, <code>[http://creativecommons.org/ns cc:license]</code>, <code>[http://www.w3.org/1999/xhtml/ xhtml:license]</code>)<br />
 +
The stable URL of the work's license; e.g., http://creativecommons.org/licenses/by/3.0/. If you are using Creative Commons licenses, we also recommend following the [[CC REL]] specification for identifying further CC license metadata.
 +
 
 +
See the [[Syndication|CC with syndication formats]] documentation for more information on including this in a bootstrap feed.
 +
 
 +
'''Education level''' (<code>[http://dublincore.org/documents/dcmi-terms/#terms-educationLevel DCT:educationLevel]</code>)<br />
 +
What grade(s) or age-level(s) this material is suitable for. The education level should indicate all levels (student ages) for which the resource is deemed appropriate. Though we accept any descriptions that seem appropriate to you, please consider using one of the following schemas:
 +
 
 +
*primary, secondary, tertiary, adult;
 +
*K,1,2,3,...,20 (where the number refers to the actual grade-level).
 +
 
 +
You may include equivalent terms as well by specifying more than one value for DCT:educationLevel. For example, you might include a separate DCT:educationLevel tag for 9, 10, and secondary.
 +
 
 +
'''Language''' (<code>[http://www.w3.org/XML/1998/namespace xml:lang]</code>, <code>[http://dublincore.org/documents/dcmi-terms/#terms-language DCT:language]</code>)<br />
 +
The language(s) of the referenced resource (not of your site). When specifying the language for a resource, the value should be specified as described by RFC-4646.8 For example, en for English. To distinguish English (United States) from English (United Kingdom), the language would be specified as en-US and en-GB, respectively.
  
 
In an Atom 1.0 feed, the language is specified as the <code>xml:lang</code> attribute of the <code>content</code> element.  Multiple languages in a single entry is not supported.
 
In an Atom 1.0 feed, the language is specified as the <code>xml:lang</code> attribute of the <code>content</code> element.  Multiple languages in a single entry is not supported.
  
=== Embedding license data ===
+
'''Subject''' (<code>[http://dublincore.org/documents/dcmi-terms/#terms-subject DCT:subject]</code>)<br />
 
+
The subject(s) of the resource; e.g., mathematics. The subject refers to the actual content in the resource; i.e., what the resource is about. For many resources, more than one subject will be necessary; in these cases, simply specify multiple subject elements. Ideally you should try to limit the contents of the subject to only those subjects that are objectively reflective of the entire resource. Other types of categories (opinions, metrics, etc.) may have other vocabularies available which are more appropriate.
Since the licensing of a resource is expected to be conveyed via URL, we can leverage the Atom &lt;link&gt; element. However we must markup the link element so as to identify it as a license URL.  This is accomplished with adding the attribute rel="license" to the &lt;link&gt; element. For example:
 
  
<pre><link rel="license" href="http://creativecommons.org/licenses/by/3.0/" /></pre>
+
{{Infobox|'''Note about RDFa Vocabularies'''
  
See the complete [[Syndication|CC with syndication formats]] documentation for more information.
+
Notice that each metadata label is preceded by a prefix of either dc or xhtml. In the RDFa specification, these are indicators of which vocabulary defines the properties, or metadata terms. We recommend the [http://purl.org/dc/terms/ Dublin Core] vocabulary for the majority of properties because of its widespread adoption. For license, we recommend using the xhtml namespace because it’s built in to the XHTML specification and is equivalent to other definitions of the property.}}
  
 
== Examples ==
 
== Examples ==
Line 51: Line 65:
 
=== [X]HTML + [[RDFa]] ===
 
=== [X]HTML + [[RDFa]] ===
  
The following is an example of how a resource at http://ocw.example.org/math/101 could be annotated with machine readable metadata.  This is our preferred manner for encoding this information as it exposes the metadata to a much wider range of clients.
+
The following is an example of how a resource at http://ocw.example.org/math/101 could be annotated with machine-readable metadata, including license and attribution information.  This is our preferred manner for encoding this information as it exposes the metadata to a much wider range of clients.
  
 
<pre>
 
<pre>
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
 
<html xmlns="http://www.w3.org/1999/xhtml/"
 
<html xmlns="http://www.w3.org/1999/xhtml/"
       xmlns:dc="http://purl.org/dc/terms/">
+
       xmlns:dc="http://purl.org/dc/terms/"
 +
      xmlns:cc="http://creativecommons.org/ns#">
 
   <head>
 
   <head>
 
   <title>OER Site</title>
 
   <title>OER Site</title>
Line 63: Line 78:
 
   <body>
 
   <body>
 
     <h1 property="dc:title">Math 101</h1>
 
     <h1 property="dc:title">Math 101</h1>
     <h2>by <span property="dc:author">John Q. Public</span></h2>
+
     <h2>by <a href="http://example.org/~johnq" property="dc:author cc:attributionName" rel="cc:attributionURL">John Q. Public</a></h2>
 
     <p property="dc:description">Basic mathematics for 5th graders</p>
 
     <p property="dc:description">Basic mathematics for 5th graders</p>
 
     <p>Subjects: <span property="dc:subject">Math</span></p>
 
     <p>Subjects: <span property="dc:subject">Math</span></p>
Line 76: Line 91:
 
</pre>
 
</pre>
  
If a site aggregates resources such that the metadata appears on a page other than the actual resource, the <code>about</code> attribute can be used to indicate that the metadata is about a different resource.  For example, the following page could be published at <code>'''http://commons.oer.example.org/math/101'''</code> and still refer to the same resource as the previous example:
+
If a site aggregates resources such that the metadata appear on a page other than the actual resource, the <code>about</code> attribute can be used to indicate that the metadata are about a different resource.  For example, the following page could be published at <code>'''http://commons.oer.example.org/math/101'''</code> and still refer to the same resource as the previous example:
  
 
<pre>
 
<pre>
Line 103: Line 118:
 
</html>
 
</html>
 
</pre>
 
</pre>
 
  
 
=== Atom 1.0 Example ===
 
=== Atom 1.0 Example ===
  
Here is a sample, one entry Atom 1.0 feed which implements the guidelines above.
+
Here is a sample, one entry Atom 1.0 feed which implements the guidelines above. '''Note that inclusion of additional metadata in the feed is optional and considered inferior to inclusion with the resource using [[RDFa]].'''
  
 
<pre>
 
<pre>

Latest revision as of 18:10, 9 June 2010

This is a basic guide to increasing the discoverability of online educational resources by preparing them for inclusion into search engines that utilize structured data, like DiscoverEd. This guide contains technical language and sample XHTML and RDFa.

DiscoverEd is an experimental project from Creative Commons intended to explore how structured data may be used to enhance the search experience. Metadata about the resources, including the license and subject information available, are exposed in the search result set. We are particularly interested in open educational resources (OER) and are collaborating with other open education projects to improve search and discovery capabilities for OER, using DiscoverEd and other available tools. For in-depth details, read the white paper that describes the goals and design of DiscoverEd.

This page is meant to be a quick checklist for maximizing the discoverability of your resources in DiscoverEd and similarly designed search engines. Not all of these steps are necessary for inclusion into DiscoverEd. For example, structured data are not technically required for resources to be included in search results, but without them users of the search engine will be provided with very little information about your resources.

Resource Feed

DiscoverEd uses resource feeds to direct its resource crawl. In order to index your educational resources, DiscoverEd will need the URL to an RSS or Atom feed that is limited to your educational resources. It is not likely that a site is composed entirely of educational materials, instead consisting of "About" pages, links to staff profiles, and so on, in addition to the educational resources. An index of educational resources should be composed of only actual educational materials, thereby reducing or eliminating clutter that typically accompanies web-scale queries.

DiscoverEd consumes the feeds for each site that has been listed for inclusion. Your feed essentially provides a URL "road map" of your resources, which can then be used to run a directed crawl of the resources you curate. In other words, the crawler knows where the relevant resources are located because you, the curator, have pointed at them directly using the feed.

Many curatorial sites already have feed functionality (RSS or Atom) or support the Open Archive Initiative's Protocol for Metadata Harvesting (OAI-PMH). The MIT Open CourseWare site, for example, allows you to subscribe to a feed of the courses, which means that you can get an update every time a course is added, deleted, or changed. This type of feed also usually contains a list of the URLs for every course already on the site. Both feeds and OAI-PMH also provide a convenient method of polling, allowing the system to periodically check for new resources. Once a feed is set up, the DiscoverEd system can be kept up to date with minimal oversight.

Resource Metadata

Once you have located the URL to a feed that is limited to your educational resources, a good next step to increasing their discoverability would be to provide metadata about those resources. We recommend XHTML+RDFa for metadata encoding and transport.

As a curator, you have certain goals for the resources you curate. Generally, you want curated resources to be as easy to find as possible. Core to this goal is enabling machines to detect and interpret metadata about the resources, such as title, language, or licensing terms, in a way that is interoperable with as many detection and interpretation methods as possible. Interoperability here means not only that different programs can read particular metadata properties, but also that the vocabularies themselves, which are sets of related properties, can evolve and be extended. It is also important that potential extensions be backward compatible: existing tools should not be disrupted when new properties are added. If possible, existing tools should even be able to handle basic aspects of new properties. This is precisely the kind of "interoperability of meaning" that RDF is designed to support.

For this and other reasons, the ideal method for metadata encoding/transport is XHTML+RDFa. We believe this has the broadest possible exposure for current and future software agents. For more information as to why we recommend and require RDFa for metadata transport, see the CC REL W3C specification and our white paper. For technical information on XHTML and RDFa, see the W3C RDFa Primer.

This section outlines some of the RDFa metadata Creative Commons is collecting for the DiscoverEd project and gives some examples of using RDFa in XHTML documents. These metadata are extracted from the document at crawl time. While our metadata store may include additional metadata information from resources, these fields are exposed by default in the search results:

  • Title
  • Summary
  • License
  • Education level
  • Language
  • Subject

Title (DCT:title)
A brief descriptive title for the resource.

Summary (DCT:description)
A relatively short summary or synopsis of the resource.

License (DCT:license, cc:license, xhtml:license)
The stable URL of the work's license; e.g., http://creativecommons.org/licenses/by/3.0/. If you are using Creative Commons licenses, we also recommend following the CC REL specification for identifying further CC license metadata.

See the CC with syndication formats documentation for more information on including this in a bootstrap feed.

Education level (DCT:educationLevel)
What grade(s) or age-level(s) this material is suitable for. The education level should indicate all levels (student ages) for which the resource is deemed appropriate. Though we accept any descriptions that seem appropriate to you, please consider using one of the following schemas:

  • primary, secondary, tertiary, adult;
  • K,1,2,3,...,20 (where the number refers to the actual grade-level).

You may include equivalent terms as well by specifying more than one value for DCT:educationLevel. For example, you might include a separate DCT:educationLevel tag for 9, 10, and secondary.

Language (xml:lang, DCT:language)
The language(s) of the referenced resource (not of your site). When specifying the language for a resource, the value should be specified as described by RFC-4646.8 For example, en for English. To distinguish English (United States) from English (United Kingdom), the language would be specified as en-US and en-GB, respectively.

In an Atom 1.0 feed, the language is specified as the xml:lang attribute of the content element. Multiple languages in a single entry is not supported.

Subject (DCT:subject)
The subject(s) of the resource; e.g., mathematics. The subject refers to the actual content in the resource; i.e., what the resource is about. For many resources, more than one subject will be necessary; in these cases, simply specify multiple subject elements. Ideally you should try to limit the contents of the subject to only those subjects that are objectively reflective of the entire resource. Other types of categories (opinions, metrics, etc.) may have other vocabularies available which are more appropriate.

Note about RDFa Vocabularies

Notice that each metadata label is preceded by a prefix of either dc or xhtml. In the RDFa specification, these are indicators of which vocabulary defines the properties, or metadata terms. We recommend the Dublin Core vocabulary for the majority of properties because of its widespread adoption. For license, we recommend using the xhtml namespace because it’s built in to the XHTML specification and is equivalent to other definitions of the property.

Examples

[X]HTML + RDFa

The following is an example of how a resource at http://ocw.example.org/math/101 could be annotated with machine-readable metadata, including license and attribution information. This is our preferred manner for encoding this information as it exposes the metadata to a much wider range of clients.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml/"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:cc="http://creativecommons.org/ns#">
  <head>
   <title>OER Site</title>
  </head>

  <body>
     <h1 property="dc:title">Math 101</h1>
     <h2>by <a href="http://example.org/~johnq" property="dc:author cc:attributionName" rel="cc:attributionURL">John Q. Public</a></h2>
     <p property="dc:description">Basic mathematics for 5th graders</p>
     <p>Subjects: <span property="dc:subject">Math</span></p>
     <p>Grade level: <span property="dc:educationLevel">5</span></p>
     <p>Language: <span property="dc:language" content="en">English</span></p>
     <p>License: <a href="http://creativecommons.org/by/3.0/" rel="license">Attribution 3.0</a></p>

     <p>Lorem ipsum, etc, etc.</p>

  </body>
</html>

If a site aggregates resources such that the metadata appear on a page other than the actual resource, the about attribute can be used to indicate that the metadata are about a different resource. For example, the following page could be published at http://commons.oer.example.org/math/101 and still refer to the same resource as the previous example:


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml/"
      xmlns:dc="http://purl.org/dc/terms/">
  <head>
   <title>OER Site</title>
  </head>

  <body>
     <div about="http://ocw.example.org/math/101">
       <h1 property="dc:title">Math 101</h1>
       <h2>by <span property="dc:author">John Q. Public</span></h2>
       <p property="dc:description">Basic mathematics for 5th graders</p>
       <p>Subjects: <span property="dc:subject">Math</span></p>
       <p>Grade level: <span property="dc:educationLevel">5</span></p>
       <p>Language: <span property="dc:language" content="en">English</span></p>
       <p>License: <a href="http://creativecommons.org/by/3.0/" rel="license">Attribution 3.0</a></p>
     </div>

     <p>Lorem ipsum, etc, etc.</p>

  </body>
</html>

Atom 1.0 Example

Here is a sample, one entry Atom 1.0 feed which implements the guidelines above. Note that inclusion of additional metadata in the feed is optional and considered inferior to inclusion with the resource using RDFa.

<feed xmlns="http://www.w3.org/2005/Atom">
  <id>http://oersite.example.org/cc/</id>
  <title>OER Aggregation Web Site</title>
  <updated>2008-01-16T12:00:00Z</updated>
  <link rel="self" href="http://oersite.example.org/cc/atom.xml" type="application/atom+xml" />
  <author>
    <name>John Q. Public</name>
    <email>webmaster@oersite.org</email>
  </author>
  <entry>
    <id>tag:ocw.org,2007-10-15:/math/101</id>
    <updated>2007-10-15T12:00:00Z</updated>
    <link href="http://ocw.example.org/math/101" />
    <title>Math 101</title>
    <summary>Basic mathematics for 5th graders</summary>
    <link rel="license" href="http://creativecommons.org/licenses/by/3.0/" />
    <category term="dc:subject:Math" />
    <category term="dc:educationLevel:5" />
    <content type="xhtml" xml:lang="en">The content</content>
  </entry>
</feed>