DiscoverEd Metadata

From Creative Commons
Revision as of 19:06, 9 June 2010 by Akozak (talk | contribs)
Jump to: navigation, search

DiscoverEd is an experimental project from Creative Commons intended to explore how structured data may be used to enhance the search experience. Metadata about the resources, including the license and subject information available, are exposed in the search result set. We are particularly interested in open educational resources (OER) and are collaborating with other open education projects to improve search and discovery capabilities for OER, using DiscoverEd and other available tools. For in-depth details, read the white paper that describes the goals and design of DiscoverEd.

This page is meant to be a quick checklist for maximizing the discoverability of your resources in DiscoverEd and similarly designed search engines. Not all of these steps are necessary for inclusion into DiscoverEd. For example, structured data are not technically required for resources to be included in search results, but without them users of the search engine will be provided with very little information about your resources.

Resource Feed

DiscoverEd uses resource feeds to direct its resource crawl. In order to index your educational resources, DiscoverEd will need the URL to an RSS or Atom feed that is limited to your educational resources. It is not likely that a site is composed entirely of educational materials, instead consisting of "About" pages, links to staff profiles, and so on, in addition to the educational resources. An index of educational resources should be composed of only actual educational materials, thereby reducing or eliminating clutter that typically accompanies web-scale queries.

DiscoverEd consumes the feeds for each site that has been listed for inclusion. Your feed essentially provides a URL "road map" of your resources, which can then be used to run a directed crawl of the resources you curate. In other words, the crawler knows where the relevant resources are located because you, the curator, have pointed at them directly using the feed.

Many curatorial sites already have feed functionality (RSS or Atom) or support the Open Archive Initiative's Protocol for Metadata Harvesting (OAI-PMH). The MIT Open CourseWare site, for example, allows you to subscribe to a feed of the courses, which means that you can get an update every time a course is added, deleted, or changed. This type of feed also usually contains a list of the URLs for every course already on the site. Both feeds and OAI-PMH also provide a convenient method of polling, allowing the system to periodically check for new resources. Once a feed is set up, the DiscoverEd system can be kept up to date with minimal oversight.

Resource Metadata

Once you have located the URL to a feed that is limited to your educational resources, a good next step to increasing their discoverability would be to provide metadata about those resources. We recommend XHTML+RDFa for metadata encoding and transport.

As a curator, you have certain goals for the resources you curate. Generally, you want curated resources to be as easy to find as possible. Core to this goal is enabling machines to detect and interpret metadata about the resources, such as title, language, or licensing terms, in a way that is interoperable with as many detection and interpretation methods as possible. Interoperability here means not only that different programs can read particular metadata properties, but also that the vocabularies themselves, which are sets of related properties, can evolve and be extended. It is also important that potential extensions be backward compatible: existing tools should not be disrupted when new properties are added. If possible, existing tools should even be able to handle basic aspects of new properties. This is precisely the kind of "interoperability of meaning" that RDF is designed to support.

For this and other reasons, the ideal method for metadata encoding/transport is XHTML+RDFa. We believe this has the broadest possible exposure for current and future software agents. For more information as to why we recommend and require RDFa for metadata transport, see the CC REL W3C specification and our white paper. For technical information on XHTML and RDFa, see the W3C RDFa Primer.

This section outlines some of the RDFa metadata Creative Commons is collecting for the DiscoverEd project and gives some examples of using RDFa in XHTML documents. These metadata are extracted from the document at crawl time. While our metadata store may include additional metadata information from resources, these fields are exposed by default in the search results:

  • Title
  • Summary
  • License
  • Education level
  • Language
  • Subject

Note about RDFa Vocabularies

Notice that each metadata label is preceded by a prefix of either dc or xhtml. In the RDFa specification, these are indicators of which vocabulary defines the properties, or metadata terms. We recommend the Dublin Core vocabulary for the majority of properties because of its widespread adoption. For license, we recommend using the xhtml namespace because it’s built in to the XHTML specification and is equivalent to other definitions of the property.

Title (DCT:title)
A brief descriptive title for the resource.

Summary (DCT:description)
A relatively short summary or synopsis of the resource.

License (DCT:license, cc:license, xhtml:license)
The stable URL of the work's license; e.g., If you are using Creative Commons licenses, we also recommend following the C REL specification for identifying further CC license metadata.

See the CC with syndication formats documentation for more information on including this in a bootstrap feed.

Education level (DCT:educationLevel)

What grade(s) or age-level(s) this material is suitable for. The education level should indicate all levels (student ages) for which the resource is deemed appropriate. Though we accept any descriptions that seem appropriate to you, please consider using one of the following schemas:

  • primary, secondary, tertiary, adult;
  • K,1,2,3,...,20 (where the number refers to the actual grade-level).

You may include equivalent terms as well by specifying more than one value for DCT:educationLevel. For example, you might include a separate DCT:educationLevel tag for 9, 10, and secondary.

Language (xml:lang, DCT:language)
The language(s) of the referenced resource (not of your site). When specifying the language for a resource, the value should be specified as described by RFC-4646.8 For example, en for English. To distinguish English (United States) from English (United Kingdom), the language would be specified as en-US and en-GB, respectively.

In an Atom 1.0 feed, the language is specified as the xml:lang attribute of the content element. Multiple languages in a single entry is not supported.

Subject (DCT:subject)
The subject(s) of the resource; e.g., mathematics. The subject refers to the actual content in the resource; i.e., what the resource is about. For many resources, more than one subject will be necessary; in these cases, simply specify multiple subject elements. Ideally you should try to limit the contents of the subject to only those subjects that are objectively reflective of the entire resource. Other types of categories (opinions, metrics, etc.) may have other vocabularies available which are more appropriate.



The following is an example of how a resource at could be annotated with machine-readable metadata, including license and attribution information. This is our preferred manner for encoding this information as it exposes the metadata to a much wider range of clients.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "">
<html xmlns=""
   <title>OER Site</title>

     <h1 property="dc:title">Math 101</h1>
     <h2>by <a href="" property="dc:author cc:attributionName" rel="cc:attributionURL">John Q. Public</a></h2>
     <p property="dc:description">Basic mathematics for 5th graders</p>
     <p>Subjects: <span property="dc:subject">Math</span></p>
     <p>Grade level: <span property="dc:educationLevel">5</span></p>
     <p>Language: <span property="dc:language" content="en">English</span></p>
     <p>License: <a href="" rel="license">Attribution 3.0</a></p>

     <p>Lorem ipsum, etc, etc.</p>


If a site aggregates resources such that the metadata appear on a page other than the actual resource, the about attribute can be used to indicate that the metadata are about a different resource. For example, the following page could be published at and still refer to the same resource as the previous example:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "">
<html xmlns=""
   <title>OER Site</title>

     <div about="">
       <h1 property="dc:title">Math 101</h1>
       <h2>by <span property="dc:author">John Q. Public</span></h2>
       <p property="dc:description">Basic mathematics for 5th graders</p>
       <p>Subjects: <span property="dc:subject">Math</span></p>
       <p>Grade level: <span property="dc:educationLevel">5</span></p>
       <p>Language: <span property="dc:language" content="en">English</span></p>
       <p>License: <a href="" rel="license">Attribution 3.0</a></p>

     <p>Lorem ipsum, etc, etc.</p>


Atom 1.0 Example

Here is a sample, one entry Atom 1.0 feed which implements the guidelines above. Note that inclusion of additional metadata in the feed is optional and considered inferior to inclusion with the resource using RDFa.

<feed xmlns="">
  <title>OER Aggregation Web Site</title>
  <link rel="self" href="" type="application/atom+xml" />
    <name>John Q. Public</name>
    <link href="" />
    <title>Math 101</title>
    <summary>Basic mathematics for 5th graders</summary>
    <link rel="license" href="" />
    <category term="dc:subject:Math" />
    <category term="dc:educationLevel:5" />
    <content type="xhtml" xml:lang="en">The content</content>