Difference between revisions of "Field Query Mapping"

From Creative Commons
Jump to: navigation, search
(Created page with '{{DiscoverEd Specification |contact=Asheesh Laroia |project=AgShare |status=In Development }} The people who run a DiscoverEd may wish to let users search specific metadata easil…')
 
 
(7 intermediate revisions by 2 users not shown)
Line 4: Line 4:
 
|status=In Development
 
|status=In Development
 
}}
 
}}
The people who run a DiscoverEd may wish to let users search specific metadata easily. For example, http://discovered.creativecommons.org/search/ lets users search for works "tagged" with "banana" as a Dublin Core subject by searching for tag:banana.  
+
The people who run a DiscoverEd may wish to let users search specific metadata easily. For example, http://discovered.labs.creativecommons.org/ lets users search for works "tagged" with "banana" by searching for tag:banana. (In particular, the predicate for "tag" is the term "subject" as specified by the Dublin Core.)
  
 
These prefixes, like tag:, are stored in the DiscoverEd code right now. This feature aims to move those into a configuration file.
 
These prefixes, like tag:, are stored in the DiscoverEd code right now. This feature aims to move those into a configuration file.
Line 12: Line 12:
 
== Requirements ==
 
== Requirements ==
  
(still writing)he+%5B%5BDiscove
+
When DiscoverEd crawls feeds and resources and saves metadata such as the page title, it converts this information into RDF triples; those triples are eventually saved on disk in a triple store, namely Jena.
 +
 
 +
We will create a new configuration file that stores a list of mappings from predicate URIs. For example, we might list "method:" as a shorthand for the RDF predicate <http://purl.org/dc/terms/instructionalMethod>, a.k.a. "dct:instructionalMethod". At indexing time, a Lucene column called "method" will be created in the Lucene documents corresponding to each resource that has the dct:instructionalMethod predicate set in the Jena store.
 +
 
 +
Then, at search time, Nutch's built-in query parser handles the query, e.g., "method:yaddayadda".
 +
 
 +
== How to use ==
 +
 
 +
'''Note''': There is an implementation of this in the current version of DiscoverEd (as of 2010-08-30), but it ignores the ''excludecurator'' argument.
 +
 
 +
Let's say you want to allow users to perform this query:
 +
 
 +
<blockquote><pre>
 +
method:"Experiential learning"
 +
</pre></blockquote>
 +
 
 +
and retrieve all web pages in your index that have a metadatum with predicate <http://purl.org/dc/terms/instructionalMethod> and value "Experiential learning".
 +
 
 +
To do so, first edit <code>conf/nutch-site.xml</code>. Add this XML inside the <configuration> block.
 +
 
 +
<blockquote><pre>
 +
<property>
 +
    <name>query.basic.method.boost</name>
 +
    <value>1.0</value>
 +
</property>
 +
</pre></blockquote>
 +
 
 +
This block of XML tells Nutch to accept the "method:" prefix in search queries. The value of this property indicates the weight the search engine should assign to this term.
 +
 
 +
Next, edit <code>conf/discovered-search-prefixes.xml</code>. Add this XML inside the <configuration> block.
 +
 
 +
<blockquote><pre>
 +
<property>
 +
    <name>http://purl.org/dc/terms/instructionalMethod</name>
 +
    <value>method</value>
 +
</property>
 +
</pre></blockquote>
 +
 
 +
This block of XML tells DiscoverEd to copy data out of the Jena store and paste it into a format where Nutch's basic query parser can find it.
 +
 
 +
== Implementation ==
 +
 
 +
* Added a sample configuration file
 +
* Added code to our IndexFilter that looks for relevant triples and stores them in the Lucene document
 +
* We had a problem with these columns not appearing in Lucene, but we fixed the underlying bug that caused that.
 +
 
 +
== Next steps ==
 +
 
 +
* Rewriting this to be compatible with ''excludecurator''.

Latest revision as of 19:24, 30 August 2010

Contact Contact::Asheesh Laroia
Project ,|project_name|Project Driver::project_name}}
Status Status::In Development

The people who run a DiscoverEd may wish to let users search specific metadata easily. For example, http://discovered.labs.creativecommons.org/ lets users search for works "tagged" with "banana" by searching for tag:banana. (In particular, the predicate for "tag" is the term "subject" as specified by the Dublin Core.)

These prefixes, like tag:, are stored in the DiscoverEd code right now. This feature aims to move those into a configuration file.

This feature was defined and developed during the June 2010 DiscoverEd Sprint

Requirements

When DiscoverEd crawls feeds and resources and saves metadata such as the page title, it converts this information into RDF triples; those triples are eventually saved on disk in a triple store, namely Jena.

We will create a new configuration file that stores a list of mappings from predicate URIs. For example, we might list "method:" as a shorthand for the RDF predicate <http://purl.org/dc/terms/instructionalMethod>, a.k.a. "dct:instructionalMethod". At indexing time, a Lucene column called "method" will be created in the Lucene documents corresponding to each resource that has the dct:instructionalMethod predicate set in the Jena store.

Then, at search time, Nutch's built-in query parser handles the query, e.g., "method:yaddayadda".

How to use

Note: There is an implementation of this in the current version of DiscoverEd (as of 2010-08-30), but it ignores the excludecurator argument.

Let's say you want to allow users to perform this query:

 method:"Experiential learning"

and retrieve all web pages in your index that have a metadatum with predicate <http://purl.org/dc/terms/instructionalMethod> and value "Experiential learning".

To do so, first edit conf/nutch-site.xml. Add this XML inside the <configuration> block.

 <property>
     <name>query.basic.method.boost</name>
     <value>1.0</value>
 </property>

This block of XML tells Nutch to accept the "method:" prefix in search queries. The value of this property indicates the weight the search engine should assign to this term.

Next, edit conf/discovered-search-prefixes.xml. Add this XML inside the <configuration> block.

 <property>
     <name>http://purl.org/dc/terms/instructionalMethod</name>
     <value>method</value>
 </property>

This block of XML tells DiscoverEd to copy data out of the Jena store and paste it into a format where Nutch's basic query parser can find it.

Implementation

  • Added a sample configuration file
  • Added code to our IndexFilter that looks for relevant triples and stores them in the Lucene document
  • We had a problem with these columns not appearing in Lucene, but we fixed the underlying bug that caused that.

Next steps

  • Rewriting this to be compatible with excludecurator.