Creative Commons - User contributions [en]

Field Query Mapping

2010-06-29T15:41:32Z

Dithyramble: /* How to use */

{{DiscoverEd Specification
|contact=Asheesh Laroia
|project=AgShare
|status=In Development
}}
The people who run a DiscoverEd may wish to let users search specific metadata easily. For example, http://discovered.creativecommons.org/search/ lets users search for works "tagged" with "banana" by searching for tag:banana. (In particular, the predicate for "tag" is the term "subject" as specified by the Dublin Core.)

These prefixes, like tag:, are stored in the DiscoverEd code right now. This feature aims to move those into a configuration file.

This feature was defined and developed during the [[DiscoverEd Sprint (June, 2010)|June 2010 DiscoverEd Sprint]]

== Requirements ==

When DiscoverEd crawls feeds and resources and saves metadata such as the page title, it converts this information into RDF triples; those triples are eventually saved on disk in a triple store, namely Jena.

We will create a new configuration file that stores a list of mappings from predicate URIs. For example, we might list "method:" as a shorthand for the RDF predicate <http://purl.org/dc/terms/instructionalMethod>, a.k.a. "dct:instructionalMethod". At indexing time, a Lucene column called "method" will be created in the Lucene documents corresponding to each resource that has the dct:instructionalMethod predicate set in the Jena store.

Then, at search time, Nutch's built-in query parser handles the query, e.g., "method:yaddayadda".

== How to use ==

Let's say you want to allow users to perform this query:

<blockquote><pre>
method:"Experiential learning"
</pre></blockquote>

and retrieve all web pages in your index that have a metadatum with predicate <http://purl.org/dc/terms/instructionalMethod> and value "Experiential learning".

To do so, first edit <code>conf/nutch-site.xml</code>. Add this XML inside the <configuration> block.

<blockquote><pre>
<property>
<name>query.basic.method.boost</name>
<value>1.0</value>
</property>
</pre></blockquote>

This block of XML tells Nutch to accept the "method:" prefix in search queries. The value of this property indicates the weight the search engine should assign to this term.

Next, edit <code>conf/discovered-search-prefixes.xml</code>. Add this XML inside the <configuration> block.

<blockquote><pre>
<property>
<name>http://purl.org/dc/terms/instructionalMethod</name>
<value>method</value>
</property>
</pre></blockquote>

This block of XML tells DiscoverEd to copy data out of the Jena store and paste it into a format where Nutch's basic query parser can find it.

== Implementation ==

* Added a sample configuration file
* Added code to our IndexFilter that looks for relevant triples and stores them in the Lucene document
* Problem: The Lucene documents does not seem to show our column, so we're going back to the drawing board and carefully reading the [http://wiki.apache.org/nutch/HowToMakeCustomSearch relevant Nutch documentation] to make sure we're using the APIs correctly

== Deferred until later ==

* Handling provenance with regard to this. Based on the current plan for how to handle curator exclusion, and using the above example, we have to make sure that instead of adding merely the column "method", we add something like "curator1:method", "curator2:method", and so on. (This may be out of date; see the spec for Excluding curators.)`

Field Query Mapping

2010-06-29T15:40:25Z

Dithyramble: /* How to use */

{{DiscoverEd Specification
|contact=Asheesh Laroia
|project=AgShare
|status=In Development
}}
The people who run a DiscoverEd may wish to let users search specific metadata easily. For example, http://discovered.creativecommons.org/search/ lets users search for works "tagged" with "banana" by searching for tag:banana. (In particular, the predicate for "tag" is the term "subject" as specified by the Dublin Core.)

These prefixes, like tag:, are stored in the DiscoverEd code right now. This feature aims to move those into a configuration file.

This feature was defined and developed during the [[DiscoverEd Sprint (June, 2010)|June 2010 DiscoverEd Sprint]]

== Requirements ==

When DiscoverEd crawls feeds and resources and saves metadata such as the page title, it converts this information into RDF triples; those triples are eventually saved on disk in a triple store, namely Jena.

We will create a new configuration file that stores a list of mappings from predicate URIs. For example, we might list "method:" as a shorthand for the RDF predicate <http://purl.org/dc/terms/instructionalMethod>, a.k.a. "dct:instructionalMethod". At indexing time, a Lucene column called "method" will be created in the Lucene documents corresponding to each resource that has the dct:instructionalMethod predicate set in the Jena store.

Then, at search time, Nutch's built-in query parser handles the query, e.g., "method:yaddayadda".

== How to use ==

Let's say you want to allow users to perform this query:

<blockquote>
method:"Experiential learning"
</blockquote>

and retrieve all web pages in your index that have a metadatum with predicate <http://purl.org/dc/terms/instructionalMethod> and value "Experiential learning".

To do so, first edit <code>conf/nutch-site.xml</code>. Add this XML inside the <configuration> block.

<blockquote>
<property>
<name>query.basic.method.boost</name>
<value>1.0</value>
</property>
</blockquote>

This block of XML tells Nutch to accept the "method:" prefix in search queries. The value of this property indicates the weight the search engine should assign to this term.

Next, edit <code>conf/discovered-search-prefixes.xml</code>. Add this XML inside the <configuration> block.

<blockquote>
<property>
<name>http://purl.org/dc/terms/instructionalMethod</name>
<value>method</value>
</property>
</blockquote>

This block of XML tells DiscoverEd to copy data out of the Jena store and paste it into a format where Nutch's basic query parser can find it.

== Implementation ==

* Added a sample configuration file
* Added code to our IndexFilter that looks for relevant triples and stores them in the Lucene document
* Problem: The Lucene documents does not seem to show our column, so we're going back to the drawing board and carefully reading the [http://wiki.apache.org/nutch/HowToMakeCustomSearch relevant Nutch documentation] to make sure we're using the APIs correctly

== Deferred until later ==

* Handling provenance with regard to this. Based on the current plan for how to handle curator exclusion, and using the above example, we have to make sure that instead of adding merely the column "method", we add something like "curator1:method", "curator2:method", and so on. (This may be out of date; see the spec for Excluding curators.)`

Field Query Mapping

2010-06-29T15:26:40Z

Dithyramble: Explain how to use field query mapping

{{DiscoverEd Specification
|contact=Asheesh Laroia
|project=AgShare
|status=In Development
}}
The people who run a DiscoverEd may wish to let users search specific metadata easily. For example, http://discovered.creativecommons.org/search/ lets users search for works "tagged" with "banana" by searching for tag:banana. (In particular, the predicate for "tag" is the term "subject" as specified by the Dublin Core.)

These prefixes, like tag:, are stored in the DiscoverEd code right now. This feature aims to move those into a configuration file.

This feature was defined and developed during the [[DiscoverEd Sprint (June, 2010)|June 2010 DiscoverEd Sprint]]

== Requirements ==

When DiscoverEd crawls feeds and resources and saves metadata such as the page title, it converts this information into RDF triples; those triples are eventually saved on disk in a triple store, namely Jena.

We will create a new configuration file that stores a list of mappings from predicate URIs. For example, we might list "method:" as a shorthand for the RDF predicate <http://purl.org/dc/terms/instructionalMethod>, a.k.a. "dct:instructionalMethod". At indexing time, a Lucene column called "method" will be created in the Lucene documents corresponding to each resource that has the dct:instructionalMethod predicate set in the Jena store.

Then, at search time, Nutch's built-in query parser handles the query, e.g., "method:yaddayadda".

== How to use ==

Let's say you want to allow users to perform this query:

method:"Experiential learning"

and retrieve all web pages in your index that have a metadatum with predicate <http://purl.org/dc/terms/instructionalMethod> and value "Experiential learning".

To do so, first edit <code>conf/nutch-site.xml</code>. Add this XML inside the <configuration> block.

<property>
<name>query.basic.method.boost</name>
<value>1.0</value>
</property>

This block of XML tells Nutch to accept the "method:" prefix in search queries. The value of this property indicates the weight the search engine should assign to this term.

Next, edit <code>conf/discovered-search-prefixes.xml</code>. Add this XML inside the <configuration> block.

<property>
<name>http://purl.org/dc/terms/instructionalMethod</name>
<value>method</value>
</property>

This block of XML tells DiscoverEd to copy data out of the Jena store and paste it into a format where Nutch's basic query parser can find it.

== Implementation ==

* Added a sample configuration file
* Added code to our IndexFilter that looks for relevant triples and stores them in the Lucene document
* Problem: The Lucene documents does not seem to show our column, so we're going back to the drawing board and carefully reading the [http://wiki.apache.org/nutch/HowToMakeCustomSearch relevant Nutch documentation] to make sure we're using the APIs correctly

== Deferred until later ==

* Handling provenance with regard to this. Based on the current plan for how to handle curator exclusion, and using the above example, we have to make sure that instead of adding merely the column "method", we add something like "curator1:method", "curator2:method", and so on. (This may be out of date; see the spec for Excluding curators.)`

User Supplied Metadata

2010-06-23T04:26:03Z

Dithyramble: /* What needs to be done */

{{DiscoverEd Specification
|contact=Raphael Krut-Landau
|project=AgShare
|status=In Development
}}

== The story from the user's point of view ==

A moment a bit like this is fairly common. You've asked a search engine to tell you what it knows about a particular query. When the engine returns with its listing of results, you see a particular result that could be categorized more usefully. You want to tell the search engine, bring up this result when the user searches for such-and-such a word.

== Requirements ==

In this feature, we allow you, as a user of DiscoverEd, to associate a tag with a search result that you see on your screen. Next to all search results there is a small link reading "Add a tag"; click this to open a brightly colored box where you can enter the tag. The box has a small "submit" link; click this and you immediately see the word alongside all the other tags that the engine associates with the result, if there were any.

== Implementation ==

The brightly colored box mentioned above is an HTML form. The POST handler which accepts the user's submission creates (if necessary) a new Jena triple store whose URI represents the person who filled in the form. The handler then inserts a new RDFa triple into this triple store:

result_uri, dct:subject, tag

(Side note: The word "subject" above might confuse you a bit if you are into RDF. In RDF, "subject" usually means the subject of a triple (subject, predicate, object). In the Dublin Core terms (DCT), subject means a ''topic''. We use it here to mean "is tagged with".)

We want to ensure that this new tag appears whenever anybody now or in the future chances upon the search result in question using this particular installation of the DiscoverEd search engine. Here's how the engine will do that. From time to time, a webmaster asks his copy of DiscoverEd to "crawl" — that is, to download copies of web pages from the internet and put their text, and other information about them, into the search engine's Lucene database. We want to make sure that the user-submitted tag is included among that information we store in Lucene.

So there'll be a bit of a code that runs whenever you ask DiscoverEd to perform a crawl. During the crawl, when we are inserting information about a particular URL into the Lucene database, this bit of code looks in all the Jena triple stores for any tags associated with that URL. It then inserts these tags into Lucene as well. In the parlance of Lucene, it adds a new column (or you could say, a new kind of field). The column is named something like 18__dct_subject. The number 18 signifies the user who submitted a tag via the brightly colored box mentioned above. It then adds a new field to the Lucene document associated with the URL we're crawling.

== What works ==

Look in the branch <tt>add_tagging_form</tt> (at time of writing, this pointed to [http://gitorious.org/discovered/repo/commit/a2af4aea3270e4a663abc2eb89c310e1ab5148c8 a2af4aea3270e4a663abc2eb89c310e1ab5148c8]).

* We can add a tag to the RdfStore and retrieve it, using the bean api for both adding and retrieving. (Nothing crazy-special.)
* The search results jsp has the add-a-tag form

== What needs to be done ==
* Make <tt>org.creativecommons.learn.test.AddATag.testCheckThatResourceIsSearchableViaTag</tt> pass.
* Write a test that the HTML form submits to a POST handler which adds a tag to the RdfStore. This code adds a tag: <tt>org.creativecommons.learn.Tag.add(taggerURI, resourceURI, tag);</tt>

User Supplied Metadata

2010-06-23T04:24:02Z

Dithyramble: /* What needs to be done */

Dithyramble: beginning of a write up of a spec

{{DiscoverEd Specification
|contact=Raphael Krut-Landau
|project=AgShare
|status=In Development
}}

== The story from the user's point of view ==

A moment a bit like this is fairly common. You've asked a search engine to tell you what it knows about a particular query; "sustainability water" for instance. When the engine returns with its listing of results, you see a particular result that could be categorized more effectively. You want to teach the engine a new fact about one of those ecological pages.

== Requirements ==

In this feature, we allow you, as a user of DiscoverEd, to associate a new word with a search result that you see on your screen. Next to all search results there is a small link reading "Add a tag"; click this to open a brightly colored rectangular box where you can enter a new word. The box has a small "submit" link; click this and you immediately see the word alongside all the other tags that the engine associates with the result, if there were any.

== Implementation ==

The brightly colored box mentioned above is an HTML form. The POST handler which accepts the user's submission writes a new RDFa triple to the Jena quad store, consisting of four strings:

submitter_uri, result_uri, dct:subject, tag

Note that the word "subject" above might confuse you a bit if you are into RDF. In RDF, "subject" usually means the subject of a triple (subject, predicate, object). In the Dublin Core terms (DCT), subject means a ''topic''. We use it to pick out the concept, 'is tagged with'.

We want to ensure that this new tag appears whenever anybody now or in the future chances upon the search result in question using this particular installation of the DiscoverEd search engine. Here's how the engine will do that. From time to time, a webmaster asks his copy of DiscoverEd to "crawl" — that is, to download copies of web pages from the internet and put their text, and other information about them, into the search engine's Lucene database. We want to make sure that the user-submitted tag is included among that information we store in Lucene.

So there'll be a bit of a code that runs whenever you ask DiscoverEd to perform a crawl. This code looks in the Jena quad store for any tags stored there. It then adds these tags to Lucene. In the parlance of Lucene, it adds a new column (or you could say, a new kind of field). The column is named something like 18__dct_subject. 18 signifies the user who submitted a tag via the brightly colored box mentioned above.

Field Query Mapping

2010-06-22T15:05:42Z

Dithyramble:

Field Query Mapping

2010-06-22T14:58:47Z

Dithyramble: disambig "subject"

DiscoverEd/Meetings/2010/06/21

2010-06-21T18:19:16Z

Dithyramble:

Asheesh, Nathan and Raffi were on this phone call.

== Sprint Follow-up ==
* Outstanding tasks
** Asheesh writes up a team report for his team (-:
*** structured as a spec page
** Raffi writes up a team report for the tag-adding team
* Issues from sprint
** NY has been refactoring the RdfStore, and was sad to see half-finished refactorings in the codebase. Going forward, we should pay more attention to these refactorings. When we add a new helper method that simply calls an existing method, maybe we can simply replace the old method. That way we could avoid leaving both the old and new versions in the class.
* Tests
** running from Ant: Asheesh and Raffi will confirm that the tests in the branch "next" do pass. Nathan will push an ant target he wrote once his laptop is resuscitated. AL and RKL will then use ant to run the tests.
** source tree separation (src/tests/... instead of src/java/...)
*** This seems to be the pattern Nathan has observed in large Java projects. It also allows you to easily create a "run-time" that excludes your testing code.

What we were working on before the sprint:
== Excluding a curator from a search ==
* We had written a large test, and were in the process of breaking it up into smaller pieces which could be individually tested. - Raffi
* That test was called "MinusCurator", and it sort of overwhelmingly failed. Asheesh began work with Tim on migrating the TripleStoreIndexer to use the new document.add(String, String) method from Nutch rather than LuceneWriter.add(Field, String). The latter is deprecated, and moreover seems to not quite work. Yesterday Asheesh began writing a few helper methods and tests in a branch to help complete this migration.

== Next steps ==

The current goal is to make sure TripleStoreIndexer works. It's pretty deeply broken if we know we can't write a single TripleStore-based value into Lucene. Raffi pointed out that the tag-addition team has a test for this which he thinks already passes.

After that, we will work on landing Tim's and Asheesh's code from the sprint, namely the work on creating new Lucene columns that represent particular RDF predicates, controlled simply by a configuration file.

[http://piratepad.net/5OhqF55lTk This history of this document lives here at PiratePad]

DiscoverEd Sprint (June, 2010)

2010-06-14T16:34:16Z

Dithyramble: /* Attendees */ fix minor typo

[[Category:DiscoverEd]]

== Overview ==

* '''What:''' A sprint on development of [[DiscoverEd]]; see [[:Category:DiscoverEd_Specification|DiscoverEd Specifications]] for possible areas of work
* '''When:''' Tuesday, June 15 through Thursday, June 17, 2010
* '''Where:''' [http://vudat.msu.edu/location/ Wills House], Michigan State University, East Lansing, MI ([http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=&q=101+Wills+House,+east+lansing,+mi&sll=37.0625,-95.677068&sspn=41.818029,58.447266&ie=UTF8&hq=&hnear=Wills+House,+East+Lansing,+Ingham,+Michigan+48823&z=14 map])

== Attendees ==

* Asheesh Laroia (OpenHatch / Creative Commons)
* Raphael Krut-Landau (OpenHatch / Creative Commons)
* [[User:Nathan Yergler|Nathan Yergler]] (Creative Commons)
* Alex Kozak (Creative Commons)
* Ali Asad Lotia (open.michigan)
* Kevin Coffman (open.michigan)
* ''add your name and affiliation here''

== Travel & Accommodations ==

* [http://vudat.msu.edu/directions/ Directions to Wills House]
** Brendan Guenther will provide MSU Guest Parking Passes when you arrive in the morning
* Area Hotels (mention MSU for discounted rate)
** [http://www.marriott.com/hotels/travel/lants-towneplace-suites-east-lansing/ Townplace Suites by Marriott]
** [http://www.hamptoninn.com/en/hp/hotels/index.jhtml?ctyhocn=LANETHX Hampton Inn]

== Agenda ==

''This is a draft, subject to change.''

=== Tuesday ===

* 9:00 AM - Welcome
* DiscoverEd Context: Why, When, Where are we going? (10 min)
* Introductions<br/>Developers introduce themselves, give brief statement on what they've done with DiscoverEd, what they're interested in working on. (5 min ea.)
* MSU Context: AgShare, FSKN, etc (Chris Geith, 10-15 min)
* 10:30 AM - Identify Themes, Possible Blocks of Work
* 11:30 AM - Pair up and begin work
* 4:45 PM - Brief report back from each group: unexpected issues, things to bring to the group as a whole, etc.

=== Wednesday ===

No schedule; work as pairs.

=== Thursday ===

* 3:30 PM - Pairs begin to make sure work is pushed to Gitorious, determine next steps
* 4:00 PM - Full report back: pairs report on progress and state of their code. Pairs are encouraged to identify follow up steps they will take post sprint to see tasks to completion.

== Preparation ==

In order to minimize time spent configuring laptops, etc, please try to do the following before arriving at the sprint:

# Make sure you have prerequisite software installed and working:
#* git (Windows users, see http://progit.org/book/ch1-4.html and http://code.google.com/p/msysgit/)
#* Java 1.6 JDK
#* Eclipse (not required, but can make life easier)
# Generate an SSH public key (if needed; see http://progit.org/book/ch4-3.html for some instructions)
# Create a [http://gitorious.org Gitorious] account and add your SSH key to it

=== Code Preparation ===

''TBD''

Linked Data Curation

2010-06-09T14:29:17Z

Dithyramble: minor typo

{{DiscoverEd Specification
|contact=Nathan Yergler
|project=DiscoverEd
|status=Draft
}}
DiscoverEd currently relies on feeds or OAI-PMH to aggregate resources from a curator. Third-party curators may wish to provide a list of resources with additional metadata without the overhead or restrictions of providing feeds or OAI-PMH support. This specification describes a linked, structured data model for curating resources for DiscoverEd without feeds or OAI-PMH.

== Requirements ==

== Resources ==

* [http://www.openarchives.org/ore/ OAI-ORE] (Object Reuse & Exchange)

Resource Analytics

2010-06-09T14:28:19Z

Dithyramble: /* Requirements */ minor typo

{{DiscoverEd Specification
|contact=Nathan Yergler
|project=AgShare
|status=Draft
}}
Curators (both publishers and third party) are interested in verifying that their resources are ingested correctly, how often they are searched for, and how often they are used. Operators are interested in exploring searches that are not successful, and how users interact with DiscoverEd. Analytics will provide information about indexed resources, searches performed, and user activity in a DiscoverEd instance.

== Requirements ==

* Provide a web-based dashboard which displays the status of curator indexes. This includes: last aggregation, last crawl, number of resources indexed, any errors which occurred during aggregation or crawl.
* Provide a web page which displays search analytics. This includes basic web analytics, such as number of visitors, bounce rate, referrers, and time spent on site. It also includes DiscoverEd-specific analytics, including:
** searches grouped by number of results (for exploring queries that have no results,
** popular search terms, and
** resource refinements (ie, do users refine by curator? subject? license?)

This may also include resource click-through tracking and reporting.

Metadata Provenance

2010-06-09T14:26:23Z

Dithyramble: "short coming" -> "shortcoming"

{{DiscoverEd Specification
|contact=Nathan Yergler
|project=AgShare
|status=In Development
}}
{{Draft}}

The initial version of DiscoverEd does not include provenance support. Provenance means tracking the source of resource metadata. Due to this limitation, DiscoverEd has limited ability to filter by curator. While you can filter for resources with a specific curator, the remaining search terms are not limited to metadata provided by that curator. This is a significant shortcoming for resources with multiple curators.

== Requirements ==

* The provenance of metadata discovered through RSS, Atom, and OAI-PMH is stored in the RDF Store.
* Metadata extracted from structured data is stored with provenance reflecting the page it was extracted from.
* Users can filter a query to exclude a curator, and metadata provided by that curator is not considered for other query terms. For example, "<code>-curator:http://example.org subject:biology cells</code>" would return results containing the term "cells", with the subject tag "biology" provided by a curator <strong>other than</strong> http://example.org.

DiscoverEd/Development notes

2010-06-07T18:15:15Z

Dithyramble: Created page with 'To start using Eclipse on another computer, you'll need to go File > Import and specify the root directory in the repo (We actually haven't totally checked that this works yet.)'

To start using Eclipse on another computer, you'll need to go File > Import and specify the root directory in the repo

(We actually haven't totally checked that this works yet.)

2010-05-07T18:46:41Z

Dithyramble: +cat +stub

[[Category:DiscoverEd]]
{{Stub}}

=== Check out and build the source code ===
<pre>
$ git clone git://gitorious.org/discovered/repo.git discovered
$ cd discovered
$ ant
</pre>

=== Add a curator and a feed ===

DiscoverEd uses feeds to help identify resources to crawl. Feeds are provided by curators, who can also provide metadata about resources.

<pre>
$ ./bin/feeds addcurator "ND OCW" http://ocw.nd.edu/
$ ./bin/feeds addfeed rss http://ocw.nd.edu/front-page/courselist/rss http://ocw.nd.edu/
</pre>

=== Aggregate and crawl resources ===

<pre>
$ ./bin/feeds aggregate
$ mkdir seed
$ ./bin/feeds seed > seed/urls.txt
$ ant -f dedbuild.xml crawl
</pre>

=== Run the web server ===

DiscoverEd/Install manually

2010-05-07T18:45:51Z

Dithyramble: This text was moved from DiscoverEd Quickstart

=== Check out and build the source code ===
<pre>
$ git clone git://gitorious.org/discovered/repo.git discovered
$ cd discovered
$ ant
</pre>

=== Add a curator and a feed ===

DiscoverEd uses feeds to help identify resources to crawl. Feeds are provided by curators, who can also provide metadata about resources.

<pre>
$ ./bin/feeds addcurator "ND OCW" http://ocw.nd.edu/
$ ./bin/feeds addfeed rss http://ocw.nd.edu/front-page/courselist/rss http://ocw.nd.edu/
</pre>

=== Aggregate and crawl resources ===

<pre>
$ ./bin/feeds aggregate
$ mkdir seed
$ ./bin/feeds seed > seed/urls.txt
$ ant -f dedbuild.xml crawl
</pre>

=== Run the web server ===