[[Category:DiscoverEd]]

The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:

* Username: '''agshare'''
* Host name: '''search.agshare.org''' (currently the same as discovered.labs.creativecommons.org)

So, for example, to set up your environment, do:

$ sudo su - agshare

Given that, give [[Running DiscoverEd]] a look!

== Deploying new WARs ==

To deploy a new war, do this:

* rm -rf ~/tomcat/webapps/ROOT
* cp nutch-1.1.war ~/tomcat/webapps/ROOT.war

Then restart Tomcat.

== Restarting Tomcat ==

The AgShare deployment uses a Tomcat instance in its $HOME (supported by the tomcat6-instance-create script). It's wrapped as "/etc/init.d/agshare" so the boot process can use it. But you can restart it this way:

* ~/tomcat/bin/shutdown.sh
* ~/tomcat/bin/startup.sh

== Starting Tomcat at boot ==

/etc/rc.local contains a call to run ~/tomcat/bin/startup.sh as the agshare user. That's kind of hackish, I realize.

== Piwik analytics ==

We use a self-hosted package called [http://piwik.org/ Piwik] to record search engine queries and measure traffic to the website. All the data stays with us.

You can use the [http://search.agshare.org/static/piwik/piwik/index.php Piwik admin interface] to view the stats, if you have an account. If you want an account, talk to Nathan.

=== Piwik general configuration ===

* Configuration: It uses a MySQL database. You can see the details in the Piwik configuration file.
* Path on the server: '''/var/www/search.agshare.org/www/static/piwik/piwik'''
* Web serving: Apache + mod_php5 serve it up. We set up '''/var/www/search.agshare.org/www/static''' to be served by Apache; you can see that in /etc/apache2/sites-available/search.agshare.org.

To get piwik running, we had to add piwik to the default template. See the "changes" section below for more info.

=== Site search ===

We added the [http://github.com/BeezyT/piwik-sitesearch the sitesearch plugin] (still in beta; see [http://dev.piwik.org/trac/ticket/49 this Piwik ticket]) to let us analyze site search.

The site search plugin requires that we:
* Change the default translations so that they
* Configure it: In the [http://search.agshare.org/static/piwik/piwik/index.php?module=SiteSearch&action=admin&idSite=1&period=day&date=yesterday Site Search settings], I set the "Search URL" to "search.jsp" (no leading slash) and the "Search Parameter" to "query". This matches [http://search.agshare.org/search.jsp?query=body queries like this].

Piwik SiteSearch can keep track of the number of results that the search engine returns for each query. To do that, it needs some to be able to "scrape" the information out of the web page, or alternately have the servlet provide it. I chose the "scrape" option. I implemented that in [http://gitorious.org/+discovereders/discovered/agshare-live/commit/4ffdd225670c9af3e57d686111907aa5e5d150fe a commit].

== Version control ==

The Agshare deployment's git repository can be [http://gitorious.org/+discovereders/discovered/agshare-live/ found on Gitorious].

When you want to back up the AgShare deployment's git state, just do:

$ git push mirror --mirror

2010-10-05T15:00:03Z

Paulproteus: /* Monday, August 16, 2010 */

[[Category:DiscoverEd]]
__TOC__

= Tue Oct 5, 2010 =

* Nathan, Asheesh: http://openetherpad.org/2010-10-05

= Monday, August 16, 2010 =

* Nathan, Asheesh: http://piratepad.net/vMJnNgqP5q

= Tuesday, August 10, 2010 =

* Nathan, Asheesh: http://piratepad.net/F3pDrLemz4

= Monday, June 28, 2010 =

* Nathan, Asheesh, and Rafi: [http://piratepad.net/rOBKTtpocp Notes]
* TripleStoreIndexer
** NY thinks this is fixed, want to sync up https://www.pivotaltracker.com/story/show/4001001
** This is working in next and master
** Accepted Pivotal story
* MinusCurator
** status update
*** Still in progress
* Field Mapping work (from sprint)
** Was blocked by TripleStoreIndexer
** Going to work on for 3 hours, landing work from sprint
* MakeSeed
** NY thought this was fixed, but it's open in Pivotal ( https://www.pivotaltracker.com/story/show/3888957 )
** Need to double check other verbs
* Priorities:
** Field Mapping
** Make sure feeds verbs handle provenance correctly
** MinusCurator

= Monday, June 21, 2010 =

* Nathan, Asheesh and Raffi: [http://piratepad.net/5OhqF55lTk Notes]

= Monday, June 7, 2010 =

AL: Branch cleanup

A new developer should begin to add work on top of master.

The branch "provenance_tests" contains work on OAI-PMH and RDFa. It's also where we'll add the work on minus-curator.

"provenance_tests" will become "next"

AL: OAI-PMH provenance implementation

Defer to post -curator implementation

RKL: RDFa extraction and indexing

We can parse the RDFa title by adding the RDFa-parsing logic in the *wrong* place (in the feed updating method). I've tried to move it into the *right* place (a plugin, so we don't hit the network twice) but the plugin isn't being executed. I must have configured the plugin incorrectly.
Raffi should look in the Hadoop log, not standard out, for his logging message.
Raffi should push the branch to Nathan and he'll sanity-check the configuration.

AL/RKL: -curator status

Let's get this as far along as possible before the spring!

NRY: "Feature" pages in CC wiki

In preparation for the sprint next week, Brendan from MSU asked if we could create general "Feature" write-ups for things we're working on for AgShare. These would describe some feature (ie, Provenance, Analytics), so he could evaluate where intersection lies with his FSKN work.

I think each Feature maps to one or more Stories in Pivotal. My initial stub is at http://wiki.creativecommons.org/Curator_Filtering (probably going to rename to Provenance or something like that).

= Monday, June 2, 2010 =

== Things Asheesh wants ==

* We work more clearly out of Pivotal Tracker.
** While on the phone, Asheesh updated it.

== Things Nathan wants done before the sprint ==

* Branch cleanup
** Asheesh just did this.
* Data imported through OAI-PMH has the provenance of its feed. Damn the torpedos^Wtests.

Then we'll work on the "minus curator" story in Pivotal Tracker.

== Thursday departure planning ==

At some point, Nathan wants to figure out his Thursday departure planning on the sprint's last day.

AgShare/Tech

2010-09-17T22:31:34Z

Paulproteus: /* Deploying new WARs */

AgShare/Tech

2010-09-17T22:28:49Z

Paulproteus:

Running DiscoverEd

2010-09-17T22:24:23Z

Paulproteus: /* Crawl */

[[Category:DiscoverEd]]

{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment. It will be massaged into more general docs in the fullness of time.}}

== Instructions for running a crawl ==

Tips:
* For long aggregates and crawls, run in 'screen'.

Three phases to the process of updating the index:
# Aggregation (polling feeds old and new)
# crawling
# merging (merging the new index with the existing one).

=== Set up environment ===

Execute these commands to set up your environment for running the tools. It also places you in the discovered user's account.

<pre>
$ sudo su - discovered
$ cd code
</pre>

=== Managing Feeds ===

The feeds script (./bin/feeds) allows you to add curators or feeds.
Running it without parameters will show the sub-commands.
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).

==== Notes ====

*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system.

*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI. It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.

=== Aggregation ===

Aggregation polls the feeds and adds new resources to the triple store. It will also poll any OPML feeds and add the new feeds it finds.

<pre>$ ./bin/feeds aggregate</pre>

=== Crawl ===

Before you crawl you need to make a seed which tells the crawler what to retrieve.

If the directory "seed/" does not exist, create it with

<pre>
mkdir seed
</pre>

Then create the seed list of URLs:

<pre>
$ ./bin/feeds seed > ./seed/crawl-urls.txt
</pre>

When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).

To run the actual crawl do:
<pre>
$ ./bin/crawl_and_merge.sh
</pre>

Finally, restart Tomcat (the Java app server) to make sure the new index is being used:

<pre>
$ sudo /etc/init.d/tomcat6 restart
</pre>

== Managing curators and feeds ==

On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:

<pre>
listfeeds list all feeds
listcurators list all curators
addfeed add a feed
resetfeed reset the last aggregation date for a feed
addcurator add a curator
rmfeed remove a feed
setcurator set the curator for a feed
aggregate
dump
seed
</pre>

Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)

=== addfeed ===
<pre>
addfeed [feed_type] [feed_url] [curator_url]
</pre>

Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).

Feed type notes: "rss" is a parser that does RSS/Atom sniffing.

=== addcurator ===
<pre>
addcurator [curator_name] [curator_url]
</pre>

Curator names with spaces should be surrounded by quotation marks (e.g. addcurator "CC Open Textbook Project" http://www.collegeopentextbooks.org/)

=== setcurator ===
<pre>
setcurator [feed_url] [curator_url]
</pre>

== Deploying new WARs ==

To deploy a new war, do this:

* sudo rm -rf /var/lib/tomcat6/webapps/search/ # clear the existing app to force redeployment
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war
* sudo /etc/init.d/tomcat6 restart

== Things the server administrator should know ==

=== JAVA_HOME ===

Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':

JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME

=== Maximum open files ===

Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.

To address this, we added this to '''/etc/security/limits.conf''':

### For Tomcat etc.
* soft nofile 4096
* hard nofile 4096

Running DiscoverEd

2010-09-17T22:24:07Z

Paulproteus: /* Crawl */

[[Category:DiscoverEd]]

{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment. It will be massaged into more general docs in the fullness of time.}}

== Instructions for running a crawl ==

Tips:
* For long aggregates and crawls, run in 'screen'.

Three phases to the process of updating the index:
# Aggregation (polling feeds old and new)
# crawling
# merging (merging the new index with the existing one).

=== Set up environment ===

Execute these commands to set up your environment for running the tools. It also places you in the discovered user's account.

<pre>
$ sudo su - discovered
$ cd code
</pre>

=== Managing Feeds ===

The feeds script (./bin/feeds) allows you to add curators or feeds.
Running it without parameters will show the sub-commands.
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).

==== Notes ====

*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system.

*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI. It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.

=== Aggregation ===

Aggregation polls the feeds and adds new resources to the triple store. It will also poll any OPML feeds and add the new feeds it finds.

<pre>$ ./bin/feeds aggregate</pre>

=== Crawl ===

Before you crawl you need to make a seed which tells the crawler what to retrieve.

If the directory "seed/" does not exist, create it with

<pre>
mkdir seed
</pre>

Then create the seed list of URLs:

<pre>
$ ./bin/feeds seed > ./seed/crawl-urls.txt
</pre>

When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).

To run the actual crawl do:
<pre>
$ ./bin/crawl_and_merge.sh

Finally, restart Tomcat (the Java app server) to make sure the new index is being used:

<pre>
$ sudo /etc/init.d/tomcat6 restart
</pre>

== Managing curators and feeds ==

On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:

<pre>
listfeeds list all feeds
listcurators list all curators
addfeed add a feed
resetfeed reset the last aggregation date for a feed
addcurator add a curator
rmfeed remove a feed
setcurator set the curator for a feed
aggregate
dump
seed
</pre>

Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)

=== addfeed ===
<pre>
addfeed [feed_type] [feed_url] [curator_url]
</pre>

Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).

Feed type notes: "rss" is a parser that does RSS/Atom sniffing.

=== addcurator ===
<pre>
addcurator [curator_name] [curator_url]
</pre>

Curator names with spaces should be surrounded by quotation marks (e.g. addcurator "CC Open Textbook Project" http://www.collegeopentextbooks.org/)

=== setcurator ===
<pre>
setcurator [feed_url] [curator_url]
</pre>

== Deploying new WARs ==

To deploy a new war, do this:

* sudo rm -rf /var/lib/tomcat6/webapps/search/ # clear the existing app to force redeployment
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war
* sudo /etc/init.d/tomcat6 restart

== Things the server administrator should know ==

=== JAVA_HOME ===

Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':

JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME

=== Maximum open files ===

Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.

To address this, we added this to '''/etc/security/limits.conf''':

### For Tomcat etc.
* soft nofile 4096
* hard nofile 4096

Running DiscoverEd

2010-09-10T15:44:26Z

Paulproteus: /* Deploying new WARs */

[[Category:DiscoverEd]]

{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment. It will be massaged into more general docs in the fullness of time.}}

== Instructions for running a crawl ==

Tips:
* For long aggregates and crawls, run in 'screen'.

Three phases to the process of updating the index:
# Aggregation (polling feeds old and new)
# crawling
# merging (merging the new index with the existing one).

=== Set up environment ===

Execute these commands to set up your environment for running the tools. It also places you in the discovered user's account.

<pre>
$ sudo su - discovered
$ cd code
</pre>

=== Managing Feeds ===

The feeds script (./bin/feeds) allows you to add curators or feeds.
Running it without parameters will show the sub-commands.
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).

==== Notes ====

*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system.

*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI. It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.

=== Aggregation ===

Aggregation polls the feeds and adds new resources to the triple store. It will also poll any OPML feeds and add the new feeds it finds.

<pre>$ ./bin/feeds aggregate</pre>

=== Crawl ===

Before you crawl you need to make a seed which tells the crawler what to retrieve.

If the directory "seed/" does not exist, create it with

<pre>
mkdir seed
</pre>

Then create the seed list of URLs:

<pre>
$ ./bin/feeds seed > ./seed/crawl-urls.txt
</pre>

When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).

To run the actual crawl do:

<pre>
$ ant -f dedbuild.xml crawl
</pre>

This will read the seed files and run the crawl. The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name. For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM. After the crawl completes you need to merge the new index with the old one.

The production index lives in '''$HOME/production-crawl''' ($HOME is set to /var/www/discovered.labs.creativecommons.org).

To merge the index run:

<pre>
$ ./bin/merge ./crawl-<timestamp>-merged $HOME/production-crawl ./crawl-<timestamp>
</pre>

The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.

After the merge completes (assuming it does so successfully) you'll want to move it into the production directory. Do something like:

<pre>
$ mkdir -p ~/archived-crawls/$(date -I)
$ mv ~/production-crawl ~/archived-crawls/$(date -I)
</pre>

to rename the existing index so you can go back to it if necessary.

Then you can do

<pre>
$ mv ./crawl-new-dir-merged ~/production-crawl
</pre>

And finally restart Tomcat (the Java app server) to make sure the new index is being used:

<pre>
$ sudo /etc/init.d/tomcat6 restart
</pre>

== Managing curators and feeds ==

On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:

<pre>
listfeeds list all feeds
listcurators list all curators
addfeed add a feed
resetfeed reset the last aggregation date for a feed
addcurator add a curator
rmfeed remove a feed
setcurator set the curator for a feed
aggregate
dump
seed
</pre>

Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)

=== addfeed ===
<pre>
addfeed [feed_type] [feed_url] [curator_url]
</pre>

Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).

Feed type notes: "rss" is a parser that does RSS/Atom sniffing.

=== addcurator ===
<pre>
addcurator [curator_name] [curator_url]
</pre>

Curator names with spaces should be surrounded by quotation marks (e.g. addcurator "CC Open Textbook Project" http://www.collegeopentextbooks.org/)

=== setcurator ===
<pre>
setcurator [feed_url] [curator_url]
</pre>

== Deploying new WARs ==

To deploy a new war, do this:

* sudo rm -rf /var/lib/tomcat6/webapps/search/ # clear the existing app to force redeployment
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war
* sudo /etc/init.d/tomcat6 restart

== Things the server administrator should know ==

=== JAVA_HOME ===

Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':

JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME

=== Maximum open files ===

Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.

To address this, we added this to '''/etc/security/limits.conf''':

### For Tomcat etc.
* soft nofile 4096
* hard nofile 4096

Running DiscoverEd

2010-09-10T15:43:57Z

Paulproteus:

[[Category:DiscoverEd]]

{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment. It will be massaged into more general docs in the fullness of time.}}

== Instructions for running a crawl ==

Tips:
* For long aggregates and crawls, run in 'screen'.

Three phases to the process of updating the index:
# Aggregation (polling feeds old and new)
# crawling
# merging (merging the new index with the existing one).

=== Set up environment ===

Execute these commands to set up your environment for running the tools. It also places you in the discovered user's account.

<pre>
$ sudo su - discovered
$ cd code
</pre>

=== Managing Feeds ===

The feeds script (./bin/feeds) allows you to add curators or feeds.
Running it without parameters will show the sub-commands.
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).

==== Notes ====

*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system.

*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI. It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.

=== Aggregation ===

Aggregation polls the feeds and adds new resources to the triple store. It will also poll any OPML feeds and add the new feeds it finds.

<pre>$ ./bin/feeds aggregate</pre>

=== Crawl ===

Before you crawl you need to make a seed which tells the crawler what to retrieve.

If the directory "seed/" does not exist, create it with

<pre>
mkdir seed
</pre>

Then create the seed list of URLs:

<pre>
$ ./bin/feeds seed > ./seed/crawl-urls.txt
</pre>

When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).

To run the actual crawl do:

<pre>
$ ant -f dedbuild.xml crawl
</pre>

This will read the seed files and run the crawl. The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name. For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM. After the crawl completes you need to merge the new index with the old one.

The production index lives in '''$HOME/production-crawl''' ($HOME is set to /var/www/discovered.labs.creativecommons.org).

To merge the index run:

<pre>
$ ./bin/merge ./crawl-<timestamp>-merged $HOME/production-crawl ./crawl-<timestamp>
</pre>

The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.

After the merge completes (assuming it does so successfully) you'll want to move it into the production directory. Do something like:

<pre>
$ mkdir -p ~/archived-crawls/$(date -I)
$ mv ~/production-crawl ~/archived-crawls/$(date -I)
</pre>

to rename the existing index so you can go back to it if necessary.

Then you can do

<pre>
$ mv ./crawl-new-dir-merged ~/production-crawl
</pre>

And finally restart Tomcat (the Java app server) to make sure the new index is being used:

<pre>
$ sudo /etc/init.d/tomcat6 restart
</pre>

== Managing curators and feeds ==

On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:

<pre>
listfeeds list all feeds
listcurators list all curators
addfeed add a feed
resetfeed reset the last aggregation date for a feed
addcurator add a curator
rmfeed remove a feed
setcurator set the curator for a feed
aggregate
dump
seed
</pre>

Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)

=== addfeed ===
<pre>
addfeed [feed_type] [feed_url] [curator_url]
</pre>

Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).

Feed type notes: "rss" is a parser that does RSS/Atom sniffing.

=== addcurator ===
<pre>
addcurator [curator_name] [curator_url]
</pre>

Curator names with spaces should be surrounded by quotation marks (e.g. addcurator "CC Open Textbook Project" http://www.collegeopentextbooks.org/)

=== setcurator ===
<pre>
setcurator [feed_url] [curator_url]
</pre>

== Deploying new WARs ==

To deploy a new war, do this:

* sudo rm -rf /var/lib/tomcat6/search/ # clear the existing app to force redeployment
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war
* sudo /etc/init.d/tomcat6 restart

== Things the server administrator should know ==

=== JAVA_HOME ===

Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':

JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME

=== Maximum open files ===

Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.

To address this, we added this to '''/etc/security/limits.conf''':

### For Tomcat etc.
* soft nofile 4096
* hard nofile 4096

AgShare/Tech

2010-09-08T23:06:39Z

Paulproteus:

AgShare/Tech

2010-09-08T23:01:59Z

Paulproteus:

Running DiscoverEd

2010-09-08T22:50:11Z

Paulproteus:

[[Category:DiscoverEd]]

{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment. It will be massaged into more general docs in the fullness of time.}}

== Instructions for running a crawl ==

Tips:
* For long aggregates and crawls, run in 'screen'.

Three phases to the process of updating the index:
# Aggregation (polling feeds old and new)
# crawling
# merging (merging the new index with the existing one).

=== Set up environment ===

Execute these commands to set up your environment for running the tools. It also places you in the discovered user's account.

<pre>
$ sudo su - discovered
$ cd code
</pre>

=== Managing Feeds ===

The feeds script (./bin/feeds) allows you to add curators or feeds.
Running it without parameters will show the sub-commands.
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).

==== Notes ====

*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system.

*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI. It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.

=== Aggregation ===

Aggregation polls the feeds and adds new resources to the triple store. It will also poll any OPML feeds and add the new feeds it finds.

<pre>$ ./bin/feeds aggregate</pre>

=== Crawl ===

Before you crawl you need to make a seed which tells the crawler what to retrieve.

If the directory "seed/" does not exist, create it with

<pre>
mkdir seed
</pre>

Then create the seed list of URLs:

<pre>
$ ./bin/feeds seed > ./seed/crawl-urls.txt
</pre>

When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).

To run the actual crawl do:

<pre>
$ ant -f dedbuild.xml crawl
</pre>

This will read the seed files and run the crawl. The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name. For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM. After the crawl completes you need to merge the new index with the old one.

The production index lives in '''$HOME/production-crawl''' ($HOME is set to /var/www/discovered.labs.creativecommons.org).

To merge the index run:

<pre>
$ ./bin/merge ./crawl-<timestamp>-merged $HOME/production-crawl ./crawl-<timestamp>
</pre>

The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.

After the merge completes (assuming it does so successfully) you'll want to move it into the production directory. Do something like:

<pre>
$ mkdir -p ~/archived-crawls/$(date -I)
$ mv ~/production-crawl ~/archived-crawls/$(date -I)
</pre>

to rename the existing index so you can go back to it if necessary.

Then you can do

<pre>
$ mv ./crawl-new-dir-merged ~/production-crawl
</pre>

And finally restart Tomcat (the Java app server) to make sure the new index is being used:

<pre>
$ sudo /etc/init.d/tomcat6 restart
</pre>

== Managing curators and feeds ==

On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:

<pre>
listfeeds list all feeds
listcurators list all curators
addfeed add a feed
resetfeed reset the last aggregation date for a feed
addcurator add a curator
rmfeed remove a feed
setcurator set the curator for a feed
aggregate
dump
seed
</pre>

Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)

=== addfeed ===
<pre>
addfeed [feed_type] [feed_url] [curator_url]
</pre>

Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).

Feed type notes: "rss" is a parser that does RSS/Atom sniffing.

=== addcurator ===
<pre>
addcurator [curator_name] [curator_url]
</pre>

Curator names with spaces should be surrounded by quotation marks (e.g. addcurator "CC Open Textbook Project" http://www.collegeopentextbooks.org/)

=== setcurator ===
<pre>
setcurator [feed_url] [curator_url]
</pre>

== Deploying new WARs ==

To deploy a new war, do this:

* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war
* sudo /etc/init.d/tomcat6 restart

== Things the server administrator should know ==

=== JAVA_HOME ===

Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':

JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME

=== Maximum open files ===

Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.

To address this, we added this to '''/etc/security/limits.conf''':

### For Tomcat etc.
* soft nofile 4096
* hard nofile 4096

AgShare/Tech

2010-09-08T22:49:05Z

Paulproteus: Created page with "Category:DiscoverEd The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note: * Username: '''agshare''' * Host name: ..."

AgShare

2010-09-08T22:36:12Z

Paulproteus:

[[Category:DiscoverEd]]

Creative Commons is a partner on the [http://www.oerafrica.org/agshare AgShare] project. Supported by the Bill and Melinda Gates Foundation, the goal of this planning and pilot project is to create a scalable and sustainable collaboration of existing organizations for African publishing, localizing, and sharing of teaching and learning materials that fill critical resource gaps in African MSc agriculture curriculum and that can be modified for other downstream uses.

== Technical notes ==

* [[AgShare/Tech]]

PuSH Feed Type

2010-09-07T15:49:14Z

Paulproteus:

{{DiscoverEd Specification
|contact=Asheesh Laroia
|project=AgShare
|status=Draft
}}
The people who run a DiscoverEd instance may wish to be updated nearly-immediately when there are new resources published by a curator.

Right now, DiscoverEd instances aggregate feeds and crawl every once in a while, often manually at the behest of the search engine operator. PubSubHubBub provides a way for the DiscoverEd instance to subscribe feeds and receive automatic, nearly-instantaneous notification of new information in the feed.

This can be built on top of existing Atom/RSS feeds that curators already publish.

This feature was defined and developed during the fun DC meeting thing. (Nathan, did that meeting have a name?)

== Requirements ==

A complete implementation of this specification would provide the following things.

* DiscoverEd can discover a PuSH ''hub'' mentioned in a feed.
* DiscoverEd can register itself as a ''subscriber'' to that feed on that hub. (To do that, it has to provide a URL on the DiscoverEd instance that, when the feed is updated, the hub should POST to.)
* When the hub pings DiscoverEd to say there is an update to that feed, it re-aggregates data from that feed, does a crawl, and merges the index.

== Status ==

* This draft document has been written. That's all.
* NSDL is interested in trying this with us.

== Questions ==

* Can we make things as simple as this:
** OER Africa adds <link rel="hub"...>
** They do nothing else.
** The chosen hub polls the feed, and when there are updates, pings us.
** Then we get real-time updates with basically no effort from OER Africa.

DiscoverEd/Install manually

2010-09-07T14:38:47Z

Paulproteus: /* Switching to MySQL */

[[Category:DiscoverEd]]

{{Infobox|
[[DiscoverEd]] is based on [http://nutch.apache.org/ Nutch]. As such, you may wish to consult the [http://wiki.apache.org/nutch/ Nutch Wiki] for general deployment questions.}}

{{Stub}}

=== Check out and build the source code ===

<pre>
$ git clone git://gitorious.org/discovered/repo.git discovered
$ cd discovered
$ ant
</pre>

=== Add a curator and a feed ===

DiscoverEd uses feeds to help identify resources to crawl. Feeds are provided by curators, who can also provide metadata about resources.

<pre>
$ ./bin/feeds addcurator "ND OCW" http://ocw.nd.edu/
$ ./bin/feeds addfeed rss http://ocw.nd.edu/front-page/courselist/rss http://ocw.nd.edu/
</pre>

=== Aggregate and crawl resources ===

<pre>
$ ./bin/feeds aggregate
$ mkdir seed
$ ./bin/feeds seed > seed/urls.txt
$ ant -f dedbuild.xml crawl
</pre>

=== Run the web application ===

Edit conf/nutch-site.xml to point to your crawl location.

<pre>
$ ant war
$ [copy the war file to your J2EE container]
</pre>

=== Switching to MySQL ===

By default, DiscoverEd (at least on the ''next'' branch) uses an on-disk database called Derby for storing resource metadata. You should use a different database, like MySQL, in production.

To do that, edit '''conf/discovered.xml''' and update the following sections as appropriate:

<pre>
<property>
<name>rdfstore.db.driver</name>
<value>com.mysql.jdbc.Driver</value>
</property>

<property>
<name>rdfstore.db.url</name>
<value>jdbc:mysql://localhost/discovered?autoReconnect=true</value>
</property>

<property>
<name>rdfstore.db.user</name>
<value>discovered</value>
</property>

<property>
<name>rdfstore.db.password</name>
<value></value>
</property>

</pre>

== Known issues ==

=== Derby and OAI:PMH aren't compatible ===

If you use the default backend, OAI:PMH crawls won't work. Instead, you'll get SQL syntax errors from the code. We haven't fully diagnosed the problem; instead, if you get a problem like that, we suggest you switch to MySQL as per the "Switching to MySQL" section.

PuSH Feed Type

2010-09-07T14:04:00Z

Paulproteus:

PuSH Feed Type

2010-09-07T14:01:43Z

Paulproteus: /* Requirements */

PuSH Feed Type

2010-09-07T14:01:07Z

Paulproteus:

{{DiscoverEd Specification
|contact=Asheesh Laroia
|project=AgShare
|status=Draft
}}
The people who run a DiscoverEd instance may wish to be updated nearly-immediately when there are new resources published by a curator.

Right now, DiscoverEd instances aggregate feeds and crawl every once in a while, often manually at the behest of the search engine operator. PubSubHubBub provides a way for the DiscoverEd instance to subscribe feeds and receive automatic, nearly-instantaneous notification of new information in the feed.

This can be built on top of existing Atom/RSS feeds that curators already publish.

This feature was defined and developed during the fun DC meeting thing. (Nathan, did that meeting have a name?)

== Requirements ==

A complete implementation of this specification would provide the following things.

* DiscoverEd can discover a PuSH ''hub'' mentioned in a feed.
* DiscoverEd can register itself as a ''subscriber'' to that feed on that hub.
* When the hub pings DiscoverEd with an update to that feed, it re-aggregates data from that feed, does a crawl, and merges the index.

== Status ==

* This draft document has been written. That's all.

2010-06-21T20:06:36Z

Paulproteus: /* Requirements */

Field Query Mapping

2010-06-21T18:49:08Z

Paulproteus: Created page with '{{DiscoverEd Specification |contact=Asheesh Laroia |project=AgShare |status=In Development }} The people who run a DiscoverEd may wish to let users search specific metadata easil…'

DiscoverEd Glossary

2010-06-15T13:52:48Z

Paulproteus:

[[Category:DiscoverEd]]
__TOC__

=== Curator ===

An agent (individual, organization, group) which identifies resources for inclusion in the DiscoverEd index. A curator may be creator/publisher of the resources, or may be a third party which identifies existing resources and [possibly] adds additional metadata. A curator provides one or more feeds identifying the resources to be indexed. (FIXME: Add an example curator.)

=== Feed ===

A list or map of resources to be included in the index. A feed is associated with a particular curator, and may also include metadata about the resource. Feed is used as a generic term to include Atom/RSS (parsed using Rome) and OAI-PMH endpoints. (FIXME: Add a sample feed curated by somebody.)

=== Resource ===

A single resource to be indexed, identified by a curator. Metadata about the resource may be included with it as [[RDFa]], or provided by the curator. (FIXME: Link to a sample resource.)

=== SKOS ===

[http://www.w3.org/2004/02/skos/ SKOS] is a set of specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading lists and taxonomies within the framework of the Semantic Web. (FIXME: Add a link to a SKOS data set.)

DiscoverEd/Development notes

2010-06-15T13:04:57Z

Paulproteus:

== Eclipse ==

If you use Eclipse, you'll be pleased to know that the repository contains an Eclipse project file. To get going, choose "Create a new project from existing sources." This should import all that is necessary into Eclipse.

=== Eclipse, Nutch, and the class path ===

You can end up in a mess with the class path, since ant has one way of managing the class path, whereas Eclipse has a second. So things that work in Eclipse can fail in the ant targets.

FIXME: Write more problems and solutions here.

[[Category:DiscoverEd]]

DiscoverEd/Development notes

2010-06-13T14:25:03Z

Paulproteus: /* Eclipse = */

DiscoverEd/Development notes

2010-06-13T14:24:15Z

Paulproteus:

== Eclipse ===

If you use Eclipse, you'll be pleased to know that the repository contains an Eclipse project file. To get going, choose "Create a new project from existing sources." This should import all that is necessary into Eclipse.

=== Eclipse, Nutch, and the class path ===

You can end up in a mess with the class path, since ant has one way of managing the class path, whereas Eclipse has a second. So things that work in Eclipse can fail in the ant targets.

FIXME: Write more problems and solutions here.