Difference between revisions of "Running DiscoverEd"

From Creative Commons
Jump to: navigation, search
(Update production crawl location)
Line 15: Line 15:
 
=== Set up environment ===
 
=== Set up environment ===
  
Execute these commands to set up your environment for running the tools.  It also places you into a sub-shell so you'll have to do logout twice to
+
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.
disconnect:
 
  
 
<pre>
 
<pre>
$ cd /var/www/discovered.creativecommos.org/oenutch
+
$ sudo su - discovered
$ ./bin/env.sh
 
$ export JAVA_HOME=/usr/lib/jvm/java-6-sun
 
 
</pre>
 
</pre>
  
Line 89: Line 86:
 
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.
 
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.
  
The production index lives in /usr/local/nutch/crawl.
+
The production index lives in '''/var/www/discovered.labs.creativecommons.org/production-crawl'''.
  
 
To merge the index run:
 
To merge the index run:
  
 
<pre>
 
<pre>
$ ./bin/merge ./crawl-<timestamp>-merged /usr/local/nutch/crawl ./crawl-<timestamp>
+
$ ./bin/merge ./crawl-<timestamp>-merged /var/www/discovered.labs.creativecommons.org/production-crawl ./crawl-<timestamp>
 
</pre>
 
</pre>
  

Revision as of 19:25, 20 August 2010


This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment. It will be massaged into more general docs in the fullness of time.

Instructions for running a crawl

Tips:

  • For long aggregates and crawls, run in 'screen'.

Three phases to the process of updating the index:

  1. Aggregation (polling feeds old and new)
  2. crawling
  3. merging (merging the new index with the existing one).

Set up environment

Execute these commands to set up your environment for running the tools. It also places you in the discovered user's account.

$ sudo su - discovered

Switching to MySQL

By default, DiscoverEd (at least on the next branch) uses an on-disk database called Derby for storing resource metadata. You should use a different database, like MySQL, in production.

To do that, edit conf/discovered.xml and update the following sections as appropriate:

<property>
  <name>rdfstore.db.driver</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>rdfstore.db.url</name>
  <value>jdbc:mysql://localhost/discovered?autoReconnect=true</value>
</property>

<property>
  <name>rdfstore.db.user</name>
  <value>discovered</value>
</property>

<property>
  <name>rdfstore.db.password</name>
  <value></value>
</property>

Managing Feeds

The feeds script (./bin/feeds) allows you to add curators or feeds. Running it without parameters will show the sub-commands. Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).

Notes

  • For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system.
  • If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI. It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.

Aggregation

Aggregation polls the feeds and adds new resources to the triple store. It will also poll any OPML feeds and add the new feeds it finds.

$ ./bin/feeds aggregate

Crawl

Before you crawl you need to make a seed which tells the crawler what to retrieve.

$ ./bin/feeds seed > ./seed/crawl-urls.txt

When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).

To run the actual crawl do:

$ ant -f ccbuild.xml crawl

This will read the seed files and run the crawl. The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name. For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM. After the crawl completes you need to merge the new index with the old one.

The production index lives in /var/www/discovered.labs.creativecommons.org/production-crawl.

To merge the index run:

$ ./bin/merge ./crawl-<timestamp>-merged /var/www/discovered.labs.creativecommons.org/production-crawl ./crawl-<timestamp>

The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.

After the merge completes (assuming it does so successfully) you'll want to move it into the production directory. Do something like:

$ mv /usr/local/nutch/crawl /usr/local/nutch/crawl.20090730

to rename the existing index so you can go back to it if necessary.

Then you can do

$ mv ./crawl-new-dir-merged /usr/local/nutch/crawl

And finally restart Tomcat (the Java app server) to make sure the new index is being used:

$ sudo /etc/init.d/tomcat5.5 restart

Managing curators and feeds

On a6, in the /var/www/discovered.creativecommons.org/oenutch directory, running ./bin/feeds with no parameters shows the list of subcommands:

  listfeeds        list all feeds
  listcurators     list all curators
  addfeed          add a feed
  resetfeed        reset the last aggregation date for a feed
  addcurator       add a curator
  rmfeed           remove a feed
  setcurator       set the curator for a feed
  aggregate
  dump
  seed

Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)

addfeed

addfeed [feed_type] [feed_url] [curator_url]

Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).

Feed type notes: "rss" is a parser that does RSS/Atom sniffing.

addcurator

addcurator [curator_name] [curator_url]

Curator names with spaces should be surrounded by quotation marks (e.g. addcurator "CC Open Textbook Project" http://www.collegeopentextbooks.org/)

setcurator

setcurator [feed_url] [curator_url]