Difference between revisions of "Running DiscoverEd"
(→addcurator) |
(→Crawl) |
||
(20 intermediate revisions by 2 users not shown) | |||
Line 5: | Line 5: | ||
== Instructions for running a crawl == | == Instructions for running a crawl == | ||
− | + | Tips: | |
− | * | + | * For long aggregates and crawls, run in 'screen'. |
− | |||
− | |||
Three phases to the process of updating the index: | Three phases to the process of updating the index: | ||
Line 15: | Line 13: | ||
# merging (merging the new index with the existing one). | # merging (merging the new index with the existing one). | ||
− | === | + | === Set up environment === |
− | Execute these commands to set up your environment for running the tools. It also | + | Execute these commands to set up your environment for running the tools. It also places you in the discovered user's account. |
− | places you | ||
− | |||
<pre> | <pre> | ||
− | $ | + | $ sudo su - discovered |
− | $ | + | $ cd code |
− | |||
</pre> | </pre> | ||
Line 44: | Line 39: | ||
<pre>$ ./bin/feeds aggregate</pre> | <pre>$ ./bin/feeds aggregate</pre> | ||
− | |||
− | |||
=== Crawl === | === Crawl === | ||
Line 51: | Line 44: | ||
Before you crawl you need to make a seed which tells the crawler what to retrieve. | Before you crawl you need to make a seed which tells the crawler what to retrieve. | ||
− | + | If the directory "seed/" does not exist, create it with | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<pre> | <pre> | ||
− | + | mkdir seed | |
</pre> | </pre> | ||
− | + | Then create the seed list of URLs: | |
− | |||
− | |||
− | |||
− | |||
<pre> | <pre> | ||
− | $ ./bin/ | + | $ ./bin/feeds seed > ./seed/crawl-urls.txt |
</pre> | </pre> | ||
− | + | When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled). | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | To run the actual crawl do: | ||
<pre> | <pre> | ||
− | $ | + | $ ./bin/crawl-and-merge.sh |
</pre> | </pre> | ||
− | + | Finally, restart Tomcat (the Java app server) to make sure the new index is being used: | |
<pre> | <pre> | ||
− | $ sudo /etc/init.d/ | + | $ sudo /etc/init.d/tomcat6 restart |
</pre> | </pre> | ||
== Managing curators and feeds == | == Managing curators and feeds == | ||
− | On | + | On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands: |
<pre> | <pre> | ||
Line 134: | Line 108: | ||
setcurator [feed_url] [curator_url] | setcurator [feed_url] [curator_url] | ||
</pre> | </pre> | ||
+ | |||
+ | == Deploying new WARs == | ||
+ | |||
+ | To deploy a new war, do this: | ||
+ | |||
+ | * sudo rm -rf /var/lib/tomcat6/webapps/search/ # clear the existing app to force redeployment | ||
+ | * sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war | ||
+ | * sudo /etc/init.d/tomcat6 restart | ||
+ | |||
+ | == Things the server administrator should know == | ||
+ | |||
+ | === JAVA_HOME === | ||
+ | |||
+ | Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''': | ||
+ | |||
+ | JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME | ||
+ | |||
+ | === Maximum open files === | ||
+ | |||
+ | Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have. | ||
+ | |||
+ | To address this, we added this to '''/etc/security/limits.conf''': | ||
+ | |||
+ | ### For Tomcat etc. | ||
+ | * soft nofile 4096 | ||
+ | * hard nofile 4096 |
Latest revision as of 20:54, 11 October 2010
This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment. It will be massaged into more general docs in the fullness of time.
Contents
Instructions for running a crawl
Tips:
- For long aggregates and crawls, run in 'screen'.
Three phases to the process of updating the index:
- Aggregation (polling feeds old and new)
- crawling
- merging (merging the new index with the existing one).
Set up environment
Execute these commands to set up your environment for running the tools. It also places you in the discovered user's account.
$ sudo su - discovered $ cd code
Managing Feeds
The feeds script (./bin/feeds) allows you to add curators or feeds. Running it without parameters will show the sub-commands. Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).
Notes
- For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system.
- If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI. It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.
Aggregation
Aggregation polls the feeds and adds new resources to the triple store. It will also poll any OPML feeds and add the new feeds it finds.
$ ./bin/feeds aggregate
Crawl
Before you crawl you need to make a seed which tells the crawler what to retrieve.
If the directory "seed/" does not exist, create it with
mkdir seed
Then create the seed list of URLs:
$ ./bin/feeds seed > ./seed/crawl-urls.txt
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).
To run the actual crawl do:
$ ./bin/crawl-and-merge.sh
Finally, restart Tomcat (the Java app server) to make sure the new index is being used:
$ sudo /etc/init.d/tomcat6 restart
Managing curators and feeds
On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:
listfeeds list all feeds listcurators list all curators addfeed add a feed resetfeed reset the last aggregation date for a feed addcurator add a curator rmfeed remove a feed setcurator set the curator for a feed aggregate dump seed
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)
addfeed
addfeed [feed_type] [feed_url] [curator_url]
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).
Feed type notes: "rss" is a parser that does RSS/Atom sniffing.
addcurator
addcurator [curator_name] [curator_url]
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator "CC Open Textbook Project" http://www.collegeopentextbooks.org/)
setcurator
setcurator [feed_url] [curator_url]
Deploying new WARs
To deploy a new war, do this:
- sudo rm -rf /var/lib/tomcat6/webapps/search/ # clear the existing app to force redeployment
- sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war
- sudo /etc/init.d/tomcat6 restart
Things the server administrator should know
JAVA_HOME
Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured discovered.labs.creativecommons.org to have JAVA_HOME set for every user. We did that by adding this to /etc/profile:
JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME
Maximum open files
Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.
To address this, we added this to /etc/security/limits.conf:
### For Tomcat etc. * soft nofile 4096 * hard nofile 4096