Hacking DiscoverEd
How to deploy a hackable DiscoverEd, make changes, and update your deployment
Contents
Check out and build the source code
$ git clone git://gitorious.org/discovered/repo.git discovered $ cd discovered $ git checkout (whatever branch we're working on today) $ ant
Add a curator and a feed
DiscoverEd uses feeds to help identify resources to crawl. Feeds are provided by curators, who can also provide metadata about resources.
By default DiscoverEd uses MySQL and looks for a database called discovered. ```Configure your database settings by editing conf/discovered.xml
.```
Make sure the database exists and then:
$ ./bin/feeds addcurator "ND OCW" http://ocw.nd.edu/ $ ./bin/feeds addfeed rss http://ocw.nd.edu/english/@@rss http://ocw.nd.edu/
See DiscoverEd Feeds for information on supported feed types.
More information on ./bin/feeds
commands at Running DiscoverEd (some information will be discovered.cc specific)
Aggregate and crawl resources
$ ./bin/feeds aggregate $ ./bin/feeds seed > seed/urls.txt $ ant -f dedbuild.xml crawl
Run the web application
Edit conf/nutch-site.xml to point to your crawl location.
$ ant war $ cp build/nutch-1.1.war [substitute the location for your J2EE container here; ie, /var/lib/tomcat6/webapps ]
Hacking The Code
- Run Eclipse
- Do File -> Import...
- When it asks you to "Existing projects into workspace," choose "General -> File System"
- Select the location of your source tree
- Click Finish
(There are three options. 1. "Existing projects into workspace". 2. "Create from existing source" 3. "File 1. "Existing projects into workspace". 2. "Create from existing source" 3. "File System". Some of these trigger an error regarding Nutch MP3 code.)
The DiscoverEd source code lives in two locations:
- ded/src/java contains DiscoverEd specific code, primarily related to interfacing with the RDF store.
- src/plugins/cclearn contains the DiscoverEd Nutch plugin, which provides some filtering features to Nutch and ensures metadata indexed in the RDF store is injected into the Lucene index
Generally, the plugin may depend upon code in the ded/src/java tree, but classes in the plugin may not be available to that code.
Commiting Changes and Merging to the Main Repository
Troubleshooting
I get a big long Java backtrace talking about Jena and MySQL the first time I run the code
This means that you need to CREATE DATABASE discovered in MySQL. DiscoverEd stores its data in MySQL by default, and you need to either (a) create that database, or (b) choose a different configuration file.
Database permissions
You might need to change the MySQL credentials or database configuration value in conf/discovered.xml
. DiscoverEd does not require that you use the root user; it does require that the database already exist.
JAVA_HOME on a Mac
Mac users setting JAVA_HOME should use /usr/libexec/java_home to determine the current JAVA_HOME
if you're really lazy add JAVA_HOME=`/usr/libexec/java_home` to .bash_profile and it will set JAVA_HOME each time you invoke a shell. (This is a good idea!)
Error message: "Feature 'http://apache.org/xml/features/xinclude' is not recognized."
"You probably have an older version of Xerces somewhere in your classpath or something is overriding the default parser configuration with one that doesn't support XInclude." (http://marc.info/?l=xerces-j-user&m=117066278506146&w=2)
AccessControlException
When starting Tomcat, if you get a traceback like this in your tomcat log (e.g., in /var/lib/tomcat6/logs/localhost-$date.log):
SEVERE: Exception sending context initialized event to listener instance of class org.apache.nutch.searcher.NutchBean$NutchBeanConstructor java.lang.RuntimeException: java.security.AccessControlException: access denied (java.lang.reflect.ReflectPermission suppressAccessChecks) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1377) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
and so on, try changing the Tomcat policy in /etc/tomcat6/policy.d/04webapps.policy. Add these lines in the grant {} block:
// Attempt to get Nutch working // Courtesy of Alex McLintock at http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200907.mbox/<d398ec7f0907041237j6acffe0fm10b7cd374a77795b@mail.gmail.com> permission java.security.AllPermission;
This is obviously inappropriate for any site running a public instance of DiscoverEd. But it might be useful for your local dev environment. If you know how to specify a class level permission, please update this document.
Missing build/plugins
Be sure to run `ant` in the root repo directory.