Hacking DiscoverEd

From Creative Commons
Revision as of 14:20, 16 June 2010 by Dithyramble (talk | contribs) (Troubleshooting: +two more troubleshooting sections)
Jump to: navigation, search

How to deploy a hackable DiscoverEd, make changes, and update your deployment

Check out and build the source code

$ git clone git://gitorious.org/discovered/repo.git discovered
$ cd discovered
$ git checkout (whatever branch we're working on today)
$ ant

Add a curator and a feed

DiscoverEd uses feeds to help identify resources to crawl. Feeds are provided by curators, who can also provide metadata about resources.

By default DiscoverEd uses MySQL and looks for a database called discovered. ```Configure your database settings by editing conf/discovered.xml.```

Make sure the database exists and then:

$ ./bin/feeds addcurator "ND OCW" http://ocw.nd.edu/ 
$ ./bin/feeds addfeed rss http://ocw.nd.edu/english/@@rss http://ocw.nd.edu/

See DiscoverEd Feeds for information on supported feed types.

More information on ./bin/feeds commands at Running DiscoverEd (some information will be discovered.cc specific)

Aggregate and crawl resources

$ ./bin/feeds aggregate
$ ./bin/feeds seed > seed/urls.txt
$ ant -f dedbuild.xml crawl

Run the web application

Edit conf/nutch-site.xml to point to your crawl location.

$ ant war $ cp build/nutch-1.1.war [substitute the location for your J2EE container here; ie, /var/lib/tomcat6/webapps ]

Hacking The Code

  • Run Eclipse
  • Do File -> Import...
    • When it asks you to "Existing projects into workspace," choose "General -> File System"
    • Select the location of your source tree
    • Click Finish

(There are three options. 1. "Existing projects into workspace". 2. "Create from existing source" 3. "File 1. "Existing projects into workspace". 2. "Create from existing source" 3. "File System". Some of these trigger an error regarding Nutch MP3 code.)

The DiscoverEd source code lives in two locations:

  • ded/src/java contains DiscoverEd specific code, primarily related to interfacing with the RDF store.
  • src/plugins/cclearn contains the DiscoverEd Nutch plugin, which provides some filtering features to Nutch and ensures metadata indexed in the RDF store is injected into the Lucene index

Generally, the plugin may depend upon code in the ded/src/java tree, but classes in the plugin may not be available to that code.

Commiting Changes and Merging to the Main Repository

Troubleshooting

I get a big long Java backtrace talking about Jena and MySQL the first time I run the code

This means that you need to CREATE DATABASE discovered in MySQL. DiscoverEd stores its data in MySQL by default, and you need to either (a) create that database, or (b) choose a different configuration file.

Database permissions

You might need to change the MySQL credentials or database configuration value in conf/discovered.xml. DiscoverEd does not require that you use the root user; it does require that the database already exist.

JAVA_HOME on a Mac

Mac users setting JAVA_HOME should use /usr/libexec/java_home to determine the current JAVA_HOME

if you're really lazy add JAVA_HOME=`/usr/libexec/java_home` to .bash_profile and it will set JAVA_HOME each time you invoke a shell. (This is a good idea!)

Error message: "Feature 'http://apache.org/xml/features/xinclude' is not recognized."

"You probably have an older version of Xerces somewhere in your classpath or something is overriding the default parser configuration with one that doesn't support XInclude." (http://marc.info/?l=xerces-j-user&m=117066278506146&w=2)

AccessControlException

When starting Tomcat, if you get a traceback like this in your tomcat log (e.g., in /var/lib/tomcat6/logs/localhost-$date.log):

SEVERE: Exception sending context initialized event to listener instance of class org.apache.nutch.searcher.NutchBean$NutchBeanConstructor
java.lang.RuntimeException: java.security.AccessControlException: access denied (java.lang.reflect.ReflectPermission suppressAccessChecks)
       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
       at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1377)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)

and so on, try changing the Tomcat policy in /etc/tomcat6/policy.d/04webapps.policy. Add these lines in the grant {} block:

   // Attempt to get Nutch working
   // Courtesy of Alex McLintock at http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200907.mbox/<d398ec7f0907041237j6acffe0fm10b7cd374a77795b@mail.gmail.com>
   permission java.security.AllPermission;

This is obviously inappropriate for any site running a public instance of DiscoverEd. But it might be useful for your local dev environment. If you know how to specify a class level permission, please update this document.

Missing build/plugins

Be sure to run ant in the root repo directory.

Missing parse-mp3 plugin

Remove that source folder from the build path (in Eclimse, Project > Properties > Java Build Path > Source.

Eclipse complains: Wrong version number in .class file

Use Java 1.6 as your compiler. Be sure to use the right JVM for this project.