Difference between revisions of "Hacking DiscoverEd"

From Creative Commons
Jump to: navigation, search
(Hacking The Code: add info for indenting with spaces instead of tabs in Eclipse 3.5)
Line 13: Line 13:
  
 
= Add a curator and a feed =
 
= Add a curator and a feed =
 +
 +
By default DiscoverEd uses Derby and will create the on-disk database if needed.  See [[DiscoverEd/Installation Instructions|the installation instructions]] for information on using other databases, such as MySQL.
  
 
DiscoverEd uses feeds to help identify resources to crawl.  Feeds are provided by curators, who can also provide metadata about resources.
 
DiscoverEd uses feeds to help identify resources to crawl.  Feeds are provided by curators, who can also provide metadata about resources.
 
By default DiscoverEd uses MySQL and looks for a database called discovered.  ```Configure your database settings by editing <code>conf/discovered.xml</code>.```
 
 
Make sure the database exists and then:
 
  
 
<pre>
 
<pre>
Line 39: Line 37:
 
= Run the web application =
 
= Run the web application =
  
'''Edit conf/nutch-site.xml to point to your crawl location.'''
+
You can run the web front-end using [http://en.wikipedia.org/wiki/Jetty_%28web_server%29 Jetty] (included with your checkout) by running:
  
<code>
+
<pre>
$ ant war
+
$ ant -f dedbuild.xml serve
$ cp build/nutch-1.1.war [substitute the location for your J2EE container here; ie, /var/lib/tomcat6/webapps ]
+
</pre>
</code>
 
  
 
= Hacking The Code  =
 
= Hacking The Code  =
Line 83: Line 80:
 
= Troubleshooting =
 
= Troubleshooting =
  
==== I get a big long Java backtrace talking about Jena and MySQL the first time I run the code ====
+
==== I get a big long Java backtrace talking about Jena and MySQL ====
  
This means that you need to CREATE DATABASE discovered in MySQL. DiscoverEd stores its data in MySQL by default, and you need to either (a) create that database, or (b) choose a different configuration file.
+
If you've configured DiscoverEd to use MySQL as the database backend, you'll need to create the database first.
  
 
==== Database permissions ====
 
==== Database permissions ====
Line 101: Line 98:
  
 
==== Error message: "Feature 'http://apache.org/xml/features/xinclude' is not recognized." ====
 
==== Error message: "Feature 'http://apache.org/xml/features/xinclude' is not recognized." ====
 +
 
"You probably have an older version of Xerces somewhere in your classpath or something is overriding the default parser configuration with one that doesn't support XInclude." (http://marc.info/?l=xerces-j-user&m=117066278506146&w=2)
 
"You probably have an older version of Xerces somewhere in your classpath or something is overriding the default parser configuration with one that doesn't support XInclude." (http://marc.info/?l=xerces-j-user&m=117066278506146&w=2)
  

Revision as of 02:16, 24 August 2010

How to deploy a hackable DiscoverEd, make changes, and update your deployment

Check out and build the source code

git clone git://gitorious.org/discovered/repo.git discovered
cd discovered
git checkout (whatever branch we're working on today)
ant

Add a curator and a feed

By default DiscoverEd uses Derby and will create the on-disk database if needed. See the installation instructions for information on using other databases, such as MySQL.

DiscoverEd uses feeds to help identify resources to crawl. Feeds are provided by curators, who can also provide metadata about resources.

$ ./bin/feeds addcurator "ND OCW" http://ocw.nd.edu/ 
$ ./bin/feeds addfeed rss http://ocw.nd.edu/english/@@rss http://ocw.nd.edu/

See DiscoverEd Feeds for information on supported feed types.

More information on ./bin/feeds commands at Running DiscoverEd (some information will be discovered.cc specific)

Aggregate and crawl resources

$ ./bin/feeds aggregate
$ ./bin/feeds seed > seed/urls.txt
$ ant -f dedbuild.xml crawl

Run the web application

You can run the web front-end using Jetty (included with your checkout) by running:

$ ant -f dedbuild.xml serve

Hacking The Code

  • Run Eclipse
  • Do File -> Import...
    • When it asks you to "Existing projects into workspace," choose "General -> File System"
    • Select the location of your source tree
    • Click Finish

(There are three options. 1. "Existing projects into workspace". 2. "Create from existing source" 3. "File 1. "Existing projects into workspace". 2. "Create from existing source" 3. "File System". Some of these trigger an error regarding Nutch MP3 code.)

The DiscoverEd source code lives in two locations:

  • ded/src/java contains DiscoverEd specific code, primarily related to interfacing with the RDF store.
  • src/plugins/cclearn contains the DiscoverEd Nutch plugin, which provides some filtering features to Nutch and ensures metadata indexed in the RDF store is injected into the Lucene index

Generally, the plugin may depend upon code in the ded/src/java tree, but classes in the plugin may not be available to that code.

Note: The DiscoverEd developers will consider you extra special if you indent your code using spaces instead of tabs. You may even earn a gold star.

To use spaces for all indentation for all Java projects in Eclipse 3.5 (Galileo):

  1. Open Preferences
  2. Expand the Java group
  3. Expand the Code Style subgroup within the Java group
  4. Select Formatter
  5. Click on "New" in the Formatter section and name your profile
  6. Check that the "Indentation" tab is active
  7. Select "Spaces only" from the "Tab policy" dropdown
  8. Click "Apply" and or "OK"

You may also do this on a per-project basis by setting it as a project property. The general process is the same.

Commiting Changes and Merging to the Main Repository

Troubleshooting

I get a big long Java backtrace talking about Jena and MySQL

If you've configured DiscoverEd to use MySQL as the database backend, you'll need to create the database first.

Database permissions

You might need to change the MySQL credentials or database configuration value in conf/discovered.xml. DiscoverEd does not require that you use the root user; it does require that the database already exist.

JAVA_HOME on a Mac

Mac users setting JAVA_HOME should use /usr/libexec/java_home to determine the current JAVA_HOME

if you're really lazy add JAVA_HOME=`/usr/libexec/java_home` to .bash_profile and it will set JAVA_HOME each time you invoke a shell. (This is a good idea!)

Error message: "Feature 'http://apache.org/xml/features/xinclude' is not recognized."

"You probably have an older version of Xerces somewhere in your classpath or something is overriding the default parser configuration with one that doesn't support XInclude." (http://marc.info/?l=xerces-j-user&m=117066278506146&w=2)

AccessControlException

When starting Tomcat, if you get a traceback like this in your tomcat log (e.g., in /var/lib/tomcat6/logs/localhost-$date.log):

SEVERE: Exception sending context initialized event to listener instance of class org.apache.nutch.searcher.NutchBean$NutchBeanConstructor
java.lang.RuntimeException: java.security.AccessControlException: access denied (java.lang.reflect.ReflectPermission suppressAccessChecks)
       at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
       at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1377)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)

and so on, try changing the Tomcat policy in /etc/tomcat6/policy.d/04webapps.policy. Add these lines in the grant {} block:

   // Attempt to get Nutch working
   // Courtesy of Alex McLintock at http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200907.mbox/<d398ec7f0907041237j6acffe0fm10b7cd374a77795b@mail.gmail.com>
   permission java.security.AllPermission;

This is obviously inappropriate for any site running a public instance of DiscoverEd. But it might be useful for your local dev environment. If you know how to specify a class level permission, please update this document.

Missing build/plugins

Be sure to run ant in the root repo directory.

Missing parse-mp3 plugin

Remove that source folder from the build path (in Eclimse, Project > Properties > Java Build Path > Source.

Eclipse complains: Wrong version number in .class file

Use Java 1.6 as your compiler. Be sure to use the right JVM for this project.