Difference between revisions of "Hacking DiscoverEd"

From Creative Commons
Jump to: navigation, search
(Created page with 'Edit this page here: http://piratepad.net/IgSLjgAcA2')
 
Line 1: Line 1:
Edit this page here: http://piratepad.net/IgSLjgAcA2
+
How to deploy a hackable DiscoverEd, make changes, and update your deployment
 +
 
 +
[[Category:DiscoverEd]]
 +
 
 +
=== Check out and build the source code ===
 +
 
 +
<pre>
 +
$ git clone git://gitorious.org/discovered/repo.git discovered
 +
$ cd discovered
 +
$ git checkout (whatever branch we're working on today)
 +
$ ant
 +
</pre>
 +
 
 +
=== Add a curator and a feed ===
 +
 
 +
DiscoverEd uses feeds to help identify resources to crawl.  Feeds are provided by curators, who can also provide metadata about resources.
 +
 
 +
By default DiscoverEd uses MySQL and looks for a database called discovered.  If you want to change this, edit conf/discovered.xml.
 +
 
 +
Make sure the database exists and then:
 +
 
 +
<pre>
 +
$ ./bin/feeds addcurator "ND OCW" http://ocw.nd.edu/
 +
$ ./bin/feeds addfeed rss http://ocw.nd.edu/english/@@rss http://ocw.nd.edu/
 +
</pre>
 +
 
 +
See [[DiscoverEd Feeds]] for information on supported feed types.
 +
 
 +
More information on "./bin/feeds" commands at http://wiki.creativecommons.org/Running_DiscoverEd  (some information will be discovered.cc specific)
 +
 
 +
=== Aggregate and crawl resources ===
 +
 
 +
<pre>
 +
$ ./bin/feeds aggregate
 +
$ ./bin/feeds seed > seed/urls.txt
 +
$ ant -f dedbuild.xml crawl
 +
</pre>
 +
 
 +
=== Run the web application ===
 +
 
 +
'''Edit conf/nutch-site.xml to point to your crawl location.'''
 +
 
 +
<code>
 +
$ ant war
 +
$ cp build/nutch-1.1.war [substitute the location for your J2EE container here; ie, /var/lib/tomcat6/webapps ]
 +
</code>
 +
 
 +
=== Hacking The Code  ===
 +
 
 +
* Run Eclipse
 +
* Do File -> Import...
 +
** When it asks you to "Existing projects into workspace," choose "General -> File System"
 +
** Select the location of your source tree
 +
** Click Finish
 +
 
 +
(There are three options.  1. "Existing projects into workspace". 2. "Create from existing source" 3. "File  1. "Existing projects into workspace". 2. "Create from existing source" 3. "File System". Some of these trigger an error regarding Nutch MP3 code.)
 +
 
 +
The DiscoverEd source code lives in two locations:
 +
 
 +
* ded/src/java contains DiscoverEd specific code, primarily related to interfacing with the RDF store.
 +
* src/plugins/cclearn contains the DiscoverEd Nutch plugin, which provides some filtering features to Nutch and ensures metadata indexed in the RDF store is injected into the Lucene index
 +
 
 +
Generally, the plugin may depend upon code in the ded/src/java tree, but classes in the plugin may not be available to that code.
 +
 
 +
=== Commiting Changes and Merging to the Main Repository ===
 +
 
 +
 
 +
 
 +
=== Troubleshooting ===
 +
 
 +
==== I get a big long Java backtrace talking about Jena and MySQL the first time I run the code ====
 +
 
 +
This means that you need to CREATE DATABASE discovered in MySQL. DiscoverEd stores its data in MySQL by default, and you need to either (a) create that database, or (b) choose a different configuration file.
 +
 
 +
==== Database permissions ====
 +
 
 +
You might need to change the MySQL credentials or database configuration value in <code>conf/discovered.xml</code>. DiscoverEd does not require that you use the root user; it does require that the database already exist.
 +
 
 +
==== JAVA_HOME on a Mac ====
 +
 
 +
Mac users setting JAVA_HOME should use
 +
/usr/libexec/java_home to determine the current JAVA_HOME
 +
 
 +
if you're really lazy add
 +
JAVA_HOME=`/usr/libexec/java_home`
 +
to .bash_profile and it will set JAVA_HOME each time you invoke a shell. (This is a good idea!)
 +
 
 +
==== Error message: "Feature 'http://apache.org/xml/features/xinclude' is not recognized." ====
 +
"You probably have an older version of Xerces somewhere in your classpath or something is overriding the default parser configuration with one that doesn't support XInclude." (http://marc.info/?l=xerces-j-user&m=117066278506146&w=2)

Revision as of 15:26, 15 June 2010

How to deploy a hackable DiscoverEd, make changes, and update your deployment

Check out and build the source code

$ git clone git://gitorious.org/discovered/repo.git discovered
$ cd discovered
$ git checkout (whatever branch we're working on today)
$ ant

Add a curator and a feed

DiscoverEd uses feeds to help identify resources to crawl. Feeds are provided by curators, who can also provide metadata about resources.

By default DiscoverEd uses MySQL and looks for a database called discovered. If you want to change this, edit conf/discovered.xml.

Make sure the database exists and then:

$ ./bin/feeds addcurator "ND OCW" http://ocw.nd.edu/ 
$ ./bin/feeds addfeed rss http://ocw.nd.edu/english/@@rss http://ocw.nd.edu/

See DiscoverEd Feeds for information on supported feed types.

More information on "./bin/feeds" commands at http://wiki.creativecommons.org/Running_DiscoverEd (some information will be discovered.cc specific)

Aggregate and crawl resources

$ ./bin/feeds aggregate
$ ./bin/feeds seed > seed/urls.txt
$ ant -f dedbuild.xml crawl

Run the web application

Edit conf/nutch-site.xml to point to your crawl location.

$ ant war $ cp build/nutch-1.1.war [substitute the location for your J2EE container here; ie, /var/lib/tomcat6/webapps ]

Hacking The Code

  • Run Eclipse
  • Do File -> Import...
    • When it asks you to "Existing projects into workspace," choose "General -> File System"
    • Select the location of your source tree
    • Click Finish

(There are three options. 1. "Existing projects into workspace". 2. "Create from existing source" 3. "File 1. "Existing projects into workspace". 2. "Create from existing source" 3. "File System". Some of these trigger an error regarding Nutch MP3 code.)

The DiscoverEd source code lives in two locations:

  • ded/src/java contains DiscoverEd specific code, primarily related to interfacing with the RDF store.
  • src/plugins/cclearn contains the DiscoverEd Nutch plugin, which provides some filtering features to Nutch and ensures metadata indexed in the RDF store is injected into the Lucene index

Generally, the plugin may depend upon code in the ded/src/java tree, but classes in the plugin may not be available to that code.

Commiting Changes and Merging to the Main Repository

Troubleshooting

I get a big long Java backtrace talking about Jena and MySQL the first time I run the code

This means that you need to CREATE DATABASE discovered in MySQL. DiscoverEd stores its data in MySQL by default, and you need to either (a) create that database, or (b) choose a different configuration file.

Database permissions

You might need to change the MySQL credentials or database configuration value in conf/discovered.xml. DiscoverEd does not require that you use the root user; it does require that the database already exist.

JAVA_HOME on a Mac

Mac users setting JAVA_HOME should use /usr/libexec/java_home to determine the current JAVA_HOME

if you're really lazy add JAVA_HOME=`/usr/libexec/java_home` to .bash_profile and it will set JAVA_HOME each time you invoke a shell. (This is a good idea!)

Error message: "Feature 'http://apache.org/xml/features/xinclude' is not recognized."

"You probably have an older version of Xerces somewhere in your classpath or something is overriding the default parser configuration with one that doesn't support XInclude." (http://marc.info/?l=xerces-j-user&m=117066278506146&w=2)