<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
		<id>https://wiki.creativecommons.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Paulproteus</id>
		<title>Creative Commons - User contributions [en]</title>
		<link rel="self" type="application/atom+xml" href="https://wiki.creativecommons.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Paulproteus"/>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/wiki/Special:Contributions/Paulproteus"/>
		<updated>2026-06-13T07:59:00Z</updated>
		<subtitle>User contributions</subtitle>
		<generator>MediaWiki 1.30.0</generator>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=MediaWiki_Feed_Type&amp;diff=43055</id>
		<title>MediaWiki Feed Type</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=MediaWiki_Feed_Type&amp;diff=43055"/>
				<updated>2010-10-14T14:56:18Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Further work */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=Complete&lt;br /&gt;
}}&lt;br /&gt;
Publishers who store and edit text in MediaWiki may want a particular ''category'' from their wiki aggregated into a DiscoverEd instance. This specification describes:&lt;br /&gt;
&lt;br /&gt;
* How publishers can set up their wikis to be aggregated by DiscoverEd&lt;br /&gt;
* How DiscoverEd instance maintainers can aggregate content from such a wiki&lt;br /&gt;
* The changes that we made to DiscoverEd to support this.&lt;br /&gt;
&lt;br /&gt;
It's pretty easy, all around. Relax, and read on.&lt;br /&gt;
&lt;br /&gt;
== Guide for publishers ==&lt;br /&gt;
&lt;br /&gt;
DiscoverEd can pull content from a category in your wiki. Our code relies on the following:&lt;br /&gt;
&lt;br /&gt;
* The MediaWiki API enabled, at least the read-only portion. (This is the default. [http://www.mediawiki.org/wiki/API:Restricting_API_usage More info here].)&lt;br /&gt;
* A &amp;lt;link rel=&amp;quot;search&amp;quot;&amp;gt; tag pointing to the OpenSearch API on your wiki. This is also enabled by default, but if you customize the theme, you might accidentally remove this tag. We need it there.&lt;br /&gt;
&lt;br /&gt;
We detect the MediaWiki API path by looking at a page on your wiki and then determining the URL to the API by looking for another MediaWiki PHP file link. Right now, we rely on the &amp;lt;link rel=&amp;quot;search&amp;quot;&amp;gt; for that detection. So if you remove the OpenSearch header, we can't find the API URL.&lt;br /&gt;
&lt;br /&gt;
== Guide for DiscoverEd sysadmins ==&lt;br /&gt;
&lt;br /&gt;
* There is a new feed type: ''mediawiki-category''. If you add a feed with that type, set the URL of the feed to the MediaWiki category. For example:&lt;br /&gt;
** to slurp in resources from Category:DiscoverEd on this wiki, where the wiki is curated by http://creativecommons.org/, enter this:&lt;br /&gt;
** bin/feeds addfeed mediawiki-category http://wiki.creativecommons.org/Category:DiscoverEd http://creativecommons.org/&lt;br /&gt;
* The provenance of a Resource we find in the category is the full category URL.&lt;br /&gt;
* The code has extensive logging of its exceptions, so if you find you are missing data you thought you would have. do read the log.&lt;br /&gt;
&lt;br /&gt;
== DiscoverEd code changes ==&lt;br /&gt;
&lt;br /&gt;
The changes from [http://gitorious.org/discovered/repo/commit/a60ed73cc793f74f9cdafd853cdec208eda77a78 a60ed73cc793f74f9cdafd853cdec208eda77a78] to [http://gitorious.org/discovered/repo/commit/7500f599a9e5667db7644984dfb1d8d549a48597 7500f59] represent the initial implementation. A summary of the implementation details:&lt;br /&gt;
&lt;br /&gt;
* Factored out the RSS feed aggregation into a separate class&lt;br /&gt;
* Created a new valid feed type, ''mediawiki-category''&lt;br /&gt;
* Created a MediaWikiCategory class which can, starting from a category page, find the API URL, query it for all the pages in that category, and create a Resource representing each such page.&lt;br /&gt;
&lt;br /&gt;
== Further work ==&lt;br /&gt;
&lt;br /&gt;
Things that would be nice, but that the world will probably never see:&lt;br /&gt;
&lt;br /&gt;
* It would be nice if MediaWiki had a &amp;lt;link rel=&amp;quot;api&amp;quot;&amp;gt; or similar that unambiguously pointed to the API.&lt;br /&gt;
* It would be interesting if MediaWiki just created an RSS feed, in &amp;quot;MIT OCW format&amp;quot;, for each category.&lt;br /&gt;
* We extract ''extremely'' little metadata right now from the pages: just the title. It would be nice if there were a reasonable way to store and extract metadata. A decision by us could make a big difference.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=MediaWiki_Feed_Type&amp;diff=43054</id>
		<title>MediaWiki Feed Type</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=MediaWiki_Feed_Type&amp;diff=43054"/>
				<updated>2010-10-14T14:51:30Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Guide for DiscoverEd sysadmins */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=Complete&lt;br /&gt;
}}&lt;br /&gt;
Publishers who store and edit text in MediaWiki may want a particular ''category'' from their wiki aggregated into a DiscoverEd instance. This specification describes:&lt;br /&gt;
&lt;br /&gt;
* How publishers can set up their wikis to be aggregated by DiscoverEd&lt;br /&gt;
* How DiscoverEd instance maintainers can aggregate content from such a wiki&lt;br /&gt;
* The changes that we made to DiscoverEd to support this.&lt;br /&gt;
&lt;br /&gt;
It's pretty easy, all around. Relax, and read on.&lt;br /&gt;
&lt;br /&gt;
== Guide for publishers ==&lt;br /&gt;
&lt;br /&gt;
DiscoverEd can pull content from a category in your wiki. Our code relies on the following:&lt;br /&gt;
&lt;br /&gt;
* The MediaWiki API enabled, at least the read-only portion. (This is the default. [http://www.mediawiki.org/wiki/API:Restricting_API_usage More info here].)&lt;br /&gt;
* A &amp;lt;link rel=&amp;quot;search&amp;quot;&amp;gt; tag pointing to the OpenSearch API on your wiki. This is also enabled by default, but if you customize the theme, you might accidentally remove this tag. We need it there.&lt;br /&gt;
&lt;br /&gt;
We detect the MediaWiki API path by looking at a page on your wiki and then determining the URL to the API by looking for another MediaWiki PHP file link. Right now, we rely on the &amp;lt;link rel=&amp;quot;search&amp;quot;&amp;gt; for that detection. So if you remove the OpenSearch header, we can't find the API URL.&lt;br /&gt;
&lt;br /&gt;
== Guide for DiscoverEd sysadmins ==&lt;br /&gt;
&lt;br /&gt;
* There is a new feed type: ''mediawiki-category''. If you add a feed with that type, set the URL of the feed to the MediaWiki category. For example:&lt;br /&gt;
** to slurp in resources from Category:DiscoverEd on this wiki, where the wiki is curated by http://creativecommons.org/, enter this:&lt;br /&gt;
** bin/feeds addfeed mediawiki-category http://wiki.creativecommons.org/Category:DiscoverEd http://creativecommons.org/&lt;br /&gt;
* The provenance of a Resource we find in the category is the full category URL.&lt;br /&gt;
* The code has extensive logging of its exceptions, so if you find you are missing data you thought you would have. do read the log.&lt;br /&gt;
&lt;br /&gt;
== DiscoverEd code changes ==&lt;br /&gt;
&lt;br /&gt;
The changes from [http://gitorious.org/discovered/repo/commit/a60ed73cc793f74f9cdafd853cdec208eda77a78 a60ed73cc793f74f9cdafd853cdec208eda77a78] to [http://gitorious.org/discovered/repo/commit/7500f599a9e5667db7644984dfb1d8d549a48597 7500f59] represent the initial implementation. A summary of the implementation details:&lt;br /&gt;
&lt;br /&gt;
* Factored out the RSS feed aggregation into a separate class&lt;br /&gt;
* Created a new valid feed type, ''mediawiki-category''&lt;br /&gt;
* Created a MediaWikiCategory class which can, starting from a category page, find the API URL, query it for all the pages in that category, and create a Resource representing each such page.&lt;br /&gt;
&lt;br /&gt;
== Further work ==&lt;br /&gt;
&lt;br /&gt;
Things that would be nice, but that the world will probably never see:&lt;br /&gt;
&lt;br /&gt;
* It would be nice if MediaWiki had a &amp;lt;link rel=&amp;quot;api&amp;quot;&amp;gt; or similar that unambiguously pointed automated agents to the API.&lt;br /&gt;
* It would be interesting if MediaWiki just created an RSS feed, in &amp;quot;MIT OCW format&amp;quot;, for each category.&lt;br /&gt;
* We extract ''extremely'' little metadata right now from the pages: just the title. It would be nice if there were a reasonable way to store and extract metadata. A decision by us could make a big difference.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=MediaWiki_Feed_Type&amp;diff=43053</id>
		<title>MediaWiki Feed Type</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=MediaWiki_Feed_Type&amp;diff=43053"/>
				<updated>2010-10-14T14:47:41Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: Created page with &amp;quot;{{DiscoverEd Specification |contact=Asheesh Laroia |project=AgShare |status=Complete }} Publishers who store and edit text in MediaWiki may want a particular ''category'' from th...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=Complete&lt;br /&gt;
}}&lt;br /&gt;
Publishers who store and edit text in MediaWiki may want a particular ''category'' from their wiki aggregated into a DiscoverEd instance. This specification describes:&lt;br /&gt;
&lt;br /&gt;
* How publishers can set up their wikis to be aggregated by DiscoverEd&lt;br /&gt;
* How DiscoverEd instance maintainers can aggregate content from such a wiki&lt;br /&gt;
* The changes that we made to DiscoverEd to support this.&lt;br /&gt;
&lt;br /&gt;
It's pretty easy, all around. Relax, and read on.&lt;br /&gt;
&lt;br /&gt;
== Guide for publishers ==&lt;br /&gt;
&lt;br /&gt;
DiscoverEd can pull content from a category in your wiki. Our code relies on the following:&lt;br /&gt;
&lt;br /&gt;
* The MediaWiki API enabled, at least the read-only portion. (This is the default. [http://www.mediawiki.org/wiki/API:Restricting_API_usage More info here].)&lt;br /&gt;
* A &amp;lt;link rel=&amp;quot;search&amp;quot;&amp;gt; tag pointing to the OpenSearch API on your wiki. This is also enabled by default, but if you customize the theme, you might accidentally remove this tag. We need it there.&lt;br /&gt;
&lt;br /&gt;
We detect the MediaWiki API path by looking at a page on your wiki and then determining the URL to the API by looking for another MediaWiki PHP file link. Right now, we rely on the &amp;lt;link rel=&amp;quot;search&amp;quot;&amp;gt; for that detection. So if you remove the OpenSearch header, we can't find the API URL.&lt;br /&gt;
&lt;br /&gt;
== Guide for DiscoverEd sysadmins ==&lt;br /&gt;
&lt;br /&gt;
* There is a new feed type: ''mediawiki-category''. If you add a feed with that type, set the URL of the feed to the MediaWiki category.&lt;br /&gt;
* The provenance of a Resource we find in the category is the category URL.&lt;br /&gt;
* The code has extensive logging of its exceptions, so if you find you are missing data you thought you would have. do read the log.&lt;br /&gt;
&lt;br /&gt;
== DiscoverEd code changes ==&lt;br /&gt;
&lt;br /&gt;
The changes from [http://gitorious.org/discovered/repo/commit/a60ed73cc793f74f9cdafd853cdec208eda77a78 a60ed73cc793f74f9cdafd853cdec208eda77a78] to [http://gitorious.org/discovered/repo/commit/7500f599a9e5667db7644984dfb1d8d549a48597 7500f59] represent the initial implementation. A summary of the implementation details:&lt;br /&gt;
&lt;br /&gt;
* Factored out the RSS feed aggregation into a separate class&lt;br /&gt;
* Created a new valid feed type, ''mediawiki-category''&lt;br /&gt;
* Created a MediaWikiCategory class which can, starting from a category page, find the API URL, query it for all the pages in that category, and create a Resource representing each such page.&lt;br /&gt;
&lt;br /&gt;
== Further work ==&lt;br /&gt;
&lt;br /&gt;
Things that would be nice, but that the world will probably never see:&lt;br /&gt;
&lt;br /&gt;
* It would be nice if MediaWiki had a &amp;lt;link rel=&amp;quot;api&amp;quot;&amp;gt; or similar that unambiguously pointed automated agents to the API.&lt;br /&gt;
* It would be interesting if MediaWiki just created an RSS feed, in &amp;quot;MIT OCW format&amp;quot;, for each category.&lt;br /&gt;
* We extract ''extremely'' little metadata right now from the pages: just the title. It would be nice if there were a reasonable way to store and extract metadata. A decision by us could make a big difference.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=42833</id>
		<title>AgShare/Tech</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=42833"/>
				<updated>2010-10-12T15:45:40Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Piwik general configuration */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:&lt;br /&gt;
&lt;br /&gt;
* Username: '''agshare'''&lt;br /&gt;
* Host name: '''search.agshare.org''' (currently the same as discovered.labs.creativecommons.org)&lt;br /&gt;
&lt;br /&gt;
So, for example, to set up your environment, do:&lt;br /&gt;
&lt;br /&gt;
 $ sudo su - agshare&lt;br /&gt;
&lt;br /&gt;
Given that, give [[Running DiscoverEd]] a look!&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* rm -rf ~/tomcat/webapps/ROOT&lt;br /&gt;
* cp nutch-1.1.war ~/tomcat/webapps/ROOT.war&lt;br /&gt;
&lt;br /&gt;
Then restart Tomcat.&lt;br /&gt;
&lt;br /&gt;
== Restarting Tomcat ==&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment uses a Tomcat instance in its $HOME (supported by the tomcat6-instance-create script). It's wrapped as &amp;quot;/etc/init.d/agshare&amp;quot; so the boot process can use it. But you can restart it this way:&lt;br /&gt;
&lt;br /&gt;
* ~/tomcat/bin/shutdown.sh&lt;br /&gt;
* ~/tomcat/bin/startup.sh&lt;br /&gt;
&lt;br /&gt;
== Starting Tomcat at boot ==&lt;br /&gt;
&lt;br /&gt;
/etc/rc.local contains a call to run ~/tomcat/bin/startup.sh as the agshare user. That's kind of hackish, I realize.&lt;br /&gt;
&lt;br /&gt;
== Piwik analytics ==&lt;br /&gt;
&lt;br /&gt;
We use a self-hosted package called [http://piwik.org/ Piwik] to record search engine queries and measure traffic to the website. All the data stays with us.&lt;br /&gt;
&lt;br /&gt;
You can use the [http://search.agshare.org/static/piwik/piwik/index.php Piwik admin interface] to view the stats, if you have an account. If you want an account, talk to Nathan.&lt;br /&gt;
&lt;br /&gt;
=== Piwik general configuration ===&lt;br /&gt;
&lt;br /&gt;
* Configuration: It uses a MySQL database. You can see the details in the Piwik configuration file.&lt;br /&gt;
* Path on the server: '''/var/www/search.agshare.org/www/static/piwik/piwik'''&lt;br /&gt;
* Web serving: Apache + mod_php5 serve it up. We set up '''/var/www/search.agshare.org/www/static''' to be served by Apache; you can see that in /etc/apache2/sites-available/search.agshare.org.&lt;br /&gt;
&lt;br /&gt;
To get piwik running, we had to add piwik to the default template. I implemented that in [http://gitorious.org/+discovereders/discovered/agshare-live/commit/b396498f99de2d21259aed48bbf7918f7cf436d2 a commit].&lt;br /&gt;
&lt;br /&gt;
=== Site search ===&lt;br /&gt;
&lt;br /&gt;
We added the [http://github.com/BeezyT/piwik-sitesearch the sitesearch plugin] (still in beta; see [http://dev.piwik.org/trac/ticket/49 this Piwik ticket]) to let us analyze site search.&lt;br /&gt;
&lt;br /&gt;
The site search plugin requires that we:&lt;br /&gt;
* Change the default translations so that they &lt;br /&gt;
* Configure it: In the [http://search.agshare.org/static/piwik/piwik/index.php?module=SiteSearch&amp;amp;action=admin&amp;amp;idSite=1&amp;amp;period=day&amp;amp;date=yesterday Site Search settings], I set the &amp;quot;Search URL&amp;quot; to &amp;quot;search.jsp&amp;quot; (no leading slash) and the &amp;quot;Search Parameter&amp;quot; to &amp;quot;query&amp;quot;. This matches [http://search.agshare.org/search.jsp?query=body queries like this].&lt;br /&gt;
&lt;br /&gt;
Piwik SiteSearch can keep track of the number of results that the search engine returns for each query. To do that, it needs some to be able to &amp;quot;scrape&amp;quot; the information out of the web page, or alternately have the servlet provide it. I chose the &amp;quot;scrape&amp;quot; option. I implemented that in [http://gitorious.org/+discovereders/discovered/agshare-live/commit/4ffdd225670c9af3e57d686111907aa5e5d150fe a commit].&lt;br /&gt;
&lt;br /&gt;
== Version control ==&lt;br /&gt;
&lt;br /&gt;
The Agshare deployment's git repository can be [http://gitorious.org/+discovereders/discovered/agshare-live/ found on Gitorious]. That is available from within the agshare deployment as a ''git remote'' named ''mirror''.&lt;br /&gt;
&lt;br /&gt;
When you want to back up the AgShare deployment's git state, just do:&lt;br /&gt;
&lt;br /&gt;
 $ git push mirror --mirror&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=42832</id>
		<title>AgShare/Tech</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=42832"/>
				<updated>2010-10-12T15:44:05Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Version control */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:&lt;br /&gt;
&lt;br /&gt;
* Username: '''agshare'''&lt;br /&gt;
* Host name: '''search.agshare.org''' (currently the same as discovered.labs.creativecommons.org)&lt;br /&gt;
&lt;br /&gt;
So, for example, to set up your environment, do:&lt;br /&gt;
&lt;br /&gt;
 $ sudo su - agshare&lt;br /&gt;
&lt;br /&gt;
Given that, give [[Running DiscoverEd]] a look!&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* rm -rf ~/tomcat/webapps/ROOT&lt;br /&gt;
* cp nutch-1.1.war ~/tomcat/webapps/ROOT.war&lt;br /&gt;
&lt;br /&gt;
Then restart Tomcat.&lt;br /&gt;
&lt;br /&gt;
== Restarting Tomcat ==&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment uses a Tomcat instance in its $HOME (supported by the tomcat6-instance-create script). It's wrapped as &amp;quot;/etc/init.d/agshare&amp;quot; so the boot process can use it. But you can restart it this way:&lt;br /&gt;
&lt;br /&gt;
* ~/tomcat/bin/shutdown.sh&lt;br /&gt;
* ~/tomcat/bin/startup.sh&lt;br /&gt;
&lt;br /&gt;
== Starting Tomcat at boot ==&lt;br /&gt;
&lt;br /&gt;
/etc/rc.local contains a call to run ~/tomcat/bin/startup.sh as the agshare user. That's kind of hackish, I realize.&lt;br /&gt;
&lt;br /&gt;
== Piwik analytics ==&lt;br /&gt;
&lt;br /&gt;
We use a self-hosted package called [http://piwik.org/ Piwik] to record search engine queries and measure traffic to the website. All the data stays with us.&lt;br /&gt;
&lt;br /&gt;
You can use the [http://search.agshare.org/static/piwik/piwik/index.php Piwik admin interface] to view the stats, if you have an account. If you want an account, talk to Nathan.&lt;br /&gt;
&lt;br /&gt;
=== Piwik general configuration ===&lt;br /&gt;
&lt;br /&gt;
* Configuration: It uses a MySQL database. You can see the details in the Piwik configuration file.&lt;br /&gt;
* Path on the server: '''/var/www/search.agshare.org/www/static/piwik/piwik'''&lt;br /&gt;
* Web serving: Apache + mod_php5 serve it up. We set up '''/var/www/search.agshare.org/www/static''' to be served by Apache; you can see that in /etc/apache2/sites-available/search.agshare.org.&lt;br /&gt;
&lt;br /&gt;
To get piwik running, we had to add piwik to the default template. See the &amp;quot;changes&amp;quot; section below for more info.&lt;br /&gt;
&lt;br /&gt;
=== Site search ===&lt;br /&gt;
&lt;br /&gt;
We added the [http://github.com/BeezyT/piwik-sitesearch the sitesearch plugin] (still in beta; see [http://dev.piwik.org/trac/ticket/49 this Piwik ticket]) to let us analyze site search.&lt;br /&gt;
&lt;br /&gt;
The site search plugin requires that we:&lt;br /&gt;
* Change the default translations so that they &lt;br /&gt;
* Configure it: In the [http://search.agshare.org/static/piwik/piwik/index.php?module=SiteSearch&amp;amp;action=admin&amp;amp;idSite=1&amp;amp;period=day&amp;amp;date=yesterday Site Search settings], I set the &amp;quot;Search URL&amp;quot; to &amp;quot;search.jsp&amp;quot; (no leading slash) and the &amp;quot;Search Parameter&amp;quot; to &amp;quot;query&amp;quot;. This matches [http://search.agshare.org/search.jsp?query=body queries like this].&lt;br /&gt;
&lt;br /&gt;
Piwik SiteSearch can keep track of the number of results that the search engine returns for each query. To do that, it needs some to be able to &amp;quot;scrape&amp;quot; the information out of the web page, or alternately have the servlet provide it. I chose the &amp;quot;scrape&amp;quot; option. I implemented that in [http://gitorious.org/+discovereders/discovered/agshare-live/commit/4ffdd225670c9af3e57d686111907aa5e5d150fe a commit].&lt;br /&gt;
&lt;br /&gt;
== Version control ==&lt;br /&gt;
&lt;br /&gt;
The Agshare deployment's git repository can be [http://gitorious.org/+discovereders/discovered/agshare-live/ found on Gitorious]. That is available from within the agshare deployment as a ''git remote'' named ''mirror''.&lt;br /&gt;
&lt;br /&gt;
When you want to back up the AgShare deployment's git state, just do:&lt;br /&gt;
&lt;br /&gt;
 $ git push mirror --mirror&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=42831</id>
		<title>AgShare/Tech</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=42831"/>
				<updated>2010-10-12T15:43:46Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Piwik analytics */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:&lt;br /&gt;
&lt;br /&gt;
* Username: '''agshare'''&lt;br /&gt;
* Host name: '''search.agshare.org''' (currently the same as discovered.labs.creativecommons.org)&lt;br /&gt;
&lt;br /&gt;
So, for example, to set up your environment, do:&lt;br /&gt;
&lt;br /&gt;
 $ sudo su - agshare&lt;br /&gt;
&lt;br /&gt;
Given that, give [[Running DiscoverEd]] a look!&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* rm -rf ~/tomcat/webapps/ROOT&lt;br /&gt;
* cp nutch-1.1.war ~/tomcat/webapps/ROOT.war&lt;br /&gt;
&lt;br /&gt;
Then restart Tomcat.&lt;br /&gt;
&lt;br /&gt;
== Restarting Tomcat ==&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment uses a Tomcat instance in its $HOME (supported by the tomcat6-instance-create script). It's wrapped as &amp;quot;/etc/init.d/agshare&amp;quot; so the boot process can use it. But you can restart it this way:&lt;br /&gt;
&lt;br /&gt;
* ~/tomcat/bin/shutdown.sh&lt;br /&gt;
* ~/tomcat/bin/startup.sh&lt;br /&gt;
&lt;br /&gt;
== Starting Tomcat at boot ==&lt;br /&gt;
&lt;br /&gt;
/etc/rc.local contains a call to run ~/tomcat/bin/startup.sh as the agshare user. That's kind of hackish, I realize.&lt;br /&gt;
&lt;br /&gt;
== Piwik analytics ==&lt;br /&gt;
&lt;br /&gt;
We use a self-hosted package called [http://piwik.org/ Piwik] to record search engine queries and measure traffic to the website. All the data stays with us.&lt;br /&gt;
&lt;br /&gt;
You can use the [http://search.agshare.org/static/piwik/piwik/index.php Piwik admin interface] to view the stats, if you have an account. If you want an account, talk to Nathan.&lt;br /&gt;
&lt;br /&gt;
=== Piwik general configuration ===&lt;br /&gt;
&lt;br /&gt;
* Configuration: It uses a MySQL database. You can see the details in the Piwik configuration file.&lt;br /&gt;
* Path on the server: '''/var/www/search.agshare.org/www/static/piwik/piwik'''&lt;br /&gt;
* Web serving: Apache + mod_php5 serve it up. We set up '''/var/www/search.agshare.org/www/static''' to be served by Apache; you can see that in /etc/apache2/sites-available/search.agshare.org.&lt;br /&gt;
&lt;br /&gt;
To get piwik running, we had to add piwik to the default template. See the &amp;quot;changes&amp;quot; section below for more info.&lt;br /&gt;
&lt;br /&gt;
=== Site search ===&lt;br /&gt;
&lt;br /&gt;
We added the [http://github.com/BeezyT/piwik-sitesearch the sitesearch plugin] (still in beta; see [http://dev.piwik.org/trac/ticket/49 this Piwik ticket]) to let us analyze site search.&lt;br /&gt;
&lt;br /&gt;
The site search plugin requires that we:&lt;br /&gt;
* Change the default translations so that they &lt;br /&gt;
* Configure it: In the [http://search.agshare.org/static/piwik/piwik/index.php?module=SiteSearch&amp;amp;action=admin&amp;amp;idSite=1&amp;amp;period=day&amp;amp;date=yesterday Site Search settings], I set the &amp;quot;Search URL&amp;quot; to &amp;quot;search.jsp&amp;quot; (no leading slash) and the &amp;quot;Search Parameter&amp;quot; to &amp;quot;query&amp;quot;. This matches [http://search.agshare.org/search.jsp?query=body queries like this].&lt;br /&gt;
&lt;br /&gt;
Piwik SiteSearch can keep track of the number of results that the search engine returns for each query. To do that, it needs some to be able to &amp;quot;scrape&amp;quot; the information out of the web page, or alternately have the servlet provide it. I chose the &amp;quot;scrape&amp;quot; option. I implemented that in [http://gitorious.org/+discovereders/discovered/agshare-live/commit/4ffdd225670c9af3e57d686111907aa5e5d150fe a commit].&lt;br /&gt;
&lt;br /&gt;
== Version control ==&lt;br /&gt;
&lt;br /&gt;
The Agshare deployment's git repository can be [http://gitorious.org/+discovereders/discovered/agshare-live/ found on Gitorious].&lt;br /&gt;
&lt;br /&gt;
When you want to back up the AgShare deployment's git state, just do:&lt;br /&gt;
&lt;br /&gt;
 $ git push mirror --mirror&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=42830</id>
		<title>AgShare/Tech</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=42830"/>
				<updated>2010-10-12T15:15:40Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:&lt;br /&gt;
&lt;br /&gt;
* Username: '''agshare'''&lt;br /&gt;
* Host name: '''search.agshare.org''' (currently the same as discovered.labs.creativecommons.org)&lt;br /&gt;
&lt;br /&gt;
So, for example, to set up your environment, do:&lt;br /&gt;
&lt;br /&gt;
 $ sudo su - agshare&lt;br /&gt;
&lt;br /&gt;
Given that, give [[Running DiscoverEd]] a look!&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* rm -rf ~/tomcat/webapps/ROOT&lt;br /&gt;
* cp nutch-1.1.war ~/tomcat/webapps/ROOT.war&lt;br /&gt;
&lt;br /&gt;
Then restart Tomcat.&lt;br /&gt;
&lt;br /&gt;
== Restarting Tomcat ==&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment uses a Tomcat instance in its $HOME (supported by the tomcat6-instance-create script). It's wrapped as &amp;quot;/etc/init.d/agshare&amp;quot; so the boot process can use it. But you can restart it this way:&lt;br /&gt;
&lt;br /&gt;
* ~/tomcat/bin/shutdown.sh&lt;br /&gt;
* ~/tomcat/bin/startup.sh&lt;br /&gt;
&lt;br /&gt;
== Starting Tomcat at boot ==&lt;br /&gt;
&lt;br /&gt;
/etc/rc.local contains a call to run ~/tomcat/bin/startup.sh as the agshare user. That's kind of hackish, I realize.&lt;br /&gt;
&lt;br /&gt;
== Piwik analytics ==&lt;br /&gt;
&lt;br /&gt;
We use a self-hosted package called [http://piwik.org/ Piwik] to record search engine queries and measure traffic to the website.&lt;br /&gt;
&lt;br /&gt;
You can use the [http://search.agshare.org/static/piwik/piwik/index.php Piwik interface from here] if you have an account. If you want an account, talk to Nathan.&lt;br /&gt;
&lt;br /&gt;
=== Piwik general configuration ===&lt;br /&gt;
&lt;br /&gt;
* Configuration: It uses a MySQL database. You can see the details in the Piwik configuration file.&lt;br /&gt;
* Path on the server: '''/var/www/search.agshare.org/www/static'''&lt;br /&gt;
* Web serving: Apache + mod_php5 serve it up. We set up '''/var/www/search.agshare.org/www/static''' to be served by Apache; you can see that in /etc/apache2/sites-available/search.agshare.org.&lt;br /&gt;
&lt;br /&gt;
=== Site search ===&lt;br /&gt;
&lt;br /&gt;
We added the [http://github.com/BeezyT/piwik-sitesearch the sitesearch plugin] (still in beta; see [http://dev.piwik.org/trac/ticket/49 this Piwik ticket]) to let us analyze site search.&lt;br /&gt;
&lt;br /&gt;
The site search plugin requires that we:&lt;br /&gt;
* Add piwik to the default template&lt;br /&gt;
* Change the default translations so that they &lt;br /&gt;
* Configure it: &lt;br /&gt;
&lt;br /&gt;
You can adminster &lt;br /&gt;
&lt;br /&gt;
* Plugins: '''piwik/plugins/SiteSearch''' is a git clone of&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42563</id>
		<title>DiscoverEd/Meetings</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42563"/>
				<updated>2010-10-05T15:56:42Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Mon, Sep 13, 2010 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
= Tue Oct 5, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-10-05&lt;br /&gt;
&lt;br /&gt;
= Thu Sep 30, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-09-30&lt;br /&gt;
&lt;br /&gt;
= Tue, Sep 28, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/2010-09-28&lt;br /&gt;
&lt;br /&gt;
= Mon, Sep 20, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.com/2010-09-20&lt;br /&gt;
&lt;br /&gt;
= Mon, Sep 13, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/2010-09-13&lt;br /&gt;
&lt;br /&gt;
= Mon, Aug 30, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/2010-08-30&lt;br /&gt;
&lt;br /&gt;
= Monday, August 16, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/vMJnNgqP5q&lt;br /&gt;
&lt;br /&gt;
= Tuesday, August 10, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/F3pDrLemz4&lt;br /&gt;
&lt;br /&gt;
= Monday, June 28, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh, and Rafi: [http://piratepad.net/rOBKTtpocp Notes]&lt;br /&gt;
* TripleStoreIndexer&lt;br /&gt;
** NY thinks this is fixed, want to sync up https://www.pivotaltracker.com/story/show/4001001&lt;br /&gt;
** This is working in next and master&lt;br /&gt;
** Accepted Pivotal story&lt;br /&gt;
* MinusCurator&lt;br /&gt;
** status update&lt;br /&gt;
*** Still in progress&lt;br /&gt;
* Field Mapping work (from sprint)&lt;br /&gt;
** Was blocked by TripleStoreIndexer&lt;br /&gt;
** Going to work on for 3 hours, landing work from sprint&lt;br /&gt;
* MakeSeed&lt;br /&gt;
** NY thought this was fixed, but it's open in Pivotal ( https://www.pivotaltracker.com/story/show/3888957 )&lt;br /&gt;
** Need to double check other verbs&lt;br /&gt;
* Priorities:&lt;br /&gt;
** Field Mapping&lt;br /&gt;
** Make sure feeds verbs handle provenance correctly&lt;br /&gt;
** MinusCurator&lt;br /&gt;
&lt;br /&gt;
= Monday, June 21, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh and Raffi: [http://piratepad.net/5OhqF55lTk Notes]&lt;br /&gt;
&lt;br /&gt;
= Monday, June 7, 2010 =&lt;br /&gt;
&lt;br /&gt;
AL: Branch cleanup&lt;br /&gt;
&lt;br /&gt;
A new developer should begin to add work on top of master.&lt;br /&gt;
&lt;br /&gt;
The branch &amp;quot;provenance_tests&amp;quot; contains work on OAI-PMH and RDFa. It's also where we'll add the work on minus-curator.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;provenance_tests&amp;quot; will become &amp;quot;next&amp;quot;&lt;br /&gt;
&lt;br /&gt;
AL: OAI-PMH provenance implementation&lt;br /&gt;
&lt;br /&gt;
Defer to post -curator implementation&lt;br /&gt;
&lt;br /&gt;
RKL: RDFa extraction and indexing&lt;br /&gt;
&lt;br /&gt;
We can parse the RDFa title by adding the RDFa-parsing logic in the *wrong* place (in the feed updating method). I've tried to move it into the *right* place (a plugin, so we don't hit the network twice) but the plugin isn't being executed. I must have configured the plugin incorrectly.&lt;br /&gt;
Raffi should look in the Hadoop log, not standard out, for his logging message.&lt;br /&gt;
Raffi should push the branch to Nathan and he'll sanity-check the configuration.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AL/RKL: -curator status&lt;br /&gt;
&lt;br /&gt;
Let's get this as far along as possible before the spring!&lt;br /&gt;
&lt;br /&gt;
NRY: &amp;quot;Feature&amp;quot; pages in CC wiki&lt;br /&gt;
&lt;br /&gt;
In preparation for the sprint next week, Brendan from MSU asked if we could create general &amp;quot;Feature&amp;quot; write-ups for things we're working on for AgShare.  These would describe some feature (ie, Provenance, Analytics), so he could evaluate where intersection lies with his FSKN work.  &lt;br /&gt;
&lt;br /&gt;
I think each Feature maps to one or more Stories in Pivotal.  My initial stub is at http://wiki.creativecommons.org/Curator_Filtering (probably going to rename to Provenance or something like that).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Monday, June 2, 2010 =&lt;br /&gt;
&lt;br /&gt;
== Things Asheesh wants ==&lt;br /&gt;
&lt;br /&gt;
* We work more clearly out of Pivotal Tracker.&lt;br /&gt;
** While on the phone, Asheesh updated it.&lt;br /&gt;
&lt;br /&gt;
== Things Nathan wants done before the sprint ==&lt;br /&gt;
&lt;br /&gt;
* Branch cleanup&lt;br /&gt;
** Asheesh just did this.&lt;br /&gt;
* Data imported through OAI-PMH has the provenance of its feed. Damn the torpedos^Wtests.&lt;br /&gt;
&lt;br /&gt;
Then we'll work on the &amp;quot;minus curator&amp;quot; story in Pivotal Tracker.&lt;br /&gt;
&lt;br /&gt;
== Thursday departure planning ==&lt;br /&gt;
&lt;br /&gt;
At some point, Nathan wants to figure out his Thursday departure planning on the sprint's last day.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42562</id>
		<title>DiscoverEd/Meetings</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42562"/>
				<updated>2010-10-05T15:56:14Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Mon, Sep 20, 2010 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
= Tue Oct 5, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-10-05&lt;br /&gt;
&lt;br /&gt;
= Thu Sep 30, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-09-30&lt;br /&gt;
&lt;br /&gt;
= Tue, Sep 28, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/2010-09-28&lt;br /&gt;
&lt;br /&gt;
= Mon, Sep 20, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.com/2010-09-20&lt;br /&gt;
&lt;br /&gt;
= Mon, Sep 13, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/2010-09-13&lt;br /&gt;
&lt;br /&gt;
= Monday, August 16, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/vMJnNgqP5q&lt;br /&gt;
&lt;br /&gt;
= Tuesday, August 10, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/F3pDrLemz4&lt;br /&gt;
&lt;br /&gt;
= Monday, June 28, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh, and Rafi: [http://piratepad.net/rOBKTtpocp Notes]&lt;br /&gt;
* TripleStoreIndexer&lt;br /&gt;
** NY thinks this is fixed, want to sync up https://www.pivotaltracker.com/story/show/4001001&lt;br /&gt;
** This is working in next and master&lt;br /&gt;
** Accepted Pivotal story&lt;br /&gt;
* MinusCurator&lt;br /&gt;
** status update&lt;br /&gt;
*** Still in progress&lt;br /&gt;
* Field Mapping work (from sprint)&lt;br /&gt;
** Was blocked by TripleStoreIndexer&lt;br /&gt;
** Going to work on for 3 hours, landing work from sprint&lt;br /&gt;
* MakeSeed&lt;br /&gt;
** NY thought this was fixed, but it's open in Pivotal ( https://www.pivotaltracker.com/story/show/3888957 )&lt;br /&gt;
** Need to double check other verbs&lt;br /&gt;
* Priorities:&lt;br /&gt;
** Field Mapping&lt;br /&gt;
** Make sure feeds verbs handle provenance correctly&lt;br /&gt;
** MinusCurator&lt;br /&gt;
&lt;br /&gt;
= Monday, June 21, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh and Raffi: [http://piratepad.net/5OhqF55lTk Notes]&lt;br /&gt;
&lt;br /&gt;
= Monday, June 7, 2010 =&lt;br /&gt;
&lt;br /&gt;
AL: Branch cleanup&lt;br /&gt;
&lt;br /&gt;
A new developer should begin to add work on top of master.&lt;br /&gt;
&lt;br /&gt;
The branch &amp;quot;provenance_tests&amp;quot; contains work on OAI-PMH and RDFa. It's also where we'll add the work on minus-curator.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;provenance_tests&amp;quot; will become &amp;quot;next&amp;quot;&lt;br /&gt;
&lt;br /&gt;
AL: OAI-PMH provenance implementation&lt;br /&gt;
&lt;br /&gt;
Defer to post -curator implementation&lt;br /&gt;
&lt;br /&gt;
RKL: RDFa extraction and indexing&lt;br /&gt;
&lt;br /&gt;
We can parse the RDFa title by adding the RDFa-parsing logic in the *wrong* place (in the feed updating method). I've tried to move it into the *right* place (a plugin, so we don't hit the network twice) but the plugin isn't being executed. I must have configured the plugin incorrectly.&lt;br /&gt;
Raffi should look in the Hadoop log, not standard out, for his logging message.&lt;br /&gt;
Raffi should push the branch to Nathan and he'll sanity-check the configuration.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AL/RKL: -curator status&lt;br /&gt;
&lt;br /&gt;
Let's get this as far along as possible before the spring!&lt;br /&gt;
&lt;br /&gt;
NRY: &amp;quot;Feature&amp;quot; pages in CC wiki&lt;br /&gt;
&lt;br /&gt;
In preparation for the sprint next week, Brendan from MSU asked if we could create general &amp;quot;Feature&amp;quot; write-ups for things we're working on for AgShare.  These would describe some feature (ie, Provenance, Analytics), so he could evaluate where intersection lies with his FSKN work.  &lt;br /&gt;
&lt;br /&gt;
I think each Feature maps to one or more Stories in Pivotal.  My initial stub is at http://wiki.creativecommons.org/Curator_Filtering (probably going to rename to Provenance or something like that).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Monday, June 2, 2010 =&lt;br /&gt;
&lt;br /&gt;
== Things Asheesh wants ==&lt;br /&gt;
&lt;br /&gt;
* We work more clearly out of Pivotal Tracker.&lt;br /&gt;
** While on the phone, Asheesh updated it.&lt;br /&gt;
&lt;br /&gt;
== Things Nathan wants done before the sprint ==&lt;br /&gt;
&lt;br /&gt;
* Branch cleanup&lt;br /&gt;
** Asheesh just did this.&lt;br /&gt;
* Data imported through OAI-PMH has the provenance of its feed. Damn the torpedos^Wtests.&lt;br /&gt;
&lt;br /&gt;
Then we'll work on the &amp;quot;minus curator&amp;quot; story in Pivotal Tracker.&lt;br /&gt;
&lt;br /&gt;
== Thursday departure planning ==&lt;br /&gt;
&lt;br /&gt;
At some point, Nathan wants to figure out his Thursday departure planning on the sprint's last day.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42561</id>
		<title>DiscoverEd/Meetings</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42561"/>
				<updated>2010-10-05T15:55:51Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Tue, Sep 28, 2010 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
= Tue Oct 5, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-10-05&lt;br /&gt;
&lt;br /&gt;
= Thu Sep 30, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-09-30&lt;br /&gt;
&lt;br /&gt;
= Tue, Sep 28, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/2010-09-28&lt;br /&gt;
&lt;br /&gt;
= Mon, Sep 20, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.com/2010-09-20&lt;br /&gt;
&lt;br /&gt;
= Monday, August 16, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/vMJnNgqP5q&lt;br /&gt;
&lt;br /&gt;
= Tuesday, August 10, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/F3pDrLemz4&lt;br /&gt;
&lt;br /&gt;
= Monday, June 28, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh, and Rafi: [http://piratepad.net/rOBKTtpocp Notes]&lt;br /&gt;
* TripleStoreIndexer&lt;br /&gt;
** NY thinks this is fixed, want to sync up https://www.pivotaltracker.com/story/show/4001001&lt;br /&gt;
** This is working in next and master&lt;br /&gt;
** Accepted Pivotal story&lt;br /&gt;
* MinusCurator&lt;br /&gt;
** status update&lt;br /&gt;
*** Still in progress&lt;br /&gt;
* Field Mapping work (from sprint)&lt;br /&gt;
** Was blocked by TripleStoreIndexer&lt;br /&gt;
** Going to work on for 3 hours, landing work from sprint&lt;br /&gt;
* MakeSeed&lt;br /&gt;
** NY thought this was fixed, but it's open in Pivotal ( https://www.pivotaltracker.com/story/show/3888957 )&lt;br /&gt;
** Need to double check other verbs&lt;br /&gt;
* Priorities:&lt;br /&gt;
** Field Mapping&lt;br /&gt;
** Make sure feeds verbs handle provenance correctly&lt;br /&gt;
** MinusCurator&lt;br /&gt;
&lt;br /&gt;
= Monday, June 21, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh and Raffi: [http://piratepad.net/5OhqF55lTk Notes]&lt;br /&gt;
&lt;br /&gt;
= Monday, June 7, 2010 =&lt;br /&gt;
&lt;br /&gt;
AL: Branch cleanup&lt;br /&gt;
&lt;br /&gt;
A new developer should begin to add work on top of master.&lt;br /&gt;
&lt;br /&gt;
The branch &amp;quot;provenance_tests&amp;quot; contains work on OAI-PMH and RDFa. It's also where we'll add the work on minus-curator.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;provenance_tests&amp;quot; will become &amp;quot;next&amp;quot;&lt;br /&gt;
&lt;br /&gt;
AL: OAI-PMH provenance implementation&lt;br /&gt;
&lt;br /&gt;
Defer to post -curator implementation&lt;br /&gt;
&lt;br /&gt;
RKL: RDFa extraction and indexing&lt;br /&gt;
&lt;br /&gt;
We can parse the RDFa title by adding the RDFa-parsing logic in the *wrong* place (in the feed updating method). I've tried to move it into the *right* place (a plugin, so we don't hit the network twice) but the plugin isn't being executed. I must have configured the plugin incorrectly.&lt;br /&gt;
Raffi should look in the Hadoop log, not standard out, for his logging message.&lt;br /&gt;
Raffi should push the branch to Nathan and he'll sanity-check the configuration.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AL/RKL: -curator status&lt;br /&gt;
&lt;br /&gt;
Let's get this as far along as possible before the spring!&lt;br /&gt;
&lt;br /&gt;
NRY: &amp;quot;Feature&amp;quot; pages in CC wiki&lt;br /&gt;
&lt;br /&gt;
In preparation for the sprint next week, Brendan from MSU asked if we could create general &amp;quot;Feature&amp;quot; write-ups for things we're working on for AgShare.  These would describe some feature (ie, Provenance, Analytics), so he could evaluate where intersection lies with his FSKN work.  &lt;br /&gt;
&lt;br /&gt;
I think each Feature maps to one or more Stories in Pivotal.  My initial stub is at http://wiki.creativecommons.org/Curator_Filtering (probably going to rename to Provenance or something like that).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Monday, June 2, 2010 =&lt;br /&gt;
&lt;br /&gt;
== Things Asheesh wants ==&lt;br /&gt;
&lt;br /&gt;
* We work more clearly out of Pivotal Tracker.&lt;br /&gt;
** While on the phone, Asheesh updated it.&lt;br /&gt;
&lt;br /&gt;
== Things Nathan wants done before the sprint ==&lt;br /&gt;
&lt;br /&gt;
* Branch cleanup&lt;br /&gt;
** Asheesh just did this.&lt;br /&gt;
* Data imported through OAI-PMH has the provenance of its feed. Damn the torpedos^Wtests.&lt;br /&gt;
&lt;br /&gt;
Then we'll work on the &amp;quot;minus curator&amp;quot; story in Pivotal Tracker.&lt;br /&gt;
&lt;br /&gt;
== Thursday departure planning ==&lt;br /&gt;
&lt;br /&gt;
At some point, Nathan wants to figure out his Thursday departure planning on the sprint's last day.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42560</id>
		<title>DiscoverEd/Meetings</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42560"/>
				<updated>2010-10-05T15:55:32Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Thu Sep 30, 2010 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
= Tue Oct 5, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-10-05&lt;br /&gt;
&lt;br /&gt;
= Thu Sep 30, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-09-30&lt;br /&gt;
&lt;br /&gt;
= Tue, Sep 28, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/2010-09-28&lt;br /&gt;
&lt;br /&gt;
= Monday, August 16, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/vMJnNgqP5q&lt;br /&gt;
&lt;br /&gt;
= Tuesday, August 10, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/F3pDrLemz4&lt;br /&gt;
&lt;br /&gt;
= Monday, June 28, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh, and Rafi: [http://piratepad.net/rOBKTtpocp Notes]&lt;br /&gt;
* TripleStoreIndexer&lt;br /&gt;
** NY thinks this is fixed, want to sync up https://www.pivotaltracker.com/story/show/4001001&lt;br /&gt;
** This is working in next and master&lt;br /&gt;
** Accepted Pivotal story&lt;br /&gt;
* MinusCurator&lt;br /&gt;
** status update&lt;br /&gt;
*** Still in progress&lt;br /&gt;
* Field Mapping work (from sprint)&lt;br /&gt;
** Was blocked by TripleStoreIndexer&lt;br /&gt;
** Going to work on for 3 hours, landing work from sprint&lt;br /&gt;
* MakeSeed&lt;br /&gt;
** NY thought this was fixed, but it's open in Pivotal ( https://www.pivotaltracker.com/story/show/3888957 )&lt;br /&gt;
** Need to double check other verbs&lt;br /&gt;
* Priorities:&lt;br /&gt;
** Field Mapping&lt;br /&gt;
** Make sure feeds verbs handle provenance correctly&lt;br /&gt;
** MinusCurator&lt;br /&gt;
&lt;br /&gt;
= Monday, June 21, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh and Raffi: [http://piratepad.net/5OhqF55lTk Notes]&lt;br /&gt;
&lt;br /&gt;
= Monday, June 7, 2010 =&lt;br /&gt;
&lt;br /&gt;
AL: Branch cleanup&lt;br /&gt;
&lt;br /&gt;
A new developer should begin to add work on top of master.&lt;br /&gt;
&lt;br /&gt;
The branch &amp;quot;provenance_tests&amp;quot; contains work on OAI-PMH and RDFa. It's also where we'll add the work on minus-curator.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;provenance_tests&amp;quot; will become &amp;quot;next&amp;quot;&lt;br /&gt;
&lt;br /&gt;
AL: OAI-PMH provenance implementation&lt;br /&gt;
&lt;br /&gt;
Defer to post -curator implementation&lt;br /&gt;
&lt;br /&gt;
RKL: RDFa extraction and indexing&lt;br /&gt;
&lt;br /&gt;
We can parse the RDFa title by adding the RDFa-parsing logic in the *wrong* place (in the feed updating method). I've tried to move it into the *right* place (a plugin, so we don't hit the network twice) but the plugin isn't being executed. I must have configured the plugin incorrectly.&lt;br /&gt;
Raffi should look in the Hadoop log, not standard out, for his logging message.&lt;br /&gt;
Raffi should push the branch to Nathan and he'll sanity-check the configuration.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AL/RKL: -curator status&lt;br /&gt;
&lt;br /&gt;
Let's get this as far along as possible before the spring!&lt;br /&gt;
&lt;br /&gt;
NRY: &amp;quot;Feature&amp;quot; pages in CC wiki&lt;br /&gt;
&lt;br /&gt;
In preparation for the sprint next week, Brendan from MSU asked if we could create general &amp;quot;Feature&amp;quot; write-ups for things we're working on for AgShare.  These would describe some feature (ie, Provenance, Analytics), so he could evaluate where intersection lies with his FSKN work.  &lt;br /&gt;
&lt;br /&gt;
I think each Feature maps to one or more Stories in Pivotal.  My initial stub is at http://wiki.creativecommons.org/Curator_Filtering (probably going to rename to Provenance or something like that).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Monday, June 2, 2010 =&lt;br /&gt;
&lt;br /&gt;
== Things Asheesh wants ==&lt;br /&gt;
&lt;br /&gt;
* We work more clearly out of Pivotal Tracker.&lt;br /&gt;
** While on the phone, Asheesh updated it.&lt;br /&gt;
&lt;br /&gt;
== Things Nathan wants done before the sprint ==&lt;br /&gt;
&lt;br /&gt;
* Branch cleanup&lt;br /&gt;
** Asheesh just did this.&lt;br /&gt;
* Data imported through OAI-PMH has the provenance of its feed. Damn the torpedos^Wtests.&lt;br /&gt;
&lt;br /&gt;
Then we'll work on the &amp;quot;minus curator&amp;quot; story in Pivotal Tracker.&lt;br /&gt;
&lt;br /&gt;
== Thursday departure planning ==&lt;br /&gt;
&lt;br /&gt;
At some point, Nathan wants to figure out his Thursday departure planning on the sprint's last day.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42559</id>
		<title>DiscoverEd/Meetings</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42559"/>
				<updated>2010-10-05T15:55:00Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Tue Oct 5, 2010 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
= Tue Oct 5, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-10-05&lt;br /&gt;
&lt;br /&gt;
= Thu Sep 30, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-09-30&lt;br /&gt;
&lt;br /&gt;
= Monday, August 16, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/vMJnNgqP5q&lt;br /&gt;
&lt;br /&gt;
= Tuesday, August 10, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/F3pDrLemz4&lt;br /&gt;
&lt;br /&gt;
= Monday, June 28, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh, and Rafi: [http://piratepad.net/rOBKTtpocp Notes]&lt;br /&gt;
* TripleStoreIndexer&lt;br /&gt;
** NY thinks this is fixed, want to sync up https://www.pivotaltracker.com/story/show/4001001&lt;br /&gt;
** This is working in next and master&lt;br /&gt;
** Accepted Pivotal story&lt;br /&gt;
* MinusCurator&lt;br /&gt;
** status update&lt;br /&gt;
*** Still in progress&lt;br /&gt;
* Field Mapping work (from sprint)&lt;br /&gt;
** Was blocked by TripleStoreIndexer&lt;br /&gt;
** Going to work on for 3 hours, landing work from sprint&lt;br /&gt;
* MakeSeed&lt;br /&gt;
** NY thought this was fixed, but it's open in Pivotal ( https://www.pivotaltracker.com/story/show/3888957 )&lt;br /&gt;
** Need to double check other verbs&lt;br /&gt;
* Priorities:&lt;br /&gt;
** Field Mapping&lt;br /&gt;
** Make sure feeds verbs handle provenance correctly&lt;br /&gt;
** MinusCurator&lt;br /&gt;
&lt;br /&gt;
= Monday, June 21, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh and Raffi: [http://piratepad.net/5OhqF55lTk Notes]&lt;br /&gt;
&lt;br /&gt;
= Monday, June 7, 2010 =&lt;br /&gt;
&lt;br /&gt;
AL: Branch cleanup&lt;br /&gt;
&lt;br /&gt;
A new developer should begin to add work on top of master.&lt;br /&gt;
&lt;br /&gt;
The branch &amp;quot;provenance_tests&amp;quot; contains work on OAI-PMH and RDFa. It's also where we'll add the work on minus-curator.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;provenance_tests&amp;quot; will become &amp;quot;next&amp;quot;&lt;br /&gt;
&lt;br /&gt;
AL: OAI-PMH provenance implementation&lt;br /&gt;
&lt;br /&gt;
Defer to post -curator implementation&lt;br /&gt;
&lt;br /&gt;
RKL: RDFa extraction and indexing&lt;br /&gt;
&lt;br /&gt;
We can parse the RDFa title by adding the RDFa-parsing logic in the *wrong* place (in the feed updating method). I've tried to move it into the *right* place (a plugin, so we don't hit the network twice) but the plugin isn't being executed. I must have configured the plugin incorrectly.&lt;br /&gt;
Raffi should look in the Hadoop log, not standard out, for his logging message.&lt;br /&gt;
Raffi should push the branch to Nathan and he'll sanity-check the configuration.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AL/RKL: -curator status&lt;br /&gt;
&lt;br /&gt;
Let's get this as far along as possible before the spring!&lt;br /&gt;
&lt;br /&gt;
NRY: &amp;quot;Feature&amp;quot; pages in CC wiki&lt;br /&gt;
&lt;br /&gt;
In preparation for the sprint next week, Brendan from MSU asked if we could create general &amp;quot;Feature&amp;quot; write-ups for things we're working on for AgShare.  These would describe some feature (ie, Provenance, Analytics), so he could evaluate where intersection lies with his FSKN work.  &lt;br /&gt;
&lt;br /&gt;
I think each Feature maps to one or more Stories in Pivotal.  My initial stub is at http://wiki.creativecommons.org/Curator_Filtering (probably going to rename to Provenance or something like that).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Monday, June 2, 2010 =&lt;br /&gt;
&lt;br /&gt;
== Things Asheesh wants ==&lt;br /&gt;
&lt;br /&gt;
* We work more clearly out of Pivotal Tracker.&lt;br /&gt;
** While on the phone, Asheesh updated it.&lt;br /&gt;
&lt;br /&gt;
== Things Nathan wants done before the sprint ==&lt;br /&gt;
&lt;br /&gt;
* Branch cleanup&lt;br /&gt;
** Asheesh just did this.&lt;br /&gt;
* Data imported through OAI-PMH has the provenance of its feed. Damn the torpedos^Wtests.&lt;br /&gt;
&lt;br /&gt;
Then we'll work on the &amp;quot;minus curator&amp;quot; story in Pivotal Tracker.&lt;br /&gt;
&lt;br /&gt;
== Thursday departure planning ==&lt;br /&gt;
&lt;br /&gt;
At some point, Nathan wants to figure out his Thursday departure planning on the sprint's last day.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42555</id>
		<title>DiscoverEd/Meetings</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Meetings&amp;diff=42555"/>
				<updated>2010-10-05T15:00:03Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Monday, August 16, 2010 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
= Tue Oct 5, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://openetherpad.org/2010-10-05&lt;br /&gt;
&lt;br /&gt;
= Monday, August 16, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/vMJnNgqP5q&lt;br /&gt;
&lt;br /&gt;
= Tuesday, August 10, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh: http://piratepad.net/F3pDrLemz4&lt;br /&gt;
&lt;br /&gt;
= Monday, June 28, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh, and Rafi: [http://piratepad.net/rOBKTtpocp Notes]&lt;br /&gt;
* TripleStoreIndexer&lt;br /&gt;
** NY thinks this is fixed, want to sync up https://www.pivotaltracker.com/story/show/4001001&lt;br /&gt;
** This is working in next and master&lt;br /&gt;
** Accepted Pivotal story&lt;br /&gt;
* MinusCurator&lt;br /&gt;
** status update&lt;br /&gt;
*** Still in progress&lt;br /&gt;
* Field Mapping work (from sprint)&lt;br /&gt;
** Was blocked by TripleStoreIndexer&lt;br /&gt;
** Going to work on for 3 hours, landing work from sprint&lt;br /&gt;
* MakeSeed&lt;br /&gt;
** NY thought this was fixed, but it's open in Pivotal ( https://www.pivotaltracker.com/story/show/3888957 )&lt;br /&gt;
** Need to double check other verbs&lt;br /&gt;
* Priorities:&lt;br /&gt;
** Field Mapping&lt;br /&gt;
** Make sure feeds verbs handle provenance correctly&lt;br /&gt;
** MinusCurator&lt;br /&gt;
&lt;br /&gt;
= Monday, June 21, 2010 =&lt;br /&gt;
&lt;br /&gt;
* Nathan, Asheesh and Raffi: [http://piratepad.net/5OhqF55lTk Notes]&lt;br /&gt;
&lt;br /&gt;
= Monday, June 7, 2010 =&lt;br /&gt;
&lt;br /&gt;
AL: Branch cleanup&lt;br /&gt;
&lt;br /&gt;
A new developer should begin to add work on top of master.&lt;br /&gt;
&lt;br /&gt;
The branch &amp;quot;provenance_tests&amp;quot; contains work on OAI-PMH and RDFa. It's also where we'll add the work on minus-curator.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;provenance_tests&amp;quot; will become &amp;quot;next&amp;quot;&lt;br /&gt;
&lt;br /&gt;
AL: OAI-PMH provenance implementation&lt;br /&gt;
&lt;br /&gt;
Defer to post -curator implementation&lt;br /&gt;
&lt;br /&gt;
RKL: RDFa extraction and indexing&lt;br /&gt;
&lt;br /&gt;
We can parse the RDFa title by adding the RDFa-parsing logic in the *wrong* place (in the feed updating method). I've tried to move it into the *right* place (a plugin, so we don't hit the network twice) but the plugin isn't being executed. I must have configured the plugin incorrectly.&lt;br /&gt;
Raffi should look in the Hadoop log, not standard out, for his logging message.&lt;br /&gt;
Raffi should push the branch to Nathan and he'll sanity-check the configuration.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
AL/RKL: -curator status&lt;br /&gt;
&lt;br /&gt;
Let's get this as far along as possible before the spring!&lt;br /&gt;
&lt;br /&gt;
NRY: &amp;quot;Feature&amp;quot; pages in CC wiki&lt;br /&gt;
&lt;br /&gt;
In preparation for the sprint next week, Brendan from MSU asked if we could create general &amp;quot;Feature&amp;quot; write-ups for things we're working on for AgShare.  These would describe some feature (ie, Provenance, Analytics), so he could evaluate where intersection lies with his FSKN work.  &lt;br /&gt;
&lt;br /&gt;
I think each Feature maps to one or more Stories in Pivotal.  My initial stub is at http://wiki.creativecommons.org/Curator_Filtering (probably going to rename to Provenance or something like that).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Monday, June 2, 2010 =&lt;br /&gt;
&lt;br /&gt;
== Things Asheesh wants ==&lt;br /&gt;
&lt;br /&gt;
* We work more clearly out of Pivotal Tracker.&lt;br /&gt;
** While on the phone, Asheesh updated it.&lt;br /&gt;
&lt;br /&gt;
== Things Nathan wants done before the sprint ==&lt;br /&gt;
&lt;br /&gt;
* Branch cleanup&lt;br /&gt;
** Asheesh just did this.&lt;br /&gt;
* Data imported through OAI-PMH has the provenance of its feed. Damn the torpedos^Wtests.&lt;br /&gt;
&lt;br /&gt;
Then we'll work on the &amp;quot;minus curator&amp;quot; story in Pivotal Tracker.&lt;br /&gt;
&lt;br /&gt;
== Thursday departure planning ==&lt;br /&gt;
&lt;br /&gt;
At some point, Nathan wants to figure out his Thursday departure planning on the sprint's last day.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=41922</id>
		<title>AgShare/Tech</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=41922"/>
				<updated>2010-09-17T22:31:34Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Deploying new WARs */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:&lt;br /&gt;
&lt;br /&gt;
* Username: '''agshare'''&lt;br /&gt;
* Host name: '''search.agshare.org''' (currently the same as discovered.labs.creativecommons.org)&lt;br /&gt;
&lt;br /&gt;
So, for example, to set up your environment, do:&lt;br /&gt;
&lt;br /&gt;
 $ sudo su - agshare&lt;br /&gt;
&lt;br /&gt;
Given that, give [[Running DiscoverEd]] a look!&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* rm -rf ~/tomcat/webapps/ROOT&lt;br /&gt;
* cp nutch-1.1.war ~/tomcat/webapps/ROOT.war&lt;br /&gt;
&lt;br /&gt;
Then restart Tomcat.&lt;br /&gt;
&lt;br /&gt;
== Restarting Tomcat ==&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment uses a Tomcat instance in its $HOME (supported by the tomcat6-instance-create script). It's wrapped as &amp;quot;/etc/init.d/agshare&amp;quot; so the boot process can use it. But you can restart it this way:&lt;br /&gt;
&lt;br /&gt;
* ~/tomcat/bin/shutdown.sh&lt;br /&gt;
* ~/tomcat/bin/startup.sh&lt;br /&gt;
&lt;br /&gt;
== Starting Tomcat at boot ==&lt;br /&gt;
&lt;br /&gt;
/etc/rc.local contains a call to run ~/tomcat/bin/startup.sh as the agshare user. That's kind of hackish, I realize.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=41921</id>
		<title>AgShare/Tech</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=41921"/>
				<updated>2010-09-17T22:28:49Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:&lt;br /&gt;
&lt;br /&gt;
* Username: '''agshare'''&lt;br /&gt;
* Host name: '''search.agshare.org''' (currently the same as discovered.labs.creativecommons.org)&lt;br /&gt;
&lt;br /&gt;
So, for example, to set up your environment, do:&lt;br /&gt;
&lt;br /&gt;
 $ sudo su - agshare&lt;br /&gt;
&lt;br /&gt;
Given that, give [[Running DiscoverEd]] a look!&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* cp nutch-1.1.war ~/tomcat/webapps/ROOT.war&lt;br /&gt;
&lt;br /&gt;
Then restart Tomcat.&lt;br /&gt;
&lt;br /&gt;
== Restarting Tomcat ==&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment uses a Tomcat instance in its $HOME (supported by the tomcat6-instance-create script). It's wrapped as &amp;quot;/etc/init.d/agshare&amp;quot; so the boot process can use it. But you can restart it this way:&lt;br /&gt;
&lt;br /&gt;
* ~/tomcat/bin/shutdown.sh&lt;br /&gt;
* ~/tomcat/bin/startup.sh&lt;br /&gt;
&lt;br /&gt;
== Starting Tomcat at boot ==&lt;br /&gt;
&lt;br /&gt;
/etc/rc.local contains a call to run ~/tomcat/bin/startup.sh as the agshare user. That's kind of hackish, I realize.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=41920</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=41920"/>
				<updated>2010-09-17T22:24:23Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Crawl */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
$ cd code&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
If the directory &amp;quot;seed/&amp;quot; does not exist, create it with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then create the seed list of URLs:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/crawl_and_merge.sh&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Finally, restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* sudo rm -rf /var/lib/tomcat6/webapps/search/ # clear the existing app to force redeployment&lt;br /&gt;
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war&lt;br /&gt;
* sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&lt;br /&gt;
== Things the server administrator should know ==&lt;br /&gt;
&lt;br /&gt;
=== JAVA_HOME ===&lt;br /&gt;
&lt;br /&gt;
Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':&lt;br /&gt;
&lt;br /&gt;
 JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME&lt;br /&gt;
&lt;br /&gt;
=== Maximum open files ===&lt;br /&gt;
&lt;br /&gt;
Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.&lt;br /&gt;
&lt;br /&gt;
To address this, we added this to '''/etc/security/limits.conf''':&lt;br /&gt;
&lt;br /&gt;
 ### For Tomcat etc.&lt;br /&gt;
 *               soft    nofile          4096&lt;br /&gt;
 *               hard    nofile          4096&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=41919</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=41919"/>
				<updated>2010-09-17T22:24:07Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Crawl */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
$ cd code&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
If the directory &amp;quot;seed/&amp;quot; does not exist, create it with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then create the seed list of URLs:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/crawl_and_merge.sh&lt;br /&gt;
&lt;br /&gt;
Finally, restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* sudo rm -rf /var/lib/tomcat6/webapps/search/ # clear the existing app to force redeployment&lt;br /&gt;
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war&lt;br /&gt;
* sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&lt;br /&gt;
== Things the server administrator should know ==&lt;br /&gt;
&lt;br /&gt;
=== JAVA_HOME ===&lt;br /&gt;
&lt;br /&gt;
Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':&lt;br /&gt;
&lt;br /&gt;
 JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME&lt;br /&gt;
&lt;br /&gt;
=== Maximum open files ===&lt;br /&gt;
&lt;br /&gt;
Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.&lt;br /&gt;
&lt;br /&gt;
To address this, we added this to '''/etc/security/limits.conf''':&lt;br /&gt;
&lt;br /&gt;
 ### For Tomcat etc.&lt;br /&gt;
 *               soft    nofile          4096&lt;br /&gt;
 *               hard    nofile          4096&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=41561</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=41561"/>
				<updated>2010-09-10T15:44:26Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Deploying new WARs */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
$ cd code&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
If the directory &amp;quot;seed/&amp;quot; does not exist, create it with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then create the seed list of URLs:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in '''$HOME/production-crawl''' ($HOME is set to /var/www/discovered.labs.creativecommons.org).&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged $HOME/production-crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mkdir -p ~/archived-crawls/$(date -I)&lt;br /&gt;
$ mv ~/production-crawl ~/archived-crawls/$(date -I)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged ~/production-crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* sudo rm -rf /var/lib/tomcat6/webapps/search/ # clear the existing app to force redeployment&lt;br /&gt;
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war&lt;br /&gt;
* sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&lt;br /&gt;
== Things the server administrator should know ==&lt;br /&gt;
&lt;br /&gt;
=== JAVA_HOME ===&lt;br /&gt;
&lt;br /&gt;
Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':&lt;br /&gt;
&lt;br /&gt;
 JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME&lt;br /&gt;
&lt;br /&gt;
=== Maximum open files ===&lt;br /&gt;
&lt;br /&gt;
Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.&lt;br /&gt;
&lt;br /&gt;
To address this, we added this to '''/etc/security/limits.conf''':&lt;br /&gt;
&lt;br /&gt;
 ### For Tomcat etc.&lt;br /&gt;
 *               soft    nofile          4096&lt;br /&gt;
 *               hard    nofile          4096&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=41560</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=41560"/>
				<updated>2010-09-10T15:43:57Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
$ cd code&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
If the directory &amp;quot;seed/&amp;quot; does not exist, create it with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then create the seed list of URLs:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in '''$HOME/production-crawl''' ($HOME is set to /var/www/discovered.labs.creativecommons.org).&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged $HOME/production-crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mkdir -p ~/archived-crawls/$(date -I)&lt;br /&gt;
$ mv ~/production-crawl ~/archived-crawls/$(date -I)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged ~/production-crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* sudo rm -rf /var/lib/tomcat6/search/ # clear the existing app to force redeployment&lt;br /&gt;
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war&lt;br /&gt;
* sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&lt;br /&gt;
== Things the server administrator should know ==&lt;br /&gt;
&lt;br /&gt;
=== JAVA_HOME ===&lt;br /&gt;
&lt;br /&gt;
Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':&lt;br /&gt;
&lt;br /&gt;
 JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME&lt;br /&gt;
&lt;br /&gt;
=== Maximum open files ===&lt;br /&gt;
&lt;br /&gt;
Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.&lt;br /&gt;
&lt;br /&gt;
To address this, we added this to '''/etc/security/limits.conf''':&lt;br /&gt;
&lt;br /&gt;
 ### For Tomcat etc.&lt;br /&gt;
 *               soft    nofile          4096&lt;br /&gt;
 *               hard    nofile          4096&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=41504</id>
		<title>AgShare/Tech</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=41504"/>
				<updated>2010-09-08T23:06:39Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:&lt;br /&gt;
&lt;br /&gt;
* Username: '''agshare'''&lt;br /&gt;
* Host name: '''search.agshare.org''' (currently the same as discovered.labs.creativecommons.org)&lt;br /&gt;
&lt;br /&gt;
So, for example, to set up your environment, do:&lt;br /&gt;
&lt;br /&gt;
 $ sudo su - agshare&lt;br /&gt;
&lt;br /&gt;
Given that, give [[Running DiscoverEd]] a look!&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* cp nutch-1.1.war ~/tomcat/webapps/ROOT.war&lt;br /&gt;
&lt;br /&gt;
Then restart Tomcat.&lt;br /&gt;
&lt;br /&gt;
== Restarting Tomcat ==&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment uses a Tomcat instance in its $HOME (supported by the tomcat6-instance-create script). So to restart it, try:&lt;br /&gt;
&lt;br /&gt;
* ~/tomcat/bin/shutdown.sh&lt;br /&gt;
* ~/tomcat/bin/startup.sh&lt;br /&gt;
&lt;br /&gt;
== Starting Tomcat at boot ==&lt;br /&gt;
&lt;br /&gt;
/etc/rc.local contains a call to run ~/tomcat/bin/startup.sh as the agshare user. That's kind of hackish, I realize.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=41503</id>
		<title>AgShare/Tech</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=41503"/>
				<updated>2010-09-08T23:01:59Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:&lt;br /&gt;
&lt;br /&gt;
* Username: '''agshare'''&lt;br /&gt;
* Host name: '''search.agshare.org''' (currently the same as discovered.labs.creativecommons.org)&lt;br /&gt;
&lt;br /&gt;
So, for example, to set up your environment, do:&lt;br /&gt;
&lt;br /&gt;
 $ sudo su - agshare&lt;br /&gt;
&lt;br /&gt;
Given that, give [[Running DiscoverEd]] a look!&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* cp nutch-1.1.war ~/tomcat/webapps/ROOT.war&lt;br /&gt;
* ~/tomcat/bin/shutdown.sh&lt;br /&gt;
* ~/tomcat/bin/startup.sh&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=41502</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=41502"/>
				<updated>2010-09-08T22:50:11Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
$ cd code&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
If the directory &amp;quot;seed/&amp;quot; does not exist, create it with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then create the seed list of URLs:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in '''$HOME/production-crawl''' ($HOME is set to /var/www/discovered.labs.creativecommons.org).&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged $HOME/production-crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mkdir -p ~/archived-crawls/$(date -I)&lt;br /&gt;
$ mv ~/production-crawl ~/archived-crawls/$(date -I)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged ~/production-crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On discovered.labs.creativecommons.org in the $HOME/code directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war&lt;br /&gt;
* sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&lt;br /&gt;
== Things the server administrator should know ==&lt;br /&gt;
&lt;br /&gt;
=== JAVA_HOME ===&lt;br /&gt;
&lt;br /&gt;
Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':&lt;br /&gt;
&lt;br /&gt;
 JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME&lt;br /&gt;
&lt;br /&gt;
=== Maximum open files ===&lt;br /&gt;
&lt;br /&gt;
Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.&lt;br /&gt;
&lt;br /&gt;
To address this, we added this to '''/etc/security/limits.conf''':&lt;br /&gt;
&lt;br /&gt;
 ### For Tomcat etc.&lt;br /&gt;
 *               soft    nofile          4096&lt;br /&gt;
 *               hard    nofile          4096&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=41501</id>
		<title>AgShare/Tech</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=AgShare/Tech&amp;diff=41501"/>
				<updated>2010-09-08T22:49:05Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: Created page with &amp;quot;Category:DiscoverEd  The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:  * Username: '''agshare''' * Host name: ...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
The AgShare deployment works analogously to the CC Labs deployment of DiscoverEd. Some important things to note:&lt;br /&gt;
&lt;br /&gt;
* Username: '''agshare'''&lt;br /&gt;
* Host name: '''search.agshare.org''' (currently the same as discovered.labs.creativecommons.org)&lt;br /&gt;
&lt;br /&gt;
So, for example, to set up your environment, do:&lt;br /&gt;
&lt;br /&gt;
 $ sudo su - agshare&lt;br /&gt;
&lt;br /&gt;
Given that, give [[Running DiscoverEd]] a look!&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=AgShare&amp;diff=41500</id>
		<title>AgShare</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=AgShare&amp;diff=41500"/>
				<updated>2010-09-08T22:36:12Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
Creative Commons is a partner on the [http://www.oerafrica.org/agshare AgShare] project.  Supported by the Bill and Melinda Gates Foundation, the goal of this planning and pilot project is to create a scalable and sustainable collaboration of existing organizations for African publishing, localizing, and sharing of teaching and learning materials that fill critical resource gaps in African MSc agriculture curriculum and that can be modified for other downstream uses.&lt;br /&gt;
&lt;br /&gt;
== Technical notes ==&lt;br /&gt;
&lt;br /&gt;
* [[AgShare/Tech]]&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=PuSH_Feed_Type&amp;diff=41338</id>
		<title>PuSH Feed Type</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=PuSH_Feed_Type&amp;diff=41338"/>
				<updated>2010-09-07T15:49:14Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=Draft&lt;br /&gt;
}}&lt;br /&gt;
The people who run a DiscoverEd instance may wish to be updated nearly-immediately when there are new resources published by a curator.&lt;br /&gt;
&lt;br /&gt;
Right now, DiscoverEd instances aggregate feeds and crawl every once in a while, often manually at the behest of the search engine operator. PubSubHubBub provides a way for the DiscoverEd instance to subscribe feeds and receive automatic, nearly-instantaneous notification of new information in the feed.&lt;br /&gt;
&lt;br /&gt;
This can be built on top of existing Atom/RSS feeds that curators already publish.&lt;br /&gt;
&lt;br /&gt;
This feature was defined and developed during the fun DC meeting thing. (Nathan, did that meeting have a name?)&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
A complete implementation of this specification would provide the following things.&lt;br /&gt;
&lt;br /&gt;
* DiscoverEd can discover a PuSH ''hub'' mentioned in a feed.&lt;br /&gt;
* DiscoverEd can register itself as a ''subscriber'' to that feed on that hub. (To do that, it has to provide a URL on the DiscoverEd instance that, when the feed is updated, the hub should POST to.)&lt;br /&gt;
* When the hub pings DiscoverEd to say there is an update to that feed, it re-aggregates data from that feed, does a crawl, and merges the index.&lt;br /&gt;
&lt;br /&gt;
== Status ==&lt;br /&gt;
&lt;br /&gt;
* This draft document has been written. That's all.&lt;br /&gt;
* NSDL is interested in trying this with us.&lt;br /&gt;
&lt;br /&gt;
== Questions ==&lt;br /&gt;
&lt;br /&gt;
* Can we make things as simple as this:&lt;br /&gt;
** OER Africa adds &amp;lt;link rel=&amp;quot;hub&amp;quot;...&amp;gt;&lt;br /&gt;
** They do nothing else.&lt;br /&gt;
** The chosen hub polls the feed, and when there are updates, pings us.&lt;br /&gt;
** Then we get real-time updates with basically no effort from OER Africa.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Install_manually&amp;diff=41331</id>
		<title>DiscoverEd/Install manually</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Install_manually&amp;diff=41331"/>
				<updated>2010-09-07T14:38:47Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Switching to MySQL */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|&lt;br /&gt;
[[DiscoverEd]] is based on [http://nutch.apache.org/ Nutch].  As such, you may wish to consult the [http://wiki.apache.org/nutch/ Nutch Wiki] for general deployment questions.}}&lt;br /&gt;
&lt;br /&gt;
{{Stub}}&lt;br /&gt;
&lt;br /&gt;
=== Check out and build the source code ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ git clone git://gitorious.org/discovered/repo.git discovered&lt;br /&gt;
$ cd discovered&lt;br /&gt;
$ ant&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Add a curator and a feed ===&lt;br /&gt;
&lt;br /&gt;
DiscoverEd uses feeds to help identify resources to crawl.  Feeds are provided by curators, who can also provide metadata about resources.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds addcurator &amp;quot;ND OCW&amp;quot; http://ocw.nd.edu/ &lt;br /&gt;
$ ./bin/feeds addfeed rss http://ocw.nd.edu/front-page/courselist/rss http://ocw.nd.edu/&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Aggregate and crawl resources ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds aggregate&lt;br /&gt;
$ mkdir seed&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; seed/urls.txt&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Run the web application ===&lt;br /&gt;
&lt;br /&gt;
Edit conf/nutch-site.xml to point to your crawl location.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant war&lt;br /&gt;
$ [copy the war file to your J2EE container]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Switching to MySQL ===&lt;br /&gt;
&lt;br /&gt;
By default, DiscoverEd (at least on the ''next'' branch) uses an on-disk database called Derby for storing resource metadata. You should use a different database, like MySQL, in production.&lt;br /&gt;
&lt;br /&gt;
To do that, edit '''conf/discovered.xml''' and update the following sections as appropriate:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.driver&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;com.mysql.jdbc.Driver&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.url&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;jdbc:mysql://localhost/discovered?autoReconnect=true&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.user&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;discovered&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.password&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Known issues ==&lt;br /&gt;
&lt;br /&gt;
=== Derby and OAI:PMH aren't compatible ===&lt;br /&gt;
&lt;br /&gt;
If you use the default backend, OAI:PMH crawls won't work. Instead, you'll get SQL syntax errors from the code. We haven't fully diagnosed the problem; instead, if you get a problem like that, we suggest you switch to MySQL as per the &amp;quot;Switching to MySQL&amp;quot; section.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=PuSH_Feed_Type&amp;diff=41330</id>
		<title>PuSH Feed Type</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=PuSH_Feed_Type&amp;diff=41330"/>
				<updated>2010-09-07T14:04:00Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=Draft&lt;br /&gt;
}}&lt;br /&gt;
The people who run a DiscoverEd instance may wish to be updated nearly-immediately when there are new resources published by a curator.&lt;br /&gt;
&lt;br /&gt;
Right now, DiscoverEd instances aggregate feeds and crawl every once in a while, often manually at the behest of the search engine operator. PubSubHubBub provides a way for the DiscoverEd instance to subscribe feeds and receive automatic, nearly-instantaneous notification of new information in the feed.&lt;br /&gt;
&lt;br /&gt;
This can be built on top of existing Atom/RSS feeds that curators already publish.&lt;br /&gt;
&lt;br /&gt;
This feature was defined and developed during the fun DC meeting thing. (Nathan, did that meeting have a name?)&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
A complete implementation of this specification would provide the following things.&lt;br /&gt;
&lt;br /&gt;
* DiscoverEd can discover a PuSH ''hub'' mentioned in a feed.&lt;br /&gt;
* DiscoverEd can register itself as a ''subscriber'' to that feed on that hub. (To do that, it has to provide a URL on the DiscoverEd instance that, when the feed is updated, the hub should POST to.)&lt;br /&gt;
* When the hub pings DiscoverEd to say there is an update to that feed, it re-aggregates data from that feed, does a crawl, and merges the index.&lt;br /&gt;
&lt;br /&gt;
== Status ==&lt;br /&gt;
&lt;br /&gt;
* This draft document has been written. That's all.&lt;br /&gt;
* NSDL is interested in trying this with us.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=PuSH_Feed_Type&amp;diff=41329</id>
		<title>PuSH Feed Type</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=PuSH_Feed_Type&amp;diff=41329"/>
				<updated>2010-09-07T14:01:43Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Requirements */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=Draft&lt;br /&gt;
}}&lt;br /&gt;
The people who run a DiscoverEd instance may wish to be updated nearly-immediately when there are new resources published by a curator.&lt;br /&gt;
&lt;br /&gt;
Right now, DiscoverEd instances aggregate feeds and crawl every once in a while, often manually at the behest of the search engine operator. PubSubHubBub provides a way for the DiscoverEd instance to subscribe feeds and receive automatic, nearly-instantaneous notification of new information in the feed.&lt;br /&gt;
&lt;br /&gt;
This can be built on top of existing Atom/RSS feeds that curators already publish.&lt;br /&gt;
&lt;br /&gt;
This feature was defined and developed during the fun DC meeting thing. (Nathan, did that meeting have a name?)&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
A complete implementation of this specification would provide the following things.&lt;br /&gt;
&lt;br /&gt;
* DiscoverEd can discover a PuSH ''hub'' mentioned in a feed.&lt;br /&gt;
* DiscoverEd can register itself as a ''subscriber'' to that feed on that hub. (To do that, it has to provide a URL on the DiscoverEd instance that, when the feed is updated, the hub should POST to.)&lt;br /&gt;
* When the hub pings DiscoverEd to say there is an update to that feed, it re-aggregates data from that feed, does a crawl, and merges the index.&lt;br /&gt;
&lt;br /&gt;
== Status ==&lt;br /&gt;
&lt;br /&gt;
* This draft document has been written. That's all.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=PuSH_Feed_Type&amp;diff=41328</id>
		<title>PuSH Feed Type</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=PuSH_Feed_Type&amp;diff=41328"/>
				<updated>2010-09-07T14:01:07Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=Draft&lt;br /&gt;
}}&lt;br /&gt;
The people who run a DiscoverEd instance may wish to be updated nearly-immediately when there are new resources published by a curator.&lt;br /&gt;
&lt;br /&gt;
Right now, DiscoverEd instances aggregate feeds and crawl every once in a while, often manually at the behest of the search engine operator. PubSubHubBub provides a way for the DiscoverEd instance to subscribe feeds and receive automatic, nearly-instantaneous notification of new information in the feed.&lt;br /&gt;
&lt;br /&gt;
This can be built on top of existing Atom/RSS feeds that curators already publish.&lt;br /&gt;
&lt;br /&gt;
This feature was defined and developed during the fun DC meeting thing. (Nathan, did that meeting have a name?)&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
A complete implementation of this specification would provide the following things.&lt;br /&gt;
&lt;br /&gt;
* DiscoverEd can discover a PuSH ''hub'' mentioned in a feed.&lt;br /&gt;
* DiscoverEd can register itself as a ''subscriber'' to that feed on that hub.&lt;br /&gt;
* When the hub pings DiscoverEd with an update to that feed, it re-aggregates data from that feed, does a crawl, and merges the index.&lt;br /&gt;
&lt;br /&gt;
== Status ==&lt;br /&gt;
&lt;br /&gt;
* This draft document has been written. That's all.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=PuSH_Feed_Type&amp;diff=41291</id>
		<title>PuSH Feed Type</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=PuSH_Feed_Type&amp;diff=41291"/>
				<updated>2010-09-05T20:25:54Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: Created page with &amp;quot;{{DiscoverEd Specification |contact=Asheesh Laroia |project=AgShare |status=Draft }} The people who run a DiscoverEd instance may wish to be updated nearly-immediately when there...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=Draft&lt;br /&gt;
}}&lt;br /&gt;
The people who run a DiscoverEd instance may wish to be updated nearly-immediately when there are new resources published by a curator.&lt;br /&gt;
&lt;br /&gt;
Right now, DiscoverEd instances aggregate feeds and crawl every once in a while, often manually at the behest of the search engine operator. PubSubHubBub provides a way for the DiscoverEd instance to subscribe feeds and receive automatic, nearly-instantaneous notification of new information in the feed.&lt;br /&gt;
&lt;br /&gt;
This can be built on top of existing Atom/RSS feeds that curators already publish.&lt;br /&gt;
&lt;br /&gt;
This feature was defined and developed during the fun DC meeting thing. (Nathan, did that meeting have a name?)&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
To be considered working, this specification would provide the following things.&lt;br /&gt;
&lt;br /&gt;
* Instructions for any Atom/RSS feed to participate in the PuSH network.&lt;br /&gt;
* DiscoverEd can accept POST requests from a PuSH hub. These POST requests are how the hub notifies the Nutch&lt;br /&gt;
* DiscoverEd can subscribe to updates from a PuSH hub (whose address can be configured in the configuration file).&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Metadata_Provenance&amp;diff=40754</id>
		<title>Metadata Provenance</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Metadata_Provenance&amp;diff=40754"/>
				<updated>2010-08-30T20:07:11Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Requirements */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Nathan Yergler&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=In Development&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
The initial version of DiscoverEd does not include provenance support.  Provenance means tracking the source of resource metadata.  Due to this limitation, DiscoverEd has limited ability to filter by curator.  While you can filter for resources with a specific curator, the remaining search terms are not limited to metadata provided by that curator.  This is a significant shortcoming for resources with multiple curators.&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
* The provenance of metadata discovered through RSS, Atom, and OAI-PMH is stored in the RDF Store.&lt;br /&gt;
* Metadata extracted from structured data is stored with provenance reflecting the page it was extracted from.&lt;br /&gt;
* Users can filter a query to exclude a curator, and metadata provided by that curator is not considered for other query terms.  For example, &amp;quot;&amp;lt;code&amp;gt;excludecurator:http://example.org subject:biology cells&amp;lt;/code&amp;gt;&amp;quot; would return results containing the term &amp;quot;cells&amp;quot;, with the subject tag &amp;quot;biology&amp;quot; provided by a curator &amp;lt;strong&amp;gt;other than&amp;lt;/strong&amp;gt; http://example.org.&lt;br /&gt;
&lt;br /&gt;
== Status ==&lt;br /&gt;
&lt;br /&gt;
Provenance support was initially added with table prefixes, and later refactored to use [http://www4.wiwiss.fu-berlin.de/bizer/ng4j/ Named Graphs for Jena].  Provenance support has been landed in &amp;lt;tt&amp;gt;next&amp;lt;/tt&amp;gt;, and is running on [http://discovered.labs.creativecommons.org Labs].&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Field_Query_Mapping&amp;diff=40751</id>
		<title>Field Query Mapping</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Field_Query_Mapping&amp;diff=40751"/>
				<updated>2010-08-30T19:24:50Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=In Development&lt;br /&gt;
}}&lt;br /&gt;
The people who run a DiscoverEd may wish to let users search specific metadata easily. For example, http://discovered.labs.creativecommons.org/ lets users search for works &amp;quot;tagged&amp;quot; with &amp;quot;banana&amp;quot; by searching for tag:banana. (In particular, the predicate for &amp;quot;tag&amp;quot; is the term &amp;quot;subject&amp;quot; as specified by the Dublin Core.)&lt;br /&gt;
&lt;br /&gt;
These prefixes, like tag:, are stored in the DiscoverEd code right now. This feature aims to move those into a configuration file.&lt;br /&gt;
&lt;br /&gt;
This feature was defined and developed during the [[DiscoverEd Sprint (June, 2010)|June 2010 DiscoverEd Sprint]]&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
When DiscoverEd crawls feeds and resources and saves metadata such as the page title, it converts this information into RDF triples; those triples are eventually saved on disk in a triple store, namely Jena.&lt;br /&gt;
&lt;br /&gt;
We will create a new configuration file that stores a list of mappings from predicate URIs. For example, we might list &amp;quot;method:&amp;quot; as a shorthand for the RDF predicate &amp;lt;http://purl.org/dc/terms/instructionalMethod&amp;gt;, a.k.a. &amp;quot;dct:instructionalMethod&amp;quot;. At indexing time, a Lucene column called &amp;quot;method&amp;quot; will be created in the Lucene documents corresponding to each resource that has the dct:instructionalMethod predicate set in the Jena store.&lt;br /&gt;
&lt;br /&gt;
Then, at search time, Nutch's built-in query parser handles the query, e.g., &amp;quot;method:yaddayadda&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
== How to use ==&lt;br /&gt;
&lt;br /&gt;
'''Note''': There is an implementation of this in the current version of DiscoverEd (as of 2010-08-30), but it ignores the ''excludecurator'' argument.&lt;br /&gt;
&lt;br /&gt;
Let's say you want to allow users to perform this query:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;lt;pre&amp;gt;&lt;br /&gt;
 method:&amp;quot;Experiential learning&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and retrieve all web pages in your index that have a metadatum with predicate &amp;lt;http://purl.org/dc/terms/instructionalMethod&amp;gt; and value &amp;quot;Experiential learning&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
To do so, first edit &amp;lt;code&amp;gt;conf/nutch-site.xml&amp;lt;/code&amp;gt;. Add this XML inside the &amp;lt;configuration&amp;gt; block.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;lt;pre&amp;gt;&lt;br /&gt;
 &amp;lt;property&amp;gt;&lt;br /&gt;
     &amp;lt;name&amp;gt;query.basic.method.boost&amp;lt;/name&amp;gt;&lt;br /&gt;
     &amp;lt;value&amp;gt;1.0&amp;lt;/value&amp;gt;&lt;br /&gt;
 &amp;lt;/property&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This block of XML tells Nutch to accept the &amp;quot;method:&amp;quot; prefix in search queries. The value of this property indicates the weight the search engine should assign to this term.&lt;br /&gt;
&lt;br /&gt;
Next, edit &amp;lt;code&amp;gt;conf/discovered-search-prefixes.xml&amp;lt;/code&amp;gt;. Add this XML inside the &amp;lt;configuration&amp;gt; block.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;lt;pre&amp;gt;&lt;br /&gt;
 &amp;lt;property&amp;gt;&lt;br /&gt;
     &amp;lt;name&amp;gt;http://purl.org/dc/terms/instructionalMethod&amp;lt;/name&amp;gt;&lt;br /&gt;
     &amp;lt;value&amp;gt;method&amp;lt;/value&amp;gt;&lt;br /&gt;
 &amp;lt;/property&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This block of XML tells DiscoverEd to copy data out of the Jena store and paste it into a format where Nutch's basic query parser can find it.&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
* Added a sample configuration file&lt;br /&gt;
* Added code to our IndexFilter that looks for relevant triples and stores them in the Lucene document&lt;br /&gt;
* We had a problem with these columns not appearing in Lucene, but we fixed the underlying bug that caused that.&lt;br /&gt;
&lt;br /&gt;
== Next steps ==&lt;br /&gt;
&lt;br /&gt;
* Rewriting this to be compatible with ''excludecurator''.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40749</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40749"/>
				<updated>2010-08-30T19:17:15Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
$ cd code&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
If the directory &amp;quot;seed/&amp;quot; does not exist, create it with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then create the seed list of URLs:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in '''/var/www/discovered.labs.creativecommons.org/production-crawl'''.&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged /var/www/discovered.labs.creativecommons.org/production-crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mkdir -p ~/archived-crawls/$(date -I)&lt;br /&gt;
$ mv ~/production-crawl ~/archived-crawls/$(date -I)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged ~/production-crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On a6, in the /var/www/discovered.creativecommons.org/oenutch directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war&lt;br /&gt;
* sudo /etc/init.d/tomcat6 restart&lt;br /&gt;
&lt;br /&gt;
== Things the server administrator should know ==&lt;br /&gt;
&lt;br /&gt;
=== JAVA_HOME ===&lt;br /&gt;
&lt;br /&gt;
Many of our scripts require a JAVA_HOME environment variable to be set. For our convenience, we configured ''discovered.labs.creativecommons.org'' to have JAVA_HOME set for every user. We did that by adding this to '''/etc/profile''':&lt;br /&gt;
&lt;br /&gt;
 JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ ; export JAVA_HOME&lt;br /&gt;
&lt;br /&gt;
=== Maximum open files ===&lt;br /&gt;
&lt;br /&gt;
Tomcat and Nutch sometimes have problems opening files. This is because they've exceeded the number of open files that a process can have.&lt;br /&gt;
&lt;br /&gt;
To address this, we added this to '''/etc/security/limits.conf''':&lt;br /&gt;
&lt;br /&gt;
 ### For Tomcat etc.&lt;br /&gt;
 *               soft    nofile          4096&lt;br /&gt;
 *               hard    nofile          4096&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Field_Query_Mapping&amp;diff=40402</id>
		<title>Field Query Mapping</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Field_Query_Mapping&amp;diff=40402"/>
				<updated>2010-08-24T18:18:21Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=In Development&lt;br /&gt;
}}&lt;br /&gt;
The people who run a DiscoverEd may wish to let users search specific metadata easily. For example, http://discovered.creativecommons.org/search/ lets users search for works &amp;quot;tagged&amp;quot; with &amp;quot;banana&amp;quot; by searching for tag:banana. (In particular, the predicate for &amp;quot;tag&amp;quot; is the term &amp;quot;subject&amp;quot; as specified by the Dublin Core.)&lt;br /&gt;
&lt;br /&gt;
These prefixes, like tag:, are stored in the DiscoverEd code right now. This feature aims to move those into a configuration file.&lt;br /&gt;
&lt;br /&gt;
This feature was defined and developed during the [[DiscoverEd Sprint (June, 2010)|June 2010 DiscoverEd Sprint]]&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
When DiscoverEd crawls feeds and resources and saves metadata such as the page title, it converts this information into RDF triples; those triples are eventually saved on disk in a triple store, namely Jena.&lt;br /&gt;
&lt;br /&gt;
We will create a new configuration file that stores a list of mappings from predicate URIs. For example, we might list &amp;quot;method:&amp;quot; as a shorthand for the RDF predicate &amp;lt;http://purl.org/dc/terms/instructionalMethod&amp;gt;, a.k.a. &amp;quot;dct:instructionalMethod&amp;quot;. At indexing time, a Lucene column called &amp;quot;method&amp;quot; will be created in the Lucene documents corresponding to each resource that has the dct:instructionalMethod predicate set in the Jena store.&lt;br /&gt;
&lt;br /&gt;
Then, at search time, Nutch's built-in query parser handles the query, e.g., &amp;quot;method:yaddayadda&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
== How to use ==&lt;br /&gt;
&lt;br /&gt;
Let's say you want to allow users to perform this query:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;lt;pre&amp;gt;&lt;br /&gt;
 method:&amp;quot;Experiential learning&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and retrieve all web pages in your index that have a metadatum with predicate &amp;lt;http://purl.org/dc/terms/instructionalMethod&amp;gt; and value &amp;quot;Experiential learning&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
To do so, first edit &amp;lt;code&amp;gt;conf/nutch-site.xml&amp;lt;/code&amp;gt;. Add this XML inside the &amp;lt;configuration&amp;gt; block.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;lt;pre&amp;gt;&lt;br /&gt;
 &amp;lt;property&amp;gt;&lt;br /&gt;
     &amp;lt;name&amp;gt;query.basic.method.boost&amp;lt;/name&amp;gt;&lt;br /&gt;
     &amp;lt;value&amp;gt;1.0&amp;lt;/value&amp;gt;&lt;br /&gt;
 &amp;lt;/property&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This block of XML tells Nutch to accept the &amp;quot;method:&amp;quot; prefix in search queries. The value of this property indicates the weight the search engine should assign to this term.&lt;br /&gt;
&lt;br /&gt;
Next, edit &amp;lt;code&amp;gt;conf/discovered-search-prefixes.xml&amp;lt;/code&amp;gt;. Add this XML inside the &amp;lt;configuration&amp;gt; block.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;lt;pre&amp;gt;&lt;br /&gt;
 &amp;lt;property&amp;gt;&lt;br /&gt;
     &amp;lt;name&amp;gt;http://purl.org/dc/terms/instructionalMethod&amp;lt;/name&amp;gt;&lt;br /&gt;
     &amp;lt;value&amp;gt;method&amp;lt;/value&amp;gt;&lt;br /&gt;
 &amp;lt;/property&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This block of XML tells DiscoverEd to copy data out of the Jena store and paste it into a format where Nutch's basic query parser can find it.&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
* Added a sample configuration file&lt;br /&gt;
* Added code to our IndexFilter that looks for relevant triples and stores them in the Lucene document&lt;br /&gt;
* We had a problem with these columns not appearing in Lucene, but we fixed the underlying bug that caused that.&lt;br /&gt;
&lt;br /&gt;
== Next steps ==&lt;br /&gt;
&lt;br /&gt;
* Testing and deployment.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40323</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40323"/>
				<updated>2010-08-23T15:51:05Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Deploying new WARs */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
$ cd code&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
If the directory &amp;quot;seed/&amp;quot; does not exist, create it with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then create the seed list of URLs:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in '''/var/www/discovered.labs.creativecommons.org/production-crawl'''.&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged /var/www/discovered.labs.creativecommons.org/production-crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mkdir -p ~/archived-crawls/$(date -I)&lt;br /&gt;
$ mv ~/production-crawl ~/archived-crawls/$(date -I)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged ~/production-crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat5.5 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On a6, in the /var/www/discovered.creativecommons.org/oenutch directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;br /&gt;
&lt;br /&gt;
* sudo cp nutch-1.1.war /var/lib/tomcat6/webapps/search.war&lt;br /&gt;
* sudo /etc/init.d/tomcat6 restart&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40322</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40322"/>
				<updated>2010-08-23T15:49:59Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
$ cd code&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
If the directory &amp;quot;seed/&amp;quot; does not exist, create it with&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then create the seed list of URLs:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in '''/var/www/discovered.labs.creativecommons.org/production-crawl'''.&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged /var/www/discovered.labs.creativecommons.org/production-crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mkdir -p ~/archived-crawls/$(date -I)&lt;br /&gt;
$ mv ~/production-crawl ~/archived-crawls/$(date -I)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged ~/production-crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat5.5 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On a6, in the /var/www/discovered.creativecommons.org/oenutch directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Deploying new WARs ==&lt;br /&gt;
&lt;br /&gt;
To deploy a new war, do this:&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd&amp;diff=40321</id>
		<title>DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd&amp;diff=40321"/>
				<updated>2010-08-23T15:32:12Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;DiscoverEd is a search prototype developed by Creative Commons to explore metadata enhanced search, specifically for OER.  While most search engines rely solely on algorithmic analyses of resources, DiscoverEd can incorporate [[DiscoverEd Data|data provided by the resource publisher or curator]]. DiscoverEd supports several common metadata formats, including OAI-PMH and RDFa.  The use of these formats allows otherwise unrelated educational projects, curators, and repositories to express facts about their resources in the a way that tools (like DiscoverEd) can use for purposes like search and discovery.  DiscoverEd is a project that allows us to explore ways to improve search for OER, and simultaneously demonstrate the utility of structured data.  DiscoverEd is built on [http://lucene.apache.org/nutch/ Nutch].&lt;br /&gt;
&lt;br /&gt;
'''Creative Commons maintains an experimental instance of DiscoverEd at [http://discovered.creativecommons.org discovered.labs.creativecommons.org].'''&lt;br /&gt;
&lt;br /&gt;
== General Documentation ==&lt;br /&gt;
*[[DiscoverEd FAQ|FAQ]]&lt;br /&gt;
*[[DiscoverEd Glossary|Glossary]]&lt;br /&gt;
** Gloassary of DiscoverEd-related terms.&lt;br /&gt;
*[[DiscoverEd Metadata|Metadata]]&lt;br /&gt;
** Basic guide on metadata markup for DiscoverEd.&lt;br /&gt;
&lt;br /&gt;
==Software Documentation ==&lt;br /&gt;
*[[DiscoverEd Quickstart|Quickstart]]&lt;br /&gt;
*[[/Install_manually|Installing Instructions]]&lt;br /&gt;
*[[Running DiscoverEd]]&lt;br /&gt;
*[[DiscoverEd Data|Data]]&lt;br /&gt;
&lt;br /&gt;
== Developer Documentation ==&lt;br /&gt;
* Source repository  ([http://gitorious.org/discovered gitorious])&lt;br /&gt;
* Project planning ([https://www.pivotaltracker.com/projects/77041 Pivotal Tracker])&lt;br /&gt;
* [[DiscoverEd/Development notes|Development notes]]&lt;br /&gt;
* [[Hacking DiscoverEd]]&lt;br /&gt;
*[[:Category:DiscoverEd_Specification|DiscoverEd dev spec pages]]&lt;br /&gt;
* [[/Meetings]]&lt;br /&gt;
&lt;br /&gt;
== Additional Information ==&lt;br /&gt;
*[[Related Efforts]]&lt;br /&gt;
&lt;br /&gt;
[[Category:DiscoverEd]]&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40259</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40259"/>
				<updated>2010-08-20T19:30:33Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Crawl */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in '''/var/www/discovered.labs.creativecommons.org/production-crawl'''.&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged /var/www/discovered.labs.creativecommons.org/production-crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mkdir -p ~/archived-crawls/$(date -I)&lt;br /&gt;
$ mv ~/production-crawl ~/archived-crawls/$(date -I)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged ~/production-crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat5.5 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On a6, in the /var/www/discovered.creativecommons.org/oenutch directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd&amp;diff=40258</id>
		<title>DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd&amp;diff=40258"/>
				<updated>2010-08-20T19:28:05Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;DiscoverEd is a search prototype developed by Creative Commons to explore metadata enhanced search, specifically for OER.  While most search engines rely solely on algorithmic analyses of resources, DiscoverEd can incorporate [[DiscoverEd Data|data provided by the resource publisher or curator]]. DiscoverEd supports several common metadata formats, including OAI-PMH and RDFa.  The use of these formats allows otherwise unrelated educational projects, curators, and repositories to express facts about their resources in the a way that tools (like DiscoverEd) can use for purposes like search and discovery.  DiscoverEd is a project that allows us to explore ways to improve search for OER, and simultaneously demonstrate the utility of structured data.  DiscoverEd is built on [http://lucene.apache.org/nutch/ Nutch].&lt;br /&gt;
&lt;br /&gt;
'''Creative Commons maintains an experimental instance of DiscoverEd at [http://discovered.labs.creativecommons.org discovered.labs.creativecommons.org].'''&lt;br /&gt;
&lt;br /&gt;
== General Documentation ==&lt;br /&gt;
*[[DiscoverEd FAQ|FAQ]]&lt;br /&gt;
*[[DiscoverEd Glossary|Glossary]]&lt;br /&gt;
** Gloassary of DiscoverEd-related terms.&lt;br /&gt;
*[[DiscoverEd Metadata|Metadata]]&lt;br /&gt;
** Basic guide on metadata markup for DiscoverEd.&lt;br /&gt;
&lt;br /&gt;
==Software Documentation ==&lt;br /&gt;
*[[DiscoverEd Quickstart|Quickstart]]&lt;br /&gt;
*[[Running DiscoverEd]]&lt;br /&gt;
*[[DiscoverEd Data|Data]]&lt;br /&gt;
&lt;br /&gt;
== Developer Documentation ==&lt;br /&gt;
* Source repository  ([http://gitorious.org/discovered gitorious])&lt;br /&gt;
* Project planning ([https://www.pivotaltracker.com/projects/77041 Pivotal Tracker])&lt;br /&gt;
* [[DiscoverEd/Development notes|Development notes]]&lt;br /&gt;
* [[Hacking DiscoverEd]]&lt;br /&gt;
*[[:Category:DiscoverEd_Specification|DiscoverEd dev spec pages]]&lt;br /&gt;
* [[/Meetings]]&lt;br /&gt;
&lt;br /&gt;
== Additional Information ==&lt;br /&gt;
*[[Related Efforts]]&lt;br /&gt;
&lt;br /&gt;
[[Category:DiscoverEd]]&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Install_manually&amp;diff=40257</id>
		<title>DiscoverEd/Install manually</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Install_manually&amp;diff=40257"/>
				<updated>2010-08-20T19:27:35Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
{{Stub}}&lt;br /&gt;
&lt;br /&gt;
=== Check out and build the source code ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ git clone git://gitorious.org/discovered/repo.git discovered&lt;br /&gt;
$ cd discovered&lt;br /&gt;
$ ant&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Add a curator and a feed ===&lt;br /&gt;
&lt;br /&gt;
DiscoverEd uses feeds to help identify resources to crawl.  Feeds are provided by curators, who can also provide metadata about resources.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds addcurator &amp;quot;ND OCW&amp;quot; http://ocw.nd.edu/ &lt;br /&gt;
$ ./bin/feeds addfeed rss http://ocw.nd.edu/front-page/courselist/rss http://ocw.nd.edu/&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Aggregate and crawl resources ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds aggregate&lt;br /&gt;
$ mkdir seed&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; seed/urls.txt&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Run the web application ===&lt;br /&gt;
&lt;br /&gt;
Edit conf/nutch-site.xml to point to your crawl location.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&lt;br /&gt;
$ ant war&lt;br /&gt;
$ [copy the war file to your J2EE container]&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Switching to MySQL ===&lt;br /&gt;
&lt;br /&gt;
By default, DiscoverEd (at least on the ''next'' branch) uses an on-disk database called Derby for storing resource metadata. You should use a different database, like MySQL, in production.&lt;br /&gt;
&lt;br /&gt;
To do that, edit '''conf/discovered.xml''' and update the following sections as appropriate:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.driver&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;com.mysql.jdbc.Driver&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.url&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;jdbc:mysql://localhost/discovered?autoReconnect=true&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.user&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;discovered&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.password&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40256</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40256"/>
				<updated>2010-08-20T19:27:15Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Switching to MySQL */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in '''/var/www/discovered.labs.creativecommons.org/production-crawl'''.&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged /var/www/discovered.labs.creativecommons.org/production-crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv /usr/local/nutch/crawl /usr/local/nutch/crawl.20090730&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged /usr/local/nutch/crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat5.5 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On a6, in the /var/www/discovered.creativecommons.org/oenutch directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40255</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40255"/>
				<updated>2010-08-20T19:27:03Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: ccbuild.xml =&amp;gt; dedbuild.xml&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Switching to MySQL ===&lt;br /&gt;
&lt;br /&gt;
By default, DiscoverEd (at least on the ''next'' branch) uses an on-disk database called Derby for storing resource metadata. You should use a different database, like MySQL, in production.&lt;br /&gt;
&lt;br /&gt;
To do that, edit '''conf/discovered.xml''' and update the following sections as appropriate:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.driver&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;com.mysql.jdbc.Driver&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.url&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;jdbc:mysql://localhost/discovered?autoReconnect=true&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.user&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;discovered&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.password&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f dedbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in '''/var/www/discovered.labs.creativecommons.org/production-crawl'''.&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged /var/www/discovered.labs.creativecommons.org/production-crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv /usr/local/nutch/crawl /usr/local/nutch/crawl.20090730&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged /usr/local/nutch/crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat5.5 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On a6, in the /var/www/discovered.creativecommons.org/oenutch directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40254</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40254"/>
				<updated>2010-08-20T19:25:45Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: Update production crawl location&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you in the discovered user's account.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo su - discovered&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Switching to MySQL ===&lt;br /&gt;
&lt;br /&gt;
By default, DiscoverEd (at least on the ''next'' branch) uses an on-disk database called Derby for storing resource metadata. You should use a different database, like MySQL, in production.&lt;br /&gt;
&lt;br /&gt;
To do that, edit '''conf/discovered.xml''' and update the following sections as appropriate:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.driver&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;com.mysql.jdbc.Driver&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.url&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;jdbc:mysql://localhost/discovered?autoReconnect=true&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.user&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;discovered&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.password&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f ccbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in '''/var/www/discovered.labs.creativecommons.org/production-crawl'''.&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged /var/www/discovered.labs.creativecommons.org/production-crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv /usr/local/nutch/crawl /usr/local/nutch/crawl.20090730&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged /usr/local/nutch/crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat5.5 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On a6, in the /var/www/discovered.creativecommons.org/oenutch directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40248</id>
		<title>Running DiscoverEd</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Running_DiscoverEd&amp;diff=40248"/>
				<updated>2010-08-20T18:55:04Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
&lt;br /&gt;
{{Infobox|This page contains raw documentation, some of which is only applicable to our DiscoverEd deployment.  It will be massaged into more general docs in the fullness of time.}}&lt;br /&gt;
&lt;br /&gt;
== Instructions for running a crawl ==&lt;br /&gt;
&lt;br /&gt;
Tips: &lt;br /&gt;
* For long aggregates and crawls, run in 'screen'.&lt;br /&gt;
&lt;br /&gt;
Three phases to the process of updating the index:&lt;br /&gt;
# Aggregation (polling feeds old and new)&lt;br /&gt;
# crawling&lt;br /&gt;
# merging (merging the new index with the existing one).  &lt;br /&gt;
&lt;br /&gt;
=== Set up environment ===&lt;br /&gt;
&lt;br /&gt;
Execute these commands to set up your environment for running the tools.  It also places you into a sub-shell so you'll have to do logout twice to&lt;br /&gt;
disconnect:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ cd /var/www/discovered.creativecommos.org/oenutch&lt;br /&gt;
$ ./bin/env.sh&lt;br /&gt;
$ export JAVA_HOME=/usr/lib/jvm/java-6-sun&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Switching to MySQL ===&lt;br /&gt;
&lt;br /&gt;
By default, DiscoverEd (at least on the ''next'' branch) uses an on-disk database called Derby for storing resource metadata. You should use a different database, like MySQL, in production.&lt;br /&gt;
&lt;br /&gt;
To do that, edit '''conf/discovered.xml''' and update the following sections as appropriate:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.driver&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;com.mysql.jdbc.Driver&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.url&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;jdbc:mysql://localhost/discovered?autoReconnect=true&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.user&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;discovered&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;property&amp;gt;&lt;br /&gt;
  &amp;lt;name&amp;gt;rdfstore.db.password&amp;lt;/name&amp;gt;&lt;br /&gt;
  &amp;lt;value&amp;gt;&amp;lt;/value&amp;gt;&lt;br /&gt;
&amp;lt;/property&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Managing Feeds ===&lt;br /&gt;
&lt;br /&gt;
The feeds script (./bin/feeds) allows you to add curators or feeds. &lt;br /&gt;
Running it without parameters will show the sub-commands.  &lt;br /&gt;
Feeds and curators are identified by URL (and yes, it's picky -- http://example.org is not the same as http://example.org/ ).&lt;br /&gt;
&lt;br /&gt;
==== Notes ====&lt;br /&gt;
&lt;br /&gt;
*For each feed packaged by an OPML feed, the curator is set by the feed title. The OPML consumer will only add the curator/feed if the feed isn't already in the system. &lt;br /&gt;
&lt;br /&gt;
*If you add a feed that already exists, you'll just overwrite the old one (since it's a triple store and the URI is the identifier. Same with a curator; they're also identified by URI.  It's more likely you'd get two curators, but so long as you're dealing with the same feed URL you won't get dupes.&lt;br /&gt;
&lt;br /&gt;
=== Aggregation ===&lt;br /&gt;
&lt;br /&gt;
Aggregation polls the feeds and adds new resources to the triple store.  It will also poll any OPML feeds and add the new feeds it finds.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;$ ./bin/feeds aggregate&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Crawl ===&lt;br /&gt;
&lt;br /&gt;
Before you crawl you need to make a seed which tells the crawler what to retrieve.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/feeds seed &amp;gt; ./seed/crawl-urls.txt&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
When the crawl runs it will look in ./seed/ and open every file it finds there, expecting to find one URL per line (so remove files when you don't want them to be crawled).&lt;br /&gt;
&lt;br /&gt;
To run the actual crawl do:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ant -f ccbuild.xml crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will read the seed files and run the crawl.  The result of this is a new index in the oenutch directory; the directories have a timestamp derived directory name.  For example, crawl-20090730201000 for a crawl run on July 30, 2009 @ 8:10 PM.  After the crawl completes you need to merge the new index with the old one.&lt;br /&gt;
&lt;br /&gt;
The production index lives in /usr/local/nutch/crawl.&lt;br /&gt;
&lt;br /&gt;
To merge the index run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ ./bin/merge ./crawl-&amp;lt;timestamp&amp;gt;-merged /usr/local/nutch/crawl ./crawl-&amp;lt;timestamp&amp;gt;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The target directory (the first parameter) will be created for you. The second parameter doesn't change, and the third parameter is the directory just created by the crawl.&lt;br /&gt;
&lt;br /&gt;
After the merge completes (assuming it does so successfully) you'll want to move it into the production directory.  Do something like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv /usr/local/nutch/crawl /usr/local/nutch/crawl.20090730&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to rename the existing index so you can go back to it if necessary.&lt;br /&gt;
&lt;br /&gt;
Then you can do&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ mv ./crawl-new-dir-merged /usr/local/nutch/crawl&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And finally restart Tomcat (the Java app server) to make sure the new index is being used:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
$ sudo /etc/init.d/tomcat5.5 restart&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Managing curators and feeds ==&lt;br /&gt;
&lt;br /&gt;
On a6, in the /var/www/discovered.creativecommons.org/oenutch directory, running ./bin/feeds with no parameters shows the list of subcommands:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  listfeeds        list all feeds&lt;br /&gt;
  listcurators     list all curators&lt;br /&gt;
  addfeed          add a feed&lt;br /&gt;
  resetfeed        reset the last aggregation date for a feed&lt;br /&gt;
  addcurator       add a curator&lt;br /&gt;
  rmfeed           remove a feed&lt;br /&gt;
  setcurator       set the curator for a feed&lt;br /&gt;
  aggregate&lt;br /&gt;
  dump&lt;br /&gt;
  seed&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each one is run as an argument to the feeds script (i.e. ./bin/feeds [command] [parameter1] [parameter2]...)&lt;br /&gt;
&lt;br /&gt;
=== addfeed ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addfeed [feed_type] [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Assuming you've added the curator for this feed with addcurator, the curator URL will set the curator (so you don't need to set it again with setcurator).&lt;br /&gt;
&lt;br /&gt;
Feed type notes: &amp;quot;rss&amp;quot; is a parser that does RSS/Atom sniffing.&lt;br /&gt;
&lt;br /&gt;
=== addcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
addcurator [curator_name] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Curator names with spaces should be surrounded by quotation marks (e.g. addcurator &amp;quot;CC Open Textbook Project&amp;quot; http://www.collegeopentextbooks.org/)&lt;br /&gt;
&lt;br /&gt;
=== setcurator ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
setcurator [feed_url] [curator_url]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Field_Query_Mapping&amp;diff=35911</id>
		<title>Field Query Mapping</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Field_Query_Mapping&amp;diff=35911"/>
				<updated>2010-06-21T20:06:36Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Requirements */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=In Development&lt;br /&gt;
}}&lt;br /&gt;
The people who run a DiscoverEd may wish to let users search specific metadata easily. For example, http://discovered.creativecommons.org/search/ lets users search for works &amp;quot;tagged&amp;quot; with &amp;quot;banana&amp;quot; as a Dublin Core subject by searching for tag:banana. &lt;br /&gt;
&lt;br /&gt;
These prefixes, like tag:, are stored in the DiscoverEd code right now. This feature aims to move those into a configuration file.&lt;br /&gt;
&lt;br /&gt;
This feature was defined and developed during the [[DiscoverEd Sprint (June, 2010)|June 2010 DiscoverEd Sprint]]&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
When DiscoverEd crawls feeds and resources and saves metadata such as the page title, it converts this information into RDF triples; those triples are eventually saved on disk in a triple store, Jena.&lt;br /&gt;
&lt;br /&gt;
We will create a new configuration file that stores a list of mappings from predicate URIs (such as stating that &amp;quot;method:&amp;quot; will be a shorthand for the RDF predicate http://purl.org/dc/terms/instructionalMethod, AKA dct:instructionalMethod). At indexing time, a Lucene column called &amp;quot;method&amp;quot; will be created in the Lucene documents corresponding to each resource that has the dct:instructionalMethod predicate set.&lt;br /&gt;
&lt;br /&gt;
Then, at search time, Nutch's built-in query parser handles the query.&lt;br /&gt;
&lt;br /&gt;
== Implementation ==&lt;br /&gt;
&lt;br /&gt;
* Added a sample configuration file&lt;br /&gt;
* Added code to our IndexFilter that looks for relevant triples and stores them in the Lucene document&lt;br /&gt;
* Problem: The Lucene documents does not seem to show our column, so we're going back to the drawing board and carefully reading the [http://wiki.apache.org/nutch/HowToMakeCustomSearch relevant Nutch documentation] to make sure we're using the APIs correctly&lt;br /&gt;
&lt;br /&gt;
==Deferred until later==&lt;br /&gt;
&lt;br /&gt;
* Handling provenance with regard to this.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=Field_Query_Mapping&amp;diff=35906</id>
		<title>Field Query Mapping</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=Field_Query_Mapping&amp;diff=35906"/>
				<updated>2010-06-21T18:49:08Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: Created page with '{{DiscoverEd Specification |contact=Asheesh Laroia |project=AgShare |status=In Development }} The people who run a DiscoverEd may wish to let users search specific metadata easil…'&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{DiscoverEd Specification&lt;br /&gt;
|contact=Asheesh Laroia&lt;br /&gt;
|project=AgShare&lt;br /&gt;
|status=In Development&lt;br /&gt;
}}&lt;br /&gt;
The people who run a DiscoverEd may wish to let users search specific metadata easily. For example, http://discovered.creativecommons.org/search/ lets users search for works &amp;quot;tagged&amp;quot; with &amp;quot;banana&amp;quot; as a Dublin Core subject by searching for tag:banana. &lt;br /&gt;
&lt;br /&gt;
These prefixes, like tag:, are stored in the DiscoverEd code right now. This feature aims to move those into a configuration file.&lt;br /&gt;
&lt;br /&gt;
This feature was defined and developed during the [[DiscoverEd Sprint (June, 2010)|June 2010 DiscoverEd Sprint]]&lt;br /&gt;
&lt;br /&gt;
== Requirements ==&lt;br /&gt;
&lt;br /&gt;
(still writing)he+%5B%5BDiscove&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd_Glossary&amp;diff=35465</id>
		<title>DiscoverEd Glossary</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd_Glossary&amp;diff=35465"/>
				<updated>2010-06-15T13:52:48Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:DiscoverEd]]&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
=== Curator ===&lt;br /&gt;
&lt;br /&gt;
An agent (individual, organization, group) which identifies resources for inclusion in the DiscoverEd index.  A curator may be creator/publisher of the resources, or may be a third party which identifies existing resources and [possibly] adds additional metadata.  A curator provides one or more feeds identifying the resources to be indexed. (FIXME: Add an example curator.)&lt;br /&gt;
&lt;br /&gt;
=== Feed ===&lt;br /&gt;
&lt;br /&gt;
A list or map of resources to be included in the index.  A feed is associated with a particular curator, and may also include metadata about the resource.  Feed is used as a generic term to include Atom/RSS (parsed using Rome) and OAI-PMH endpoints. (FIXME: Add a sample feed curated by somebody.)&lt;br /&gt;
&lt;br /&gt;
=== Resource ===&lt;br /&gt;
&lt;br /&gt;
A single resource to be indexed, identified by a curator.  Metadata about the resource may be included with it as [[RDFa]], or provided by the curator. (FIXME: Link to a sample resource.)&lt;br /&gt;
&lt;br /&gt;
=== SKOS ===&lt;br /&gt;
&lt;br /&gt;
[http://www.w3.org/2004/02/skos/ SKOS] is a set of specifications and standards to support the use of knowledge organization systems (KOS) such as thesauri, classification schemes, subject heading lists and taxonomies within the framework of the Semantic Web. (FIXME: Add a link to a SKOS data set.)&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Development_notes&amp;diff=35428</id>
		<title>DiscoverEd/Development notes</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Development_notes&amp;diff=35428"/>
				<updated>2010-06-15T13:04:57Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Eclipse ==&lt;br /&gt;
&lt;br /&gt;
If you use Eclipse, you'll be pleased to know that the repository contains an Eclipse project file. To get going, choose &amp;quot;Create a new project from existing sources.&amp;quot; This should import all that is necessary into Eclipse.&lt;br /&gt;
&lt;br /&gt;
=== Eclipse, Nutch, and the class path ===&lt;br /&gt;
&lt;br /&gt;
You can end up in a mess with the class path, since ant has one way of managing the class path, whereas Eclipse has a second. So things that work in Eclipse can fail in the ant targets.&lt;br /&gt;
&lt;br /&gt;
FIXME: Write more problems and solutions here.&lt;br /&gt;
&lt;br /&gt;
[[Category:DiscoverEd]]&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Development_notes&amp;diff=35405</id>
		<title>DiscoverEd/Development notes</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Development_notes&amp;diff=35405"/>
				<updated>2010-06-13T14:25:03Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: /* Eclipse = */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Eclipse ==&lt;br /&gt;
&lt;br /&gt;
If you use Eclipse, you'll be pleased to know that the repository contains an Eclipse project file. To get going, choose &amp;quot;Create a new project from existing sources.&amp;quot; This should import all that is necessary into Eclipse.&lt;br /&gt;
&lt;br /&gt;
=== Eclipse, Nutch, and the class path ===&lt;br /&gt;
&lt;br /&gt;
You can end up in a mess with the class path, since ant has one way of managing the class path, whereas Eclipse has a second. So things that work in Eclipse can fail in the ant targets.&lt;br /&gt;
&lt;br /&gt;
FIXME: Write more problems and solutions here.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	<entry>
		<id>https://wiki.creativecommons.org/index.php?title=DiscoverEd/Development_notes&amp;diff=35404</id>
		<title>DiscoverEd/Development notes</title>
		<link rel="alternate" type="text/html" href="https://wiki.creativecommons.org/index.php?title=DiscoverEd/Development_notes&amp;diff=35404"/>
				<updated>2010-06-13T14:24:15Z</updated>
		
		<summary type="html">&lt;p&gt;Paulproteus: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Eclipse ===&lt;br /&gt;
&lt;br /&gt;
If you use Eclipse, you'll be pleased to know that the repository contains an Eclipse project file. To get going, choose &amp;quot;Create a new project from existing sources.&amp;quot; This should import all that is necessary into Eclipse.&lt;br /&gt;
&lt;br /&gt;
=== Eclipse, Nutch, and the class path ===&lt;br /&gt;
&lt;br /&gt;
You can end up in a mess with the class path, since ant has one way of managing the class path, whereas Eclipse has a second. So things that work in Eclipse can fail in the ant targets.&lt;br /&gt;
&lt;br /&gt;
FIXME: Write more problems and solutions here.&lt;/div&gt;</summary>
		<author><name>Paulproteus</name></author>	</entry>

	</feed>