Nutch hello world
Contents
Nutch hello world
download and install ant
download and install Cygwin
download HBase 0.94.14
http://mirrors.cnnic.cn/apache/hbase/stable/hbase-0.98.9-hadoop2-bin.tar.gz
config java_home in .bashrc
Download a source package
http://mirror.bit.edu.cn/apache/nutch/2.2.1/
cd apache-nutch-2.2.1
Run ant
Now there is a directory runtime/local which contains a ready to use Nutch installation.
Customize your crawl properties
Add your agent name in the value field of the http.agent.name property in conf/nutch-site.xml, for example:
Edit the file conf/regex-urlfilter.txt and replace
accept anything else
+.
with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:
+^http://([a-z0-9]*.)*nutch.apache.org/
Specify the GORA backend in $NUTCH_HOME/conf/nutch-site.xml
Ensure the HBase gora-hbase dependency is available in $NUTCH_HOME/ivy/ivy.xml
Ensure that HBaseStore is set as the default datastore in $NUTCH_HOME/conf/gora.properties. Other documentation for HBaseStore can be found here.
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
run ant runtime
config ssh for cygwin
http://hbase.apache.org/cygwin.html
start HBase
http://wiki.apache.org/nutch/NutchTutorial
Author -
LastMod 2015-01-06