Nutch hello world

download and install ant

download and install Cygwin

download HBase 0.94.14

http://mirrors.cnnic.cn/apache/hbase/stable/hbase-0.98.9-hadoop2-bin.tar.gz

config java_home in .bashrc

Download a source package

http://mirror.bit.edu.cn/apache/nutch/2.2.1/

cd apache-nutch-2.2.1

Run ant

Now there is a directory runtime/local which contains a ready to use Nutch installation.

Customize your crawl properties

Add your agent name in the value field of the http.agent.name property in conf/nutch-site.xml, for example:

Edit the file conf/regex-urlfilter.txt and replace

accept anything else

+.

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:

+^http://([a-z0-9]*.)*nutch.apache.org/

Specify the GORA backend in $NUTCH_HOME/conf/nutch-site.xml

  • Ensure the HBase gora-hbase dependency is available in $NUTCH_HOME/ivy/ivy.xml

  • Ensure that HBaseStore is set as the default datastore in $NUTCH_HOME/conf/gora.properties. Other documentation for HBaseStore can be found here.

    gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

run ant runtime

config ssh for cygwin

http://hbase.apache.org/cygwin.html

start HBase

http://wiki.apache.org/nutch/NutchTutorial

http://wiki.apache.org/nutch/Nutch2Tutorial

http://hbase.apache.org/book/quickstart.html