Hadoop - Cognizant Transmutation

Upcoming Mini-tutorial at BigDataCamp: How to Build a Hadoop Cluster from Scratch in 20 Minutes by CTO of Infochimps

Robert J Berger — Sat, 11 Feb 2012 03:08:16 +0000

Flip Kromer (@mrflip), CTO of Infochimps will give a an overview and tutorial on using the latest version of Ironfan (which until today was called cluster_chef) at the BigDataCamp unconference put on by Dave Nielsen just before O’Reilly’s Strata Conference Feb 27 from 5:30pm to 10pm

We’ve been using cluster_chef at Runa as the basis of our chef management for our entire production environment for the last few months. I’m very excited about what Flip and his team have done to turn Ironfan into a pretty nice way to orchestrate the Chef deployment of complex clusters of servers with a focus on supporting the Hadoop Ecosystem and EC2. Its not specific to Hadoop or EC2, but has a lot for supporting those which is what we really liked.

Here’s the blurb:

How to Build a Hadoop Cluster from Scratch in 20 Minutes

In this tutorial, Flip Kromer, CTO of Infochimps will introduce Ironfan (formerly Cluster Chef), Infochimps’ open-source tool for orchestrated systems provisioning. It builds on Chef, Opscode‘s beloved open-source tool for provisioning cloud machines and adds a number of superpowers that allow you to provision and deploy coordinated clusters of machines all at once. Stop monkeying around, spending days or weeks spinning up clusters, and manually copying and pasting IP addresses to glue all the pieces together. Just spin them up when you need them and kill them when you don’t; let Ironfan handle the details. Now, you can spend your money, time and engineering focus on more important things – like finding insights in your data.

The Ironfan demo will run approximately 30 minutes and we welcome all attendees to bring short demos (3-5 minutes) of awesome things they have done with Chef or Ironfan (formerly Cluster Chef). Flip will also be available for Q&A at the end of this session or later in the evening over beers.

You should sign up in advance for the BigDataCamp at http://bigdatacamp-santaclara-2012-eivtefrnd.eventbrite.com/ Its free but they like to know how many are coming.

The post Upcoming Mini-tutorial at BigDataCamp: How to Build a Hadoop Cluster from Scratch in 20 Minutes by CTO of Infochimps first appeared on Cognizant Transmutation.

HBase/Hadoop on Mac OS X (Pseudo-Distributed)

Robert J Berger — Mon, 03 May 2010 03:50:13 +0000

I wanted to do some experimenting with various tools for doing Hadoop and HBase activities and didn’t want to have to bother making it work with our Cluster in the Cloud. I just wanted a simple experimental environment on my Macbook Pro running Snow Leopard Mac OS X.

So I thought it was time to revisit installing Hadoop and HBase on the Mac using the latest versions of everything. This will be deployed as Psuedo-Distributed mode native to Mac OS X. Some folks actually create a set of Linux VMs with a full Hadoop/HBase stack and run that on the Mac, but that is a bit of overkill for now.

These instructions mainly follow the standard instructions for Apache Hadoop and Apache HBase

Prerequisits

Mac OS X Xcode developer tools which includes Java 1.6.x. You can get this for free from the Apple Mac Dev Center. You have to become a member but there is a free membership available.

Download and Unpack Latest Distros

You can get a link to a mirror for Hadoop via the Hadoop Apache Mirror link and for Hbase at the HBase Apache Mirror link. Each of those links will bring you to a suggested link to a mirror for Hadoop or HBase. Once you click on the suggest link, it will bring you to a mirror with the recent releases. You can click on the stable link which will then bring you to a directory that has the latest stable Hadoop (as of this writing: hadoop-0.20.2.tar.gz) or HBase (as of this writing: hbase-0.20.3.tar.gz ). Click on those tar.gz files to download them.

I am going to keep the distros in ~/work/pkgs. I usually create a directory ~/work/pkgs and unpack the tar files there as numbered versions and then create symbolic links to them in ~/work. But you can do this all in any directory that you can control.:

cd ~/work
mkdir -p pkgs
cd pkgs
tar xvzf hadoop-0.20.2.tar.gz
tar xvzf hbase-0.20.3.tar.gz
cd ..
ln -s pkgs/hadoop-0.20.2 hadoop
ln -s pkgs/hbase-020.3 hbase
mkdir -p hadoop/logs
mkdir -p hbase/logs

Now you can have your tools all access ~/work/hadoop or ~/work/hbase and not care what version it is. You can update to later version just by downloading, untarring the distro and then just change the symbolic links.

Configure Hadoop

All the configuration files mentioned here will be in ~/work/hadoop/conf. In this example we are assuming that the Hadoop servers will only be accessed from this localhost. If you need to make it accessable from other hosts or VMs on your lan that support Bonjour, you could use the bonjour name (ie. the name of your mac followed by .local such as mymac.local) instead of localhost in the following Hadoop and HBase configuraitons

hadoop-env.sh

Mainly need to tell Hadoop where your JAVA_HOME is.

Add the following line below the commented out JAVA_HOME line is in hadoop-env.sh

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home

core-site.xml





  
    fs.default.name
    hdfs://localhost:9000

hdfs-site.xml





  
    dfs.replication
    1

mapred-site.xml





  
    mapred.job.tracker
    localhost:9001

Make sure you can ssh without a password to the hostname used in the configs

The Hadoop and Hbase start/stop scripts use ssh to access the various servers. In this case of doing a Pseudo-Distributed mode, everything is running on the localhost, but we still need to allow the scripts to ssh to the localhost.

Check that you can ssh to the localhost (or whatever hostname you used in the above configs)

We’re assuming that we’ll be running the Hadoop/HBase servers as the same user as our login. You can set things up to run as the hadoop user, but its kind of complicated on Mac OS X. See the section File System Layout in an earlier post Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard. That section and a few other points thru that post describe how to create and use a hadoop user to run the Hadoop and HBase servers.

Back to just doing this as our own user. Test that you can ssh to the localhost without a password:

ssh localhost

If you see something like the following paragraph that ends up with a password prompt, then you need to add a key to your ssh setup that does not need a password (you may need to say yes if you are asked if you want to continue connecting).

The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 3c:5d:6a:39:64:78:02:9d:a3:c9:69:68:50:23:71:eb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Password:

To create a passwordless key and add it to your set of authorized keys that can access your host, do the following (as yourself, not as root. The id_dsa file name can be arbitrary):

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa_for_hadoop
cat ~/.ssh/id_dsa_for_hadoop.pub >> ~/.ssh/authorized_keys

If you have strong alternative opinions on how to set up your own keys to accomplish the same thing please do it your own way. This is just the basic way of doing a passwordless ssh. You may want to use a key you already have lying around or some other mechanism.

Start Hadoop

One time format of Hadoop File System

Only once, before the first time you use Hadoop, you have to create a formated Hadoop File System. Don’t do this again once you have data in your Hadoop file system as it will erase anything you might have saved there. You may have to do this command again if somehow you screw up your file system. But its not something to do lightly the second time.

~/work/hadoop/bin/hadoop namenode -format

If all goes well, you should see something like:

10/05/02 18:45:04 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = Psion.local/192.168.50.16
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/02 18:45:04 INFO namenode.FSNamesystem: fsOwner=rberger,rberger,admin,com.apple.access_screensharing,_developer,_lpoperator,_lpadmin,_appserveradm,_appserverusr,localaccounts,everyone,com.apple.sharepoint.group.2,com.apple.sharepoint.group.3,dev,com.apple.sharepoint.group.1,workgroup
10/05/02 18:45:04 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/02 18:45:04 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/02 18:45:04 INFO common.Storage: Image file of size 97 saved in 0 seconds.
10/05/02 18:45:04 INFO common.Storage: Storage directory /tmp/hadoop-rberger/dfs/name has been successfully formatted.
10/05/02 18:45:04 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Psion.local/192.168.50.16
************************************************************/

Starting and stopping Hadoop

Now you can start Hadoop. You will use this command to start Hadoop in general:

~/work/hadoop/bin/start-all.sh

You can stop Hadoop with the command

~/work/hadoop/bin/stop-all.sh

But remember if you are running HBase, stop that first, then stop Hadoop.

Making sure Hadoop is working

You can see the Hadoop logs in ~/work/hadoop/logs

You should be able to see the Hadoop Namenode web interface at http://localhost:50070/ and the JobTracker Web Interface at http://localhost:50030/. If not, check that you have 5 java processes running where each of those java processes have one of the following as their last command line (as seen from a ps ax | grep hadoop command) :

org.apache.hadoop.mapred.JobTracker
org.apache.hadoop.hdfs.server.namenode.NameNode
org.apache.hadoop.mapred.TaskTracker
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
org.apache.hadoop.hdfs.server.datanode.DataNode

If you do not see these 5 processes, check the logs in ~work/hadoop/logs/*.{out,log} for messages that might give you a hint as to what went wrong.

Run some example map/reduce jobs

The Hadoop distro comes with some example / test map / reduce jobs. Here we’ll run them and make sure things are working end to end.

cd ~/work/hadoop
# Copy the input files into the distributed filesystem
# (there will be no output visible from the command):
bin/hadoop fs -put conf input
# Run some of the examples provided:
# (there will be a large amount of INFO statements as output)
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
# Examine the output files:
bin/hadoop fs -cat output/part-00000

The resulting output should be something like:

3	dfs.class
2	dfs.period
1	dfs.file
1	dfs.replication
1	dfs.servers
1	dfsadmin
1	dfsmetrics.log

Configuring HBase

The following config files all reside in ~/work/hbase/conf. As mentioned earlier, use a FQDN or a Bonjour name instead of localhost if you need remote clients to access HBase. But if you don’t use localhost here, make sure you do the same in the Hadoop config.

hbase-env.sh

Add the following line below the commented out JAVA_HOME line is in hbase-env.sh

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home

Add the following line below the commented out HBASE_CLASSPATH= line

export HBASE_CLASSPATH=${HOME}/work/hadoop/conf

hbase-site.xml





  
    hbase.rootdir
    hdfs://localhost:9000/hbase
    The directory shared by region servers.

Making Sure HBase is Working

If you do a ps ax | grep hbase you should see two java processes. One should end with:
org.apache.hadoop.hbase.zookeeper.HQuorumPeer start
And the other should end with:
org.apache.hadoop.hbase.master.HMaster start
Since we are running in the Pseudo-Distributed mode, there will not be any explicit regionservers running. If you have problems, check the logs in ~/work/hbase/logs/*.{out,log}

Testing HBase using the HBase Shell

From the unix prompt give the following command:

~/work/hbase/bin/hbase shell

Here is some example commands from the Apache HBase Installation Instructions:

base> # Type "help" to see shell help screen
hbase> help
hbase> # To create a table named "mylittletable" with a column family of "mylittlecolumnfamily", type
hbase> create "mylittletable", "mylittlecolumnfamily"
hbase> # To see the schema for you just created "mylittletable" table and its single "mylittlecolumnfamily", type
hbase> describe "mylittletable"
hbase> # To add a row whose id is "myrow", to the column "mylittlecolumnfamily:x" with a value of 'v', do
hbase> put "mylittletable", "myrow", "mylittlecolumnfamily:x", "v"
hbase> # To get the cell just added, do
hbase> get "mylittletable", "myrow"
hbase> # To scan you new table, do
hbase> scan "mylittletable"

You can stop hbase with the command:

~/work/hbase/bin/stop-hbase.sh

Once that has stopped you can stop hadoop:

~/work/hadoop/bin/stop-all.sh

Conclusion

You should now have a fully working Pseudo-Distributed Hadoop / HBase setup on your Mac. This is not suitable for any kind of large data or production project. In fact it will probably fail if you try to do anything with lots of data or high volumes of I/O. HBase seems to not like to work well until you get 4 – 5 regionservers.

But this Pseudo-Distributed version should be fine for doing experiments with tools and small data sets.

Now I can get on with playing with Cascading-Clojure and Cascalog!

The post HBase/Hadoop on Mac OS X (Pseudo-Distributed) first appeared on Cognizant Transmutation.

Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2

Robert J Berger — Sat, 05 Sep 2009 01:34:41 +0000

NOTE (Sep 7 2009): Updated info on need to use Amazon Private DNS Names and clarified the need for the masters, slaves and regionservers files. Also updated to use HBase 0.20.0 Release Candidate 3

Introduction

As someone who has “skipped” Java and wants to learn as little as possible about it, and as one who has not had much experience with Hadoop so far, HBase deployment has a big learning curve. So some of the things I describe below may be obvious to those who have had experience in those domains.

Where’s the docs for HBase 0.20

If you go to the HBase wiki, you will find that there is not much documentation on the 0.20 version. This puzzled me since all the twittering, blog posting and other buzz was talking about people using 0.20 even though its “pre-release”

One of the great things about going to meetups such as the HBase Meetup is you can talk to the folks who actually wrote the thing and ask them “Where is the documentation for HBase 0.20

Turns out its in the HBase 0.20.0 distribution in the docs directory. The easiest thing is to get the pre-built 0.20.0 release candididate 3. If you download the source from the version control repository you have to build the documentation using Ant. If you are an Java/Ant kind of person it might not be hard. But just to build the docs, you have to meet some dependencies like

What we learnt with 0.19.x

We have been learning a lot about making HBase Cluster work at a basic level. I had a lot of problems getting 0.19.x running beyond a single node in Psuedo Distributed mode. I think a lot of my problems was just not getting how it all fit together with Hadoop and what the different startup/shutdown scripts did.

Then we finally tried the HBase EC2 Scripts even though it uses an AMI based on Fedora 8 and seemed wired to 0.19.0. Its a pretty nice script if you want to have an opionated HBase cluster set up. But it did educate us on how to get a cluster to go. It has a bit of strangeness by having a script in /root/hbase_init that is called at boot time to configure all the hadoop and hbase conf scripts and then call the hadoop and hbase startup scripts. Something like this is kind of needed for Amazon EC2 since you don’t really know what the IP Address/FQDN is until boot time.

The scripts also set up an Amazon Security Group for the cluster master and one for the rest of the cluster. I beleive it then uses this as a way to identify the group as well.

The main thing we did get was by going thru mainly the /root/hbase_init script we were able to figure out what the process was for bringing up Hadoop/HBase as a cluster.

We did build a staging cluster with this script. We were able to pretty easily change the scripts to use 0.19.3 instead of 0.19.0. But its opions were different than ours for many things. Plus after talking to the folks at the HBase Meetup, and having all sort of weird problems with our app on 0.19.3, we were convinced that our future is in HBase 0.20. And 0.20 introduces some new things like using Zookeeper to manage the Master selection so seems like its not worth it for us to continue to use this script. Though it helped in our learning quite a bit!

Building an HBase 0.20.0 Cluster

This post will use the HBase pre-built Release Candidate 3 and the prebuild standard Hadoop 0.20.0.

This post will show how to do all this “by hand”. Hopefully we’ll have an article on how to do all this with Chef sometime soon.

The Hbase folks say that you really should have at least 5 regionservers and one master. The master and several of the regionservers can also run the zookeeper quorum. Of course the master serveris also going to run the Hadoop Nameserver Secondary name server. Then the 5 other nodes are running the Hadoop HDFS Data nodes as well as the HBase region servers. When you build out larger clusters, you will probably want to dedicate machines to Zookeepers and hot-standby Hbase Masters. Name Servers are still the Single Point of Failure (SPOF). Rumour has it that this will be fixed in Hadoop 0.21.

We’re not using Map / Reduce yet so won’t go into that, but its just a mater of different startup scripts to make the same nodes do Map/Reduce as HDFS and HBase.

In this example, we’re installing and running everything as Root. It can also be done as a special user like hadoop as described in the earlier blog post Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard

Getting the pre-requisites in order

We started with the vanilla alestic Ubuntu 9.04 Jaunty 64Bit Server AMI: ami-5b46a732 and instantiated 6 High CPU Large Instances. You really want as much memory and cores as you can get. You can do the following by hand or combine it with the shell scripting described below in the section Installing Hadoop and HBase.

apt-get update
apt-get upgrade

Then added via apt-get install:

apt-get install sun-java6-jdk

Downloading Hadoop and HBase

You can use the production Hadoop 0.20.0 release. You can find them at the mirrors at http://www.apache.org/dyn/closer.cgi/hadoop/core/. The examples show from one mirror:

wget http://mirror.cloudera.com/apache/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz

You can download the HBase 0.20.0 Release Candidate 3 in a prebuilt form from http://people.apache.org/~stack/hbase-0.20.0-candidate-3/ (You can get the source out of Version Control:http://hadoop.apache.org/hbase/version_control.html but  you'll have to figure out how to build it.)

wget http://people.apache.org/~stack/hbase-0.20.0-candidate-3/hbase-0.20.0.tar.gz

Installing Hadoop and HBase

Assuming that you are running in your home directory on the master server and that the target for the versioned packages is in /mnt/pkgs and that there will be a link in /mnt for the path to the home for hadoop and hbase:

You can do a some simple scripting to do the following on all the nodes at once:

Create a file named servers with the list of the fully qualified domain names of all your servers including “localhost” for the master and call the file “servers”.

Make sure you can ssh to all the servers from the master. Ideally you are using ssh keys. On master:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

On each of your region servers make sure that the id_dsa.pub is also in their authorized_keys (don’t delete any other keys you have in the authorized keys!)

Now with a bit of shell command line scripting you can install on all your servers at once:

for host in `cat servers`
 do
 echo $host
 ssh $host 'apt-get update; apt-get upgrade; apt-get install sun-java6-jdk'
 scp ~/hadoop-0.20.0.tar.gz ~/hbase-0.20.0.tar.gz $host:
 ssh $host 'mkdir -p /mnt/pkgs; cd /mnt/pkgs; tar xzf ~/hadoop-0.20.0.tar.gz; tar xzf ~/hbase-0.20.0.tar.gz; ln -s /mnt/pkgs/hadoop-0.20.0 /mnt/hadoop; ln -s /mnt/pkgs/hbase-0.20.0 /mnt/hbase'
done

Use Amazon Private DNS Names in Config files

So far I have found that its best to use the Amazon Private DNS names in the hadoop and hbase config files. It looks like HBase uses the system hostname to determine various things at runtime. Thie is always the Private DNS name. It also means that its difficult to use the Web GUI interfaces to HBase from outside of the Amazon Cloud. I set up a “desktop” version of Ubuntu that is running in the Amazon Cloud that I VNC (or NX) into and use its browser to view the Web Interface.

In any case, Amazon instances normally have limited TCP/UDP access to the outside world due to the default security group settings. You would have to add the various ports used by HBase and Hadoop to the security group to allow outside access.

If you do use the Amazon Public DNS names in the config files, there will be startup errors like the following for each instance that is assigned to the zookeeper quorum (there may be other errors as well, but these are the most obvious):

ec2-75-101-104-121.compute-1.amazonaws.com: java.io.IOException: Could not find my address: domU-12-31-39-06-9D-51.compute-1.internal in list of ZooKeeper quorum servers
ec2-75-101-104-121.compute-1.amazonaws.com:     at org.apache.hadoop.hbase.zookeeper.HQuorumPeer.writeMyID(HQuorumPeer.java:128)
ec2-75-101-104-121.compute-1.amazonaws.com:     at org.apache.hadoop.hbase.zookeeper.HQuorumPeer.main(HQuorumPeer.java:67)

Configuring Hadoop

Now you have to configure the hadoop on master in /mnt/hadoop/conf:

hadoop-env.sh:

The minimal things to change are:

Set your JAVA_HOME to where the java package is installed. On Ubuntu:

export JAVA_HOME=/usr/lib/jvm/java-6-sun

Add the hbase path to the HADOOP_CLASSPATH:

export HADOOP_CLASSPATH=/mnt/hbase/hbase-0.20.0.jar:/mnt/hbase/hbase-0.20.0-test.jar:/conf

core-site.xml:

Here is what we used. Primarily setting where the hadoop files are and the nameserver path and port:





   
     hadoop.tmp.dir
     /mnt/hadoop
   

   
     fs.default.name
     hdfs://domU-12-31-39-06-9D-51.compute-1.internal:50001
   

   
     tasktracker.http.threads
     80

mapred-site.xml:

Even though we are not currently using Map/Reduce this is a basic config:





   
     mapred.job.tracker
     domU-12-31-39-06-9D-51.compute-1.internal:50002
   

   
     mapred.tasktracker.map.tasks.maximum
     4
   

   
     mapred.tasktracker.reduce.tasks.maximum
     4
   

   
     mapred.output.compress
     true
   

   
     mapred.output.compression.type
     BLOCK

hdfs-site.xml:

The main thing to change based on your config is the dfs.replication. It should be less than the total number of data-nodes / region-servers.





   
     dfs.client.block.write.retries
     3
   

   
     dfs.replication
     3

Put the Fully qualified domain name of your master in the file masters and the names of the data-nodes in the file slaves.

masters:

domU-12-31-39-06-9D-51.compute-1.internal

slaves:

domU-12-31-39-06-9D-C1.compute-1.internal
domU-12-31-39-06-9D-51.compute-1.internal

We did not change any of the other files so far.

Now copy these files to the data-nodes:

for host in `cat slaves`
do
  echo $host
  scp slaves masters hdfs-site.xml hadoop-env.sh core-site.xml ${host}:/mnt/hadoop/conf
done

And also format the hdfs on the master

/mnt/hadoop/bin/hadoop namenode -format

Configuring HBase

hbase-env.sh:

Similar to the hadoop-env.sh, you must set the JAVA_HOME:

export JAVA_HOME=/usr/lib/jvm/java-6-sun

and add the hadoop conf directory to the HBASE_CLASSPATH:

export HBASE_CLASSPATH=/mnt/hadoop/conf

And for the master you will want to say:

export HBASE_MANAGES_ZK=true

hbase-site.xml:

Mainly need to define the hbase master, hbase rootdir and the list of zookeepers. We also had to bump up the hbase.zookeeper.property.maxClientCnxns from the default of 30 to 300.




   
     hbase.master
     domU-12-31-39-06-9D-51.compute-1.internal:60000
   

   
     hbase.rootdir
     hdfs://domU-12-31-39-06-9D-51.compute-1.internal:50001/hbase
   
   
     hbase.zookeeper.quorum
     domU-12-31-39-06-9D-51.compute-1.internal,domU-12-31-39-06-9D-C1.compute-1.internal,domU-12-31-39-06-9D-51.compute-1.internal
   
   
     hbase.cluster.distributed
     true
   
   
     hbase.zookeeper.property.maxClientCnxns
     300

You will also need to have a file called regionservers. Normally it contains the same hostnames as the hadoop slaves:

regionservers:

domU-12-31-39-06-9D-C1.compute-1.internal
domU-12-31-39-06-9D-51.compute-1.internal

Copy the files to the region-servers:

for host in `cat regionservers`
do
  echo $host
  scp hbase-env.sh hbase-site.xml regionservers ${host}:/mnt/hbase/conf
done

Starting Hadoop and HBase

On the master:

(This just starts the Hadoop File System services, not Map/Reduce services)

/mnt/hadoop/bin/start-dfs.sh

Then start hbase:

/mnt/hbase/bin/start-hbase.sh

You can shut things down by doing the reverse:

/mnt/hbase/bin/stop-hbase.sh
/mnt/hadoop/bin/stop-dfs.sh

It is advisable to set up init scripts. This is described in the Ubuntu /etc/init.d style startup scripts section of the earlier blog post:Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard

The post Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2 first appeared on Cognizant Transmutation.

Want to work at a Startup with Cool Tech? (HBase, Clojure, Chef, Swarms, Javascript, Ruby & Rails)

Robert J Berger — Fri, 28 Aug 2009 18:15:01 +0000

Opportunity Knocks

Runa.com, the startup where I am CTO, is looking for great developers to join our small agile team. We’re an early stage, pre-series-A startup (presently funded with strategic investments from two large corporations). Runa offers a SaaS to on-line merchant that allows them to offer dynamic product and consumer specific promotions embeded in their website. This will be a very large positive disruption to the online retailing world.

Techie keywords: clojure, hadoop, hbase, rabbitmq, erlang, chef, swarm computing, ruby, rails, javascript, amazon EC2, emacs, Macintosh, Linux, selenium, test/behavior driven development, agile, lean, XP, scalability

If you’re interested, email jobs@runa.com

If you want to know more, read on!

What do we do

Runa aims to provide the top of the long tail thru the middle of the top 500 online retailers with tools/services that companies like amazon.com use/provide. These smaller guys can’t afford or don’t have the resources to do anything on that scale, but by using our SaaS services, they can make more money while providing customers with greater value.

The first service we’re building is what we call Dynamic Sale Price.

It’s a simple concept – it allows the online-retailer to offer a sale price for each product on his site, personalized to the individual consumer who is browsing it. By using this service, merchants are able to –

Increase conversion (get them to buy!) and
Offer consumers a special price which maximizes the merchant’s profit

This is different from “dumb-discounting” where something is marked-down, and everyone sees the same price. This service is more like airline or hotel pricing which varies from day to day, but much more dynamic and real-time. Further, it is based on broad statistical factors AND individual consumer behavior. After all, if you lower prices enough, consumers will buy. Instead, we dynamically lower prices to a point where statistically, that consumer is most likely to buy.

How we do it

Runa does this by performing statistical analysis and pattern recognition of what consumers are doing on the merchant sites. This includes browsing products on various pages, adding and removing items from carts, and purchasing or abandoning the carts. We track consumers as they browse, and collect vast quantities of this click-stream data. By mining this data and applying algorithms to determine a price point per consumer based on their behavior, we’re able to maximize both conversion (getting the consumer to buy) AND merchant profit.

We also offer the merchant comprehensive reports based on analysis of the mountains of data we collect. Since the data tracks consumer activity down to the individual product SKU level (for each individual consumer), we can provide very rich analytics. This is a tool that merchants need today, but don’t have the resources to build for themselves.

The business model

For reference, it is useful to understand the affiliate marketing space. Small-to-medium merchants (our target audience) pay affiliates up to 40% of a sale price. Yes, 40%. The average is in the 20% range.

We charge our merchants around 10% of sales the Runa delivers. Our merchants are happy to pay it, because it is a performance-based pay, lower than what they pay affiliates, and there is zero up-front cost to the service. In fact, the above mentioned analytics reports are free.

We’re targeting e-commerce PLATFORMS (as opposed to individual merchants); in this way, we’re able to scale up merchant-acquisition. We have 10 early-customer merchants right now, with about 100 more planned to go live in the next 2-3 months. By the end of next year, we’re targeting about 1,000 merchants and 10,000 merchants the following year. Our channel deployment model makes these goals achievable.

At something like a 5 to 10% service charge, and a typical merchant having between 500K to 1M in sales per year, this is a VERY profitable business model. That is, of course, if we’re successful… but we’re seeing very positive signs so far.

Technology

Most of our front-end stuff (like the merchant-dashboard, reports, campaign management) is built with Ruby on Rails. Our merchant integration requires browser-side Javascript magic. All our analytics (batch-processing) and real-time pricing services are written in Clojure. We use RabbitMQ for all our messaging needs. We store data in HBase. We’re deployed on Amazon’s EC2.

Here are a few blog postings about what we’ve been up to –

Distributed Clojure system in production
Using messaging for scalability
Capjure: a simple HBase persistence layer
Clojure in production
Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2

We’ve also open-sourced a few of our projects –

swarmiji – A distributed computing system to write and run Clojure code in parallel, across CPUs
capjure – Clojure persistence for HBase

Culture at Runa

We’re a small team, very passionate about what we do. We’re focused on delivering a ground-breaking, disruptive service that will allow merchants to really change the way they sell online. We work start-up hours, but we’re flexible and laid-back about it. We know that a healthy personal life is important for a good professional life. We work with each other to support it.

We use an agile process with a lot of influences from the “Lean”:http://en.wikipedia.org/wiki/Lean_software_development and “Kanban”:http://leansoftwareengineering.com/2007/08/29/kanban-systems-for-software-development/ world. We use “Mingle”:http://studios.thoughtworks.com/mingle-agile-project-management to run our development process. Everything, OK mostly everything is covered by automated tests, so we can change things as needed.

We’re all Apple in the office – developers get a MacPro with a nice 30″ screen, and a nice 17″ MacBook Pro. We deploy on Ubuntu servers. Aeron chairs are cliché, yes; but, very comfy.

The environment is chilled out… you can wear shorts and sandals to work… Very flat organization, very non-bureaucratic… nice open spaces (no cubes!). Lunch is brought in on most days! Beer and snacks are always in the fridge.

We’re walking distance to the San Antonio Caltrain station (biking distance from the Mountain View Caltrain/VTA lightrail station).

What’s in it for you

Competitive salaries, and lots of stock-options
Cutting edge technology stack
Fantastic business opportunity, and early-stage (= great time to join!)
Developer #5 – means plenty of influence on foundational architecture and design
Smart, full bandwidth, fun people to work with
Very comfortable, nice office environment
We have a “No Assholes” policy

OK!

So, if you’re interested, email us at jobs@runa.com

No recruiters please!

We would prefer folks who are already in the Bay Area (but if you not local and are really great let’s talk!)

The post Want to work at a Startup with Cool Tech? (HBase, Clojure, Chef, Swarms, Javascript, Ruby & Rails) first appeared on Cognizant Transmutation.

Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard

Robert J Berger — Tue, 06 Jan 2009 02:19:16 +0000

UPDATE: This has been replaced by a newer post Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2 . I found that using the pre-built distributions of Hadoop and HBase much better than trying to build from source. I need more Java/Ant-fu to do the build from scratch. The HBase-0.20.0 Release Candidates are really great and seemingly easier to get the cluster going than previous releases.

Introduction

Hadoop and Map / Reduce are all the rage now days, so we figure we should be using it too.

Hbase is an implementation of Google’s Bigtable. Its built on top of the Hadoop File System (HDFS).

Its trivial to install it as a standalone on top of a filesystem, but I had some difficulty getting it working on top of HDFS in the “Pseudo-Distributed” mode.

Follow the Instructions

I set up Hadoop with no problems following the instructions on the Hadoop sitefor Pseudo-Distributed Operation which runs Hbase on top of HDFS but everything runs on one server (I.E. Its configured pretty much like a cluster but all the pieces are on the same server). Another helpful set of instructions are at Running Hadoop On Ubuntu Linux (Single-Node Cluster).

I followed the HBase installation instructions also for Pseudo-Distributed Operation.

A few things to be aware of:

Make sure that the Hadoop version and the Hbase major version numbers are the same
(I used Hadoop 0.18.2 and Hbase 0.18.1)
Make sure that the Hadoop, Hbase trees as well as the directories and files that hold the hdfs filesystem are owned by hadoop:hadoop (You have to create the user and group)
No need to disable ipv6 as some sites said

You can download the Hadoop tar file from http://www.apache.org/dyn/closer.cgi/hadoop/core/ and the Hbase tar file from http://www.apache.org/dyn/closer.cgi/hadoop/hbase/
They are also available as git repositories via:

git clone git://git.apache.org/hadoop.git
git clone git://git.apache.org/hbase.git

You can track a particular branch with the command (We’re stuck at hadoop 0.19.1 / hbase 0.19.0:

cd hadoop
git branch --track release-0.19.1 origin/tags/release-0.19.1
git checkout release-0.19.1
cd ../hbase
git branch --track 0.19.0 origin/tags/0.19.0
git checkout 0.19.0

Then in each directory build things. As far as I can tell you just need to use the default ant build. But you can build the jar also:

cd ../hadoop
ant
ant jar

cd ../hbase
ant
ant jar

Biggest Problem I Had

The thing that took the longest time to get right was when I wanted to access Hbase from other hosts. You would think you could put the DNS Fully Qualified Domain Name (FQDN) in the config file. Turns out that by default, the Hadoop tools don’t seem to use the host’s DNS resolver and just what is in /etc/hosts (as far as I can tell). So you have to use the IP address in the config file.

I believe there are ways to configure around this but I haven’t found it yet.

Configuration Examples

File System Layout

I untarred the distributions into /usr/local/pkgs and made symbolic links to /usr/local/hadoop and /usr/local/hbase as well as created the directory where Hadoop/HDFS will use for storage.

For Ubuntu:

sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop

For Mac:

Create a Home Directory

mkdir /Users/_hadoop

Find an unused groupid by seeing what ids are already in use:

sudo dscl . -list /Groups PrimaryGroupID | cut -c 32-34 | sort -rn

Then find an unused userid by seeing what userid’s are in use:

sudo dscl . -list /Users UniqueID | cut -c 20-22 | sort -rn

Pick a number that is in neither list. In our case we will use 402 for both the userid and groupid for _hadoop (Mac OS X has an underscore in front of daemon user/group names. We will also

sudo dscl . -create /Groups/_hadoop PrimaryGroupID 402
sudo dscl . -append /Groups/_hadoop RecordName hadoop

Take the Value of dsAttrTypeStandard:PrimaryGroupID in this case 500, and use it as the groupid in the following command:

sudo dscl . -create /Users/_hadoop UniqueID 402
sudo dscl . -create /Users/_hadoop RealName "Hadoop Service"
sudo dscl . -create /Users/_hadoop PrimaryGroupID 402
sudo dscl . -create /Users/_hadoop NFSHomeDirectory /Users/_hadoop
sudo dscl . -append /Users/_hadoop RecordName hadoop

For both Ubuntu and Mac (Note that the Mac will end up having a user/group id of _hadoop)

cd /usr/local/pkgs
tar xzf hadoop-0.18.2.tar.gz
tar xzf hbase-0.18.1.tar.gz

cd ..
ln -s /usr/local/pkgs/hadoop-0.18.2 hadoop
ln -s /usr/local/pkgs/hbase-0.18.1 hbase
mkdir /var/hadoop_datastore
chown -R hadoop:hadoop hadoop/ hbase/ /var/hadoop_datastore /Users/_hadoop

Hadoop Config files

The following are all in /usr/local/hadoop/conf

hadoop-env.sh

Need to set the JAVA_HOME variable. I installed java 6 via synoptic. You can also install it with:

apt-get install sun-java6-jdk

The Macintosh is a easy if you have a Intel Core 2 Dual (the Intel Core Dual doesn’t count). Apple is only supporting Java 1.6 on their 64 bit processors. If you have a 32 bit processor like the first generation Macbook Pro 17″ or first generation MacMini, or you have a PPC see Tech Tip: How to Set Up JDK 6 and JavaFX on 32-bit Intel Macs

So my config is (only the things I changed, the rest was left as is):

...
# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
 export JAVA_HOME=/usr/lib/jvm/java-6-sun
...

For the Macintosh:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/Current

hadoop-site.xml






  hadoop.tmp.dir
  /var/hadoop_datastore/hadoop-${user.name}
  A base for other temporary directories.



  fs.default.name
  hdfs://localhost:54310
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.



  mapred.job.tracker
  localhost:54311
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  



  dfs.replication
  1
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  



  dfs.datanode.socket.write.timeout
  0



   dfs.datanode.max.xcievers
   1023

HBase Config Files

The following are all in /usr/local/hbase/conf

hbase-env.sh

Again, just need to set up JAVA_HOME:

...
# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-6-sun
...

For the Macintosh:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/Current

hbase-site.xml

Here is where I wanted to give a FQDN for the host that is the hbase.master, but had to use an IP address instead.




  
    hbase.rootdir
    hdfs://localhost:54310/hbase
    The directory shared by region servers.
    Should be fully-qualified to include the filesystem to use.
    E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
    
  

  
    hbase.master
    192.168.10.50:60000
    The host and port that the HBase master runs at.

Formatting the Name Node

You must do this as the same user as will be running the daemon (hadoop)

su hadoop -s /bin/sh -c /usr/local/hadoop/bin/hadoop namenode -format

on the Mac:

/usr/bin/su _hadoop /usr/local/hadoop/bin/hadoop namenode -format

Setup passphraseless ssh

Now check that you can ssh to the localhost without a passphrase:

su - hadoop
ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands (as haddop):

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Ubuntu /etc/init.d style startup scripts

I scoured the InterTubes for example hadoop/hbase startup scripts and found absolutely none! I ended up creating a minimal one that is so far only suited for the Pseudo-Distributed Operation mode as it just calls the start-all / stop-all scripts.

/etc/init.d/hadoop

Create the place it will put its startup logs

mkdir /var/log/hadoop

Create /etc/init.d/hadoop with the following:

#!/bin/sh
### BEGIN INIT INFO
# Provides:          hadoop services
# Required-Start:    $network
# Required-Stop:     $network
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Description:       Hadoop services
# Short-Description: Enable Hadoop services including hdfs
### END INIT INFO
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
HADOOP_BIN=/usr/local/hadoop/bin
NAME=hadoop
DESC=hadoop
USER=hadoop
ROTATE_SUFFIX=
test -x $HADOOP_BIN || exit 0
RETVAL=0
set -e
cd /

start_hadoop () {
    set +e
    su $USER -s /bin/sh -c $HADOOP_BIN/start-all.sh > /var/log/hadoop/startup_log
    case "$?" in
      0)
        echo SUCCESS
        RETVAL=0
        ;;
      1)
        echo TIMEOUT - check /var/log/hadoop/startup_log
        RETVAL=1
        ;;
      *)
        echo FAILED - check /var/log/hadoop/startup_log
        RETVAL=1
        ;;
    esac
    set -e
}

stop_hadoop () {
    set +e
    if [ $RETVAL = 0 ] ; then
        su $USER -s /bin/sh -c $HADOOP_BIN/stop-all.sh > /var/log/hadoop/shutdown_log
        RETVAL=$?
        if [ $RETVAL != 0 ] ; then
            echo FAILED - check /var/log/hadoop/shutdown_log
        fi
    else
        echo No nodes running
        RETVAL=0
    fi
    set -e
}

restart_hadoop() {
    stop_hadoop
    start_hadoop
}

case "$1" in
    start)
        echo -n "Starting $DESC: "
        start_hadoop
        echo "$NAME."
        ;;
    stop)
        echo -n "Stopping $DESC: "
        stop_hadoop
        echo "$NAME."
        ;;
    force-reload|restart)
        echo -n "Restarting $DESC: "
        restart_hadoop
        echo "$NAME."
        ;;
    *)
        echo "Usage: $0 {start|stop|restart|force-reload}" >&2
        RETVAL=1
        ;;
esac
exit $RETVAL

/etc/init.d/hbase

Create the place it will put its startup logs

mkdir /var/log/hbase

Create /etc/init.d/hbase with the following:

#!/bin/sh
### BEGIN INIT INFO
# Provides:          hbase services
# Required-Start:    $network
# Required-Stop:     $network
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Description:       Hbase services
# Short-Description: Enable Hbase services including hdfs
### END INIT INFO

PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
HBASE_BIN=/usr/local/hbase/bin
NAME=hbase
DESC=hbase
USER=hadoop
ROTATE_SUFFIX=
test -x $HBASE_BIN || exit 0
RETVAL=0
set -e
cd /

start_hbase () {
    set +e
    su $USER -s /bin/sh -c $HBASE_BIN/start-hbase.sh > /var/log/hbase/startup_log
    case "$?" in
      0)
        echo SUCCESS
        RETVAL=0
        ;;
      1)
        echo TIMEOUT - check /var/log/hbase/startup_log
        RETVAL=1
        ;;
      *)
        echo FAILED - check /var/log/hbase/startup_log
        RETVAL=1
        ;;
    esac
    set -e
}

stop_hbase () {
    set +e
    if [ $RETVAL = 0 ] ; then
        su $USER -s /bin/sh -c $HBASE_BIN/stop-hbase.sh > /var/log/hbase/shutdown_log
        RETVAL=$?
        if [ $RETVAL != 0 ] ; then
            echo FAILED - check /var/log/hbase/shutdown_log
        fi
    else
        echo No nodes running
        RETVAL=0
    fi
    set -e
}

restart_hbase() {
    stop_hbase
    start_hbase
}

case "$1" in
    start)
        echo -n "Starting $DESC: "
        start_hbase
        echo "$NAME."
        ;;
    stop)
        echo -n "Stopping $DESC: "
        stop_hbase
        echo "$NAME."
        ;;
    force-reload|restart)
        echo -n "Restarting $DESC: "
        restart_hbase
        echo "$NAME."
        ;;
    *)
        echo "Usage: $0 {start|stop|restart|force-reload}" >&2
        RETVAL=1
        ;;
esac
exit $RETVAL

Set up the init system

This assumes you put the above init files in /etc/init.d

chmod +x /etc/init.d/{hbase,hadoop}
update-rc.d hadoop defaults
update-rc.d hbase defaults 25

You can now start / stop hadoop by saying:

/etc/init.d/hadoop start

/etc/init.d/hadoop stop

And similarly with hbase

/etc/init.d/hbase start

/etc/init.d/hbase stop

Make sure you start hadoop before hbase and stop hbase before you stop hadoop

Macintosh launchd style startup

Starting proceses on Macintosh Leopard is pretty easy with lauchd/launchctl.

For hadoop, create a file /Library/LaunchAgents/com.yourdomain.hadoop.plist with the following content (replace yourdomain with the domain you want to use for this class of apps):





    GroupName
    _hadoop
    KeepAlive
    
    Label
    com.yourdomain.hadoop
    ProgramArguments
    
        /usr/local/hadoop/bin/start-all.sh
    
    RunAtLoad
    
    ServiceDescription
    Hadoop Process
    UserName
    _hadoop

And for hbase, /Library/LaunchAgents/com.yourdomain.hbase.plist:





	GroupName
	_hadoop
	KeepAlive
	
	Label
	com.ibd.hbase
	ProgramArguments
	
		/usr/local/hbase/bin/start-hbase.sh
	
	RunAtLoad
	
	UserName
	_hadoop

Set the owner to root and the mode to 644:

chown root /Library/LaunchAgents/com.yourdomain.hadoop.plist /Library/LaunchAgents/com.yourdomain.hbase.plist
chmod 644 /Library/LaunchAgents/com.yourdomain.hadoop.plist /Library/LaunchAgents/com.yourdomain.hbase.plist

The next time you restart, it should start hbase and hadoop. You can also start them manually with the commands:

sudo launchctl load /Library/LaunchAgents/com.yourdomain.hadoop.plist
sudo launchctl load /Library/LaunchAgents/com.yourdomain.hbase.plist

Conclusion

You should now be able to see the HBase web interface at http://:60010

If you have problems check /var/log/{hbase,hadoop}/startup_log as well as /usr/local/hadoop/logs/hadoop-hadoop-namenode-yourhostname.log and /usr/local/hbase/logs/hbase-hadoop-master-yourhostname.log

The error messages are pretty poor. (Ie useless as far as I could tell when tracking down the FQDN/IP Address problem). But better than nothing.

I will post an update when I deploy a Full Cluster.

The post Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard first appeared on Cognizant Transmutation.

The Commoditization of Massive Data Analysis

Robert J Berger — Thu, 20 Nov 2008 07:26:24 +0000

Today’s article in O’Reilly’s Radar by Joseph Hellerstein, is a concise synopsis of the state-of-the-art large scale data analysis. It compares the Enterprise IT dominant Relational Database paradigm to the emerging (with a bullet!) MapReduce / Hadoop technologies.

Professor Hellerstein, from UC Berkeley lives this stuff as a leading researcher on databases and distributed systems. He is also an advisor to Greenplum, one of the start-ups mentioned in the article that is involved in commercializing MapReduce Tech and writes the data beta blog.

The article discusses how some companies (and they are companies with proprietary tech and nary a free download link on their home page) such as Aster Data and Greenplum that are promoting hybrid Relational Database / MapReduce Data Warehouse products. These may get some traction in the Enterprise but with any success, will eventually get squashed and/or assimilated by Oracle and thus stay in the IT Realm (IMHO).

The more interesting space is the multiverse of open source tools that are

pushing the evolution of the underlying Hadoop MapReduce as well as the growing set of tools being layered on top of Hadoop such as Hive,originally developed by Facebook Engineering, and Pig, started by Yahoo Research. Both are sets of tools, including a query language interface, for doing ad-hoc analysis of massive data sets.

Hellerstein calls all of this a renaissance in computer science research and calls for folks to look towards standardizing the upper layers of the Hadoop hierarchy, particularly the query language.

There is a debate brewing among data systems cognoscenti as to the best way to do data analysis at this scale. The old guard in the Enterprise IT camp tends to favor relational databases and the SQL language, while the web upstarts have rallied around the MapReduce programming model popularized at Google, and cloned in open source as Apache Hadoop. Hadoop is in wide use at companies like Yahoo! and Facebook, and gets a lot of attention in tech blogs as the next big open source project. But if you mention Hadoop in a corporate IT shop you are often met with blank stares — SQL is ubiquitous in those environments. There is still a surprising disconnect between these developer communities, but I expect that to change over the next year or two.

We are at the beginning of what I call The Industrial Revolution of Data. We’re not quite there yet, since most of the digital information available today is still individually “handmade”: prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation “factories” such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds. These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide.

Meanwhile, disk capacities are growing exponentially, so the cost of archiving this data remains modest. And there are plenty of reasons to believe that this data has value in a wide variety of settings. The last step of the revolution is the commoditization of data analysis software, to serve a broad class of users.

-snip-

The post The Commoditization of Massive Data Analysis first appeared on Cognizant Transmutation.