<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	
	xmlns:georss="http://www.georss.org/georss"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	>

<channel>
	<title>Hadoop - Cognizant Transmutation</title>
	<atom:link href="https://www.ibd.com/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.ibd.com</link>
	<description>Internet Bandwidth Development: Composting the Internet for over Two Decades</description>
	<lastBuildDate>Thu, 05 Aug 2021 06:21:03 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.1</generator>

<image>
	<url>https://i0.wp.com/www.ibd.com/wp-content/uploads/2019/01/fullsizeoutput_7ae8.jpeg?fit=32%2C32&#038;ssl=1</url>
	<title>Hadoop - Cognizant Transmutation</title>
	<link>https://www.ibd.com</link>
	<width>32</width>
	<height>32</height>
</image> 
<atom:link rel="hub" href="https://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="https://pubsubhubbub.superfeedr.com"/><atom:link rel="hub" href="https://websubhub.com/hub"/><site xmlns="com-wordpress:feed-additions:1">156814061</site>	<item>
		<title>Upcoming Mini-tutorial at BigDataCamp: How to Build a Hadoop Cluster from Scratch in 20 Minutes by CTO of Infochimps</title>
		<link>https://www.ibd.com/scalable-deployment/1341/</link>
		
		<dc:creator><![CDATA[Robert J Berger]]></dc:creator>
		<pubDate>Sat, 11 Feb 2012 03:08:16 +0000</pubDate>
				<category><![CDATA[Opscode Chef]]></category>
		<category><![CDATA[Scalable Deployment]]></category>
		<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[Apache Hadoop]]></category>
		<category><![CDATA[BigData]]></category>
		<category><![CDATA[Chef]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Infochimps]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Opscode]]></category>
		<guid isPermaLink="false">http://blog2.ibd.com/?p=1341</guid>

					<description><![CDATA[<p>Flip  Kromer (@mrflip), CTO of Infochimps will give a an overview and tutorial on using the latest version of Ironfan (which until today was called cluster_chef)  at&#8230;</p>
<p>The post <a href="https://www.ibd.com/scalable-deployment/1341/">Upcoming Mini-tutorial at BigDataCamp: How to Build a Hadoop Cluster from Scratch in 20 Minutes by CTO of Infochimps</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></description>
										<content:encoded><![CDATA[<p><a href="http://www.infochimps.com/"><img decoding="async" loading="lazy" class="alignleft wp-image-1356 size-full" title="chimpmark" src="https://i0.wp.com/www.ibd.com/wp-content/uploads/2012/02/chimpmark.png?resize=200%2C200" alt="Infochimps Icon" width="200" height="200" srcset="https://i0.wp.com/www.ibd.com/wp-content/uploads/2012/02/chimpmark.png?w=200&amp;ssl=1 200w, https://i0.wp.com/www.ibd.com/wp-content/uploads/2012/02/chimpmark.png?resize=150%2C150&amp;ssl=1 150w" sizes="(max-width: 200px) 100vw, 200px" data-recalc-dims="1" /></a><a href="https://github.com/mrflip">Flip  Kromer</a> (<a title="Flip Twitter Handle" href="https://twitter.com/#!/mrflip">@mrflip</a>), CTO of <a href="http://blog.infochimps.com/">Infochimps</a> will give a an overview and tutorial on using the latest version of <a title="Ironfan Github repo" href="https://github.com/infochimps/ironfan" target="_blank" rel="noopener">Ironfan</a> (which until today was called <a href="https://github.com/infochimps/cluster_chef/tree/version_3">cluster_chef</a>)  at the <a href="http://www.bigdatacamp.org/siliconvalley/2012-02-27/">BigDataCamp unconference</a> put on by  <a href="http://twitter.com/davenielsen">Dave Nielsen</a> just before <a href="http://strataconf.com/strata2012?cmp=af-conf-st12-affiliate-bdc">O&#8217;Reilly&#8217;s Strata Conference</a> Feb 27 from 5:30pm to 10pm</p>
<p>We&#8217;ve been using cluster_chef at <a title="Runa Home Page" href="http://www.runa.com" target="_blank" rel="noopener">Runa</a> as the basis of our chef management for our entire production environment for the last few months. I&#8217;m very excited about what Flip and his team have done to turn Ironfan into a pretty nice way to orchestrate the <a title="Opscode Home" href="http://wiki.opscode.com" target="_blank" rel="noopener">Chef</a> deployment of complex clusters of servers with a focus on supporting the <a title="Apache Hadoop Home" href="http://hadoop.apache.org/" target="_blank" rel="noopener">Hadoop Ecosystem</a> and <a title="AWS Home" href="http://aws.amazon.com" target="_blank" rel="noopener">EC2</a>. Its not specific to Hadoop or EC2, but has a lot for supporting those which is what we really liked.</p>
<p>Here&#8217;s the blurb:</p>
<p class="p1"><strong>How to Build a Hadoop Cluster from Scratch in 20 Minutes</strong></p>
<p class="p1">In this tutorial, Flip Kromer, CTO of Infochimps will introduce Ironfan (formerly Cluster Chef), Infochimps&#8217; open-source tool for orchestrated systems provisioning.  It builds on Chef, <a class="zem_slink" title="Opscode" href="http://www.opscode.com" rel="homepage">Opscode</a>&#8216;s beloved open-source tool for provisioning cloud machines and adds a number of superpowers that allow you to provision and deploy coordinated clusters of machines all at once.  Stop monkeying around, spending days or weeks spinning up clusters, and manually copying and pasting IP addresses to glue all the pieces together.  Just spin them up when you need them and kill them when you don&#8217;t; let Ironfan handle the details.  Now, you can spend your money, time and engineering focus on more important things &#8211; like finding insights in your data.</p>
<p class="p1">The Ironfan demo will run approximately 30 minutes and we welcome all attendees to bring short demos (3-5 minutes) of awesome things they have done with Chef or Ironfan (formerly Cluster Chef).  Flip will also be available for Q&amp;A at the end of this session or later in the evening over beers.</p>
<p class="p1">You should sign up in advance for the BigDataCamp at<a href="http://bigdatacamp-santaclara-2012-eivtefrnd.eventbrite.com/"> http://bigdatacamp-santaclara-2012-eivtefrnd.eventbrite.com/</a> Its free but they like to know how many are coming.</p>
<div class="zemanta-pixie" style="margin-top: 10px; height: 15px;"><a class="zemanta-pixie-a" title="Enhanced by Zemanta" href="http://www.zemanta.com/"><img decoding="async" class="zemanta-pixie-img" style="border: none; float: right;" src="https://i0.wp.com/img.zemanta.com/zemified_e.png" alt="Enhanced by Zemanta" data-recalc-dims="1" /></a></div><p>The post <a href="https://www.ibd.com/scalable-deployment/1341/">Upcoming Mini-tutorial at BigDataCamp: How to Build a Hadoop Cluster from Scratch in 20 Minutes by CTO of Infochimps</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1341</post-id>	</item>
		<item>
		<title>HBase/Hadoop on Mac OS X (Pseudo-Distributed)</title>
		<link>https://www.ibd.com/howto/hbase-hadoop-on-mac-ox-x/</link>
					<comments>https://www.ibd.com/howto/hbase-hadoop-on-mac-ox-x/#comments</comments>
		
		<dc:creator><![CDATA[Robert J Berger]]></dc:creator>
		<pubDate>Mon, 03 May 2010 03:50:13 +0000</pubDate>
				<category><![CDATA[HowTo]]></category>
		<category><![CDATA[Macintosh]]></category>
		<category><![CDATA[Scalable Deployment]]></category>
		<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Mac OS X]]></category>
		<guid isPermaLink="false">http://blog2.ibd.com/?p=565</guid>

					<description><![CDATA[<p>I wanted to do some experimenting with various tools for doing Hadoop and HBase activities and didn&#8217;t want to have to bother making it work&#8230;</p>
<p>The post <a href="https://www.ibd.com/howto/hbase-hadoop-on-mac-ox-x/">HBase/Hadoop on Mac OS X (Pseudo-Distributed)</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></description>
										<content:encoded><![CDATA[<p>I wanted to do some experimenting with various tools for doing Hadoop and HBase activities and didn&#8217;t want to have to bother making it work with our Cluster in the Cloud. I just wanted a simple experimental environment on my Macbook Pro running Snow Leopard Mac OS X.</p>
<p>So I thought it was time to revisit installing Hadoop and HBase on the Mac using the latest versions of everything. This will be deployed as Psuedo-Distributed mode native to Mac OS X. Some folks actually create a set of Linux VMs with a full Hadoop/HBase stack and run that on the Mac, but that is a bit of overkill for now.</p>
<p>These instructions mainly follow the standard instructions for <a href="http://hadoop.apache.org/common/docs/current/quickstart.html" target="_blank">Apache Hadoop</a> and <a href="http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#pseudo-distrib" target="_blank">Apache HBase</a></p>
<h2>Prerequisits</h2>
<p>Mac OS X Xcode developer tools which includes Java 1.6.x. You can get this for free from the <a href="https://developer.apple.com/mac/" target="_blank">Apple Mac Dev Center</a>. You have to become a member but there is a free membership available.</p>
<h2>Download and Unpack Latest Distros</h2>
<p>You can get a link to a mirror for Hadoop via the <a href="http://www.apache.org/dyn/closer.cgi/hadoop/core/" target="_blank">Hadoop Apache Mirror link</a> and for Hbase at the <a href="http://www.apache.org/dyn/closer.cgi/hadoop/hbase/" target="_blank">HBase Apache Mirror link</a>. Each of those links will bring you to a suggested link to a mirror for Hadoop or HBase. Once you click on the suggest link, it will bring you to a mirror with the recent releases. You can click on the <em>stable</em> link which will then bring you to a directory that has the latest stable Hadoop (as of this writing: hadoop-0.20.2.tar.gz) or HBase (as of this writing: hbase-0.20.3.tar.gz ). Click on those tar.gz files to download them.</p>
<p>I am going to keep the distros in ~/work/pkgs. I usually create a directory ~/work/pkgs and unpack the tar files there as numbered versions and then create symbolic links to them in ~/work. But you can do this all in any directory that you can control.:</p>
<pre><code>cd ~/work
mkdir -p pkgs
cd pkgs
tar xvzf hadoop-0.20.2.tar.gz
tar xvzf hbase-0.20.3.tar.gz
cd ..
ln -s pkgs/hadoop-0.20.2 hadoop
ln -s pkgs/hbase-020.3 hbase
mkdir -p hadoop/logs
mkdir -p hbase/logs</code></pre>
<p>Now you can have your tools all access ~/work/hadoop or ~/work/hbase and not care what version it is. You can update to later version just by downloading, untarring the distro and then just change the symbolic links.</p>
<h2>Configure Hadoop</h2>
<p>All the configuration files mentioned here will be in <em>~/work/hadoop/conf.</em> In this example we are assuming that the Hadoop servers will only be accessed from this <em>localhost</em>. If you need to make it accessable from other hosts or VMs on your lan that support Bonjour, you could use the bonjour name  (ie. the name of your mac followed by .local such as <em>mymac.local</em>) instead of <em>localhost</em> in the following Hadoop and HBase configuraitons</p>
<h3>hadoop-env.sh</h3>
<p>Mainly need to tell Hadoop where your JAVA_HOME is.</p>
<p>Add the following line below the commented out JAVA_HOME line is in hadoop-env.sh</p>
<pre><code>export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home</code></pre>
<h3>core-site.xml</h3>
<pre><code>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;

&lt;configuration&gt;
  &lt;property&gt;
    &lt;name&gt;fs.default.name&lt;/name&gt;
    &lt;value&gt;hdfs://localhost:9000&lt;/value&gt;
  &lt;/property&gt;
&lt;/configuration&gt;</code></pre>
<h3>hdfs-site.xml</h3>
<pre><code>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;

&lt;configuration&gt;
  &lt;property&gt;
    &lt;name&gt;dfs.replication&lt;/name&gt;
    &lt;value&gt;1&lt;/value&gt;
  &lt;/property&gt;
&lt;/configuration&gt;</code></pre>
<h3>mapred-site.xml</h3>
<pre><code>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;

&lt;configuration&gt;
  &lt;property&gt;
    &lt;name&gt;mapred.job.tracker&lt;/name&gt;
    &lt;value&gt;localhost:9001&lt;/value&gt;
  &lt;/property&gt;
&lt;/configuration&gt;</code></pre>
<h3>Make sure you can ssh without a password to the hostname used in the configs</h3>
<p>The Hadoop and Hbase start/stop scripts use ssh to access the various servers. In this case of doing a Pseudo-Distributed mode, everything is running on the <em>localhost</em>, but we still need to allow the scripts to ssh to the localhost.</p>
<h4>Check that you can ssh to the <em>localhost</em> (or whatever hostname you used in the above configs)</h4>
<p>We&#8217;re assuming that we&#8217;ll be running the Hadoop/HBase servers as the same user as our login. You can set things up to run as the hadoop user, but its kind of complicated on Mac OS X. See the section<em> File System Layout</em> in an earlier post <em><a href="http://blog2.ibd.com/scalable-deployment/hadoop-hdfs-and-hbase-on-ubuntu/" target="_blank">Hadoop, HDFS and Hbase on Ubuntu &amp; Macintosh Leopard</a>.</em> That section and a few other points thru that post describe how to create and use a hadoop user to run the Hadoop and HBase servers.</p>
<p>Back to just doing this as our own user. Test that you can ssh to the <em>localhost</em> without a password:</p>
<pre>ssh localhost</pre>
<p>If you see something like the following paragraph  that ends up with a password prompt, then you need to add a key to your ssh setup that does not need a password (you may need to say yes if you are asked if you want to continue connecting).</p>
<pre>The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 3c:5d:6a:39:64:78:02:9d:a3:c9:69:68:50:23:71:eb.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Password:</pre>
<p>To create a passwordless key and add it to your set of authorized keys that can access your host, do the following (as yourself, not as root. The id_dsa file name can be arbitrary):</p>
<pre>ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa_for_hadoop
cat ~/.ssh/id_dsa_for_hadoop.pub &gt;&gt; ~/.ssh/authorized_keys</pre>
<p>If you have strong alternative opinions on how to set up your own keys to accomplish the same thing please do it your own way. This is just the basic way of doing a passwordless ssh. You may want to use a key you already have lying around or some other mechanism.</p>
<h3>Start Hadoop</h3>
<h4>One time format of  Hadoop File System</h4>
<p>Only once, before the first time you use Hadoop, you have to create a formated Hadoop File System. Don&#8217;t do this again once you have data in your Hadoop file system as it will erase anything you might have saved there. You may have to do this command again if somehow you screw up your file system. But its not something to do lightly the second time.</p>
<pre>~/work/hadoop/bin/hadoop namenode -format</pre>
<p>If all goes well, you should see something like:</p>
<pre>10/05/02 18:45:04 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = Psion.local/192.168.50.16
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/02 18:45:04 INFO namenode.FSNamesystem: fsOwner=rberger,rberger,admin,com.apple.access_screensharing,_developer,_lpoperator,_lpadmin,_appserveradm,_appserverusr,localaccounts,everyone,com.apple.sharepoint.group.2,com.apple.sharepoint.group.3,dev,com.apple.sharepoint.group.1,workgroup
10/05/02 18:45:04 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/02 18:45:04 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/02 18:45:04 INFO common.Storage: Image file of size 97 saved in 0 seconds.
10/05/02 18:45:04 INFO common.Storage: Storage directory /tmp/hadoop-rberger/dfs/name has been successfully formatted.
10/05/02 18:45:04 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Psion.local/192.168.50.16
************************************************************/</pre>
<h4>Starting and stopping Hadoop</h4>
<p>Now you can start Hadoop. You will use this command to start Hadoop in general:</p>
<pre>~/work/hadoop/bin/start-all.sh</pre>
<p>You can stop Hadoop with the command</p>
<pre>~/work/hadoop/bin/stop-all.sh</pre>
<p>But remember if you are running HBase, stop that first, then stop Hadoop.</p>
<h3>Making sure Hadoop is working</h3>
<p>You can see the Hadoop logs in ~/work/hadoop/logs</p>
<p>You should be able to see the Hadoop Namenode web interface at <a href="http://localhost:50070/" target="_blank">http://localhost:50070/</a> and the JobTracker Web Interface at <a href="http://localhost:50030/" target="_blank">http://localhost:50030/</a>. If not, check that you have 5 java processes running where each of those java processes have one of the following as their last command line (as seen from a <code>ps ax | grep hadoop</code> command) :</p>
<pre>org.apache.hadoop.mapred.JobTracker
org.apache.hadoop.hdfs.server.namenode.NameNode
org.apache.hadoop.mapred.TaskTracker
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
org.apache.hadoop.hdfs.server.datanode.DataNode</pre>
<p>If you do not see these 5 processes, check the logs in ~work/hadoop/logs/*.{out,log} for messages that might give you a hint as to what went wrong.</p>
<h4>Run some example map/reduce jobs</h4>
<p>The Hadoop distro comes with some example / test map / reduce jobs. Here we&#8217;ll run them and make sure things are working end to end.</p>
<pre><code>cd ~/work/hadoop
# Copy the input files into the distributed filesystem
# (there will be no output visible from the command):
bin/hadoop fs -put conf input
# Run some of the examples provided:
# (there will be a large amount of INFO statements as output)
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
# Examine the output files:
bin/hadoop fs -cat output/part-00000
</code></pre>
<p>The resulting output should be something like:</p>
<pre>3	dfs.class
2	dfs.period
1	dfs.file
1	dfs.replication
1	dfs.servers
1	dfsadmin
1	dfsmetrics.log</pre>
<h2>Configuring HBase</h2>
<p>The following config files all reside in <em>~/work/hbase/conf</em>. As mentioned earlier, use a FQDN or a Bonjour name instead of localhost if you need remote clients to access HBase. But if you don&#8217;t use localhost here, make sure you do the same in the Hadoop config.</p>
<h3>hbase-env.sh</h3>
<p>Add the following line below the commented out JAVA_HOME line is in hbase-env.sh</p>
<pre><code>export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home</code></pre>
<p>Add the following line below the commented out HBASE_CLASSPATH= line</p>
<pre><code>export HBASE_CLASSPATH=${HOME}/work/hadoop/conf</code></pre>
<h3>hbase-site.xml</h3>
<pre><code>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;
&lt;?xml version="1.0"?&gt;&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;
&lt;configuration&gt;
  &lt;property&gt;
    &lt;name&gt;hbase.rootdir&lt;/name&gt;
    &lt;value&gt;hdfs://localhost:9000/hbase&lt;/value&gt;
    &lt;description&gt;The directory shared by region servers.
    &lt;/description&gt;
  &lt;/property&gt;
&lt;/configuration&gt;
</code></pre>
<h3>Making Sure HBase is Working</h3>
<p>If you do a ps ax | grep hbase you should see two java processes. One should end with:<br />
<code>org.apache.hadoop.hbase.zookeeper.HQuorumPeer start</code><br />
And the other should end with:<br />
<code>org.apache.hadoop.hbase.master.HMaster start</code><br />
Since we are running in the Pseudo-Distributed mode, there will not be any explicit regionservers running. If you have problems, check the logs in ~/work/hbase/logs/*.{out,log}</p>
<h3>Testing HBase using the HBase Shell</h3>
<p>From the unix prompt give the following command:</p>
<pre>~/work/hbase/bin/hbase shell</pre>
<p>Here is some example commands from the Apache HBase Installation Instructions:</p>
<pre>base&gt; # Type "help" to see shell help screen
hbase&gt; help
hbase&gt; # To create a table named "mylittletable" with a column family of "mylittlecolumnfamily", type
hbase&gt; create "mylittletable", "mylittlecolumnfamily"
hbase&gt; # To see the schema for you just created "mylittletable" table and its single "mylittlecolumnfamily", type
hbase&gt; describe "mylittletable"
hbase&gt; # To add a row whose id is "myrow", to the column "mylittlecolumnfamily:x" with a value of 'v', do
hbase&gt; put "mylittletable", "myrow", "mylittlecolumnfamily:x", "v"
hbase&gt; # To get the cell just added, do
hbase&gt; get "mylittletable", "myrow"
hbase&gt; # To scan you new table, do
hbase&gt; scan "mylittletable"</pre>
<p>You can stop hbase with the command:</p>
<pre>~/work/hbase/bin/stop-hbase.sh</pre>
<p>Once that has stopped you can stop hadoop:</p>
<pre>~/work/hadoop/bin/stop-all.sh</pre>
<h2>Conclusion</h2>
<p>You should now have a fully working Pseudo-Distributed Hadoop / HBase setup on your Mac. This is not suitable for any kind of large data or production project. In fact it will probably fail if you try to do anything with lots of data or high volumes of I/O. HBase seems to not like to work well until you get 4 &#8211; 5 regionservers.</p>
<p>But this Pseudo-Distributed version should be fine for doing experiments with tools and small data sets.</p>
<p>Now I can get on with playing with <a href="http://github.com/clj-sys/cascading-clojure" target="_blank">Cascading-Clojure</a> and <a href="http://nathanmarz.com/blog/introducing-cascalog/" target="_blank">Cascalog</a>!</p><p>The post <a href="https://www.ibd.com/howto/hbase-hadoop-on-mac-ox-x/">HBase/Hadoop on Mac OS X (Pseudo-Distributed)</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></content:encoded>
					
					<wfw:commentRss>https://www.ibd.com/howto/hbase-hadoop-on-mac-ox-x/feed/</wfw:commentRss>
			<slash:comments>25</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">565</post-id>	</item>
		<item>
		<title>Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2</title>
		<link>https://www.ibd.com/howto/experience-installing-hbase-0-20-0-cluster-on-ubuntu-9-04-and-ec2/</link>
					<comments>https://www.ibd.com/howto/experience-installing-hbase-0-20-0-cluster-on-ubuntu-9-04-and-ec2/#comments</comments>
		
		<dc:creator><![CDATA[Robert J Berger]]></dc:creator>
		<pubDate>Sat, 05 Sep 2009 01:34:41 +0000</pubDate>
				<category><![CDATA[HowTo]]></category>
		<category><![CDATA[Runa]]></category>
		<category><![CDATA[Scalable Deployment]]></category>
		<category><![CDATA[AWS]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[EC2]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[ubuntu]]></category>
		<guid isPermaLink="false">http://blog2.ibd.com/?p=237</guid>

					<description><![CDATA[<p>NOTE (Sep 7 2009): Updated info on need to use Amazon Private DNS Names and clarified the need for the masters, slaves and regionservers files.&#8230;</p>
<p>The post <a href="https://www.ibd.com/howto/experience-installing-hbase-0-20-0-cluster-on-ubuntu-9-04-and-ec2/">Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></description>
										<content:encoded><![CDATA[<p><strong>NOTE (Sep 7 2009):</strong> Updated info on need to use Amazon Private DNS Names and clarified the need for the masters, slaves and regionservers files. Also updated to use HBase 0.20.0 Release Candidate 3</p>
<h2>Introduction</h2>
<p>As someone who has &#8220;skipped&#8221; Java and wants to learn as little as possible about it, and as one who has not had much experience with Hadoop so far, HBase deployment has a big learning curve. So some of the things I describe below may be obvious to those who have had experience in those domains.</p>
<h2>Where&#8217;s the docs for HBase 0.20</h2>
<p>If you go to the HBase wiki, you will find that there is not much documentation on the 0.20 version. This puzzled me since all the twittering, blog posting and other buzz was talking about people using 0.20 even though its &#8220;pre-release&#8221;</p>
<p>One of the great things about going to meetups such as the <a title="HBase Meetup" href="http://www.meetup.com/hbaseusergroup/" target="_blank">HBase Meetup</a> is you can talk to the folks who actually wrote the thing and ask them &#8220;Where is the documentation for HBase 0.20</p>
<p>Turns out its in the HBase 0.20.0 distribution in the docs directory. The easiest thing is to get the <a href="http://people.apache.org/~stack/hbase-0.20.0-candidate-3" target="_blank">pre-built 0.20.0 release candididate 3</a>. If you download the source from the version control repository you have to build the documentation using Ant. If you are an Java/Ant kind of person it might not be hard. But just to build the docs, you have to meet some dependencies like</p>
<h2>What we learnt with 0.19.x</h2>
<p>We have been learning a lot about making HBase Cluster work at a basic level. I had a lot of problems getting 0.19.x running beyond a single node in Psuedo Distributed mode. I think a lot of my problems was just not getting how it all fit together with Hadoop and what the different startup/shutdown scripts did.</p>
<p>Then we finally tried the <a href="http://issues.apache.org/jira/browse/HBASE-838" target="_blank">HBase EC2 Scripts </a>even though it uses an AMI based on Fedora 8 and seemed wired to 0.19.0. Its a pretty nice script if you want to have an opionated HBase cluster set up. But it did educate us on how to get a cluster to go. It has a bit of strangeness by having a script in /root/hbase_init that is called at boot time to configure all the hadoop and hbase conf scripts and then call the hadoop and hbase startup scripts. Something like this is kind of needed for Amazon EC2 since you don&#8217;t really know what the IP Address/FQDN is until boot time.</p>
<p>The scripts also set up an Amazon Security Group for the cluster master and one for the rest of the cluster. I beleive it then uses this as a way to identify the group as well.</p>
<p>The main thing we did get was by going thru mainly the /root/hbase_init script we were able to figure out what the process was for bringing up Hadoop/HBase as a cluster.</p>
<p>We did build a staging cluster with this script. We were able to pretty easily change the scripts to use 0.19.3 instead of 0.19.0. But its opions were different than ours for many things. Plus after talking to the folks at the HBase Meetup, and having all sort of weird problems with our app on 0.19.3, we were convinced that our future is in HBase 0.20. And 0.20 introduces some new things like using Zookeeper to manage the Master selection so seems like its not worth it for us to continue to use this script. Though it helped in our learning quite a bit!</p>
<h2>Building an HBase 0.20.0 Cluster</h2>
<p>This post will use the HBase pre-built Release Candidate 3 and the prebuild standard Hadoop 0.20.0.</p>
<p>This post will show how to do all this &#8220;by hand&#8221;. Hopefully we&#8217;ll have an article on how to do all this with Chef sometime soon.</p>
<p>The Hbase folks say that you really should have at least 5 regionservers and one master. The master and several of the regionservers can also run the zookeeper quorum. Of course the master serveris also going to run the Hadoop Nameserver Secondary name server. Then the 5 other nodes are running the Hadoop HDFS Data nodes as well as the HBase region servers. When you build out larger clusters, you will probably want to dedicate machines to Zookeepers and hot-standby Hbase Masters. Name Servers are still the Single Point of Failure (SPOF). Rumour has it that this will be fixed in Hadoop 0.21.</p>
<p>We&#8217;re not using Map / Reduce yet so won&#8217;t go into that, but its just a mater of different startup scripts to make the same nodes do Map/Reduce as HDFS and HBase.</p>
<p>In this example, we&#8217;re installing and running everything as Root. It can also be done as a special user like hadoop as described in the earlier blog post <a href="http://blog2.ibd.com/scalable-deployment/hadoop-hdfs-an…base-on-ubuntu/" target="_blank">Hadoop, HDFS and Hbase on Ubuntu &amp; Macintosh Leopard</a></p>
<h2 style="font-size: 1.17em;">Getting the pre-requisites in order</h2>
<p>We started with the vanilla <a href="http://alestic.com/" target="_blank">alestic</a> Ubuntu 9.04 Jaunty 64Bit Server AMI: <a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1951&amp;categoryID=101" target="_blank">ami-5b46a732</a> and instantiated 6 High CPU Large Instances. You really want as much memory and cores as you can get. You can do the following by hand or combine it with the shell scripting described below in the section <em>Installing Hadoop and HBase.</em></p>
<pre>apt-get update
apt-get upgrade</pre>
<p>Then added via apt-get install:</p>
<pre>apt-get install sun-java6-jdk</pre>
<h3>Downloading Hadoop and HBase</h3>
<p>You can use the production Hadoop 0.20.0 release. You can find them at the mirrors at http://www.apache.org/dyn/closer.cgi/hadoop/core/. The examples show from one mirror:</p>
<pre>wget http://mirror.cloudera.com/apache/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz

You can download the HBase 0.20.0 Release Candidate 3 in a prebuilt form from <a href="http://people.apache.org/~stack/hbase-0.20.0-candidate-3/" target="_blank">http://people.apache.org/~stack/hbase-0.20.0-candidate-3/</a> (You can get the source out of Version Control:<a href="http://hadoop.apache.org/hbase/version_control.html" target="_blank">http://hadoop.apache.org/hbase/version_control.html</a> but  you'll have to figure out how to build it.)

wget http://people.apache.org/~stack/hbase-0.20.0-candidate-3/hbase-0.20.0.tar.gz</pre>
<h3>Installing Hadoop and HBase</h3>
<p>Assuming that you are running in your home directory on the master server and that the target for the versioned packages is in /mnt/pkgs and that there will be a link in /mnt for the path to the home for hadoop and hbase:</p>
<p>You can do a some simple scripting to do the following on all the nodes at once:</p>
<p>Create a file named servers with the list of the fully qualified domain names of all your servers including &#8220;localhost&#8221; for the master and call the file &#8220;servers&#8221;.</p>
<p>Make sure you can ssh to all the servers from the master. Ideally you are using ssh keys. On master:</p>
<pre>ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub &gt;&gt; ~/.ssh/authorized_keys</pre>
<p>On each of your region servers make sure that the id_dsa.pub is also in their authorized_keys (don&#8217;t delete any other keys you have in the authorized keys!)</p>
<p>Now with a bit of shell command line scripting you can install on all your servers at once:</p>
<pre>for host in `cat servers`
 do
 echo $host
 ssh $host 'apt-get update; apt-get upgrade; apt-get install sun-java6-jdk'
 scp ~/hadoop-0.20.0.tar.gz ~/hbase-0.20.0.tar.gz $host:
 ssh $host 'mkdir -p /mnt/pkgs; cd /mnt/pkgs; tar xzf ~/hadoop-0.20.0.tar.gz; tar xzf ~/hbase-0.20.0.tar.gz; ln -s /mnt/pkgs/hadoop-0.20.0 /mnt/hadoop; ln -s /mnt/pkgs/hbase-0.20.0 /mnt/hbase'
done</pre>
<h4>Use Amazon Private DNS Names in Config files</h4>
<p>So far I have found that its best to use the Amazon Private DNS names in the hadoop and hbase config files. It looks like HBase uses the system hostname to determine various things at runtime. Thie is always the Private DNS name. It also means that its difficult to use the Web GUI interfaces to HBase from outside of the Amazon Cloud. I set up a &#8220;desktop&#8221; version of Ubuntu that is running in the Amazon Cloud that I VNC (or NX) into and use its browser to view the Web Interface.</p>
<p>In any case, Amazon instances normally have limited TCP/UDP access to the outside world due to the default security group settings. You would have to add the various ports used by HBase and Hadoop to the security group to allow outside access.</p>
<p>If you do use the Amazon Public DNS names in the config files, there will be startup errors like the following for each instance that is assigned to the zookeeper quorum (there may be other errors as well, but these are the most obvious):</p>
<pre>ec2-75-101-104-121.compute-1.amazonaws.com: java.io.IOException: Could not find my address: domU-12-31-39-06-9D-51.compute-1.internal in list of ZooKeeper quorum servers
ec2-75-101-104-121.compute-1.amazonaws.com:     at org.apache.hadoop.hbase.zookeeper.HQuorumPeer.writeMyID(HQuorumPeer.java:128)
ec2-75-101-104-121.compute-1.amazonaws.com:     at org.apache.hadoop.hbase.zookeeper.HQuorumPeer.main(HQuorumPeer.java:67)</pre>
<h3>Configuring Hadoop</h3>
<p>Now you have to configure the hadoop on master in /mnt/hadoop/conf:</p>
<h4>hadoop-env.sh:</h4>
<p>The minimal things to change are:</p>
<p>Set your JAVA_HOME to where the java package is installed. On Ubuntu:</p>
<pre>export JAVA_HOME=/usr/lib/jvm/java-6-sun</pre>
<p>Add the hbase path to the HADOOP_CLASSPATH:</p>
<pre>export HADOOP_CLASSPATH=/mnt/hbase/hbase-0.20.0.jar:/mnt/hbase/hbase-0.20.0-test.jar:/conf</pre>
<h4>core-site.xml:</h4>
<p>Here is what we used. Primarily setting where the hadoop files are and the nameserver path and port:</p>
<pre>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;

&lt;configuration&gt;
   &lt;property&gt;
     &lt;name&gt;hadoop.tmp.dir&lt;/name&gt;
     &lt;value&gt;/mnt/hadoop&lt;/value&gt;
   &lt;/property&gt;

   &lt;property&gt;
     &lt;name&gt;fs.default.name&lt;/name&gt;
     &lt;value&gt;hdfs://domU-12-31-39-06-9D-51.compute-1.internal:50001&lt;/value&gt;
   &lt;/property&gt;

   &lt;property&gt;
     &lt;name&gt;tasktracker.http.threads&lt;/name&gt;
     &lt;value&gt;80&lt;/value&gt;
   &lt;/property&gt;
&lt;/configuration&gt;</pre>
<p>mapred-site.xml:</p>
<p>Even though we are not currently using Map/Reduce this is a basic config:</p>
<pre>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;

&lt;configuration&gt;
   &lt;property&gt;
     &lt;name&gt;mapred.job.tracker&lt;/name&gt;
     &lt;value&gt;domU-12-31-39-06-9D-51.compute-1.internal:50002&lt;/value&gt;
   &lt;/property&gt;

   &lt;property&gt;
     &lt;name&gt;mapred.tasktracker.map.tasks.maximum&lt;/name&gt;
     &lt;value&gt;4&lt;/value&gt;
   &lt;/property&gt;

   &lt;property&gt;
     &lt;name&gt;mapred.tasktracker.reduce.tasks.maximum&lt;/name&gt;
     &lt;value&gt;4&lt;/value&gt;
   &lt;/property&gt;

   &lt;property&gt;
     &lt;name&gt;mapred.output.compress&lt;/name&gt;
     &lt;value&gt;true&lt;/value&gt;
   &lt;/property&gt;

   &lt;property&gt;
     &lt;name&gt;mapred.output.compression.type&lt;/name&gt;
     &lt;value&gt;BLOCK&lt;/value&gt;
   &lt;/property&gt;
&lt;/configuration&gt;</pre>
<h4>hdfs-site.xml:</h4>
<p>The main thing to change based on your config is the dfs.replication. It should be less than the total number of data-nodes / region-servers.</p>
<pre>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;

&lt;configuration&gt;
   &lt;property&gt;
     &lt;name&gt;dfs.client.block.write.retries&lt;/name&gt;
     &lt;value&gt;3&lt;/value&gt;
   &lt;/property&gt;

   &lt;property&gt;
     &lt;name&gt;dfs.replication&lt;/name&gt;
     &lt;value&gt;3&lt;/value&gt;
   &lt;/property&gt;
&lt;/configuration&gt;</pre>
<p>Put the Fully qualified domain name of your master in the file <em>masters</em> and the names of the data-nodes in the file <em>slaves.</em></p>
<h4>masters:</h4>
<pre>domU-12-31-39-06-9D-51.compute-1.internal</pre>
<h4>slaves:</h4>
<pre>domU-12-31-39-06-9D-C1.compute-1.internal
domU-12-31-39-06-9D-51.compute-1.internal</pre>
<p>We did not change any of the other files so far.</p>
<p>Now copy these files to the data-nodes:</p>
<pre>for host in `cat slaves`
do
  echo $host
  scp slaves masters hdfs-site.xml hadoop-env.sh core-site.xml ${host}:/mnt/hadoop/conf
done</pre>
<p>And also format the hdfs on the master</p>
<pre>/mnt/hadoop/bin/hadoop namenode -format</pre>
<h3>Configuring HBase</h3>
<h4>hbase-env.sh:</h4>
<p>Similar to the hadoop-env.sh, you must set the JAVA_HOME:</p>
<pre>export JAVA_HOME=/usr/lib/jvm/java-6-sun</pre>
<p>and add the hadoop conf directory to the HBASE_CLASSPATH:</p>
<pre>export HBASE_CLASSPATH=/mnt/hadoop/conf</pre>
<p>And for the master you will want to say:</p>
<pre>export HBASE_MANAGES_ZK=true</pre>
<h4>hbase-site.xml:</h4>
<p>Mainly need to define the hbase master, hbase rootdir and the list of zookeepers. We also had to bump up the hbase.zookeeper.property.maxClientCnxns from the default of 30 to 300.</p>
<pre>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;
&lt;configuration&gt;
   &lt;property&gt;
     &lt;name&gt;hbase.master&lt;/name&gt;
     &lt;value&gt;domU-12-31-39-06-9D-51.compute-1.internal:60000&lt;/value&gt;
   &lt;/property&gt;

   &lt;property&gt;
     &lt;name&gt;hbase.rootdir&lt;/name&gt;
     &lt;value&gt;hdfs://domU-12-31-39-06-9D-51.compute-1.internal:50001/hbase&lt;/value&gt;
   &lt;/property&gt;
   &lt;property&gt;
     &lt;name&gt;hbase.zookeeper.quorum&lt;/name&gt;
     &lt;value&gt;domU-12-31-39-06-9D-51.compute-1.internal,domU-12-31-39-06-9D-C1.compute-1.internal,domU-12-31-39-06-9D-51.compute-1.internal&lt;/value&gt;
   &lt;/property&gt;
   &lt;property&gt;
     &lt;name&gt;hbase.cluster.distributed&lt;/name&gt;
     &lt;value&gt;true&lt;/value&gt;
   &lt;/property&gt;
   &lt;property&gt;
     &lt;name&gt;hbase.zookeeper.property.maxClientCnxns&lt;/name&gt;
     &lt;value&gt;300&lt;/value&gt;
   &lt;/property&gt;
&lt;/configuration&gt;</pre>
<p>You will also need to have a file called regionservers. Normally it contains the same hostnames as the hadoop slaves:</p>
<h4>regionservers:</h4>
<pre>domU-12-31-39-06-9D-C1.compute-1.internal
domU-12-31-39-06-9D-51.compute-1.internal</pre>
<p>Copy the files to the region-servers:</p>
<pre>for host in `cat regionservers`
do
  echo $host
  scp hbase-env.sh hbase-site.xml regionservers ${host}:/mnt/hbase/conf
done</pre>
<h3>Starting Hadoop and HBase</h3>
<p>On the master:</p>
<p>(This just starts the Hadoop File System services, not Map/Reduce services)</p>
<pre>/mnt/hadoop/bin/start-dfs.sh</pre>
<p>Then start hbase:</p>
<pre>/mnt/hbase/bin/start-hbase.sh</pre>
<p>You can shut things down by doing the reverse:</p>
<pre>/mnt/hbase/bin/stop-hbase.sh
/mnt/hadoop/bin/stop-dfs.sh</pre>
<p>It is advisable to set up init scripts. This is described in the <em>Ubuntu /etc/init.d style startup scripts</em> section of the earlier blog post:<a href="http://blog2.ibd.com/scalable-deployment/hadoop-hdfs-and-hbase-on-ubuntu/" target="_blank">Hadoop, HDFS and Hbase on Ubuntu &amp; Macintosh Leopard</a></p><p>The post <a href="https://www.ibd.com/howto/experience-installing-hbase-0-20-0-cluster-on-ubuntu-9-04-and-ec2/">Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></content:encoded>
					
					<wfw:commentRss>https://www.ibd.com/howto/experience-installing-hbase-0-20-0-cluster-on-ubuntu-9-04-and-ec2/feed/</wfw:commentRss>
			<slash:comments>10</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">237</post-id>	</item>
		<item>
		<title>Want to work at a Startup with Cool Tech? (HBase, Clojure, Chef, Swarms, Javascript, Ruby &#038; Rails)</title>
		<link>https://www.ibd.com/macintosh/want-to-work-at-a-startup-with-cool-tech-hbase-clojure-chef-swarms-javascript-ruby-rails/</link>
		
		<dc:creator><![CDATA[Robert J Berger]]></dc:creator>
		<pubDate>Fri, 28 Aug 2009 18:15:01 +0000</pubDate>
				<category><![CDATA[Macintosh]]></category>
		<category><![CDATA[Opscode Chef]]></category>
		<category><![CDATA[Ruby / Rails]]></category>
		<category><![CDATA[Runa]]></category>
		<category><![CDATA[Scalable Deployment]]></category>
		<category><![CDATA[AWS]]></category>
		<category><![CDATA[Git]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[rabbitmq]]></category>
		<category><![CDATA[tweekts]]></category>
		<category><![CDATA[ubuntu]]></category>
		<guid isPermaLink="false">http://blog2.ibd.com/?p=253</guid>

					<description><![CDATA[<p>Opportunity Knocks Runa.com, the startup where I am CTO, is looking for great developers to join our small agile team. We&#8217;re an early stage, pre-series-A&#8230;</p>
<p>The post <a href="https://www.ibd.com/macintosh/want-to-work-at-a-startup-with-cool-tech-hbase-clojure-chef-swarms-javascript-ruby-rails/">Want to work at a Startup with Cool Tech? (HBase, Clojure, Chef, Swarms, Javascript, Ruby & Rails)</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></description>
										<content:encoded><![CDATA[<h1 style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;"><strong>Opportunity Knocks</strong></h1>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">Runa.com, the startup where I am CTO, is looking for great developers to join our small agile team. We&#8217;re an early stage, pre-series-A startup (presently funded with strategic investments from two large corporations). Runa offers a SaaS to on-line merchant that allows them to offer dynamic product and consumer specific promotions embeded in their website. This will be a very large positive disruption to the online retailing world.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;"><span style="text-decoration: underline;">Techie keywords:</span> <strong>clojure, hadoop, hbase, rabbitmq, erlang, chef, swarm computing, ruby, rails, javascript, amazon EC2, emacs, Macintosh, Linux, selenium, test/behavior driven development, agile, lean, XP, scalability</strong></p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">If you&#8217;re interested, email  <a href="mailto:jobs@runa.com">jobs@runa.com</a></p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">If you want to know more, read on!</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<h1 style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;"><strong>What do we do</strong></h1>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">Runa aims to provide the top of the long tail thru the middle of the top 500 online retailers with tools/services that companies like amazon.com use/provide. These smaller guys can&#8217;t afford or don&#8217;t have the resources to do anything on that scale, but by using our SaaS services, they can make more money while providing customers with greater value.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">The first service we&#8217;re building is what we call Dynamic Sale Price.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">It&#8217;s a simple concept &#8211; it allows the online-retailer to offer a sale price for each product on his site, personalized to the individual consumer who is browsing it. By using this service, merchants are able to &#8211;</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<ul>
<li>Increase conversion (get them to buy!) and</li>
<li>Offer consumers a special price which maximizes the merchant&#8217;s profit</li>
</ul>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">This is different from &#8220;dumb-discounting&#8221; where something is marked-down, and everyone sees the same price. This service is more like airline or hotel pricing which varies from day to day, but much more dynamic and real-time. Further, it is based on broad statistical factors AND individual consumer behavior. After all, if you lower prices enough, consumers will buy. Instead, we dynamically lower prices to a point where statistically, that consumer is most likely to buy.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<h1 style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;"><strong>How we do it</strong></h1>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">Runa does this by performing statistical analysis and pattern recognition of what consumers are doing on the merchant sites. This includes browsing products on various pages, adding and removing items from carts, and purchasing or abandoning the carts. We track consumers as they browse, and collect vast quantities of this click-stream data. By mining this data and applying algorithms to determine a price point per consumer based on their behavior, we&#8217;re able to  maximize both conversion (getting the consumer to buy) AND merchant profit.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">We also offer the merchant comprehensive reports based on analysis of the mountains of data we collect. Since the data tracks consumer activity down to the individual product SKU level (for each individual consumer), we can provide very rich analytics.  This is a tool that merchants need today, but don&#8217;t have the resources to build for themselves.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<h1 style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;"><strong>The business model</strong></h1>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">For reference, it is useful to understand the affiliate marketing space. Small-to-medium merchants (our target audience) pay affiliates up to 40% of a sale price. Yes, 40%. The average is in the 20% range.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">We charge our merchants around 10% of sales the Runa delivers. Our merchants are happy to pay it, because it is a performance-based pay, lower than what they pay affiliates, and there is zero up-front cost to the service. In fact, the above mentioned analytics reports are free.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">We&#8217;re targeting e-commerce PLATFORMS (as opposed to individual merchants); in this way, we&#8217;re able to scale up merchant-acquisition. We have 10 early-customer merchants right now, with about 100 more planned to go live in the next 2-3 months. By the end of next year, we&#8217;re targeting about 1,000 merchants and 10,000 merchants the following year. Our channel deployment model makes these goals achievable.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">At something like a 5 to 10% service charge, and a typical merchant having between 500K to 1M in sales per year, this is a VERY profitable business model. That is, of course, if we&#8217;re successful&#8230; but we&#8217;re seeing very positive signs so far.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<h1 style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;"><strong>Technology</strong></h1>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">Most of our front-end stuff (like the merchant-dashboard, reports, campaign management) is built with Ruby on Rails. Our merchant integration requires browser-side Javascript magic. All our analytics (batch-processing) and real-time pricing services are written in Clojure. We use RabbitMQ for all our messaging needs. We store data in HBase. We&#8217;re deployed on Amazon&#8217;s EC2.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">Here are a few blog postings about what we&#8217;ve been up to &#8211;</p>
<p><a href="http://s-expressions.com/2009/05/02/startup-logbook-distributed-clojure-system-in-production-v02/" target="_blank">Distributed Clojure system in production</a><br />
<a href="http://s-expressions.com/2009/04/12/using-messaging-for-scalability/" target="_blank">Using messaging for scalability</a><br />
<a href="http://s-expressions.com/2009/03/31/capjure-a-simple-hbase-persistence-layer/" target="_blank">Capjure: a simple HBase persistence layer</a><br />
<a href="http://s-expressions.com/2009/01/28/startup-logbook-clojure-in-production-release-v01/" target="_blank">Clojure in production<br />
</a><span style="color: #0000ee; "><span style="text-decoration: underline;"><a href="http://blog2.ibd.com/scalable-deployment/experience-installing-hbase-0-20-0-cluster-on-ubuntu-9-04-and-ec2/" target="_blank">Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2</a></span></span></p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">We&#8217;ve also open-sourced a few of our projects &#8211;</p>
<p><a href="http://github.com/amitrathore/swarmiji/tree/master" target="_blank">swarmiji</a> &#8211; A distributed computing system to write and run Clojure code in parallel, across CPUs<br />
<a href="http://github.com/amitrathore/capjure/tree/master" target="_blank">capjure</a> &#8211; Clojure persistence for HBase</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<h1 style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;"><strong>Culture at Runa</strong></h1>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">We&#8217;re a small team, very passionate about what we do. We&#8217;re focused on delivering a ground-breaking, disruptive service that will allow merchants to really change the way they sell online. We work start-up hours, but we&#8217;re flexible and laid-back about it. We know that a healthy personal life is important for a good professional life. We work with each other to support it.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">We use an agile process with a lot of influences from the &#8220;Lean&#8221;:http://en.wikipedia.org/wiki/Lean_software_development and &#8220;Kanban&#8221;:http://leansoftwareengineering.com/2007/08/29/kanban-systems-for-software-development/ world. We use &#8220;Mingle&#8221;:http://studios.thoughtworks.com/mingle-agile-project-management to run our development process. Everything, OK mostly everything <img src="https://s.w.org/images/core/emoji/14.0.0/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /> is covered by automated tests, so we can change things as needed.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">We&#8217;re all Apple in the office &#8211; developers get a MacPro with a nice 30&#8243; screen, and a nice 17&#8243; MacBook Pro.  We deploy on Ubuntu servers.  Aeron chairs are cliché, yes; but, very comfy.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">The environment is chilled out&#8230; you can wear shorts and sandals to work&#8230;  Very flat organization, very non-bureaucratic&#8230; nice open spaces (no cubes!). Lunch is brought in on most days! Beer and snacks are always in the fridge.</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">We&#8217;re walking distance to the San Antonio Caltrain station (biking distance from the Mountain View Caltrain/VTA lightrail station).</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<h1 style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;"><strong>What&#8217;s in it for you</strong></h1>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<ul>
<li>Competitive salaries, and lots of stock-options</li>
<li>Cutting edge technology stack</li>
<li>Fantastic business opportunity, and early-stage (= great time to join!)</li>
<li>Developer #5 &#8211; means plenty of influence on foundational architecture and design</li>
<li>Smart, full bandwidth, fun people to work with</li>
<li>Very comfortable, nice office environment</li>
<li>We have a &#8220;No Assholes&#8221; policy</li>
</ul>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<h1 style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;"><strong>OK!</strong></h1>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana; min-height: 15.0px;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">So, if you&#8217;re interested, email us at <a href="mailto:jobs@runa.com">jobs@runa.com</a></p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">No recruiters please!</p>
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">
<p style="margin: 0.0px 0.0px 0.0px 0.0px; line-height: 14.0px; font: 12.0px Verdana;">We would prefer folks who are already in the Bay Area (but if you not local and are really great let&#8217;s talk!)</p>
<div><span style="font-family: verdana, arial, helvetica, clean, sans-serif; font-size: small;"><span style="line-height: 14px; white-space: pre-wrap; "><br />
</span></span></div><p>The post <a href="https://www.ibd.com/macintosh/want-to-work-at-a-startup-with-cool-tech-hbase-clojure-chef-swarms-javascript-ruby-rails/">Want to work at a Startup with Cool Tech? (HBase, Clojure, Chef, Swarms, Javascript, Ruby & Rails)</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">253</post-id>	</item>
		<item>
		<title>Hadoop, HDFS and Hbase on Ubuntu &#038; Macintosh Leopard</title>
		<link>https://www.ibd.com/runa/hadoop-hdfs-and-hbase-on-ubuntu/</link>
					<comments>https://www.ibd.com/runa/hadoop-hdfs-and-hbase-on-ubuntu/#comments</comments>
		
		<dc:creator><![CDATA[Robert J Berger]]></dc:creator>
		<pubDate>Tue, 06 Jan 2009 02:19:16 +0000</pubDate>
				<category><![CDATA[Runa]]></category>
		<category><![CDATA[Scalable Deployment]]></category>
		<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[ubuntu]]></category>
		<guid isPermaLink="false">http://blog2.ibd.com/?p=95</guid>

					<description><![CDATA[<p>UPDATE: This has been replaced by a newer post Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2 . I found that using the&#8230;</p>
<p>The post <a href="https://www.ibd.com/runa/hadoop-hdfs-and-hbase-on-ubuntu/">Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></description>
										<content:encoded><![CDATA[<p><strong>UPDATE: </strong>This has been replaced by a newer post <a href="http://blog2.ibd.com/scalable-deployment/experience-installing-hbase-0-20-0-cluster-on-ubuntu-9-04-and-ec2/" target="_blank">Experience installing Hbase 0.20.0 Cluster on Ubuntu 9.04 and EC2</a> . I found that using the pre-built distributions of Hadoop and HBase much better than trying to build from source. I need more Java/Ant-fu to do the build from scratch. The HBase-0.20.0 Release Candidates are really great and seemingly easier to get the cluster going than previous releases.</p>
<h2>Introduction</h2>
<p>Hadoop and Map / Reduce are all the rage now days, so we figure we should be using it too.</p>
<p>Hbase is an implementation of Google&#8217;s Bigtable. Its built on top of the Hadoop File System (HDFS).</p>
<p>Its trivial to install it as a standalone on top of a filesystem, but I had some difficulty getting it working on top of HDFS in the &#8220;Pseudo-Distributed&#8221; mode.</p>
<h2>Follow the Instructions</h2>
<p>I set up Hadoop with no problems following the <a href="http://hadoop.apache.org/core/docs/current/quickstart.html#PseudoDistributed">instructions on the Hadoop site</a>for Pseudo-Distributed Operation which runs Hbase on top of HDFS but everything runs on one server (I.E. Its configured pretty much like a cluster but all the pieces are on the same server). Another helpful set of instructions are at <a href="http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29">Running Hadoop On Ubuntu Linux (Single-Node Cluster)</a>.</p>
<p>I followed the <a href="http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#overview_description">HBase installation instructions</a> also for Pseudo-Distributed Operation.</p>
<p>A few things to be aware of:</p>
<ul>
<li> Make sure that the Hadoop version and the Hbase major version numbers are the same<br />
(I used Hadoop 0.18.2 and Hbase 0.18.1)</li>
<li> Make sure that the Hadoop, Hbase trees as well as the directories and files that hold the hdfs filesystem are owned by hadoop:hadoop (You have to create the user and group)</li>
<li> No need to disable ipv6 as some sites said</li>
</ul>
<p>You can download the Hadoop tar file from <a href="http://www.apache.org/dyn/closer.cgi/hadoop/core/" target="_blank">http://www.apache.org/dyn/closer.cgi/hadoop/core/</a> and the Hbase tar file from <a href="http://www.apache.org/dyn/closer.cgi/hadoop/hbase/" target="_blank">http://www.apache.org/dyn/closer.cgi/hadoop/hbase/<br />
</a> They are also available as git repositories via:</p>
<pre>git clone git://git.apache.org/hadoop.git
git clone git://git.apache.org/hbase.git</pre>
<p>You can track a particular branch with the command (We&#8217;re stuck at hadoop 0.19.1 / hbase 0.19.0:</p>
<pre>cd hadoop
git branch --track release-0.19.1 origin/tags/release-0.19.1
git checkout release-0.19.1
cd ../hbase
git branch --track 0.19.0 origin/tags/0.19.0
git checkout 0.19.0</pre>
<p>Then in each directory build things. As far as I can tell you just need to use the default ant build. But you can build the jar also:</p>
<pre>cd ../hadoop
ant
ant jar</pre>
<pre>cd ../hbase
ant
ant jar</pre>
<h2>Biggest Problem I Had</h2>
<p>The thing that took the longest time to get right was when I wanted to access Hbase from other hosts. You would think you could put the DNS Fully Qualified Domain Name (FQDN) in the config file. Turns out that by default, the Hadoop tools don&#8217;t seem to use the host&#8217;s DNS resolver and just what is in /etc/hosts (as far as I can tell). So you have to use the IP address in the config file.</p>
<p>I believe there are ways to configure around this but I haven&#8217;t found it yet.</p>
<h2>Configuration Examples</h2>
<h2>File System Layout</h2>
<p>I untarred the distributions into /usr/local/pkgs and made symbolic links to /usr/local/hadoop  and /usr/local/hbase  as well as created the directory where Hadoop/HDFS will use for storage.</p>
<p>For Ubuntu:</p>
<pre>sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop</pre>
<p>For Mac:</p>
<p>Create a Home Directory</p>
<pre>mkdir /Users/_hadoop</pre>
<p>Find an unused groupid by seeing what ids are already in use:</p>
<pre>sudo dscl . -list /Groups PrimaryGroupID | cut -c 32-34 | sort -rn</pre>
<p>Then find an unused userid by seeing what userid&#8217;s are in use:</p>
<pre>sudo dscl . -list /Users UniqueID | cut -c 20-22 | sort -rn</pre>
<p>Pick a number that is in neither list. In our case we will use 402 for both the userid and groupid for _hadoop (Mac OS X has an underscore in front of daemon user/group names. We will also</p>
<pre>sudo dscl . -create /Groups/_hadoop PrimaryGroupID 402
sudo dscl . -append /Groups/_hadoop RecordName hadoop</pre>
<p>Take the Value of dsAttrTypeStandard:PrimaryGroupID in this case 500, and use it as the groupid in the following command:</p>
<pre>sudo dscl . -create /Users/_hadoop UniqueID 402
sudo dscl . -create /Users/_hadoop RealName "Hadoop Service"
sudo dscl . -create /Users/_hadoop PrimaryGroupID 402
sudo dscl . -create /Users/_hadoop NFSHomeDirectory /Users/_hadoop
sudo dscl . -append /Users/_hadoop RecordName hadoop</pre>
<p>For both Ubuntu and Mac (Note that the Mac will end up having a user/group id of _hadoop)</p>
<pre>cd /usr/local/pkgs
tar xzf hadoop-0.18.2.tar.gz
tar xzf hbase-0.18.1.tar.gz

cd ..
ln -s /usr/local/pkgs/hadoop-0.18.2 hadoop
ln -s /usr/local/pkgs/hbase-0.18.1 hbase
mkdir /var/hadoop_datastore
chown -R hadoop:hadoop hadoop/ hbase/ /var/hadoop_datastore /Users/_hadoop</pre>
<h2>Hadoop Config files</h2>
<p>The following are all in /usr/local/hadoop/conf</p>
<h4>hadoop-env.sh</h4>
<p>Need to set the JAVA_HOME variable. I installed java 6 via synoptic. You can also install it with:</p>
<pre><span style="font-family: Georgia; line-height: 19px; white-space: normal;">a</span>pt-get install sun-java6-jdk</pre>
<p>The Macintosh is a easy if you have a Intel Core 2 Dual (the Intel Core Dual doesn&#8217;t count). Apple is only supporting Java 1.6 on their 64 bit processors. If you have a 32 bit processor like the first generation Macbook Pro 17&#8243; or first generation MacMini, or you have a PPC see <a href="http://wiki.netbeans.org/JavaFXAndJDK6On32BitMacOS" target="_blank">Tech Tip: How to Set Up JDK 6 and JavaFX on 32-bit Intel Macs</a></p>
<p>So my config is (only the things I changed, the rest was left as is):</p>
<pre>...
# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
 export JAVA_HOME=/usr/lib/jvm/java-6-sun
...</pre>
<p>For the Macintosh:</p>
<pre>export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/Current</pre>
<h4>hadoop-site.xml</h4>
<pre>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;
&lt;!-- Put site-specific property overrides in this file. --&gt;
&lt;configuration&gt;
&lt;property&gt;
  &lt;name&gt;hadoop.tmp.dir&lt;/name&gt;
  &lt;value&gt;/var/hadoop_datastore/hadoop-${user.name}&lt;/value&gt;
  &lt;description&gt;A base for other temporary directories.&lt;/description&gt;
&lt;/property&gt;

&lt;property&gt;
  &lt;name&gt;fs.default.name&lt;/name&gt;
  &lt;value&gt;hdfs://localhost:54310&lt;/value&gt;
  &lt;description&gt;The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.&lt;/description&gt;
&lt;/property&gt;

&lt;property&gt;
  &lt;name&gt;mapred.job.tracker&lt;/name&gt;
  &lt;value&gt;localhost:54311&lt;/value&gt;
  &lt;description&gt;The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  &lt;/description&gt;
&lt;/property&gt;

&lt;property&gt;
  &lt;name&gt;dfs.replication&lt;/name&gt;
  &lt;value&gt;1&lt;/value&gt;
  &lt;description&gt;Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  &lt;/description&gt;
&lt;/property&gt;
&lt;!-- As per note in http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200810.mbox/&lt;C20126171.post@talk.nabble.com&gt; --&gt;
&lt;property&gt;
  &lt;name&gt;dfs.datanode.socket.write.timeout&lt;/name&gt;
  &lt;value&gt;0&lt;/value&gt;
&lt;/property&gt;

&lt;property&gt;
   &lt;name&gt;dfs.datanode.max.xcievers&lt;/name&gt;
   &lt;value&gt;1023&lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;</pre>
<h2>HBase Config Files</h2>
<p>The following are all in /usr/local/hbase/conf</p>
<h4>hbase-env.sh</h4>
<p>Again, just need to set up JAVA_HOME:</p>
<pre>...
# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/java-6-sun
...</pre>
<p>For the Macintosh:</p>
<pre>export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/Current</pre>
<h4>hbase-site.xml</h4>
<p>Here is where I wanted to give a FQDN for the host that is the hbase.master, but had to use an IP address instead.</p>
<pre>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;
&lt;configuration&gt;
  &lt;property&gt;
    &lt;name&gt;hbase.rootdir&lt;/name&gt;
    &lt;value&gt;hdfs://localhost:54310/hbase&lt;/value&gt;
    &lt;description&gt;The directory shared by region servers.
    Should be fully-qualified to include the filesystem to use.
    E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
    &lt;/description&gt;
  &lt;/property&gt;

  &lt;property&gt;
    &lt;name&gt;hbase.master&lt;/name&gt;
    &lt;value&gt;192.168.10.50:60000&lt;/value&gt;
    &lt;description&gt;The host and port that the HBase master runs at.
    &lt;/description&gt;
  &lt;/property&gt;
&lt;/configuration&gt;</pre>
<h2>Formatting the Name Node</h2>
<p>You must do this as the same user as will be running the daemon (hadoop)</p>
<pre>su hadoop -s /bin/sh -c /usr/local/hadoop/bin/hadoop namenode -format</pre>
<p>on the Mac:</p>
<pre>/usr/bin/su _hadoop /usr/local/hadoop/bin/hadoop namenode -format</pre>
<h2>Setup passphraseless ssh</h2>
<p>Now check that you can ssh to the localhost without a passphrase:</p>
<pre>su - hadoop
ssh localhost</pre>
<p>If you cannot ssh to localhost without a passphrase, execute the following commands (as haddop):</p>
<pre>$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub &gt;&gt; ~/.ssh/authorized_keys</pre>
<h2>Ubuntu /etc/init.d style startup scripts</h2>
<p>I scoured the InterTubes for example hadoop/hbase startup scripts and found absolutely none! I ended up creating a minimal one that is so far only suited for the Pseudo-Distributed Operation mode as it just calls the start-all / stop-all scripts.</p>
<h4>/etc/init.d/hadoop</h4>
<p>Create the place it will put its startup logs</p>
<pre>mkdir /var/log/hadoop</pre>
<p>Create /etc/init.d/hadoop with the following:</p>
<pre>#!/bin/sh
### BEGIN INIT INFO
# Provides:          hadoop services
# Required-Start:    $network
# Required-Stop:     $network
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Description:       Hadoop services
# Short-Description: Enable Hadoop services including hdfs
### END INIT INFO
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
HADOOP_BIN=/usr/local/hadoop/bin
NAME=hadoop
DESC=hadoop
USER=hadoop
ROTATE_SUFFIX=
test -x $HADOOP_BIN || exit 0
RETVAL=0
set -e
cd /

start_hadoop () {
    set +e
    su $USER -s /bin/sh -c $HADOOP_BIN/start-all.sh &gt; /var/log/hadoop/startup_log
    case "$?" in
      0)
        echo SUCCESS
        RETVAL=0
        ;;
      1)
        echo TIMEOUT - check /var/log/hadoop/startup_log
        RETVAL=1
        ;;
      *)
        echo FAILED - check /var/log/hadoop/startup_log
        RETVAL=1
        ;;
    esac
    set -e
}

stop_hadoop () {
    set +e
    if [ $RETVAL = 0 ] ; then
        su $USER -s /bin/sh -c $HADOOP_BIN/stop-all.sh &gt; /var/log/hadoop/shutdown_log
        RETVAL=$?
        if [ $RETVAL != 0 ] ; then
            echo FAILED - check /var/log/hadoop/shutdown_log
        fi
    else
        echo No nodes running
        RETVAL=0
    fi
    set -e
}

restart_hadoop() {
    stop_hadoop
    start_hadoop
}

case "$1" in
    start)
        echo -n "Starting $DESC: "
        start_hadoop
        echo "$NAME."
        ;;
    stop)
        echo -n "Stopping $DESC: "
        stop_hadoop
        echo "$NAME."
        ;;
    force-reload|restart)
        echo -n "Restarting $DESC: "
        restart_hadoop
        echo "$NAME."
        ;;
    *)
        echo "Usage: $0 {start|stop|restart|force-reload}" &gt;&amp;2
        RETVAL=1
        ;;
esac
exit $RETVAL</pre>
<h4>/etc/init.d/hbase</h4>
<p>Create the place it will put its startup logs</p>
<pre>mkdir /var/log/hbase</pre>
<p>Create /etc/init.d/hbase with the following:</p>
<pre>#!/bin/sh
### BEGIN INIT INFO
# Provides:          hbase services
# Required-Start:    $network
# Required-Stop:     $network
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Description:       Hbase services
# Short-Description: Enable Hbase services including hdfs
### END INIT INFO

PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
HBASE_BIN=/usr/local/hbase/bin
NAME=hbase
DESC=hbase
USER=hadoop
ROTATE_SUFFIX=
test -x $HBASE_BIN || exit 0
RETVAL=0
set -e
cd /

start_hbase () {
    set +e
    su $USER -s /bin/sh -c $HBASE_BIN/start-hbase.sh &gt; /var/log/hbase/startup_log
    case "$?" in
      0)
        echo SUCCESS
        RETVAL=0
        ;;
      1)
        echo TIMEOUT - check /var/log/hbase/startup_log
        RETVAL=1
        ;;
      *)
        echo FAILED - check /var/log/hbase/startup_log
        RETVAL=1
        ;;
    esac
    set -e
}

stop_hbase () {
    set +e
    if [ $RETVAL = 0 ] ; then
        su $USER -s /bin/sh -c $HBASE_BIN/stop-hbase.sh &gt; /var/log/hbase/shutdown_log
        RETVAL=$?
        if [ $RETVAL != 0 ] ; then
            echo FAILED - check /var/log/hbase/shutdown_log
        fi
    else
        echo No nodes running
        RETVAL=0
    fi
    set -e
}

restart_hbase() {
    stop_hbase
    start_hbase
}

case "$1" in
    start)
        echo -n "Starting $DESC: "
        start_hbase
        echo "$NAME."
        ;;
    stop)
        echo -n "Stopping $DESC: "
        stop_hbase
        echo "$NAME."
        ;;
    force-reload|restart)
        echo -n "Restarting $DESC: "
        restart_hbase
        echo "$NAME."
        ;;
    *)
        echo "Usage: $0 {start|stop|restart|force-reload}" &gt;&amp;2
        RETVAL=1
        ;;
esac
exit $RETVAL</pre>
<h4>Set up the init system</h4>
<p>This assumes you put the above init files in /etc/init.d</p>
<pre>chmod +x /etc/init.d/{hbase,hadoop}
update-rc.d hadoop defaults
update-rc.d hbase defaults 25</pre>
<p>You can now start / stop hadoop by saying:</p>
<pre>/etc/init.d/hadoop start</pre>
<pre>/etc/init.d/hadoop stop</pre>
<p>And similarly with hbase</p>
<pre>/etc/init.d/hbase start</pre>
<pre>/etc/init.d/hbase stop</pre>
<p>Make sure you start hadoop before hbase and stop hbase before you stop hadoop</p>
<h2>Macintosh launchd style startup</h2>
<p>Starting proceses on Macintosh Leopard is pretty easy with lauchd/launchctl.</p>
<p>For hadoop, create a file /Library/LaunchAgents/com.yourdomain.hadoop.plist with the following content (replace yourdomain with the domain you want to use for this class of apps):</p>
<pre><code>&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"&gt;
&lt;plist version="1.0"&gt;
&lt;dict&gt;
    &lt;key&gt;GroupName&lt;/key&gt;
    &lt;string&gt;_hadoop&lt;/string&gt;
    &lt;key&gt;KeepAlive&lt;/key&gt;
    &lt;true/&gt;
    &lt;key&gt;Label&lt;/key&gt;
    &lt;string&gt;com.yourdomain.hadoop&lt;/string&gt;
    &lt;key&gt;ProgramArguments&lt;/key&gt;
    &lt;array&gt;
        &lt;string&gt;/usr/local/hadoop/bin/start-all.sh&lt;/string&gt;
    &lt;/array&gt;
    &lt;key&gt;RunAtLoad&lt;/key&gt;
    &lt;true/&gt;
    &lt;key&gt;ServiceDescription&lt;/key&gt;
    &lt;string&gt;Hadoop Process&lt;/string&gt;
    &lt;key&gt;UserName&lt;/key&gt;
    &lt;string&gt;_hadoop&lt;/string&gt;
&lt;/dict&gt;
&lt;/plist&gt;
</code></pre>
<p>And for hbase, /Library/LaunchAgents/com.yourdomain.hbase.plist:</p>
<pre><code>&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"&gt;
&lt;plist version="1.0"&gt;
&lt;dict&gt;
	&lt;key&gt;GroupName&lt;/key&gt;
	&lt;string&gt;_hadoop&lt;/string&gt;
	&lt;key&gt;KeepAlive&lt;/key&gt;
	&lt;true/&gt;
	&lt;key&gt;Label&lt;/key&gt;
	&lt;string&gt;com.ibd.hbase&lt;/string&gt;
	&lt;key&gt;ProgramArguments&lt;/key&gt;
	&lt;array&gt;
		&lt;string&gt;/usr/local/hbase/bin/start-hbase.sh&lt;/string&gt;
	&lt;/array&gt;
	&lt;key&gt;RunAtLoad&lt;/key&gt;
	&lt;true/&gt;
	&lt;key&gt;UserName&lt;/key&gt;
	&lt;string&gt;_hadoop&lt;/string&gt;
&lt;/dict&gt;
&lt;/plist&gt;
</code></pre>
<p>Set the owner to root and the mode to 644:</p>
<pre>chown root /Library/LaunchAgents/com.yourdomain.hadoop.plist /Library/LaunchAgents/com.yourdomain.hbase.plist
chmod 644 /Library/LaunchAgents/com.yourdomain.hadoop.plist /Library/LaunchAgents/com.yourdomain.hbase.plist</pre>
<p>The next time you restart, it should start hbase and hadoop. You can also start them manually with the commands:</p>
<pre>sudo launchctl load /Library/LaunchAgents/com.yourdomain.hadoop.plist
sudo launchctl load /Library/LaunchAgents/com.yourdomain.hbase.plist</pre>
<h2>Conclusion</h2>
<p>You should now be able to see the HBase web interface at http://&lt;your domain name&gt;:60010</p>
<p>If you have problems check /var/log/{hbase,hadoop}/startup_log as well as /usr/local/hadoop/logs/hadoop-hadoop-namenode-yourhostname.log and /usr/local/hbase/logs/hbase-hadoop-master-yourhostname.log</p>
<p>The error messages are pretty poor. (Ie useless as far as I could tell when tracking down the FQDN/IP Address problem). But better than nothing.</p>
<p>I will post an update when I deploy a Full Cluster.</p><p>The post <a href="https://www.ibd.com/runa/hadoop-hdfs-and-hbase-on-ubuntu/">Hadoop, HDFS and Hbase on Ubuntu & Macintosh Leopard</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></content:encoded>
					
					<wfw:commentRss>https://www.ibd.com/runa/hadoop-hdfs-and-hbase-on-ubuntu/feed/</wfw:commentRss>
			<slash:comments>8</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">95</post-id>	</item>
		<item>
		<title>The Commoditization of Massive Data Analysis</title>
		<link>https://www.ibd.com/scalable-deployment/the-commoditization-of-massive-data-analysis/</link>
					<comments>https://www.ibd.com/scalable-deployment/the-commoditization-of-massive-data-analysis/#comments</comments>
		
		<dc:creator><![CDATA[Robert J Berger]]></dc:creator>
		<pubDate>Thu, 20 Nov 2008 07:26:24 +0000</pubDate>
				<category><![CDATA[Scalable Deployment]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<guid isPermaLink="false">http://blog2.ibd.com/?p=39</guid>

					<description><![CDATA[<p>Today&#8217;s article in O&#8217;Reilly&#8217;s Radar by Joseph Hellerstein, is a concise synopsis of the state-of-the-art large scale data analysis. It compares the Enterprise IT dominant&#8230;</p>
<p>The post <a href="https://www.ibd.com/scalable-deployment/the-commoditization-of-massive-data-analysis/">The Commoditization of Massive Data Analysis</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></description>
										<content:encoded><![CDATA[<p>Today&#8217;s <a href="http://radar.oreilly.com/2008/11/the-commoditization-of-massive.html">article</a> in O&#8217;Reilly&#8217;s Radar by <a href="http://db.cs.berkeley.edu/jmh">Joseph Hellerstein</a>, is a concise synopsis of the state-of-the-art large scale data analysis. It compares the Enterprise IT dominant Relational Database paradigm to the emerging (with a bullet!) <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a> / <a href="http://hadoop.apache.org/">Hadoop</a> technologies.<img decoding="async" loading="lazy" class="alignleft size-full wp-image-38" src="https://i0.wp.com/www.ibd.com/wp-content/uploads/2008/11/8d67d34f-6a0d-43ac-9322-9f64e3dc981d.jpg?resize=300%2C71" alt="Hadoop Logo" width="300" height="71" data-recalc-dims="1" /></p>
<p>Professor Hellerstein, from UC Berkeley lives this stuff as a leading researcher on databases and distributed systems.  He is also an advisor to Greenplum, one of the start-ups mentioned in the article that is involved in commercializing MapReduce Tech and writes the <a href="http://databeta.wordpress.com/"> data beta blog</a>.</p>
<p><img decoding="async" loading="lazy" class="alignleft size-full wp-image-35" src="https://i0.wp.com/www.ibd.com/wp-content/uploads/2008/11/862cd098-9190-49d8-97dc-d026b8f0c83c.jpg?resize=173%2C52" alt="Greeplum Logo" width="173" height="52" data-recalc-dims="1" /></p>
<p><img decoding="async" loading="lazy" class="size-full wp-image-34 alignright" src="https://i0.wp.com/www.ibd.com/wp-content/uploads/2008/11/27f6fea8-abdd-4e43-9602-c75e6a39b568.jpg?resize=100%2C87" alt="Aster Logo" width="100" height="87" data-recalc-dims="1" /></p>
<p>The article discusses how some companies (and they are companies with proprietary tech and nary a free download link on their home page) such as <a href="http://www.asterdata.com/index.php">Aster Data</a> and <a href="http://www.greenplum.com/">Greenplum</a> that are promoting hybrid Relational Database / MapReduce Data Warehouse products. These may get some traction in the Enterprise but with any success, will eventually get squashed and/or assimilated by Oracle and thus stay in the IT Realm (IMHO).</p>
<p>The more interesting space is the multiverse of open source tools that are <img decoding="async" loading="lazy" class="alignleft  wp-image-36" src="https://i0.wp.com/www.ibd.com/wp-content/uploads/2008/11/ecb3533e-ec6a-46f7-be6f-1b5f3991e815.jpg?resize=180%2C120" alt="Pig" width="180" height="120" data-recalc-dims="1" /></p>
<p>pushing the evolution of the underlying Hadoop MapReduce as well as the growing set of tools being layered on top of Hadoop such as <a href="http://wiki.apache.org/hadoop/Hive">Hive,</a>originally developed by <a href="http://www.facebook.com/note.php?note_id=16121578919">Facebook Engineering</a>, and <a href="http://research.yahoo.com/node/90">Pig</a>, started by Yahoo Research. Both are sets of tools, including a query language interface, for doing ad-hoc analysis of massive data sets.</p>
<p>Hellerstein calls all of this a <a href="http://www.cccblog.org/2008/10/20/the-data-centric-gambit/"><em>renaissance in computer science research</em></a> and calls for folks to look towards standardizing the upper layers of the Hadoop hierarchy, particularly the query language.</p>
<blockquote><p>There is a debate brewing among data systems cognoscenti as to the best way to do data analysis at this scale. The old guard in the Enterprise IT camp tends to favor relational databases and the SQL language, while the web upstarts have rallied around the MapReduce programming model popularized at Google, and cloned in open source as Apache Hadoop. Hadoop is in wide use at companies like Yahoo! and Facebook, and gets a lot of attention in tech blogs as the next big open source project. But if you mention Hadoop in a corporate IT shop you are often met with blank stares &#8212; SQL is ubiquitous in those environments. There is still a surprising disconnect between these developer communities, but I expect that to change over the next year or two.</p>
<p>We are at the beginning of what I call The Industrial Revolution of Data. We&#8217;re not quite there yet, since most of the digital information available today is still individually &#8220;handmade&#8221;: prose on web pages, data entered into forms, videos and music edited and uploaded to servers. But we are starting to see the rise of automatic data generation &#8220;factories&#8221; such as software logs, UPC scanners, RFID, GPS transceivers, video and audio feeds. These automated processes can stamp out data at volumes that will quickly dwarf the collective productivity of content authors worldwide.</p>
<p>Meanwhile, disk capacities are growing exponentially, so the cost of archiving this data remains modest. And there are plenty of reasons to believe that this data has value in a wide variety of settings. The last step of the revolution is the commoditization of data analysis software, to serve a broad class of users.</p>
<p>-snip-</p></blockquote><p>The post <a href="https://www.ibd.com/scalable-deployment/the-commoditization-of-massive-data-analysis/">The Commoditization of Massive Data Analysis</a> first appeared on <a href="https://www.ibd.com">Cognizant Transmutation</a>.</p>]]></content:encoded>
					
					<wfw:commentRss>https://www.ibd.com/scalable-deployment/the-commoditization-of-massive-data-analysis/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">39</post-id>	</item>
	</channel>
</rss>
