Setting up a Raspberry Pi Hadoop Cluster

Tiago Mateus

9 min readJun 16, 2021

This is a “how to” step-by-step guide to set up a Raspberry Pi Hadoop Cluster.

Why?

This should probably be the question you have in mind. Why do I want to build a Hadoop cluster? Here’s a few reasons:

Big Data is a buzzword for nearly a decade. This article might be useful as introduction to an amazing journey on this industry.
Cluster’s hardware can be used in another projects.
You will touch on interesting topics such as: networking, ssh or security.
Pis are always fun to play!

Santa’s hardware list

3x Raspberry Pies 3B+
3x Micro SD card 32GB
3x USB type C to USB 2.0
USB charger
Ethernet Switch
4x Ethernet cables
Acrylic case

Total (approximately): 180€

Set up the hardware

Assemble the Pis to the acrylic case
Tape the charger and router
Connect the wires

Flash OS image onto SD cards

Download an Operation System (OS) for our Pis. For this guide, we are going to use the Raspbian which is the official Raspberry Pi OS.

The next step is to write the OS into our SD card. For this, we are going to use the Win32 Disk imager and the process is pretty simple: insert the SD card in our machine, open the Win32, select the iso file that we want to flash and click on write option.
Repeat this process for the remaining 2 SD cards using the same OS image.

🔓Enable SSH connection

As of the November 2016 release, Raspberry Pi OS has the SSH server disabled by default. Due to this, we need to manually enable it.

Launch Raspberry Pi Configuration from the Preferences menu
Navigate to the Interfaces tab
Select Enabled next to SSH
Click OK

Identify our Pies in the network

Connect the Pi to the switch and the power source.
To identify the Pi’s IP address, type ping raspberry.local in our terminal.
Type ssh pi@{ip}. Type yes when asked if you want to continue to connect.
We will need to type the password. By default it’s raspberry.

Might be useful to keep the SSH credential in the ~/.ssh/config of our machine

Host pi.master
HostName 192.168.1.115
User piHost pi.node1
HostName 192.168.1.116
User piHost pi.node2
HostName 192.168.1.117
User pi

(You will have different IP addresses)

Expand the file system

(this needs to be done manually on each Pi)

After the SSH connection:

Type sudo raspi-config to open the Configuration Tool.
Navigate to 7 Advanced Options
Select A1 Expand Filesystem and click on Finish.
Reboot the Pi typing:sudo shutdown -r now.

Hostnames

By default, all of the Pis are known as raspberry and have a single user, the pi. Since we are creating a cluster, this might be a nightmare to manage in a future… So let’s simplify! We will assign a hostname based on its position.

1º Pi is the pi.master ;
2º Pi is the pi.node1 ;
3º Pi is the pi.node2;

Edit the following files:

(this needs to be done manually on each Pi)

/etc/hosts, add the following:

192.168.1.115 pi.master
192.168.1.116 pi.node1
192.168.1.117 pi.node2

/etc/hostname , change its value:

# using master as example
pi.master

Once completed, we also need to reboot the Pi. Now, when the terminal is opened, instead of:

pi@raspberrypi:~ $

We should use:

pi:pi.master:~ $

SSH — fine tune

Setup the SSH aliases

Edit the ~/.ssh/config with the same same configuration as we have used above (Identify our Pies in the network step):

Host pi.master
HostName 192.168.1.115
User piHost pi.node1
HostName 192.168.1.116
User piHost pi.node2
HostName 192.168.1.117
User pi

Setup public/private key pairs

On each Pi, run the following command:

$ ssh-keygen -t ed25519

This command generates a public and private key pair within the ~/.ssh directory which can be used a securely ssh connection without the need of typing a password.

ℹ️ no passphrase is needed to protect the access to the key pair.

Concatenate the public keys into the authorized_keys:

On the slaves Pis, run the following command:

$ cat ~/.ssh/id_ed25519.pub | ssh pi@192.168.1.115 'cat >> .ssh/authorized_keys'

ℹ️ 192.168.1.115 is the pi.master IP address

On the pi.master, run the following command:

$ cat .ssh/id_ed25519.pub >> .ssh/authorized_keys

From this moment, the master is able to connect to the slaves through ssh in an easy way.

#ie: ssh to master
$ ssh pi.master

Sync the SSH configuration

To replicate the passwordless ssh on all the slaves, run the following commands:

$ scp ~/.ssh/authorized_keys pi.nodeX:~/.ssh/authorized_keys
$ scp ~/.ssh/config pi.nodeX:~/.ssh/config

⚠️ pi.nodeX: the X represents the index of the slaves

Helpers

Create scripts

As developers, we are lazy… Let’s create a few bash scripts in our ~/.bashrc :

Get hostname slaves:

function nodes {
  grep "pi" /etc/hosts | awk '{print $2}' | grep -v $(hostname)
}

Run commands on all cluster:

function cluster_cmd {
  for pi in $(nodes); do ssh $pi "$@"; done
  $@
}

Reboot cluster:

function cluster_reboot {
  cluster_cmd sudo shutdown -r now
}

Send file to cluster:

function cluster_scp {
  for pi in $(nodes); do
    cat $1 | ssh $pi "sudo tee $1" > /dev/null 2>&1
  done
}

Sync in the cluster

Now, its’ time to also add these helpers into the entire cluster.

$ source ~/.bashrc && cluster_scp ~/.bashrc

Java

Hadoop 3.2.0 requires Java 8 . You might need to change for this version.

Download Java 8:

$ sudo apt install openjdk-8-jdk

Set Java version:

$ sudo update-alternatives --config java

The default version will have a * next to it. Type a selection number and hit Enter to set a different Java version as the system default.

(this should to be done manually on each Pi)

Now that the initial setup is completed… First we will create a single-node setup on the pi.master and then, we will set the multi node cluster.

Single setup

On pi.master:

Download and extract the Hadoop

Download:

$ cd && wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz

Extract:

$ sudo tar -xvvf hadoop-3.2.0.tar.gz -C /opt/
$ rm hadoop-3.2.0.tar.gz
$ cd /opt
$ sudo mv hadoop-3.2.0 hadoop

Then, there is the need to change the owner permissions on Hadoop’s directory:

$ sudo chown pi:pi -R /opt/hadoop

Environment vars

Edit the following commands:

~/.bashrc:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

/opt/hadoop/etc/hadoop/hadoop-env.sh:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

HDFS

To get the HDFS up and running, we need to modify the following files located on /opt/hadoop/etc/hadoop :

core-site.xml:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://pi.master:9000</value>
  </property>
</configuration>

hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///opt/hadoop_tmp/hdfs/datanode</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///opt/hadoop_tmp/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

mapred-site.xml:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

yarn-site.xml :

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
   <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>  
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration>

⚠️️ The hdfs-site.xml defines where the DataNode and NameNode store the data and also sets the replication. Due to this, we also need to create these folders with the right privileges:

$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
$ sudo mkdir -p /opt/hadoop_tmp/hdfs/namenode
$ sudo chown pi:pi -R /opt/hadoop_tmp

Once we’ve edited the config files, the next step is to format the HDFS

$ hdfs namenode -format -force

Boot the HDFS with the following command:

$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh

… time to test!

… through the command jps, it should display the following services

ResourceManager
NameNode
Jps
SecondaryNameNode
DataNode
NodeManager

… or, by creating a temporary directory:

$ hadoop fs -mkdir /tmp$ hadoop fs -ls/
Found 1 items
drwxr-xr-x   - pi supergroup          0 2021-06-05 20:04 /tmp

Hide annoying NativeCodeLoader warnings

We might see the warning util.NativeCodeLoader: Unable to load native-hadoop library(…). This warning isn’t easy to solve, basically we could recompile the library from scratch on the 64-bit machine, but for this tutorial it is enough to hide it.

Add the following commands on the ~/.bashrc:

$ export HADOOP_HOME_WARN_SUPPRESS=1
$ export HADOOP_ROOT_LOGGER="WARN,DRFA"

Cluster setup

At this moment, we have a single node that works as master and slave node. It’s time to setup our cluster.

Create Hadoop directories

Through the following commands, we will create the mandatory folders on all slave Pis.

$ cluster_cmd sudo mkdir -p /opt/hadoop_tmp/hdfs
$ cluster_cmd sudo chown pi:pi –R /opt/hadoop_tmp
$ cluster_cmd sudo mkdir -p /opt/hadoop
$ cluster_cmd sudo chown pi:pi /opt/hadoop

Sync Hadoop config

Copy all the files from the opt/hadoop on all the slave Pis.

$ for pi in $(nodes); do rsync -avxP $HADOOP_HOME $pi:/opt/; done

Configuring Hadoop on Cluster

Edit the following config files located on /opt/hadoop/etc/hadoop

core-site.xml, add the following:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://pi.master:9000</value>
  </property>
</configuration>

hdfs-site.xml, add the following:

<configuration>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/opt/hadoop_tmp/hdfs/datanode</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/opt/hadoop_tmp/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
</configuration>

mapred-site.xml, add the following:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
    <value>256</value>
  </property>
  <property>
    <name>mapreduce.map.memory.mb</name>
    <value>128</value>
  </property>
  <property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>128</value>
  </property>
</configuration>

yarn-site.xml, add the following:

<configuration>
  <property>
    <name>yarn.acl.enable</name>
    <value>0</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>pi.master</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property><name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>  
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>900</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>900</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>64</value>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
  </property>
</configuration>

On $HADOOP/etc/hadoop/ we need to create two files for Hadoop to identify the master and which ones are the slaves.

Create a file named master with the single line:

pi.master

Create a file name worker and add the slaves (one per line):

pi.node1
pi.node2

Edit the /etc/hosts and remove the line:

127.0.0.1 pi.master

Now, let’s copy the hosts file to all the slaves:

$ cluster_scp /etc/hosts

… reboot the cluster:

$ cluster_reboot

Start Hadoop Cluster

On pi.master run the command:

$ hdfs namenode -format -force

Now, boot the HFDS:

$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh

Is it working?

Through the command hdfs dfsadmin -report we should have the similar output:

Configured Capacity: 60373532672 (56.23 GB)
Present Capacity: 41064591360 (38.24 GB)
DFS Remaining: 41064542208 (38.24 GB)
DFS Used: 49152 (48 KB)
DFS Used%: 0.00%
Replicated Blocks:
 Under replicated blocks: 0
 Blocks with corrupt replicas: 0
 Missing blocks: 0
 Missing blocks (with replication factor 1): 0
 Low redundancy blocks with highest priority to recover: 0
 Pending deletion blocks: 0
Erasure Coded Block Groups: 
 Low redundancy block groups: 0
 Block groups with corrupt internal blocks: 0
 Missing block groups: 0
 Low redundancy blocks with highest priority to recover: 0
 Pending deletion blocks: 0-------------------------------------------------
Live datanodes (2):Name: 192.168.1.116:9866 (pi.node1)
Hostname: pi.node1
Decommission Status : Normal
Configured Capacity: 30186766336 (28.11 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 8399671296 (7.82 GB)
DFS Remaining: 20462870528 (19.06 GB)
DFS Used%: 0.00%
DFS Remaining%: 67.79%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jun 05 21:49:42 WEST 2021
Last Block Report: Sat Jun 05 21:41:48 WEST 2021
Num of Blocks: 0Name: 192.168.1.117:9866 (pi.node2)
Hostname: pi.node2
Decommission Status : Normal
Configured Capacity: 30186766336 (28.11 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 8260870144 (7.69 GB)
DFS Remaining: 20601671680 (19.19 GB)
DFS Used%: 0.00%
DFS Remaining%: 68.25%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jun 05 21:49:42 WEST 2021
Last Block Report: Sat Jun 05 21:41:48 WEST 2021
Num of Blocks: 0

… We also have a web interface (http://pi.master:9870) where we can explore the cluster info.

Conclusion

And… That’s it folks! Now we have installed a Raspberry Pi Hadoop cluster.

I hope this guide has been helpful to you.