Setting up a Raspberry Pi Hadoop Cluster
This is a “how to” step-by-step guide to set up a Raspberry Pi Hadoop Cluster.
Why?
This should probably be the question you have in mind. Why do I want to build a Hadoop cluster? Here’s a few reasons:
Big Data
is a buzzword for nearly a decade. This article might be useful as introduction to an amazing journey on this industry.- Cluster’s hardware can be used in another projects.
- You will touch on interesting topics such as:
networking
,ssh
orsecurity
. Pis
are always fun to play!
Santa’s hardware list
- 3x Raspberry Pies 3B+
- 3x Micro SD card 32GB
- 3x USB type C to USB 2.0
- USB charger
- Ethernet Switch
- 4x Ethernet cables
- Acrylic case
Total (approximately): 180€
Set up the hardware
- Assemble the Pis to the acrylic case
- Tape the charger and router
- Connect the wires
Flash OS image onto SD cards
- Download an Operation System (OS) for our Pis. For this guide, we are going to use the Raspbian which is the official Raspberry Pi OS.
- The next step is to write the OS into our SD card. For this, we are going to use the Win32 Disk imager and the process is pretty simple: insert the SD card in our machine, open the Win32, select the iso file that we want to flash and click on write option.
- Repeat this process for the remaining 2 SD cards using the same OS image.
🔓Enable SSH connection
As of the November 2016 release, Raspberry Pi OS has the SSH server disabled by default. Due to this, we need to manually enable it.
- Launch
Raspberry Pi Configuration
from thePreferences
menu - Navigate to the
Interfaces
tab - Select
Enabled
next toSSH
- Click
OK
Identify our Pies in the network
- Connect the Pi to the switch and the power source.
- To identify the Pi’s IP address, type
ping raspberry.local
in our terminal. - Type
ssh pi@{ip}
. Typeyes
when asked if you want to continue to connect. - We will need to type the password. By default it’s
raspberry
.
Might be useful to keep the SSH credential in the ~/.ssh/config
of our machine
Host pi.master
HostName 192.168.1.115
User piHost pi.node1
HostName 192.168.1.116
User piHost pi.node2
HostName 192.168.1.117
User pi
(You will have different IP addresses)
Expand the file system
(this needs to be done manually on each Pi)
After the SSH connection:
- Type
sudo raspi-config
to open the Configuration Tool. - Navigate to
7 Advanced Options
- Select
A1 Expand Filesystem
and click onFinish
. - Reboot the Pi typing:
sudo shutdown -r now
.
Hostnames
By default, all of the Pis are known as raspberry
and have a single user, the pi
. Since we are creating a cluster, this might be a nightmare to manage in a future… So let’s simplify! We will assign a hostname
based on its position.
- 1º Pi is the
pi.master
; - 2º Pi is the
pi.node1
; - 3º Pi is the
pi.node2;
Edit the following files:
(this needs to be done manually on each Pi)
/etc/hosts
, add the following:
192.168.1.115 pi.master
192.168.1.116 pi.node1
192.168.1.117 pi.node2
/etc/hostname
, change its value:
# using master as example
pi.master
Once completed, we also need to reboot the Pi. Now, when the terminal is opened, instead of:
pi@raspberrypi:~ $
We should use:
pi:pi.master:~ $
SSH — fine tune
Setup the SSH aliases
Edit the ~/.ssh/config
with the same same configuration as we have used above (Identify our Pies in the network
step):
Host pi.master
HostName 192.168.1.115
User piHost pi.node1
HostName 192.168.1.116
User piHost pi.node2
HostName 192.168.1.117
User pi
Setup public/private key pairs
On each Pi, run the following command:
$ ssh-keygen -t ed25519
This command generates a public and private key pair within the ~/.ssh
directory which can be used a securely ssh connection without the need of typing a password.
ℹ️ no passphrase is needed to protect the access to the key pair.
- Concatenate the public keys into the
authorized_keys
:
On the slaves
Pis, run the following command:
$ cat ~/.ssh/id_ed25519.pub | ssh pi@192.168.1.115 'cat >> .ssh/authorized_keys'
ℹ️ 192.168.1.115 is the pi.master
IP address
On the pi.master
, run the following command:
$ cat .ssh/id_ed25519.pub >> .ssh/authorized_keys
From this moment, the master
is able to connect to the slaves
through ssh
in an easy way.
#ie: ssh to master
$ ssh pi.master
Sync the SSH configuration
To replicate the passwordless ssh
on all the slaves, run the following commands:
$ scp ~/.ssh/authorized_keys pi.nodeX:~/.ssh/authorized_keys
$ scp ~/.ssh/config pi.nodeX:~/.ssh/config
⚠️ pi.nodeX
: the X represents the index of the slaves
Helpers
Create scripts
As developers, we are lazy… Let’s create a few bash scripts in our ~/.bashrc
:
- Get
hostname
slaves:
function nodes {
grep "pi" /etc/hosts | awk '{print $2}' | grep -v $(hostname)
}
- Run
commands
on all cluster:
function cluster_cmd {
for pi in $(nodes); do ssh $pi "$@"; done
$@
}
Reboot
cluster:
function cluster_reboot {
cluster_cmd sudo shutdown -r now
}
Send file
to cluster:
function cluster_scp {
for pi in $(nodes); do
cat $1 | ssh $pi "sudo tee $1" > /dev/null 2>&1
done
}
Sync in the cluster
Now, its’ time to also add these helpers into the entire cluster.
$ source ~/.bashrc && cluster_scp ~/.bashrc
Java
Hadoop 3.2.0 requires Java 8 . You might need to change for this version.
- Download Java 8:
$ sudo apt install openjdk-8-jdk
- Set Java version:
$ sudo update-alternatives --config java
The default version will have a *
next to it. Type a selection number
and hit Enter
to set a different Java version as the system default.
(this should to be done manually on each Pi)
Now that the initial setup is completed… First we will create a single-node setup on the pi.master
and then, we will set the multi node cluster.
Single setup
On pi.master
:
Download and extract the Hadoop
- Download:
$ cd && wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
- Extract:
$ sudo tar -xvvf hadoop-3.2.0.tar.gz -C /opt/
$ rm hadoop-3.2.0.tar.gz
$ cd /opt
$ sudo mv hadoop-3.2.0 hadoop
Then, there is the need to change the owner permissions on Hadoop’s directory:
$ sudo chown pi:pi -R /opt/hadoop
Environment vars
Edit the following commands:
~/.bashrc
:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
/opt/hadoop/etc/hadoop/hadoop-env.sh
:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
HDFS
To get the HDFS up and running, we need to modify the following files located on /opt/hadoop/etc/hadoop
:
core-site.xml
:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://pi.master:9000</value>
</property>
</configuration>
hdfs-site.xml
:
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop_tmp/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
⚠️️ The hdfs-site.xml
defines where the DataNode
and NameNode
store the data and also sets the replication. Due to this, we also need to create these folders with the right privileges:
$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
$ sudo mkdir -p /opt/hadoop_tmp/hdfs/namenode
$ sudo chown pi:pi -R /opt/hadoop_tmp
Once we’ve edited the config files, the next step is to format the HDFS
$ hdfs namenode -format -force
Boot the HDFS with the following command:
$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh
… time to test!
… through the command jps
, it should display the following services
ResourceManager
NameNode
Jps
SecondaryNameNode
DataNode
NodeManager
… or, by creating a temporary directory:
$ hadoop fs -mkdir /tmp$ hadoop fs -ls/
Found 1 items
drwxr-xr-x - pi supergroup 0 2021-06-05 20:04 /tmp
Hide annoying NativeCodeLoader warnings
We might see the warning util.NativeCodeLoader: Unable to load native-hadoop library
(…). This warning isn’t easy to solve, basically we could recompile the library from scratch on the 64-bit machine, but for this tutorial it is enough to hide it.
Add the following commands on the ~/.bashrc
:
$ export HADOOP_HOME_WARN_SUPPRESS=1
$ export HADOOP_ROOT_LOGGER="WARN,DRFA"
Cluster setup
At this moment, we have a single node that works as master and slave node. It’s time to setup our cluster.
Create Hadoop directories
Through the following commands, we will create the mandatory folders on all slave Pis.
$ cluster_cmd sudo mkdir -p /opt/hadoop_tmp/hdfs
$ cluster_cmd sudo chown pi:pi –R /opt/hadoop_tmp
$ cluster_cmd sudo mkdir -p /opt/hadoop
$ cluster_cmd sudo chown pi:pi /opt/hadoop
Sync Hadoop config
Copy all the files from the opt/hadoop
on all the slave Pis.
$ for pi in $(nodes); do rsync -avxP $HADOOP_HOME $pi:/opt/; done
Configuring Hadoop on Cluster
Edit the following config files located on /opt/hadoop/etc/hadoop
core-site.xml
, add the following:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://pi.master:9000</value>
</property>
</configuration>
hdfs-site.xml
, add the following:
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop_tmp/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
mapred-site.xml
, add the following:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>256</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>128</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>128</value>
</property>
</configuration>
yarn-site.xml
, add the following:
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>pi.master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property><name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>900</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>900</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>64</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
On $HADOOP/etc/hadoop/
we need to create two files for Hadoop to identify the master and which ones are the slaves.
- Create a file named
master
with the single line:
pi.master
- Create a file name
worker
and add the slaves (one per line):
pi.node1
pi.node2
Edit the /etc/hosts
and remove the line:
127.0.0.1 pi.master
Now, let’s copy the hosts file to all the slaves:
$ cluster_scp /etc/hosts
… reboot the cluster:
$ cluster_reboot
Start Hadoop Cluster
On pi.master
run the command:
$ hdfs namenode -format -force
Now, boot the HFDS:
$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh
Is it working?
Through the command hdfs dfsadmin -report
we should have the similar output:
Configured Capacity: 60373532672 (56.23 GB)
Present Capacity: 41064591360 (38.24 GB)
DFS Remaining: 41064542208 (38.24 GB)
DFS Used: 49152 (48 KB)
DFS Used%: 0.00%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0-------------------------------------------------
Live datanodes (2):Name: 192.168.1.116:9866 (pi.node1)
Hostname: pi.node1
Decommission Status : Normal
Configured Capacity: 30186766336 (28.11 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 8399671296 (7.82 GB)
DFS Remaining: 20462870528 (19.06 GB)
DFS Used%: 0.00%
DFS Remaining%: 67.79%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jun 05 21:49:42 WEST 2021
Last Block Report: Sat Jun 05 21:41:48 WEST 2021
Num of Blocks: 0Name: 192.168.1.117:9866 (pi.node2)
Hostname: pi.node2
Decommission Status : Normal
Configured Capacity: 30186766336 (28.11 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 8260870144 (7.69 GB)
DFS Remaining: 20601671680 (19.19 GB)
DFS Used%: 0.00%
DFS Remaining%: 68.25%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jun 05 21:49:42 WEST 2021
Last Block Report: Sat Jun 05 21:41:48 WEST 2021
Num of Blocks: 0
… We also have a web interface (http://pi.master:9870) where we can explore the cluster info.
Conclusion
And… That’s it folks! Now we have installed a Raspberry Pi Hadoop cluster.
I hope this guide has been helpful to you.