Hadoop is a widely-used open-source framework for distributed storage and processing of large data sets. This guide will walk you through the installation and configuration of Hadoop on an Ubuntu VPS, ensuring optimal performance and scalability for your big data projects.
10 min
Edited:12-10-2024
Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers. it can be used for: Big Data Storage, Data Processing, Data Analytics, Data Lakes and more
Start by updating your Ubuntu system to ensure that all packages are up to date. Run the following commands:
1. sudo apt update
2. sudo apt upgrade
Hadoop requires Java to run, so install the default Java Development Kit (JDK) on your Ubuntu VPS:
1. sudo apt install openjdk-11-jdk -y
Verify the installation by checking the Java version:
1. java -version
You should see something like:
1. openjdk version "11.0.11"
For security and organizational purposes, it’s recommended to create a dedicated user for Hadoop:
1. sudo adduser hadoopuser
2. sudo usermod -aG sudo hadoopuser
Switch to the new user:
1. su - hadoopuser
Hadoop uses SSH to manage its nodes. Ensure that SSH is installed and set up:
1. sudo apt install ssh
Also, generate an SSH key pair for password-less SSH login:
1. ssh-keygen -t rsa -P ""
Add the generated public key to the authorized keys:
1. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
2. chmod 600 ~/.ssh/authorized_keys
Test SSH by running:
1. ssh localhost
You should be able to log in without a password prompt.
Download the latest stable version of Hadoop from the official Apache site:
1. wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Extract the downloaded file:
1. tar -xzf hadoop-3.3.6.tar.gz
2. mv hadoop-3.3.6 ~/hadoop
Edit the .bashrc file to set the Hadoop environment variables:
1. nano ~/.bashrc
Add the following lines at the end of the file:
1. #Hadoop Variables
2. export HADOOP_HOME=$HOME/hadoop
3. export HADOOP_INSTALL=$HADOOP_HOME
4. export HADOOP_MAPRED_HOME=$HADOOP_HOME
5. export HADOOP_COMMON_HOME=$HADOOP_HOME
6. export HADOOP_HDFS_HOME=$HADOOP_HOME
7. export YARN_HOME=$HADOOP_HOME
8. export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
9. export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Save and exit the file, then apply the changes:
1. source ~/.bashrc
Navigate to the Hadoop configuration folder:
1. cd ~/hadoop/etc/hadoop
Edit hadoop-env.sh, specify the Java home directory in the hadoop-env.sh file:
1. nano hadoop-env.sh
Look for the line containing JAVA_HOME and set it to your Java installation path:
1. export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Configure Core-Site, create a directory for Hadoop to store the HDFS file system:
1. mkdir -p ~/hadoopdata/hdfs/namenode
2. mkdir -p ~/hadoopdata/hdfs/datanode
Now edit the core-site.xml file:
1. nano core-site.xml
Add the following configuration:
1. <configuration>
2. <property>
3. <name>fs.defaultFS</name>
4. <value>hdfs://localhost:9000</value>
5. </property>
6. </configuration>
Configure HDFS-Site, edit the hdfs-site.xml file:
1. nano hdfs-site.xml
Add the following configuration to define the namenode and datanode directories:
1. <configuration>
2. <property>
3. <name>dfs.replication</name>
4. <value>1</value>
5. </property>
6. <property>
7. <name>dfs.namenode.name.dir</name>
8. <value>file:///home/hadoopuser/hadoopdata/hdfs/namenode</value>
9. </property>
10. <property>
11. <name>dfs.datanode.data.dir</name>
12. <value>file:///home/hadoopuser/hadoopdata/hdfs/datanode</value>
13. </property>
14. </configuration>
Configure MapReduce, edit the mapred-site.xml file:
1. nano mapred-site.xml
Add the following configuration to specify the MapReduce framework:
1. <configuration>
2. <property>
3. <name>mapreduce.framework.name</name>
4. <value>yarn</value>
5. </property>
6. </configuration>
Configure YARN, edit the yarn-site.xml file:
1. nano yarn-site.xml
Add the following configuration for YARN resource management:
1. <configuration>
2. <property>
3. <name>yarn.nodemanager.aux-services</name>
4. <value>mapreduce_shuffle</value>
5. </property>
6. </configuration>
Before starting Hadoop for the first time, format the HDFS filesystem:
1. hdfs namenode -format
Start the Hadoop services in the following order:
1. start-dfs.sh
2. start-yarn.sh
You can verify that the Hadoop services are running by using the jps command, which should display processes like NameNode, DataNode, ResourceManager, and NodeManager.
Hadoop provides a web interface for monitoring:
HDFS NameNode UI: http://localhost:9870
YARN ResourceManager UI: http://localhost:8088
You now have a fully configured Hadoop installation on your Ubuntu VPS. This setup allows you to store and process large datasets across a distributed environment, setting the stage for scalable big data operations.
14-10-2024
This article offers a detailed guide on installing and configuring IPTables on an Ubuntu VPS. IPTables is a powerful firewall tool that helps secure your server by controlling inbound and outbound traffic. Learn how to set up rules for traffic filtering, configure basic security policies, and apply custom rules to protect your VPS.
IPtables
security
12 min
This article offers a comprehensive guide on installing and configuring ModSecurity, a powerful web application firewall (WAF), on an Ubuntu VPS. Learn how to secure your server by filtering and monitoring HTTP requests, set up ModSecurity with Nginx or Apache, and apply rules to protect against common web attacks.
Modsecurity
security
10 min
14-10-2024
This article provides a comprehensive guide on installing and configuring PHP-FPM (FastCGI Process Manager) on an Ubuntu VPS. Learn how to optimize PHP performance for your web applications by configuring PHP-FPM with Nginx or Apache, managing pools, and fine-tuning settings for efficient processing of PHP scripts.
PHP-FPM
speed
optimise
12 min