Как установить hadoop на windows 10 - Доктор Windows

This detailed step-by-step guide shows you how to install the latest Hadoop (v3.2.1) on Windows 10. It’s based on the previous articles I published with some updates to reflect the feedback collected from readers to make it easier for everyone to install.

Please follow all the instructions carefully. Once you complete the steps, you will have a shiny pseudo-distributed single node Hadoop to work with.

*The yellow elephant logo is a registered trademark of Apache Hadoop; the blue window logo is registered trademark of Microsoft.

References

Refer to the following articles if you prefer to install other versions of Hadoop or if you want to configure a multi-node cluster or using WSL.

Install Hadoop 3.0.0 on Windows (Single Node)
Configure Hadoop 3.1.0 in a Multi Node Cluster
Install Hadoop 3.2.0 on Windows 10 using Windows Subsystem for Linux (WSL)

Required tools

Before you start, make sure you have these following tools enabled in Windows 10.

Tool	Comments
PowerShell	We will use this tool to download package. In my system, PowerShell version table is listed below: $PSversionTable Name Value ---- ----- PSVersion 5.1.18362.145 PSEdition Desktop PSCompatibleVersions {1.0, 2.0, 3.0, 4.0...} BuildVersion 10.0.18362.145 CLRVersion 4.0.30319.42000 WSManStackVersion 3.0 PSRemotingProtocolVersion 2.3 SerializationVersion 1.1.0.1
Git Bash or 7 Zip	We will use Git Bash or 7 Zip to unzip Hadoop binary package. You can choose to install either tool or any other tool as long as it can unzip *.tar.gz files on Windows.
Command Prompt	We will use it to start Hadoop daemons and run some commands as part of the installation process.
Java JDK	JDK is required to run Hadoop as the framework is built using Java. In my system, my JDK version is jdk1.8.0_161. Check out the supported JDK version on the following page. https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions

Now we will start the installation process.

Step 1 — Download Hadoop binary package

Select download mirror link

Go to download page of the official website:

Apache Download Mirrors — Hadoop 3.2.1

And then choose one of the mirror link. The page lists the mirrors closest to you based on your location. For me, I am choosing the following mirror link:

http://apache.mirror.digitalpacific.com.au/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

info In the following sections, this URL will be used to download the package. Your URL might be different from mine and you can replace the link accordingly.

Download the package

info In this guide, I am installing Hadoop in folder big-data of my F drive (F:big-data). If you prefer to install on another drive, please remember to change the path accordingly in the following command lines. This directory is also called destination directory in the following sections.

Open PowerShell and then run the following command lines one by one:

$dest_dir="F:big-data"
$url = "http://apache.mirror.digitalpacific.com.au/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz"
$client = new-object System.Net.WebClient
$client.DownloadFile($url,$dest_dir+"hadoop-3.2.1.tar.gz")

It may take a few minutes to download.

Once the download completes, you can verify it:

PS F:big-data> cd $dest_dir
PS F:big-data> ls


    Directory: F:big-data


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----       18/01/2020  11:01 AM      359196911 hadoop-3.2.1.tar.gz


PS F:big-data>

You can also directly download the package through your web browser and save it to the destination directory.

warning Please keep this PowerShell window open as we will use some variables in this session in the following steps. If you already closed it, it is okay, just remember to reinitialise the above variables: $client, $dest_dir.

Step 2 — Unpack the package

Now we need to unpack the downloaded package using GUI tool (like 7 Zip) or command line. For me, I will use git bash to unpack it.

Open git bash and change the directory to the destination folder:

cd F:/big-data

And then run the following command to unzip:

tar -xvzf  hadoop-3.2.1.tar.gz

The command will take quite a few minutes as there are numerous files included and the latest version introduced many new features.

After the unzip command is completed, a new folder hadoop-3.2.1 is created under the destination folder.

info When running the command you will experience errors like the following:

tar: hadoop-3.2.1/lib/native/libhadoop.so: Cannot create symlink to ‘libhadoop.so.1.0.0’: No such file or directory

Please ignore it for now as those native libraries are for Linux/UNIX and we will create Windows native IO libraries in the following steps.

Step 3 — Install Hadoop native IO binary

Hadoop on Linux includes optional Native IO support. However Native IO is mandatory on Windows and without it you will not be able to get your installation working. The Windows native IO libraries are not included as part of Apache Hadoop release. Thus we need to build and install it.

I also published another article with very detailed steps about how to compile and build native Hadoop on Windows: Compile and Build Hadoop 3.2.1 on Windows 10 Guide.

The build may take about one hourand to save our time, we can just download the binary package from github.

warning These libraries are not signed and there is no guarantee that it is 100% safe. We use if purely for test&learn purpose.

Download all the files in the following location and save them to the bin folder under Hadoop folder. For my environment, the full path is: F:big-datahadoop-3.2.1bin. Remember to change it to your own path accordingly.

https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin

Alternatively, you can run the following commands in the previous PowerShell window to download:

$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/hadoop.dll",$dest_dir+"hadoop-3.2.1bin"+"hadoop.dll")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/hadoop.exp",$dest_dir+"hadoop-3.2.1bin"+"hadoop.exp")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/hadoop.lib",$dest_dir+"hadoop-3.2.1bin"+"hadoop.lib")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/hadoop.pdb",$dest_dir+"hadoop-3.2.1bin"+"hadoop.pdb")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/libwinutils.lib",$dest_dir+"hadoop-3.2.1bin"+"libwinutils.lib")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/winutils.exe",$dest_dir+"hadoop-3.2.1bin"+"winutils.exe")
$client.DownloadFile("https://github.com/cdarlint/winutils/raw/master/hadoop-3.2.1/bin/winutils.pdb",$dest_dir+"hadoop-3.2.1bin"+"winutils.pdb")

After this, the bin folder looks like the following:

Step 4 — (Optional) Java JDK installation

Java JDK is required to run Hadoop. If you have not installed Java JDK please install it.

You can install JDK 8 from the following page:

https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Once you complete the installation, please run the following command in PowerShell or Git Bash to verify:

$ java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

If you got error about ‘cannot find java command or executable’. Don’t worry we will resolve this in the following step.

Step 5 — Configure environment variables

Now we’ve downloaded and unpacked all the artefacts we need to configure two important environment variables.

Configure JAVA_HOME environment variable

As mentioned earlier, Hadoop requires Java and we need to configure JAVA_HOME environment variable (though it is not mandatory but I recommend it).

First, we need to find out the location of Java SDK. In my system, the path is: D:Javajdk1.8.0_161.

Your location can be different depends on where you install your JDK.

And then run the following command in the previous PowerShell window:

SETX JAVA_HOME "D:Javajdk1.8.0_161"

Remember to quote the path especially if you have spaces in your JDK path.

infoYou can setup evironment variable at system by adding option /M however just in case you don’t have access to change system variables, you can just set it up at user level.

The output looks like the following:

Configure HADOOP_HOME environment variable

Similarly we need to create a new environment variable for HADOOP_HOME using the following command. The path should be your extracted Hadoop folder. For my environment it is: F:big-datahadoop-3.2.1.

If you used PowerShell to download and if the window is still open, you can simply run the following command:

SETX HADOOP_HOME $dest_dir+"/hadoop-3.2.1"

The output looks like the following screenshot:

Alternatively, you can specify the full path:

SETX HADOOP_HOME "F:big-datahadoop-3.2.1"

Now you can also verify the two environment variables in the system:

Configure PATH environment variable

Once we finish setting up the above two environment variables, we need to add the bin folders to the PATH environment variable.

If PATH environment exists in your system, you can also manually add the following two paths to it:

%JAVA_HOME%/bin
%HADOOP_HOME%/bin

Alternatively, you can run the following command to add them:

setx PATH "$env:PATH;$env:JAVA_HOME/bin;$env:HADOOP_HOME/bin"

If you don’t have other user variables setup in the system, you can also directly add a Path environment variable that references others to make it short:

Close PowerShell window and open a new one and type winutils.exe directly to verify that our above steps are completed successfully:

You should also be able to run the following command:

hadoop -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

Step 6 — Configure Hadoop

Now we are ready to configure the most important part — Hadoop configurations which involves Core, YARN, MapReduce, HDFS configurations.

Configure core site

Edit file core-site.xml in %HADOOP_HOME%etchadoop folder. For my environment, the actual path is F:big-datahadoop-3.2.1etchadoop.

Replace configuration element with the following:

<configuration>
   <property>
     <name>fs.default.name</name>
     <value>hdfs://0.0.0.0:19000</value>
   </property>
</configuration>

Configure HDFS

Edit file hdfs-site.xml in %HADOOP_HOME%etchadoop folder.

Before editing, please correct two folders in your system: one for namenode directory and another for data directory. For my system, I created the following two sub folders:

F:big-datadatadfsnamespace_logs
F:big-datadatadfsdata

Replace configuration element with the following (remember to replace the highlighted paths accordingly):

<configuration>
   <property>
     <name>dfs.replication</name>
     <value>1</value>
   </property>
   <property>
     <name>dfs.namenode.name.dir</name>
     <value>file:///F:/big-data/data/dfs/namespace_logs</value>
   </property>
   <property>
     <name>dfs.datanode.data.dir</name>
     <value>file:///F:/big-data/data/dfs/data</value>
   </property>
</configuration>

In Hadoop 3, the property names are slightly different from previous version. Refer to the following official documentation to learn more about the configuration properties:

Hadoop 3.2.1 hdfs_default.xml

infoFor DFS replication we configure it as one as we are configuring just one single node. By default the value is 3.

infoThe directory configuration are not mandatory and by default it will use Hadoop temporary folder. For our tutorial purpose, I would recommend customise the values.

Configure MapReduce and YARN site

Edit file mapred-site.xml in %HADOOP_HOME%etchadoop folder.

Replace configuration element with the following:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property> 
        <name>mapreduce.application.classpath</name>
        <value>%HADOOP_HOME%/share/hadoop/mapreduce/*,%HADOOP_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_HOME%/share/hadoop/common/*,%HADOOP_HOME%/share/hadoop/common/lib/*,%HADOOP_HOME%/share/hadoop/yarn/*,%HADOOP_HOME%/share/hadoop/yarn/lib/*,%HADOOP_HOME%/share/hadoop/hdfs/*,%HADOOP_HOME%/share/hadoop/hdfs/lib/*</value>
    </property>
</configuration>

Edit file yarn-site.xml in %HADOOP_HOME%etchadoop folder.

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

Step 7 — Initialise HDFS & bug fix

Run the following command in Command Prompt

hdfs namenode -format

This command failed with the following error and we need to fix it:

2020-01-18 13:36:03,021 ERROR namenode.NameNode: Failed to start namenode.
java.lang.UnsupportedOperationException
        at java.nio.file.Files.setPosixFilePermissions(Files.java:2044)
        at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:452)
        at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:591)
        at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:613)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:188)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1206)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1649)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1759)
2020-01-18 13:36:03,025 INFO util.ExitUtil: Exiting with status 1: java.lang.UnsupportedOperationException

Refer to the following sub section (About 3.2.1 HDFS bug on Windows) about the details of fixing this problem.

Once this is fixed, the format command (hdfs namenode -format) will show something like the following:

About 3.2.1 HDFS bug on Windows

This is a bug with 3.2.1 release:

https://issues.apache.org/jira/browse/HDFS-14890

It will be resolved in version 3.2.2 and 3.3.0.

We can apply a temporary fix as the following change diff shows:

Code fix for HDFS-14890

I’ve done the following to get this temporarily fixed before 3.2.2/3.3.0 is released:

Checkout the source code of Hadoop project from GitHub.
Checkout branch 3.2.1
Open pom file of hadoop-hdfs project
Update class StorageDirectory as described in the above code diff screen shot:

if (permission != null) {
    	  try {
              Set<PosixFilePermission> permissions =
                  PosixFilePermissions.fromString(permission.toString());
              Files.setPosixFilePermissions(curDir.toPath(), permissions);
            } catch (UnsupportedOperationException uoe) {
              // Default to FileUtil for non posix file systems
              FileUtil.setPermission(curDir, permission);
            }
      }

Use Maven to rebuild this project as the following screenshot shows:

Fix bug HDFS-14890

I’ve uploaded the JAR file into the following location. Please download it from the following link:

https://github.com/FahaoTang/big-data/blob/master/hadoop-hdfs-3.2.1.jar

And then rename the file name hadoop-hdfs-3.2.1.jar to hadoop-hdfs-3.2.1.bk in folder %HADOOP_HOME%sharehadoophdfs.

Copy the downloaded hadoop-hdfs-3.2.1.jar to folder %HADOOP_HOME%sharehadoophdfs.

warning This is just a temporary fix before the official improvement is published. I publish it purely for us to complete the whole installation process and there is no guarantee this temporary fix won’t cause any new issue.

Refer to this article for more details about how to build a native Windows Hadoop: Compile and Build Hadoop 3.2.1 on Windows 10 Guide.

Step 8 — Start HDFS daemons

Run the following command to start HDFS daemons in Command Prompt:

%HADOOP_HOME%sbinstart-dfs.cmd
Two Command Prompt windows will open: one for datanode and another for namenode as the following screenshot shows:

Step 9 — Start YARN daemons

warning You may encounter permission issues if you start YARN daemons using normal user. To ensure you don’t encounter any issues. Please open a Command Prompt window using Run as administrator.
Alternatively, you can follow this comment on this page which doesn’t require Administrator permission using a local Windows account:
https://kontext.tech/article/377/latest-hadoop-321-installation-on-windows-10-step-by-step-guide#comment314

Run the following command in an elevated Command Prompt window (Run as administrator) to start YARN daemons:

%HADOOP_HOME%sbinstart-yarn.cmd
Similarly two Command Prompt windows will open: one for resource manager and another for node manager as the following screenshot shows:

Step 10 — Useful Web portals exploration

The daemons also host websites that provide useful information about the cluster.

HDFS Namenode information UI

http://localhost:9870/dfshealth.html#tab-overview

The website looks like the following screenshot:

HDFS Datanode information UI

http://localhost:9864/datanode.html

The website looks like the following screenshot:

YARN resource manager UI

http://localhost:8088

The website looks like the following screenshot:

Through Resource Manager, you can also navigate to any Node Manager:

Step 11 — Shutdown YARN & HDFS daemons

You don’t need to keep the services running all the time. You can stop them by running the following commands one by one:

%HADOOP_HOME%sbinstop-yarn.cmd

%HADOOP_HOME%sbinstop-dfs.cmd

check Congratulations! You’ve successfully completed the installation of Hadoop 3.2.1 on Windows 10.

Let me know if you encounter any issues. Enjoy with your latest Hadoop on Windows 10.

Источник

Basic Hadoop on Windows

How to install Apache Hadoop (HDFS, YARN, MapReduce) on Windows 10.

Recommended Prerequisites

Be able to open PowerShell as administrator and run commands
Be able to stop a PowerShell process with CTRL+C
Be able to close PowerShell with ALT+SPACE C
Chocolatey, the Windows package manager

choco install hadoop -y
refreshenv
choco list -local

Close the window with ALT+SPACE C and reopen.

JDK 8 With No Spaces

Choco installs JDK 8 with spaces. We need a JDK 8 path with no spaces for Hadoop.

Hadoop cannot have spaces in the path.
Currently, Yarn cannot have a version greater than 8.
Keep Windows Environment Variables at the original Chocolatey path. (Windows 1 below.)
Use a special «no spaces in the path» JDK 8 version for Hadoop (set in hadoop-env.cmd below.)

IMPORTANT: In the following, use the path and file name specific to your JDK 8 file and installation.
My versions are only for illustration.

One way to get this version:

Go to https://adoptopenjdk.net/upstream.html
Download OpenJDK 8 (LTS) for Windows x64 JDK 103 MB as a zip file to C:.
If not permitted to download to C:, download to your «Downloads» folder and move to C: when done.
Right-click / Extract all / Extract.
Move the openjdk-8u252-b09 folder from C:OpenJDK8U-jdk_x64_windows_8u252b09 to C:.

Now we have a JDK 8 path with no spaces for Hadoop.

Verify you have C:openjdk-8u252-b09 (or similar) with the necessary files.

Edit Windows System Environment Variables

JAVA_HOME = location of your most current OpenJDK (this can have spaces — it is NOT the one we use).

HADOOP_HOME = C:Hadoophadoop-3.3.0

PATH

only one JDK in the path
include %HADOOP_HOME%bin
include C:WINDOWSsystem32

Add winutil files

Go to https://github.com/cdarlint/winutils.
Get winutils.exe and download the lastest to your C:Hadoophadoop-3.3.0bin folder. If you get java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z error then you may have to download hadoop.dll file as well to your bin folder.

Windows 2 — Edit Hadoop Files

Open C:Hadoophadoop-3.3.0etchadoop in VS Code to edit these files as shown in ./etc-hadoop.

core-site.xml
hadoop-env.cmd (set JAVA_HOME=C:openjdk-8u252-b09 — no spaces, must be version
hdfs-site.xml (path values must be formatted as shown in the example)
mapred-site.xml
workers
yarn-site.xml

Windows 3 — Format new namenode

Create and format a new namenode (at the path specified in hdfs-site.xml). In PowerShell as Adminstrator, run:

Verify it runs without errors. The process will shutdown the new namenode after formatting.

Explore the new folder created at the path given in hdfs-site.xml.

Running A Psuedo-Distributed Hadoop Cluster on Windows

Start the namenode first. From your desktop, run PowerShell as Administrator and start the namenode service. Leave the window open to keep the process running.

Then, start the datanode. Important: Make sure the datanode folder does NOT exist. If it does exist, stop the process if necessary and delete the directory. Once gone, from your desktop, run PowerShell as Administrator and start the datanode service. Leave the window open to keep the process running.

Notes:

These locations are set in hdfs-site.xml. The datanode directory must NOT EXIST before running.
Open these directories when running to see what’s happening on disk.
Exit gracefully when done. If you leave processes running, you may need to close them manually with CTRL+ALT+DELETE / Task Manager. Find the process, right-click / End task.

Monitoring Hadoop

Open a browser to http://localhost:9870/. Explore.

Interacting with HDFS

HDFS is a powerful, fast file system for managing large files with commodity hardware. It breaks each file added into blocks and replicates them across different machines (and different racks) on the cluster. By default, it creates 3 copies. It does this automatically.

We can list the contents, get help, make directories in HDFS to organize our work, put files into HDFS directories, and get files out of HDFS.

Make sure your nodes are running. From your desktop, open a third PowerShell as Administrator and use hadoop fs to access the file system. Try these Frequently Used HDFS Shell Commands

hadoop fs -ls /
hadoop fs -df hdfs:/
hadoop fs -mkdir /wordcount
hadoop fs -put C:/romeoandjuliet.txt /wordcount
hadoop fs -ls /
hadoop fs -ls /wordcount
hadoop fs -help
hadoop fs -mkdir -p project1

Start up YARN

Start your namenode and datanode services as shown above.

Open PowerShell as an Administrator and run:

c:hadoop-3.2.1sbinstart-yarn.cmd

This will start the YARN Resource Manager and the YARN Node Manager services.

Open a browser to http://localhost:8088/cluster/nodes to see your managed cluster.

Troubleshooting

Issues? Key things to check for include the following.

Windows Environment Variables

Verify you only have one java bin directory in your path.
Verify the paths provided match the locations on your machine (use File Explorer to check).

References

Romeo and Juliet
Frequently Used HDFS Shell Commands

Repository

https://github.com/denisecase/basic-hadoop-on-windows

Источник

This article provides a step-by-step guide to install Hadoop 3.3.0 on Windows 10 via WSL (Windows Subsystem for Linux). These instructions are also applicable for installing Hadoop on Linux systems.

Hadoop 3.3.0 (released on July 14, 2020) is the most recent release in the Apache Hadoop 3.3 line as of the time of this article.

Prerequisites

Follow the steps mentioned on the page below to first enable Windows Subsystem for Linux and then install a Linux distribution of your choice.

Windows Subsystem for Linux Installation Guide for Windows 10

Initial Linux Configuration (optional)

Create a dedicated Hadoop user

We will create a dedicated Hadoop user for executing Hadoop commands. Although not required, it is a recommended step as it separates the Hadoop installation from other software applications and user accounts running on the same machine.

Create a user group hadoop

sudo addgroup hadoop

Enter fullscreen mode

Exit fullscreen mode

Create a user hduser

sudo adduser hduser

Enter fullscreen mode

Exit fullscreen mode

It will ask for password 2 times followed by user details, just press enter for each prompt and confirm Yes in the end.

Add user hduser to user group hadoop

sudo  adduser hduser hadoop

Enter fullscreen mode

Exit fullscreen mode

Add hduser to sudoers list so that hduser can do admin tasks

sudo visudo

Enter fullscreen mode

Exit fullscreen mode

Append after the line «#Allow member of group sudo to execute any command»

hduser ALL=(ALL) ALL

Enter fullscreen mode

Exit fullscreen mode

Then save and close the file.

Logout of the current session

logout

Enter fullscreen mode

Exit fullscreen mode

su hduser
cd

Enter fullscreen mode

Exit fullscreen mode

Configure passphraseless ssh

Hadoop requires SSH access to manage its nodes.

This is a critical step and please make sure you follow the steps

Make sure you can SSH to localhost in Ubuntu:

ssh localhost

Enter fullscreen mode

Exit fullscreen mode

If you encounter the error ssh: connect to host localhost port 22: Connection refused, run the following commands:

sudo apt remove openssh-server
sudo apt install openssh-server
sudo service ssh start

Enter fullscreen mode

Exit fullscreen mode

If you cannot ssh to localhost without a passphrase, run the following command to initialize your private and public keys:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Enter fullscreen mode

Exit fullscreen mode

Now ssh localhost should work without any passphrase.

Hadoop Installation

Install Java JDK

Update the package index:

sudo apt update

Enter fullscreen mode

Exit fullscreen mode

Check whether Java is already installed:

java -version

Enter fullscreen mode

Exit fullscreen mode

If Java isn’t installed, install Java 8 from OpenJDK via the following command:

sudo apt-get install openjdk-8-jdk

Enter fullscreen mode

Exit fullscreen mode

Check the version installed:

java -version
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-8u275-b01-0ubuntu1~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mixed mode)

Enter fullscreen mode

Exit fullscreen mode

Download Hadoop binary

Visit the Hadoop releases page to find a download URL for Hadoop 3.3.0:

Hadoop Releases

Run the following command to download the latest binary for Hadoop 3.3.0:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

Enter fullscreen mode

Exit fullscreen mode

Wait until the download is completed.

Unzip Hadoop binary

Create a hadoop folder under the user home folder:

mkdir ~/hadoop

Enter fullscreen mode

Exit fullscreen mode

Unzip the downloaded binary package:

tar -xvzf hadoop-3.3.0.tar.gz -C ~/hadoop

Enter fullscreen mode

Exit fullscreen mode

After the package is unzipped, change the current directory to the Hadoop folder:

cd ~/hadoop/hadoop-3.3.0/

Enter fullscreen mode

Exit fullscreen mode

Setup environment variables

nano ~/.bashrc

Enter fullscreen mode

Exit fullscreen mode

Setup Hadoop and Java environment variables at the end of the .bashrc file as below and then save the bash file and close it.

#Set Hadoop-related environment variables
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=~/hadoop/hadoop-3.3.0
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME

Enter fullscreen mode

Exit fullscreen mode

For applying all these changes to the current Terminal, execute the source command.

source ~/.bashrc

Enter fullscreen mode

Exit fullscreen mode

Make sure that Java and Hadoop have been properly installed on your system and can be accessed through the Terminal by,

java -version
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-8u275-b01-0ubuntu1~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mixed mode)

Enter fullscreen mode

Exit fullscreen mode

hadoop version
Hadoop 3.3.0
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r aa96f1871bfd858f9bac59cf2a81ec470da649af
Compiled by brahma on 2020-07-06T18:44Z
Compiled with protoc 3.7.1
From source with checksum 5dc29b802d6ccd77b262ef9d04d19c4
This command was run using /home/daftdey/hadoop/hadoop-3.3.0/share/hadoop/common/hadoop-common-3.3.0.jar

Enter fullscreen mode

Exit fullscreen mode

Configure the pseudo-distributed mode (Single Node mode)

This page below is the official documentation to configure a single node cluster:

Pseudo-Distributed Operation

All the Hadoop configuration files should be located in the ~/hadoop/hadoop-3.3.0/etc/hadoop directory.

cd ~/hadoop/hadoop-3.3.0/etc/hadoop

Enter fullscreen mode

Exit fullscreen mode

The configuration setup steps are as follows:

Edit file hadoop-env.sh:

nano hadoop-env.sh

Set Java environment variable as,

# Java path
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Edit file core-site.xml:
core-site.xml informs the Hadoop daemon where NameNode runs in the cluster. It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.
```
nano core-site.xml
```
Add the following configuration:
```
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
```
Edit file hdfs-site.xml:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS.
```
nano hdfs-site.xml
```
Add the following configuration:
```
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
```

Edit file mapred-site.xml:
mapred-site.xml contains configuration settings of MapReduce application like the number of JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores available for a process, etc.

nano mapred-site.xml

Add the following configuration:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

Edit file yarn-site.xml:
yarn-site.xml contains configuration settings of ResourceManager and NodeManager like application memory management size, the operation needed on program & algorithm, etc.

nano yarn-site.xml

Add the following configuration:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

Format namenode

Go to the Hadoop home directory and format the Hadoop namenode:

cd ~/hadoop/hadoop-3.3.0
bin/hdfs namenode -format

Enter fullscreen mode

Exit fullscreen mode

This formats the HDFS via the NameNode. Formatting the file system means initializing the directory specified by the dfs.name.dir variable.

This command should be executed only for the first time. Never format an up and running Hadoop filesystem. You will lose all your data stored in the HDFS.

Run HDFS daemons

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files stored in the HDFS and tracks all the files stored across the cluster.

On startup, a DataNode connects to the Namenode and it responds to the requests from the Namenode for different operations.

Start NameNode daemon and DataNode daemon

sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [tars]

Enter fullscreen mode

Exit fullscreen mode

Check status via `jps` command

jps
8544 SecondaryNameNode
8304 DataNode
8149 NameNode
8702 Jps

Enter fullscreen mode

Exit fullscreen mode

When the HDFS services are successfully initiated, you should be able to see these four processes running.

View name node portal

Now open the browser and go to the following URL to view the NameNode interface. Scroll down and see the number of live nodes.

http://localhost:9870/dfshealth.html#tab-overview

The web UI looks like this:

Data nodes information can be also viewed through the Datanodes menu link:

Run YARN daemon

ResourceManager is the master that arbitrates all the available cluster resources and thus helps in managing the distributed applications running on the YARN system. Its work is to manage each NodeManagers and each application’s ApplicationMaster.

The NodeManager in each machine framework is the agent that is responsible for managing containers, monitoring their resource usage, and reporting the same to the ResourceManager.

Start ResourceManager daemon and NodeManager daemon

sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers

Enter fullscreen mode

Exit fullscreen mode

Check status using the `jps` command

jps
11831 ResourceManager
11352 DataNode
12347 Jps
11196 NameNode
11981 NodeManager
11597 SecondaryNameNode

Enter fullscreen mode

Exit fullscreen mode

Once the YARN services are started, two more processes for NodeManager and ResourceManage can be seen running.

View YARN web portal

Now open the browser and go to the following URL to view the YARN resource manager web UI.

http://localhost:8088/cluster

The web UI looks like this:

Execution of all the Hadoop applications can be monitored through this web portal.

Shutdown Hadoop services

After completion of the execution of all Hadoop applications, the HDFS and YARN services can be stopped as follows:

sbin/stop-yarn.sh
sbin/stop-dfs.sh

Enter fullscreen mode

Exit fullscreen mode

Verify status via jps command, only the jps service should be running:

jps
13257 Jps

Enter fullscreen mode

Exit fullscreen mode

Summary

Congratulations, you have successfully installed a single-node Hadoop cluster in one go on your Linux subsystem of Windows 10. It’s relatively easier compared to installing on Windows as we don’t need to download or build native Hadoop HDFS libraries.

Have fun exploring with Hadoop.

Try out an example application: Hadoop MapReduce computational model to perform analyses on DBLP publication data

If you encounter any issues while following the steps, please post a comment and I will try my best to help.

Further References

Install Hadoop 3.3.0 on Windows 10 using WSL

Install Hadoop: Setting up a Single Node Hadoop Cluster

Источник

There are numerous problems with the installation and operation of components such as Hadoop and Hive on Windows systems

With the help of several Internet references, I completed the construction of Hadoop and Hive development environment on Windows 10. This article documents the specific steps, problems encountered, and corresponding solutions for the entire build process.

Environmental Preparation

Software	Version	Description
Windows	10	Operating System
JDK	8	Do not use a version greater than or equal to JDK9 for the time being, because unknown exceptions will occur when starting the virtual machine
MySQL	8.x	Metadata for managing `Hive`
Apache Hadoop	3.3.0	—
Apache Hive	3.1.2	—
Apache Hive src	1.2.2	Because only the 1.x version of the Hive source code provides a .bat startup script, the ability to write their own scripts do not need to download this source code package
winutils	hadoop-3.3.0	Startup dependencies for Hadoop on Windows

Some of the components are listed below with their corresponding download addresses:

Apache Hadoop 3.3.0：https://mirror.bit.edu.cn/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
Apache Hive 3.1.2：https://mirrors.bfsu.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
Apache Hive 1.2.2 src：https://mirrors.bfsu.edu.cn/apache/hive/hive-1.2.2/apache-hive-1.2.2-src.tar.gz
winutils：https://github.com/kontext-tech/winutils

After downloading some of these software, MySQL is installed normally as a system service with system self-start. Unzip hadoop-3.3.0.tar.gz, apache-hive-3.1.2-bin.tar.gz, apache-hive-1.2.2-src.tar.gz and winutils to the specified directory

folder

Next, copy the files in the bin directory of the unpacked source package apache-hive-1.2.2-src.tar.gz to the bin directory of `apache-hive-3.1.2-bin:.

Then copy the hadoop.dll and winutils.exe files from the hadoop-3.3.0bin directory in winutils to the bin folder in the unpacked directory of Hadoop.

Hadoop

Finally, configure the JAVA_HOME and HADOOP_HOME environment variables, and add %JAVA_HOME%bin; and %HADOOP_HOME%bin to the Path:

Next, test it on the command line. If the above steps are OK, the console output will be as follows.

cmd

Configuring and starting Hadoop

In the etchadoop subdirectory of HADOOP_HOME, find and modify the following configuration files.

core-site.xml (the tmp directory here must be configured as a non-virtual directory, don’t use the default tmp directory, otherwise you will encounter permission assignment failure problems later)

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>  
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/e:/LittleData/hadoop-3.3.0/data/tmp</value>
    </property>  
</configuration>

hdfs-site.xml (here to pre-create nameNode and dataNode data storage directory, note that each directory should start with /, I pre-create nameNode and dataNode subdirectories in HADOOP_HOME/data here)

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.http.address</name>
        <value>0.0.0.0:50070</value>
    </property>
    <property>    
        <name>dfs.namenode.name.dir</name>    
        <value>/e:/LittleData/hadoop-3.3.0/data/nameNode</value>    
    </property>    
    <property>    
        <name>dfs.datanode.data.dir</name>    
        <value>/e:/LittleData/hadoop-3.3.0/data/dataNode</value>  
    </property>
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>

At this point, the minimalist configuration is basically complete. Then you need to format the namenode and start the Hadoop service. Switch to the $HADOOP_HOME/bin directory, use CMD to enter the command hdfs namenode -format (format namenode remember not to repeat the execution).

After formatting namenode, switch to the $HADOOP_HOME/sbin directory and execute the start-all.cmd script.

Here the command line will prompt start-all.cmd script has expired, it is recommended to use start-dfs.cmd and start-yarn.cmd instead. Similarly, if the execution of stop-all.cmd will also have a similar prompt, you can use stop-dfs.cmd and stop-yarn.cmd instead. After the successful execution of start-all.cmd, four JVM instances will be created (see the above figure in the Shell window automatically created four new Tabs), at this time you can view the current JVM instances through jps.

λ jps
19408 ResourceManager
16324 NodeManager
14792 Jps
15004 NameNode
2252 DataNode

You can see that four applications ResourceManager, NodeManager, NameNode and DataNode have been started, so that the standalone version of Hadoop has been started successfully. Exit these four processes with the stop-all.cmd command. You can check the status of the scheduling tasks via http://localhost:8088/:

hadoop

Go through http://localhost:50070/ to see the status of HDFS and the files.

HDFS

To restart Hadoop: execute the stop-all.cmd script first, then the start-all.cmd script.

Configuring and starting Hive

Hive is built on HDFS, so make sure Hadoop is up and running; the default file path for Hive in HDFS is prefixed with /user/hive/warehouse, so you can create this folder in HDFS from the command line first

1
2

hdfs dfs -mkdir /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive/warehouse

You also need to create and grant permissions to the tmp directory with the following command.

1
2

hdfs dfs -mkdir /tmp
hdfs dfs -chmod -R 777 /tmp

Add HIVE_HOME to the system variables, configure the specific value as E:LittleDataapache-hive-3.1.2-bin, and add %HIVE_HOME%bin; to the Path variable, similar to the previous configuration of HADOOP_HOME. Download and copy a mysql-connector-java-8.0.x.jar to the `$HIVE_HOME/lib directory.

To create the Hive configuration file, there is already a corresponding configuration file template in the $HIVE_HOME/conf directory, which needs to be copied and renamed as follows:

$HIVE_HOME/conf/hive-default.xml.template => $HIVE_HOME/conf/hive-site.xml
$HIVE_HOME/conf/hive-env.sh.template => $HIVE_HOME/conf/hive-env.sh
$HIVE_HOME/conf/hive-exec-log4j.properties.template => — $HIVE_HOME/conf/hive-exec-log4j.properties
$HIVE_HOME/conf/hive-log4j.properties.template => $HIVE_HOME/conf/hive-log4j.properties

Modify the hive-env.sh script by adding the following to the end.

1
2
3

export HADOOP_HOME=E:LittleDatahadoop-3.3.0
export HIVE_CONF_DIR=E:LittleDataapache-hive-3.1.2-binconf
export HIVE_AUX_JARS_PATH=E:LittleDataapache-hive-3.1.2-binlib

Modify the hive-site.xml file, mainly by changing the following property items.

Property Name	Property Value	Remarks
`hive.metastore.warehouse.dir`	`/user/hive/warehouse`	The data storage directory for `Hive`, which is the default value
`hive.exec.scratchdir`	`/tmp/hive`	The temporary data directory for `Hive`, which is the default value
`javax.jdo.option.ConnectionURL`	`jdbc:mysql://localhost:3306/hive?characterEncoding=UTF-8&serverTimezone=UTC`	`Hive` database connection for metadata storage
`javax.jdo.option.ConnectionDriverName`	`com.mysql.cj.jdbc.Driver`	Database driver for `Hive` metadata storage
`javax.jdo.option.ConnectionUserName`	`root`	`Hive` database user for metadata storage
`javax.jdo.option.ConnectionPassword`	`root`	Password of the database where `Hive` metadata is stored
`hive.exec.local.scratchdir`	`E:/LittleData/apache-hive-3.1.2-bin/data/scratchDir`	Create local directory `$HIVE_HOME/data/scratchDir`
`hive.downloaded.resources.dir`	`E:/LittleData/apache-hive-3.1.2-bin/data/resourcesDir`	Create local directory `$HIVE_HOME/data/resourcesDir`
`hive.querylog.location`	`E:/LittleData/apache-hive-3.1.2-bin/data/querylogDir`	Create local directory `$HIVE_HOME/data/querylogDir`
`hive.server2.logging.operation.log.location`	`E:/LittleData/apache-hive-3.1.2-bin/data/operationDir`	Create local directory `$HIVE_HOME/data/operationDir`
`datanucleus.autoCreateSchema`	`true`	Optional
`datanucleus.autoCreateTables`	`true`	Optional
`datanucleus.autoCreateColumns`	`true`	Optional
`hive.metastore.schema.verification`	`false`	Optional

After the modification, the local MySQL service to create a new database hive, encoding and character set can choose a relatively large range of utf8mb4 (although the official recommendation is latin1, but the character set to a wide range of options have no impact):

After the above preparations are done, you can perform the initialization of the Hive metadatabase by executing the following script in the $HIVE_HOME/bin directory.

`1`	`hive --service schematool -dbType mysql -initSchema`

Here’s a little problem, line 3215 of the hive-site.xml file has a magic unrecognizable symbol

This unrecognizable symbol will cause Hive’s command execution exceptions and needs to be removed. When the console outputs Initialization script completed schemaTool completed, it means that the metadatabase has been initialized.

In the $HIVE_HOME/bin directory, you can connect to Hive via hive.cmd (close the console to exit)

Try to create a table t_test

1
2

hive>  create table t_test(id INT,name string);
hive>  show tables;

Check http://localhost:50070/ to confirm that the t_test table has been created successfully.

Try to execute a write statement and a query statement.

1
2

hive>  insert into t_test(id,name) values(1,'throwx');
hive>  select * from t_test;

It took more than 30 seconds to write and 0.165 seconds to read.

Connecting to Hive using JDBC

HiveServer2 is the Hive server-side interface module, which must be started for remote clients to write and query data to Hive. Currently, this module is still based on Thrift RPC implementation, which is an improved version of HiveServer, supporting multi-client access and authentication and other functions. The following common properties of HiveServer2 can be modified in the configuration file hive-site.xml

Property Name	Property Value	Remark
hive.server2.thrift.min.worker.threads	5	Minimum number of threads to work with, default value is 5
hive.server2.thrift.max.worker.threads	500	Maximum number of working threads, default value is 500
hive.server2.thrift.port	10000	The TCP port number to listen on, the default value is 10000
hive.server2.thrift.bind.host	127.0.0.1	Bound host, the default value is 127.0.0.1
hive.execution.engine	mr	Execution engine, default value is mr

Execute the following command in the $HIVE_HOME/bin directory to start HiveServer2

`1`	`hive.cmd --service hiveserver2`

The client needs to introduce hadoop-common and hive-jdbc dependencies, and try to correspond to the Hadoop and Hive versions.

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>3.3.0</version>
</dependency>
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>3.1.2</version>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-jdbc</artifactId>
    <version>2.3.5.RELEASE</version>
</dependency>

The hadoop-common dependency chain is quite long and will download a lot of other related dependencies along with it, so you can find some free time to hang the task of downloading that dependency in some Maven project first. Finally add a unit test class HiveJdbcTest

@Slf4j
public class HiveJdbcTest {

    private static JdbcTemplate TEMPLATE;
    private static HikariDataSource DS;

    @BeforeClass
    public static void beforeClass() throws Exception {
        HikariConfig config = new HikariConfig();
        config.setDriverClassName("org.apache.hive.jdbc.HiveDriver");
        // I modified the corresponding configuration in hive-site.xml, because the port is not the default 10000
//        config.setJdbcUrl("jdbc:hive2://127.0.0.1:10091");
        config.setJdbcUrl("jdbc:hive2://127.0.0.1:10091/db_test");
        DS = new HikariDataSource(config);
        TEMPLATE = new JdbcTemplate(DS);
    }

    @AfterClass
    public static void afterClass() throws Exception {
        DS.close();
    }

    @Test
    public void testCreateDb() throws Exception {
        TEMPLATE.execute("CREATE DATABASE db_test");
    }

    @Test
    public void testCreateTable() throws Exception {
        TEMPLATE.execute("CREATE TABLE IF NOT EXISTS t_student(id INT,name string,major string)");
        log.info("Create t_student table successfully");
    }

    @Test
    public void testInsert() throws Exception {
        int update = TEMPLATE.update("INSERT INTO TABLE t_student(id,name,major) VALUES(?,?,?)", p -> {
            p.setInt(1, 10087);
            p.setString(2, "throwable");
            p.setString(3, "math");
        });
        log.info("Write to t_student successfully, update the number of records:{}", update);  // Here is more amazing, the data is written, the number of update returned is 0
    }

    @Test
    public void testSelect() throws Exception {
        List<Student> result = TEMPLATE.query("SELECT * FROM t_student", rs -> {
            List<Student> list = new ArrayList<>();
            while (rs.next()) {
                Student student = new Student();
                student.setId(rs.getLong("id"));
                student.setName(rs.getString("name"));
                student.setMajor(rs.getString("major"));
                list.add(student);
            }
            return list;
        });
        // Print log: query t_student successful, results:[HiveJdbcTest.Student(id=10087, name=throwable, major=math)]
        log.info("The query of t_student is successful, the result:{}", result);
    }

    @Data
    private static class Student {

        private Long id;
        private String name;
        private String major;
    }
}

Possible problems encountered

Java virtual machine startup failure

Currently positioned to be Hadoop can not use any version of JDK [9 + JDK, it is recommended to switch to any small version of JDK8.

Hadoop execution file not found exception

Make sure you have copied the hadoop.dll and winutils.exe files from the hadoop-3.3.0bin directory in winutils to the bin folder of the unpacked directory of Hadoop.

When executing the start-all.cmd script, there is a possibility that the batch script cannot be found. This problem has occurred on the company’s development machine, but has not been reproduced on the home development machine. The specific solution is to add cd $HADOOP_HOME to the first line of the start-all.cmd script, such as cd E:LittleDatahadoop-3.3.0.

Unable to access localhost:50070

Generally because the hdfs-site.xml configuration misses the dfs.http.address configuration item, add.

<property>
    <name>dfs.http.address</name>
    <value>0.0.0.0:50070</value>
</property>

Then just call stop-all.cmd and then call start-all.cmd to restart Hadoop.

Hive connection to MySQL exception

Note whether the MySQL driver package has been copied correctly to $HIVE_HOME/lib and check whether the four properties javax.jdo.option.ConnectionURL and so on are configured correctly. If they are all correct, pay attention to whether there is a problem with the MySQL version or the service version does not match the driver version.

Hive can’t find the batch file

The general description is xxx.cmd' is not recognized as an internal or external command... , which is usually an exception in Hive’s command execution, you need to copy all the .cmd scripts in the bin directory of the Hive 1.x source package to the corresponding directory of $HIVE_HOME/bin.

Folder permission issues

Common exceptions such as CreateSymbolicLink can cause Hive to be unable to write data using INSERT or LOAD commands. Such problems can be solved by the following.

Win + R and run gpedit.msc — Computer Settings — Windows Settings — Security Settings — Local Policies — User Rights Assignment — Create Symbolic Link — Add Current User.

Or just start CMD with administrator account or administrator privileges, and then execute the corresponding script to start Hadoop or Hive.

`SessionNotRunning` exception

This exception may occur when starting HiveServer2 or when an external client connects to HiveServer2, specifically java.lang.ClassNotFoundException: org.apache.tez.dag.api.TezConfiguration exception. The solution is: the configuration file hive-site.xml in the hive.execution.engine property value from tez to mr, and then restart HiveServer2 can be. Because there is no tez integration, the restart will still report an error, but after 60000ms it will automatically retry to start (usually after the retry will start successfully):

This is a legacy issue, but it does not affect the client’s normal connection, only the startup time will be 60 seconds longer.

HiveServer2 port conflict

Modify the value of the hive.server2.thrift.port property in the configuration file hive-site.xml to an unoccupied port, and restart HiveServer2.

Data node security mode exception

Generally SafeModeException exceptions appear, prompting Safe mode is ON. Through the command hdfs dfsadmin -safemode leave to remove the safe mode.

It is common for this exception to occur when Hive connects to the HiveServer2 service via the JDBC client, specifically the message: `User: xxx is not allowed to impersonate anonymous. this case just needs to modify the Hadoop configuration file core-site. xml, add.

<property>
    <name>hadoop.proxyuser.xxx.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.xxx.groups</name>
    <value>*</value>
</property>

Then just restart the Hadoop service.

MapRedTask’s Permissions Problem

The common exception is thrown when Hive connects to the HiveServer2 service via a JDBC client to perform an INSERT or LOAD operation, generally described as Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec MapRedTask. Permission denied: user=anonymous, access=EXECUTE, inode="/tmp/hadoop-yarn":xxxx:supergroup:drwx------. Just give the anonymous user /tmp directory read and write access through the command hdfs dfs -chmod -R 777 /tmp.

Summary

It is better to build the development environment of Hadoop and Hive directly in Linux or Unix system. The file path and permission problems of Windows system can lead to many unexpected problems. This article refers to a large number of Internet materials and introductory books on Hadoop and Hive, so I won’t post them here, standing on the shoulders of giants.

Reference https://www.throwx.cn/2020/11/03/hadoop-hive-dev-env-in-win-10/

Источник

To Perform setting up and installing Hadoop in the pseudo-distributed mode in Windows 10 using the following steps given below as follows. Let’s discuss one by one.

Step 1: Download Binary Package :

Download the latest binary from the following site as follows.

http://hadoop.apache.org/releases.html

For reference, you can check the file save to the folder as follows.

C:BigData

Step 2: Unzip the binary package

Open Git Bash, and change directory (cd) to the folder where you save the binary package and then unzip as follows.

$ cd C:BigData
MINGW64: C:BigData
$ tar -xvzf  hadoop-3.1.2.tar.gz

For my situation, the Hadoop twofold is extricated to C:BigDatahadoop-3.1.2.

Next, go to this GitHub Repo and download the receptacle organizer as a speed as demonstrated as follows. Concentrate the compress and duplicate all the documents present under the receptacle envelope to C:BigDatahadoop-3.1.2bin. Supplant the current records too.

Step 3: Create folders for datanode and namenode :

Goto C:/BigData/hadoop-3.1.2 and make an organizer ‘information’. Inside the ‘information’ envelope make two organizers ‘datanode’ and ‘namenode’. Your documents on HDFS will dwell under the datanode envelope.

Set Hadoop Environment Variables
Hadoop requires the following environment variables to be set.

HADOOP_HOME=” C:BigDatahadoop-3.1.2”
HADOOP_BIN=”C:BigDatahadoop-3.1.2bin”
JAVA_HOME=<Root of your JDK installation>”

To set these variables, navigate to My Computer or This PC.

Right-click -> Properties -> Advanced System settings -> Environment variables.

Click New to create a new environment variable.

In the event that you don’t have JAVA 1.8 introduced, at that point you’ll have to download and introduce it first. In the event that the JAVA_HOME climate variable is now set, at that point check whether the way has any spaces in it (ex: C:Program FilesJava… ). Spaces in the JAVA_HOME way will lead you to issues. There is a stunt to get around it. Supplant ‘Program Files ‘to ‘Progra~1’in the variable worth. Guarantee that the variant of Java is 1.8 and JAVA_HOME is highlighting JDK 1.8.

Step 4: To make Short Name of Java Home path

Set Hadoop Environment Variables
Edit PATH Environment Variable

Click on New and Add %JAVA_HOME%, %HADOOP_HOME%, %HADOOP_BIN%, %HADOOP_HOME%/sin to your PATH one by one.

Now we have set the environment variables, we need to validate them. Open a new Windows Command prompt and run an echo command on each variable to confirm they are assigned the desired values.

echo %HADOOP_HOME%
echo %HADOOP_BIN%
echo %PATH%

On the off chance that the factors are not instated yet, at that point it can likely be on the grounds that you are trying them in an old meeting. Ensure you have opened another order brief to test them.

Step 5: Configure Hadoop

Once environment variables are set up, we need to configure Hadoop by editing the following configuration files.

hadoop-env.cmd
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
hadoop-env.cmd

First, let’s configure the Hadoop environment file. Open C:BigDatahadoop-3.1.2etchadoophadoop-env.cmd and add below content at the bottom

set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%etchadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%bin

Step 6: Edit hdfs-site.xml

After editing core-site.xml, you need to set the replication factor and the location of namenode and datanodes. Open C:BigDatahadoop-3.1.2etchadoophdfs-site.xml and below content within <configuration> </configuration> tags.

<configuration>
 <property>
    <name>dfs.replication</name>
    <value>1</value>
 </property>
 <property>
    <name>dfs.namenode.name.dir</name>
    <value>C:BigDatahadoop-3.2.1datanamenode</value>
 </property>
 <property>
    <name>dfs.datanode.data.dir</name>
    <value>C:BigDatahadoop-3.1.2datadatanode</value>
 </property>
</configuration>

Step 7: Edit core-site.xml

Now, configure Hadoop Core’s settings. Open C:BigDatahadoop-3.1.2etchadoopcore-site.xml and below content within <configuration> </configuration> tags.

<configuration>
 <property>
   <name>fs.default.name</name>
   <value>hdfs://0.0.0.0:19000</value>
 </property>  
</configuration>

Step 8: YARN configurations

Edit file yarn-site.xml

Make sure the following entries are existing as follows.

<configuration> <property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value> </property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Step 9: Edit mapred-site.xml

At last, how about we arrange properties for the Map-Reduce system. Open C:BigDatahadoop-3.1.2etchadoopmapred-site.xml and beneath content inside <configuration> </configuration> labels. In the event that you don’t see mapred-site.xml, at that point open mapred-site.xml.template record and rename it to mapred-site.xml

<configuration>
 <property>
    <name>mapreduce.job.user.name</name>    <value>%USERNAME%</value>
 </property>
 <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
 </property>
 <property>
    <name>yarn.apps.stagingDir</name>    <value>/user/%USERNAME%/staging</value>
 </property>
 <property>
    <name>mapreduce.jobtracker.address</name>
    <value>local</value>
 </property>
</configuration>

Check if C:BigDatahadoop-3.1.2etchadoopslaves file is present, if it’s not then created one and add localhost in it and save it.

Step 10: Format Name Node :

To organize the Name Node, open another Windows Command Prompt and run the beneath order. It might give you a few admonitions, disregard them.

hadoop namenode -format

Format Hadoop Name Node

Step 11: Launch Hadoop :

Open another Windows Command brief, make a point to run it as an Administrator to maintain a strategic distance from authorization mistakes. When opened, execute the beginning all.cmd order. Since we have added %HADOOP_HOME%sbin to the PATH variable, you can run this order from any envelope. In the event that you haven’t done as such, at that point go to the %HADOOP_HOME%sbin organizer and run the order.

You can check the given below screenshot for your reference 4 new windows will open and cmd terminals for 4 daemon processes like as follows.

namenode
datanode
node manager
resource manager

Don’t close these windows, minimize them. Closing the windows will terminate the daemons. You can run them in the background if you don’t like to see these windows.

Step 12: Hadoop Web UI

In conclusion, how about we screen to perceive how are Hadoop daemons are getting along. Also you can utilize the Web UI for a wide range of authoritative and observing purposes. Open your program and begin.

Step 13: Resource Manager

Open localhost:8088 to open Resource Manager

Step 14: Node Manager

Open localhost:8042 to open Node Manager

Step 15: Name Node :

Open localhost:9870 to check out the health of Name Node

Step 16: Data Node :

Open localhost:9864 to check out Data Node

Источник

References

Required tools

Step 1 — Download Hadoop binary package

Select download mirror link

Download the package

Step 2 — Unpack the package

Step 3 — Install Hadoop native IO binary

Step 4 — (Optional) Java JDK installation

Step 5 — Configure environment variables

Configure JAVA_HOME environment variable

Configure HADOOP_HOME environment variable

Configure PATH environment variable

Step 6 — Configure Hadoop

Configure core site

Configure HDFS

Configure MapReduce and YARN site

Step 7 — Initialise HDFS & bug fix

About 3.2.1 HDFS bug on Windows

Fix bug HDFS-14890

Step 8 — Start HDFS daemons

Step 9 — Start YARN daemons

Step 10 — Useful Web portals exploration

HDFS Namenode information UI

HDFS Datanode information UI

YARN resource manager UI

Step 11 — Shutdown YARN & HDFS daemons

Basic Hadoop on Windows

Recommended Prerequisites

JDK 8 With No Spaces

Edit Windows System Environment Variables

Add winutil files

Windows 2 — Edit Hadoop Files

Windows 3 — Format new namenode

Running A Psuedo-Distributed Hadoop Cluster on Windows

Monitoring Hadoop

Interacting with HDFS

Start up YARN

Troubleshooting

References

Repository

Prerequisites

Create a dedicated Hadoop user

Configure passphraseless ssh

Hadoop Installation

Install Java JDK

Download Hadoop binary

Unzip Hadoop binary

Setup environment variables

Configure the pseudo-distributed mode (Single Node mode)

Format namenode

Run HDFS daemons

Start NameNode daemon and DataNode daemon

Check status via jps command

View name node portal

Run YARN daemon

Start ResourceManager daemon and NodeManager daemon

Check status using the jps command

View YARN web portal

Shutdown Hadoop services

Summary

Further References

Environmental Preparation

Configuring and starting Hadoop

Configuring and starting Hive

Connecting to Hive using JDBC

Possible problems encountered

Java virtual machine startup failure

Hadoop execution file not found exception

Unable to access localhost:50070

Hive connection to MySQL exception

Hive can’t find the batch file

Folder permission issues

SessionNotRunning exception

HiveServer2 port conflict

Data node security mode exception

MapRedTask’s Permissions Problem

Summary

Step 7: Edit core-site.xml

Вот еще несколько интересных статей:

Check status via `jps` command

Check status using the `jps` command

`SessionNotRunning` exception