Why Hadoop on Cygwin is a bad idea?Cygwin is a DLL (cygwin1.dll) which acts as a Linux API emulation layer providing substantial Linux API functionality and a collection of tools which provide Linux look and feel. Although cygwin is a really nice emulation layer, it is is not 24x7 ready. Running Hadoop on Cygwin on production servers is a bad idea because of the following reasons:
- First of all it is officially "for development purposes only"
- It can be quite tricky to install Cygwin and SSHD components on all of your servers.
- Like any other software Cygwin has its own bugs, and these bugs will be added to the bugs you already have in Hadoop. Sometimes you will end up with something like:
2010-xx-xx xx:xx:xx,430 WARN mapred.TaskTracker - Error initializing attempt_201001280757_0129_m_000002_0:
org.apache.hadoop.util.Shell$ExitCodeException: assertion "root_idx != -1" failed: file "/ext/build/netrel/src/cygwin-1.7.1-1/winsup/cygwin/mount.cc", line 363, function: void mount_info::init()
Frame Function Args
00289984 77461184 (00000084, 0000EA60, 00000000, 00289AA8)
00289998 77461138 (00000084, 0000EA60, 000000A4, 00289A8C)
End of stack trace
- Windows has a slow process startup time compared to Linux. At the same time Hadoop does some of its job by running shell commands (measuring disk size, files size, starting Mapper, Reducer). Even if it works well in Linux, for Windows it results in a bad perfomance
Cluster SetupIn this article I make an assumption that you are installing Hadoop on a single machine. For multi-server setup please repeat all steps from the document for all your servers.
First of all download Hadoop 0.20.2 from Apache mirrors site and configure it. Please note that you should use Windows path separator "\" for paths to files or folders on local filesystem.
Now you'll need patched Hadoop, Windows shell scripts and Java Service Wrapper configuration files to be able to run JobTracker, NameNode, TaskTracker and DataNode as Windows servers. All these components you can download from Hadoop Jira. Please download file Hadoop-0.20.2-patched.zip. In case you want to build Hadoop by yourself, read Building Patched Hadoop section of the document.
Unpack downloaded archive to the directory of your choise and copy:
- hadoop-0.20.2-core.jar file and service folder to the root of your Hadoop installation
- cpappend.bat, hadoop.bat files from bin folder to the bin folder of your Hadoop installation
- commons-compress-1.0.jar, jna-3.2.2.jar, commons-io-1.4.jar from lib folder to the lib folder of your Hadoop installation
Start Windows Command Shell and go to the service\bin folder in your Hadoop installation. If you are doing an installation on Windows 7 or Windows 2008 start Command Shell as system administrator. Run commands
InstallService.bat ..\conf\JobTracker.confYou will be asked to input the password for account you set in HADOOP_USER environment variable and should see following output
wrapper | Hadoop XXXXXXX installed.At last you should format the DFS filesystem. To do it go to the bin folder in the root of your Hadoop and run shell command
hadoop.bat namenode -formatNow you are ready to start Hadoop. Run Services (services.msc) and start services in following order:
- Hadoop NameNode
- Hadoop DataNode
- Hadoop JobTracker
- Hadoop TaskTracker
Cluster DeinstallationTo remove services you should go to the service\bin directory of Hadoop and run shell commands:
UninstallService.bat ..\conf\JobTracker.confThis commands will stop all Hadoop Windows services and will remove them.
How does it work?Hadoop uses Linux shell commands to accomplish some of its tasks. For example, it uses linux df and du commands to measure folder size and to get file system disk space usage. We implemented this functionality with help of JNA. With JNA we have an access to native shared libraries Kernel32.dll and Advapi32.dll.
Building Patched Hadoop From SourceYou can build Hadoop both on Windows and Linux. To be able to build Hadoop on Windows you will need Cygwin. First checkout Hadoop 0.20.2 source code and our patch from Hadoop Jira. Put the patch to the folder where you've checked out Hadoop and apply it by issuing
patch -p0 < HADOOP-6767.patch
Now simply build Hadoop
ant clean jarBuilt Hadoop will be located in the build folder
ShortcomingsAlthough we tried to test our patch as strongly as we can, there might be numerous bags in it. Here is a list of known shortcoming of the patch:
- We haven't tested patched Hadoop with contributed modules
- JNA library is provided under the LGPL 2.1 license which is not fully compatible with the license of Hadoop
- I have only patched Hadoop 0.20.2. But I am planning to provide a patch for Hadoop 0.18 and Hadoop that is currently in trunk later
- JNA is not the best choise for accessing Windows native API functions