Custom Input Format: Before implementing Custom Input Format, please find the answer for what is Input Format. InputFormat describes the input-specification for a Map-Reduce job. (wiki) The Map-Reduce framework relies on the InputFormat of the job to: Validate the input-specification of the job. Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper. … More Custom Input Format in MapReduce
This post is continuation of Part I. Please check the Part I here. We have downloaded the Hadoop and configured the SSH as well. Now we are going to start with Hadoop configuration files. 3. /usr/local/hadoop/etc/hadoop/core-site.xml: The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses when starting up. This file can be used to override the default … More Hadoop Installation : 2.6.0 Part II
In this post, I am giving the step by step process to install the Hadoop 2.6.0. To see the distribution/version you are using, you can try: lsb_release -a to find out version of ubuntu cat /etc/lsb-release Installing Java Hadoop framework is written in Java!! user1@localhost:~$ cd ~ pwd # Update the source list k@laptop(local … More Hadoop Installation : 2.6.0 Part I
A blog from Cloudera. Apache Sqoop provides a framework to move data between HDFS and relational databases in a parallel fashion using Hadoop’s MR framework. As Hadoop becomes more popular in enterprises, there is a growing need to move data from non-relational sources like mainframe datasets to Hadoop. Following are possible reasons for this: HDFS … More Moving Big Data from Mainframe to Hadoop
Listen in on any conversation about big data, and you’ll probably hear mention of Hadoop or Apache Spark. Here’s a brief look at what they do and how they compare. 1: They do different things. Hadoop and Apache Spark are both big-data frameworks, but they don’t really serve the same purposes. Hadoop is essentially a … More Hadoop vs Spark
The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than not. This has no bearing on the functioning of your program—lines are not missed or broken, for example—but it’s worth knowing about, as it does … More How does Hadoop process records split across block boundaries?
Apache Hive The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. We have used Apache Hive in our recent project and I was assigned to tune our application, … More Apache Hive Performance Tuning