All about Elastic Search Part I

Search is a common building block for applications. Whether we are searching Wikipedia or our log files, the behaviour is similar. A query is entered and the most relevant documents are returned. The core data structure for search is an inverted index. Elasticsearch is scalable, resilient search tool that shards and replicates a search index. … More All about Elastic Search Part I

Custom Input Format in MapReduce

Custom Input Format: Before implementing Custom Input Format, please find the answer for what is Input Format. InputFormat describes the input-specification for a Map-Reduce job. (wiki) The Map-Reduce framework relies on the InputFormat of the job to: Validate the input-specification of the job. Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper. … More Custom Input Format in MapReduce

Hadoop Installation : 2.6.0 Part II

This post is continuation of Part I. Please check the Part I here. We have downloaded the Hadoop and configured the SSH as well. Now we are going to start with Hadoop configuration files. 3. /usr/local/hadoop/etc/hadoop/core-site.xml: The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses when starting up. This file can be used to override the default … More Hadoop Installation : 2.6.0 Part II

Open Data Platform

The Open Data Platform (ODP) initiative is an industry effort focused on simplifying adoption of Apache Hadoop for the enterprise, and enabling big data solutions to flourish through improved ecosystem interoperability. It relies on the governance of the Apache Software Foundation community to innovate and deliver the Apache project technologies included in the ODP core … More Open Data Platform

Moving Big Data from Mainframe to Hadoop

A blog from Cloudera. Apache Sqoop provides a framework to move data between HDFS and relational databases in a parallel fashion using Hadoop’s MR framework. As Hadoop becomes more popular in enterprises, there is a growing need to move data from non-relational sources like mainframe datasets to Hadoop. Following are possible reasons for this: HDFS … More Moving Big Data from Mainframe to Hadoop

Hadoop vs Spark

Listen in on any conversation about big data, and you’ll probably hear mention of Hadoop or Apache Spark. Here’s a brief look at what they do and how they compare. 1: They do different things. Hadoop and Apache Spark are both big-data frameworks, but they don’t really serve the same purposes. Hadoop is essentially a … More Hadoop vs Spark

How does Hadoop process records split across block boundaries?

The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than not. This has no bearing on the functioning of your program—lines are not missed or broken, for example—but it’s worth knowing about, as it does … More How does Hadoop process records split across block boundaries?