Big Data – Iam a Software Engineer

All about Elastic Search Part I

20 Jul 2017

Search is a common building block for applications. Whether we are searching Wikipedia or our log files, the behaviour is similar. A query is entered and the most relevant documents are returned. The core data structure for search is an inverted index. Elasticsearch is scalable, resilient search tool that shards and replicates a search index. … More All about Elastic Search Part I

Custom Input Format in MapReduce

14 Feb 2017

Custom Input Format: Before implementing Custom Input Format, please find the answer for what is Input Format. InputFormat describes the input-specification for a Map-Reduce job. (wiki) The Map-Reduce framework relies on the InputFormat of the job to: Validate the input-specification of the job. Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper. … More Custom Input Format in MapReduce

Hadoop Installation : 2.6.0 Part II

28 Sep 2016

This post is continuation of Part I. Please check the Part I here. We have downloaded the Hadoop and configured the SSH as well. Now we are going to start with Hadoop configuration files. 3. /usr/local/hadoop/etc/hadoop/core-site.xml: The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses when starting up. This file can be used to override the default … More Hadoop Installation : 2.6.0 Part II

Hadoop Installation : 2.6.0 Part I

28 Sep 2016

In this post, I am giving the step by step process to install the Hadoop 2.6.0. To see the distribution/version you are using, you can try: lsb_release -a to find out version of ubuntu cat /etc/lsb-release Installing Java Hadoop framework is written in Java!! user1@localhost:~$ cd ~ pwd # Update the source list k@laptop(local … More Hadoop Installation : 2.6.0 Part I

Open Data Platform

19 Feb 2016

The Open Data Platform (ODP) initiative is an industry effort focused on simplifying adoption of Apache Hadoop for the enterprise, and enabling big data solutions to flourish through improved ecosystem interoperability. It relies on the governance of the Apache Software Foundation community to innovate and deliver the Apache project technologies included in the ODP core … More Open Data Platform

Understanding Mapreduce Input Split

19 Jan 2016

Improving performance by letting MapR-FS do the right thing by MapR The performance of your MapReduce jobs depends on a lot of factors. In this post, we’ll talk about the relationship of MapReduce input split sizes and MapR-FS chunk sizes, and how they can work together to help (or hurt) job execution time. Let’s say … More Understanding Mapreduce Input Split

Moving Big Data from Mainframe to Hadoop

2 Jan 2016

A blog from Cloudera. Apache Sqoop provides a framework to move data between HDFS and relational databases in a parallel fashion using Hadoop’s MR framework. As Hadoop becomes more popular in enterprises, there is a growing need to move data from non-relational sources like mainframe datasets to Hadoop. Following are possible reasons for this: HDFS … More Moving Big Data from Mainframe to Hadoop

Hadoop vs Spark

15 Dec 2015

Listen in on any conversation about big data, and you’ll probably hear mention of Hadoop or Apache Spark. Here’s a brief look at what they do and how they compare. 1: They do different things. Hadoop and Apache Spark are both big-data frameworks, but they don’t really serve the same purposes. Hadoop is essentially a … More Hadoop vs Spark

How does Hadoop process records split across block boundaries?

11 Aug 2015

The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than not. This has no bearing on the functioning of your program—lines are not missed or broken, for example—but it’s worth knowing about, as it does … More How does Hadoop process records split across block boundaries?

Modern Data Scientist

5 Aug 2015