Listen in on any conversation about big data, and you’ll probably hear mention of Hadoop or Apache Spark. Here’s a brief look at what they do and how they compare.
1: They do different things. Hadoop and Apache Spark are both big-data frameworks, but they don’t really serve the same purposes. Hadoop is essentially a distributed data infrastructure: It distributes massive data collections across multiple nodes within a cluster of commodity servers, which means you don’t need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously. Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn’t do distributed storage.
2: You can use one without the other. Hadoop includes not just a storage component, known as the Hadoop Distributed File System, but also a processing component called MapReduce, so you don’t need Spark to get your processing done. Conversely, you can also use Spark without Hadoop. Spark does not come with its own file management system, though, so it needs to be integrated with one — if not HDFS, then another cloud-based data platform. Spark was designed for Hadoop, however, so many agree they’re better together.
3: Spark is speedier. Spark is generally a lot faster than MapReduce because of the way it processes data. While MapReduce operates in steps, Spark operates on the whole data set in one fell swoop. “The MapReduce workflow looks like this: read data from the cluster, perform an operation, write results to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster, etc.,” explained Kirk Borne, principal data scientist at Booz Allen Hamilton. Spark, on the other hand, completes the full data analytics operations in-memory and in near real-time: “Read data from the cluster, perform all of the requisite analytic operations, write results to the cluster, done,” Borne said. Spark can be as much as 10 times faster than MapReduce for batch processing and up to 100 times faster for in-memory analytics, he said.
4: You may not need Spark’s speed. MapReduce’s processing style can be just fine if your data operations and reporting requirements are mostly static and you can wait for batch-mode processing. But if you need to do analytics on streaming data, like from sensors on a factory floor, or have applications that require multiple operations, you probably want to go with Spark. Most machine-learning algorithms, for example, require multiple operations. Common applications for Spark include real-time marketing campaigns, online product recommendations, cybersecurity analytics and machine log monitoring.
5: Failure recovery: different, but still good. Hadoop is naturally resilient to system faults or failures since data are written to disk after every operation, but Spark has similar built-in resiliency by virtue of the fact that its data objects are stored in something called resilient distributed datasets distributed across the data cluster. “These data objects can be stored in memory or on disks, and RDD provides full recovery from faults or failures,” Borne pointed out.
Use cases for Spark :
The main use cases for Spark are iterative Machine Learning algorithms and Interactive analytics.
From the ML side
Most ML algorithms run on the same data set iteratively and in MapReduce , there was no easy way to communicate a shared state from iteration to iteration.
There has been an attempt to formalize this in the form of MapReduce Design Patterns for various use cases in this book – Data-Intensive Text Processing with MapReduce (Synthesis Lectures on Human Language Technologies): Jimmy Lin, Chris Dyer, Graeme Hirst: 9781608453429: Amazon.com: Books
Most of these techniques attempted to use Java specific features in Hadoop( ThreadLocal etc) even though MapReduce in theory did not offer any shared state communication model.
Spark is the next stage in the evolution of this. The fundamental thinking is that fine grained mutable state is a very low level abstraction and building block for ML algorithms ; Hence Spark was an attempt to raise this abstraction to coarse grained immutable data called RDD’s ( Resilient DIstributed DataSets) ;
Since HDFS never really supported multiple writer concurrent appends anyway , it follows that RDD’s are not giving up much by being immutable – whereas you gain a lot by having both immutability and a higher level of abstraction to begin with for big data.
If communicating shared state was one problem, the other problem was that MapReduce was initially created for batch analytics – with only two operators map/reduce. However it was becoming very clear that most interactive analytics queries required many more map/reduce jobs to achieve their purpose.
Cascading etc was one way to approach this. Another way to approach this was to create a high level SQL like language and compile the language to generate these MapReduce queries(Hive/Pig) . However since all these jobs did multiple passes on data (each time loading from HDFS) – they could not achieve the latencies expected of Interactive analytics.
Hadoop ecosystem quickly realized that the generation of mapReduce jobs and running them sequentially was not the right approach for Interactive analytics and there needed a way to directly operate on HDFS. Google Dremel/Cloudera Impala and others ( I call it “Data Center SQL” – with SQL and their Multi-Level serving trees directly operating on HDFS) was one approach.
Another point to note is that main memory became even more cheaper during this time.
However/hence Spark probably took the approach that even batch analytics + In-memory RDD could achieve the same latencies as the Impala/Dremel approaches. So to be precise Spark is a batch analytics system that can masquerade as an interactive analytics system because of operating on in-memory RDD’s and the caching hence possible.