- One of the main limitations of MapReduce is that it persists the full dataset to HDFS after running each job. This is very expensive, because it incurs both three times (for replication) the size of the dataset in disk I/O and a similar amount of network I/O. Spark takes a more holistic view of a pipeline of operations. When the output of an operation needs to be fed into another operation, Spark passes the data directly without writing to persistent storage. This is an innovation over MapReduce that came from Microsoft’s Dryad paper, and is not original to Spark.
2. The main innovation of Spark was to introduce an in-memory caching abstraction. This makes Spark ideal for workloads where multiple operations access the same input data. Users can instruct Spark to cache input data sets in memory, so they don’t need to be read from disk for each operation.
3. What about Spark jobs that would boil down to a single MapReduce job? In many cases also these run faster on Spark than on MapReduce. The primary advantage Spark has here is that it can launch tasks much faster. MapReduce starts a new JVM for each task, which can take seconds with loading JARs, JITing, parsing configuration XML, etc. Spark keeps an executor JVM running on each node, so launching a task is simply a matter of making an RPC to it and passing a Runnable to a thread pool, which takes in the single digits of milliseconds.
Lastly, a common misconception probably worth mentioning is that Spark somehow runs entirely in memory while MapReduce does not. This is simply not the case. Spark’s shuffle implementation works very similarly to MapReduce’s: each record is serialized and written out to disk on the map side and then fetched and deserialized on the reduce side.
We can also consider the below points.
MR was developed during the early and mid 2000s when RAMs weren’t as cheap and most CPUs were 32 bit. Thus, it was designed to rely heavily on disk I/O. Spark on the other hand (RDD to be exact) was build in an era of 64 bit computers that could address TBs of RAM that have become a lot cheaper. Thus, Spark is first and foremost an in-memory technology and hence a lot faster.
To see this in action you can run the same computation on equivalent nodes, one with MR and the other with Spark. You should see that with the MR job a lot of RAM isn’t being used at any one time, while in Spark the RAM utilisation is mostly maxed out.
One disadvantage with Spark is that since data isn’t persisted to disk mid-computation, if many nodes would fail at once it would need to be re-computed from the raw data. But generally even this should take less time than the equivalent MR job.