Apache Spark is a fast and general engine for large-scale data processing.
Although Mapreduce is great for large scale data processing, it is not friendly for iterative algorithms or interactive analytic because the data have to be repeatedly loaded for each iteration or be materialized and replicated on the distributed file system between successive jobs. Apache Spark is designed to solve this problem in means of in-memory computing. The overall framework and parallel computing model of Spark is similar to MapReduce but with an important innovation, reliant distributed dataset (RDD).
An RDD is a read-only collection of objects partitioned across a cluster of computers. But the semantics of RDDs are way more than just parallelization:
- The elements of an RDD doesn’t have to exist in physical memory. In this sense, an element of RDD is an expression rather than a value. The value can be computed by evaluating the expression when necessary.
- Lazy and Ephemeral
- One can construct an RDD from a file or by transforming an existing RDD such as
cartesian(), etc. However, no real data loading or computation happens at the time of construction. Instead, they are materialized on demand when they are used in some operation, and are discarded from memory after use.
- Caching and Persistence
- We can cache a dataset in memory across operations, which allows future actions to be much faster. Caching is a key tool for iterative algorithms and fast interactive use. Caching is actually one special case of persistence that allows different storage levels, e.g. persisting the dataset on disk, persisting it in memory but as serialized Java objects (to save space), replicating it across nodes, or storing it off-heap in Tachyon. These levels are set by passing a
cache()method is a shorthand for using the default storage level StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
- Fault Tolerant
- If any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
The operations on RDDs take user-defined functions, which are closures in functional programming as Spark is implemented in Scala. A closure can refer to variables in the scope when created, which will be copied to the workers when Spark runs a closure. Spark optimizes this process by shared variables for a couple of cases:
- Broadcast variables
- If a large read-only data is used in multiple operations, it is better to copy it to the workers only once. This can be achieved by broadcast variables that are created from a variable
- Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters or sums. Only the driver program can read the accumulator’s value. Spark natively supports accumulators of numeric types.
By reusing cached data in RDDs, Spark offers great performance improvement over MapReduce (10x ~ 100x faster). Thus, it is very suitable for iterative machine learning algorithms. The built-in MLlib implements some common machine learning algorithms including linear SVM, logistic regression, linear, Lasso, and ridge regression, decision tree, naive Bayes, collaborative filtering, k-means, SVD and PCA, etc. There are also efforts to enable declarative queries in SQL or HiveQL by Spark SQL. Besides, GraphX was recently introduced for parallel graph computation. Both Spark SQL and GraphX extends the Spark RDD to support their computations. We will discuss these features in other posts.
On the other hand, Spark is independent of the underlying storage system as MapReduce. It’s application developers’ duty to organize data such as building and using any index, partitioning and collocating related data sets, etc. These are critical for interactive analytics. Merely caching is not sufficient and not effective for extremely large data