Apache Spark

  • Parallel Processing framawork

  • Primerily address machine learning limitations of Hadoop

  • Developed in University of Berkely in 2014

  • Developed using SCALA

Scala

  • High Level Programming Language

  • Support both functional and Object Oriented Programing

Functional Programming

int a = 10;
a++;
print(a)

Object a will be created and value of a is saved as 10. State of original variable is changed with a++.

int a = 10;
int a1 = a+1;
print(a)

As we are storing increamented value in a1 , state of original variable a is preserved .

When function is called with variable , original variables are preserved and new variables are created. This help in avoiding race condition when multi-threaded operations are executed as state of object does not change.

This helps to enable parallelism

Example of Functional Programming languages - LISP , SCALA,Haskel

Spark vs Hadoop

  • Spark has similar architecture as Hadoop including master-slave and master node- data node.

  • Spark outperforms Hadoop with 3X-10X improvements

  • Spark is popular for ML workload due to Iterative Workload Support , Hadoop is not suitable for iterative workload due to Disk-IO operation

  • Spark also relay on Disk-IO operation however it is controllable programittically.

Last updated