Apache Spark is an open source analytics engine used for big data workloads. It can handle both batches as well as real-time analytics and data processing workloads. Spark provides native bindings for the Java, Scala, Python, and R programming languages. In addition, it includes several libraries to support build applications for machine learning [MLlib], stream processing [Spark Streaming], and graph processing [GraphX]
- Performance 10 to 100 times faster than Hadoop M/R.
- Ease of development Spark SQL, High-performance SQL engine, API.
- Language Support, Java, Scala, Python, R.
- Storage: HDFS, Cloud Storage
- Resource Management: YARN, Mesos, Kubernetes.
** With Hadoop (Data lake) ** Without Hadoop (Lakehouse: cloud)
- Spark DataFrame and API
- Spark Database and SQL
