Apache Spark: Overview

Many different tools are used in working with Big Data. Even for the same tasks, there are several technologies, each of which has its own characteristics and disadvantages. It can be difficult to sort out all this variety and choose something.

To help with this, we will introduce one of the tools - Apache Spark . You will learn what it is, how it is used when working with big data, and how it can help. We will also compare it with another similar technology - Hadoop MapReduce .

What is Apache Spark

Apache Spark is a platform used by Big Data for cluster computing and large-scale data processing. Spark processes data in RAM and rarely accesses disk, so it is very fast.

Apache Spark is fully compatible with the Hadoop ecosystem and easily integrates into existing solutions. It does not have its own datastore and can work with various sources: HDFS, Hive, S3, HBase, Cassandra and others. Supports several programming languages: Scala, Python, Java, R, SQL.

Spark is used for data processing such as filtering, sorting, cleaning, validating, and so on. Its place in the process of working with big data is shown in the diagram below:

Spark consists of several components.

Apache Spark Core

It is the underlying data engine that underpins the entire platform. The kernel interacts with storage systems, manages memory, schedules and balances the load across the cluster. It is also responsible for maintaining API programming languages.

Spark SQL

This module serves to simplify the work with structured data and allows you to execute queries in the SQL language. Its main task is that data engineers do not think about the distributed nature of data storage, but focus on their use cases.

Streaming

Provides scalable, high-performance and resilient real-time streaming processing. Kafka, Flume, Kinesis and other systems can act as data sources for Spark.

MLlib

It is a scalable low level machine learning library. The library implements various machine learning algorithms such as clustering, regression, classification, and collaborative filtering.

GraphX

Serves for manipulating graphs and their parallel processing. GraphX ​​can measure graph connectivity, degree distribution, average path length, and other metrics. It can also connect graphs and transform them quickly. In addition to built-in graph operations, GraphX ​​also has a library that implements the PageRank algorithm.

Comparison of Apache Spark and Apache MapReduce

Previously, MapReduce was generally used to process big data. It is the Hadoop component that pioneered data processing in Big Data. But he has two global problems.

But in 2014, another tool began to gain popularity - Spark, and now it has practically supplanted MapReduce. Apache Spark was designed to address the shortcomings of MapReduce while maintaining its benefits. This is how it solves the MapReduce problems:

To summarize the comparison briefly, we can say that Hadoop MapReduce is an aging technology, and Apache Spark has practically become the standard in the field of big data.

How clouds can power Apache Spark

For data processing in Big Data, horizontal scalability is very important. The more nodes in the cluster, the faster the data is processed. And Spark is no exception. It is difficult and unprofitable to build clusters for data processing on your own equipment. It is difficult to guess in advance how many servers are needed and what capacity they should be. If you use too many servers, they will be idle. If too little, then data processing will take a long time. Therefore, it is worth paying attention to cloud technologies.

Cloud platforms can provide a huge amount of resources on demand. If you have a small load, connect several small nodes. When the load increases, you can quickly add new nodes to the cluster. This will allow you to use resources more efficiently.

Conclusion