DATAK | Spark pt1

Spark pt1

Category : Datascience Tag : gds bigdata

June 6, 2021, midnight

Short Description :

Spark part1. Spark basics

source : datak

<h2>About</h2><p>This series explain about spark, from basic concept, infrastructure, and basic syntax.</p><ol><li>Spark Basics</li><li>Infrastructure and databricks</li><li>pySpark syntax</li></ol><p><br></p><h3>What is Apache Spark?</h3><p><a href="https://spark.apache.org/" target="_blank">Atache Spark</a> is an open source frame work that is designed for parallelized distributed processing system for a big data workloads. Sparks offers multi-language APIs such as Java, Scala, and Python. Spark uses simple syntax, making complex parallelized distribution background at an abstract level.<img src="/media/django-summernote/2021-06-06/87d84038-493d-4309-b241-1cd38a6425bf.png" style="width: 814px;">  </p><h4>Spark Components</h4><blockquote><ol><li><b>Apache Spark Core</b> : Spark core is the underlying general execution engine for the spark platform that all other functionality is built upon. It provides in-memory computing and referencing datasets in external storage systems</li><li><b>Spark SQL</b> : Spark SQL is Apache Spark's module for working with structured data. The interfaces offered by Spark SQL provides Spark with more information about the structure of both the data and the computation being performed</li><li><b>Spark Streaming</b> : This component allows Spark to process real-time streaming data. Data can be ingested from many sources like Kafka, Flume, and HDFS(Hadoop Distributed File System). Then the data can be processed using complex algorithms and pushed out to file systems, databases, and live dashboards.</li><li><b>MLlib (Machine Learning Library)</b> : Apache Spark is equipped with a rich library known as MLlib. This library contains a wide array of machine learning algorithms - classification, regression, clustering, and collaborative filtering. It also includes other tools for constructing, evaluating, and tuning ML Pipelines. All these functionalities help Spark scale out across a cluster<b>.</b></li><li><b>GraphX</b> : Spark also comes with a library to manipulate graph databases and perform computations called GraphX. GraphX unifies ETL (Extract, Transform, and Load) process, exploratory analysis, and iterative graph computation within a single system</li></ol></blockquote><p>Resource : <a href="https://chartio.com/learn/data-analytics/what-is-spark/" target="_blank">What is Spark?</a></p><p><br></p><h3>Objective of Spark and difference against Hadoop</h3><p>Hadoop is known as frame work for distributed processing system, but is getting to learn that there are couple of constraint when it is utilized for machine learning;</p><ol><li>Is not designed to efficiently utilize memory for each distributed computers</li><li>Requires an access for storage for iterative processing</li><li>Requires an access for storage for using same data</li></ol><p>These are required for machine learning, and it is getting obvious constraint when Hadoop is used for machine learning.</p><p>Spark is designed to solve above issues. It is designed by distributed memory system called as RDD (Resilient Dstributed Datasets), which splits data into partitions and controls by memories at each computers. Spark can avoid multi-access to storage but process by in-memory.</p><p><br></p><h3>Spark Use Cases</h3><h4>1. If project requires huge amount of data</h4><p>Spark can handle even Peta byte of data as well as Tera byte of data. If data can not be processed by one machine. That is when Spark will be shined</p><h4>2. If project requires real time processing</h4><p>1 week of processing time by 1 laptop can be replaced with 1 hour of Spark processing. If the process time reduction will give a business a big deal that is when Spark will be shined</p><h4>3. If project requires machine learning that needs huge amount of data</h4><p>If machine learning uses small-medium dataset, 1 computer resource would be enough. Meanwhile if machine learning uses huge amount dataset and its analysis requires huge resource, Spark will give us an advantage</p><h4>4. If project already uses Hadoop</h4><p>Spark can be run on Hadoop YARN(Yet Another Resource Negotiator) cluster. Spark can be co-existed with Hadoop</p><p><br></p>

<< Back to Blog Posts
Back to Home

view article

Spark pt2

Datascience

gds bigdata

Jun 6 2021

Spark part2. Basic syntax using pySpark at databricks

Related Posts