Apache Spark Tutorial for free @mindmajix

42 views
Skip to first unread message

Aarusha

unread,
Dec 27, 2016, 2:05:50 AM12/27/16
to Common Crawl

APACHE SPARK TUTORIAL

This tutorial gives you an overview and talks about the fundamentals of Apache Spark.

  • The Spark project consists of multiple components:- Spark core and Resilient distributed datasets(RDD’s), Spark SQL, Spark Streaming, MLlib Machine learning library and GraphX.
  • Spark Core and Resilient Distributed Datasets (RDDs): Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. The fundamental programming abstraction is called Resilient Distributed Datasets, a logical collection of data partitioned across machines. RDDs can be created by referencing datasets in external storage systems, or by applying coarse-grained transformations (e.g. map, filter, reduce, join) on existing RDDs.The RDD abstraction is exposed through a language-integrated API in Java, Python, Scala similar to local, in-process collections. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.
  • Spark SQL: Spark SQL is a component on top of Spark Core that introduces a new data abstraction called Schema RDD, which provides support for structured and semi-structured data. Spark SQL provides a domain-specific language to manipulate SchemaRDDs in Scala, Java, or Python. It also provides SQL language support, with command-line interfaces and ODBC/JDBC server.
For more information, click the link below.
Reply all
Reply to author
Forward
0 new messages