Spark with Python Tickets by Walsoul Pvt Lt, Monday, December 02, 2019, Bengaluru Event

Event Information

Introduction to Big Data and Distributed Computing :

Big data analysis is future. This section of course will help you to understand, the need of distributed computation.

Introduction to data.

Data Science a vision.

Big data Introduction.

Parallel computation.

Problem with parallel computation.

Traditional parallel computation systems.

Hadoop :

Introduction to Hadoop.

Hadoop Components.

HDFS and its architecture.

HDFS Commands

◦ mkdir

◦ ls

◦ rmdir and rm

◦ copyFromLocal

◦ put

◦ cat

◦ copyToLocal

◦ get

◦ touchz

◦ mv

◦ cp

◦ distcp

◦ etc…...

fsimage and edits log files.

Hadoop property files.

Introduction to MapReduce.

Shortcoming of MapReduce.

Python : Refresher

Introduction to Python

Jupytor

Python variables and Data Type.

Operators in Python.

Interactive mode and script base programming introduction

Python Collections (List, Dictionaries etc)

Control Flow and looping in Python

Functions in Python (Declaration, Definition Types and calling)

Object oriented Python.

NumPy

Spark Introduction :

Introduction to Spark.

Spark and Hadoop (Similarity and Differences)

Spark Execution (Master Slave System , Drive, Driver manager and Executors)

Spark Shell

Resilient Distributed dataSet (RDD)

Operations On RDD :

Creation of RDD

Transformation and Action Introduction

Lazy evaluation

Some Important Transformation :

filter

map

flatMap

distinct

sample

union

intersection

subtract

cartesian

Some Important Action

first

take

top

reduce

fold

aggregate

foreach

count

collect

Creation of Paired RDD

Some important Transformation on pairRDD

combineBy

mapValues

groupByKeys

reduceByKeys

sortByKeys

subsractByKey

Joines and their Type

cogroup

Some Important action on pair RDD

lookUp

collectAsMap

countByKey

Hands on all the functions

Fault tolerance and Persistence :

RDD lineage

persistence

Benefit of persistence

Optimizing Spark program

Introduction to partitioning

Inbuilt partitioners (Hash and Range)

Benefits of partitioning

groupByKey and reduceBykey comparison

Spark broadcasting and accumulators

IO in Spark :

TextFile

Csv File

JSON

Data From HDFS

Spark Streaming :

Introduction to Spark Streaming

Transformation

Reading from HDFS

Window Concept

Push Based Receiver and Pull Based receiver

Kafka integration with Streaming.

Performance

SparkSQL.

Introduction to SparkSQL

SparkSQL datatype

DataFrame an Introduction.

Creation of a dataframe.

Summary statistics on DataFrame.

Aggregation on Given Data.

Data joining.

SparkSQL and SQL

Introduction to Hive.

Using data from Hive and HiveQL.

Optimizing SparkSQL code.

Spark Code Deployment and cluster managers.

Submitting Spark code in local mode

Submitting Spark code on StandAlone cluster manager.

Submitting Spark code on YARN

Submitting Spark code on Mesos

Note : Every part of course will be associated with hands on . A number of objective questions will always help you in scratch your brain.

Projects :

Project 1 : Spark core can be used for data preparation and aggregation. Aggregation will be implemented using Spark core APIs.

For data aggregation movie lance data will be used.

Project 2 : Implementing streaming data word frequency visualization. using Kafka and Spark streaming integration.

Project 3 : Implementation of Moving average using SparkSQL.

Project 4 : Data preprocessing, data manipulation and aggregation using SparkSQL. It will be done using Real time data.

Venue

BTM 2nd Stage

773,3rd Floor, 7th cross 16th main, Bengaluru, India

View on Maps

Walsoul Pvt Lt

Joined on Apr 10, 2019

Follow

Spark with Python

Event Information

Venue

Learn More

About

Organize Events