Introduction to Big Data and Distributed Computing :
Big data analysis is future. This section of course will help you to understand, the need of distributed computation.
Introduction to data. Data Science a vision. Big data Introduction. Parallel computation. Problem with parallel computation. Traditional parallel computation systems. Hadoop :
Introduction to Hadoop. Hadoop Components. HDFS and its architecture. HDFS Commands ◦ mkdir ◦ ls ◦ rmdir and rm ◦ copyFromLocal ◦ put ◦ cat ◦ copyToLocal ◦ get ◦ touchz ◦ mv ◦ cp ◦ distcp ◦ etc…...
fsimage and edits log files. Hadoop property files. Introduction to MapReduce. Shortcoming of MapReduce.
Scala :
Introduction to Scala Scala variables Operators in Scala Interactive mode and script base programming introduction Scala data type and operations on them Scala Collections (Touple, Map etc) Control Flow and looping in Scala Functions in Scala (Declaration, Definition Types and calling) Object oriented Scala Introduction to function programming in scala. Pattern Matching a introduction.
Spark Introduction :
Introduction to Spark. Spark and Hadoop (Similarity and Differences) Spark Execution (Master Slave System , Drive, Driver manager and Executors) Spark Shell Resilient Distributed dataSet (RDD) Operations On RDD :
Creation of RDD Transformation and Action Introduction Lazy evaluation Some Important Transformation : filter map flatMap distinct sample union intersection subtract cartesian Some Important Action first take top reduce fold aggregate foreach count collect Creation of Paired RDD Some important Transformation on pairRDD combineBy mapValues groupByKeys reduceByKeys sortByKeys subsractByKey Joines and their Type cogroup Some Important action on pair RDD lookUp collectAsMap countByKey Hands on all the functions Fault tolerance and Persistence :
RDD lineage persistence Benefit of persistence
Optimizing Spark program
Introduction to partitioning Inbuilt partitioners (Hash and Range) Benefits of partitioning groupByKey and reduceBykey comparison Spark broadcasting and accumulators IO in Spark :
TextFile Csv File JSON Data From HDFS
Spark Streaming :
Introduction to Spark Streaming Transformation Reading from HDFS Window Concept Push Based Receiver and Pull Based receiver Kafka integration with Streaming. Performance SparkSQL.
Introduction to SparkSQL SparkSQL datatype DataFrame an Introduction. Creation of a dataframe. Summary statistics on DataFrame. Aggregation on Given Data. SparkSQL and SQL Introduction to Hive. Using data from Hive and HiveQL. Optimizing SparkSQL code. Spark Code Deployment and cluster managers.
Submitting Spark code on StandAlone cluster manager. Submitting Spark code on YARN Submitting Spark code on Mesos
Note : Every part of course will be associated with hands on . A number of objective questions will always help you in scratch your brain.
Projects :
Project 1 : Spark core can be used for data preparation and aggregation. Aggregation will be implemented using Spark core APIs.
For data aggregation movie lance data will be used.
Project 2 : Implementing streaming data word frequency visualization. using Kafka and Spark streaming integration.
Project 3 : Implementation of Moving average using SparkSQL.
Project 4 : Data preprocessing, data manipulation and aggregation using SparkSQL. It will be done using Real time data.
Venue
BTM 2nd Stage
773,3rd Floor, 7th cross 16th main, Bengaluru, India