Big Data with PySpark

Learn Path Description

Advance your data skills by mastering Apache Spark. Using the Spark Python API, PySpark, you will leverage parallel computation with large datasets, and get ready for high-performance machine learning. From cleaning data to creating features and implementing machine learning models, you'll execute end-to-end workflows with Spark. The track ends with building a recommendation engine using the popular MovieLens dataset and the Million Songs dataset.

Skills You Will Gain

Courses In This Learning Path

Total Duration

4 hours

Level

Beginner

Learn Type

Certifications

Course 1

Introduction to PySpark

DataCamp

course via

DataCamp

This course will show you how to use Spark with Python. Spark allows you to perform parallel computations using large data sets. It is easy to integrate into Python. PySpark, the Python package that makes all of this magic possible, is responsible. This package allows you to access data on flights between Portland, Washington and Seattle. This package will show you how to manage the data and build a machine-learning pipeline that predicts if flights will be delayed. To get into high-performance machine learning, you can spark your Python code!

Total Duration

4 hours

Level

Intermediate

Learn Type

Certifications

Course 2

Big Data Fundamentals with PySpark

DataCamp

course via

DataCamp

Big Data has attracted a lot of attention in the past few years, and it is now a mainstream topic for many companies. What is Big Data? This course will introduce you to the basics of Big Data with PySpark. Spark is a framework which allows for Big Data "lightning fast cluster computing". It's a data processing platform engine that can run programs up to 100x faster in memory than Hadoop and 10x faster on disk. PySpark allows Spark programming. It includes powerful libraries such as SparkSQL for machine learning and MLlib to facilitate programming. Learn about William Shakespeare and how to analyze Fifa 2018 data. You'll also learn about cluster genomic datasets. This course will give you a solid understanding of PySpark and its general application to Big Data analysis.

Total Duration

4 hours

Level

Intermediate

Learn Type

Certifications

Course 3

Cleaning Data with PySpark

DataCamp

course via

DataCamp

Working with data can be difficult. It can be frustrating to work with millions or billions of rows. It is possible that you received data processing code from a laptop with very clean data. It is possible that you were responsible for moving basic data processing processes from prototype to production. You might have worked with real-world data. This could include missing fields, unusual formatting or data orders of magnitude larger. Even if you are not an expert on the topic, this course will teach you how to prepare data processes using Python and Apache Spark. This course will help you understand terminology and best practices to create a reliable, manageable and easy-to-understand data processing platform.

Total Duration

4 hours

Level

Intermediate

Learn Type

Certifications

Course 4

Feature Engineering with PySpark

DataCamp

course via

DataCamp

Your job is to find the meaning in chaos. Careful curation is required to create toys datasets like MTCars and Iris. The data must be transformed in order to make them useful for machine-learning algorithms that can predict, extract, classify, cluster, etc. This course will cover the details that data scientists spend between 70 and 80% of their time dealing, such as feature engineering and data wrangling. Let's use PySpark Big Data to reduce these datasets which are becoming increasingly complex.

Total Duration

4 hours

Level

Intermediate

Learn Type

Certifications

Course 5

Machine Learning with PySpark

DataCamp

course via

DataCamp

Spark is a powerful tool for Big Data. Spark transparently manages the allocation of compute tasks within a cluster. This allows for quick operations and lets you concentrate on the analysis, not worrying about the technical details. In this course you'll learn how to get data into Spark and then delve into the three fundamental Spark Machine Learning algorithms: Linear Regression, Logistic Regression/Classifiers, and creating pipelines. This course will also cover analysing large quantities of spam text messages as well as flight delays. This will give you the knowledge and skills to harness Spark's power for Machine Learning projects.

Total Duration

4 hours

Level

Intermediate

Learn Type

Certifications

Course 6

Building Recommendation Engines with PySpark

DataCamp

course via

DataCamp

This course will show you how to build recommendation engines in PySpark with Alternating Least Squares. This course will show you how to create recommendation engines in PySpark using Alternating Least Squares. It uses both the MovieLens dataset and Million Songs. This course also contains the code required to train, test, and implement ALS models on various types of customer data.

Learn Path Description

Courses In This Learning Path

Explore

More

Get in Touch with Us