Hands On Training

Optimizing Apache Spark on Databricks

Pluralsight

Course Features

Duration

120 minutes

Delivery Method

Online

Available on

Downloadable Courses

Accessibility

Mobile, Desktop, Laptop

Language

English

Subtitles

English

Level

Beginner

Teaching Type

Self Paced

Video Content

120 minutes

Course Description

Apache Spark is a highly performant and fast framework for big data processing. You might discover that Apache Spark code running on Azure Databricks is still experiencing some issues. You might find that your Apache Spark code running on Azure Databricks is still having issues. This could be because it has difficulty ingesting data from different sources, or because of performance issues such as disk I/O, network performance or computation bottlenecks. Next, you'll learn about Delta Lake on Azure Databricks, which allows you to store data to process, insights, and machine learning on Delta tables. You will also see how Auto Loader can be used to ingest streaming data. Finally, you'll explore common performance bottlenecks you might encounter when processing Apache Spark data, such as serialization, skew and spill. These issues can be avoided by learning how to improve your processing code with disk partitioning, Z-order clustering and bucketing.

Course Overview

International Faculty

Post Course Interactions

Hands-On Training,Instructor-Moderated Discussions

Skills You Will Gain

What You Will Learn

You will learn how Delta Lake on Azure Databricks allows you to store data for processing, insights, as well as machine learning on Delta tables and you will see how you can mitigate your data ingestion problems using Auto Loader on Databricks to ingest s

Next, you will explore common performance bottlenecks that you are likely to encounter while processing data in Apache Spark, issues dealing with serialization, skew, spill, and shuffle

You will learn techniques to mitigate these issues and see how you can improve the performance of your processing code using disk partitioning, z-order clustering, and bucketing

Finally, you will learn how you can share resources on the cluster using scheduler pools and fair scheduling and how you can reduce disk read and write operations using caching on Delta tables

When you are finished with this course, you will have the skills and knowledge of optimizing performance in Spark needed to get the best out of your Spark cluster

Course Content

Expand all sections

Module 1: Exploring and Mitigating Data Ingestion Problems

Module 2: Diagnosing and Mitigating Performance Problems

Module 3: Optimizing Spark for Performance

Course Instructors

Janani Ravi

Instructor

Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework...

Optimizing Apache Spark on Databricks

Course Features

Course Description

Module 1: Exploring and Mitigating Data Ingestion Problems

Module 2: Diagnosing and Mitigating Performance Problems

Module 3: Optimizing Spark for Performance

Explore

More

Get in Touch with Us