Optimizing Apache Spark on Databricks

Course Cover
compare button icon

Course Features

icon

Duration

120 minutes

icon

Delivery Method

Online

icon

Available on

Downloadable Courses

icon

Accessibility

Mobile, Desktop, Laptop

icon

Language

English

icon

Subtitles

English

icon

Level

Beginner

icon

Teaching Type

Self Paced

icon

Video Content

120 minutes

Course Description

Apache Spark is a highly performant and fast framework for big data processing. You might discover that Apache Spark code running on Azure Databricks is still experiencing some issues. You might find that your Apache Spark code running on Azure Databricks is still having issues. This could be because it has difficulty ingesting data from different sources, or because of performance issues such as disk I/O, network performance or computation bottlenecks. Next, you'll learn about Delta Lake on Azure Databricks, which allows you to store data to process, insights, and machine learning on Delta tables. You will also see how Auto Loader can be used to ingest streaming data. Finally, you'll explore common performance bottlenecks you might encounter when processing Apache Spark data, such as serialization, skew and spill. These issues can be avoided by learning how to improve your processing code with disk partitioning, Z-order clustering and bucketing.

Course Overview

projects-img

International Faculty

projects-img

Post Course Interactions

projects-img

Hands-On Training,Instructor-Moderated Discussions

Skills You Will Gain

What You Will Learn

You will learn how Delta Lake on Azure Databricks allows you to store data for processing, insights, as well as machine learning on Delta tables and you will see how you can mitigate your data ingestion problems using Auto Loader on Databricks to ingest s

Next, you will explore common performance bottlenecks that you are likely to encounter while processing data in Apache Spark, issues dealing with serialization, skew, spill, and shuffle

You will learn techniques to mitigate these issues and see how you can improve the performance of your processing code using disk partitioning, z-order clustering, and bucketing

Finally, you will learn how you can share resources on the cluster using scheduler pools and fair scheduling and how you can reduce disk read and write operations using caching on Delta tables

When you are finished with this course, you will have the skills and knowledge of optimizing performance in Spark needed to get the best out of your Spark cluster

Course Instructors

Author Image

Janani Ravi

Instructor

Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework...
Course Cover