Azure Databricks & Spark For Data Engineers (PySpark / SQL)
Real World Project on Formula1 Racing for Data Engineers using Azure Databricks, Delta Lake, Azure Data Factory [DP203]
What you’ll learn
Azure Databricks & Spark For Data Engineers (PySpark / SQL)
-
You will learn how to build a real-world data project using Azure Databricks and Spark Core. This course has been taught using real-world data from Formula1 motor racing
-
You will acquire professional-level data engineering skills in Azure Databricks, Delta Lake, Spark Core, Azure Data Lake Gen2, and Azure Data Factory (ADF)
-
You will learn how to create notebooks, dashboards, clusters, cluster pools, and jobs in Azure Databricks
-
You will learn how to ingest and transform data using PySpark in Azure Databricks
-
You will learn how to transform and analyze data using Spark SQL in Azure Databricks
-
You will learn about Data Lake architecture and Lakehouse architecture. Also, you will learn how to implement a solution for Lakehouse architecture using Delta Lake.
-
You will learn how to create Azure Data Factory pipelines to execute Databricks notebooks.
-
You will learn to create Azure Data Factory triggers to schedule and monitor pipelines.
-
You will gain the skills required around Azure Databricks and Data Factory to pass the Azure Data Engineer Associate certification exam DP203. Still, the course’s primary objective is not to teach you to pass the exams.
-
You will learn how to connect to Azure Databricks from PowerBI to create reports
Requirements
-
All the code and step-by-step instructions are provided, but the skills below will greatly benefit your journey.
-
Basic Python programming experience will be required
-
Basic SQL knowledge will be required
-
Knowledge of cloud fundamentals will be beneficial but not necessary
-
An Azure subscription will be required; if you don’t have one, we will create a free account in the course
Description
Major updates to the course since the launch
March 2023 – New sections 6 and 7 added. Section 8 Updated. These changes reflect the latest Databricks recommendations around accessing Azure Data Lake. Also, this provides a better solution for students to complete the course project using Azure Student Subscription or Corporate Subscriptions with limited access to Azure Active Directory.
December 2022 – Sections 3, 4 & 5 updated to reflect recent UI changes to Azure Databricks and included lessons on additional functionality included by Databricks recently to Databricks clusters.
Welcome!
I look forward to helping you learn one of the in-demand data engineering tools in the cloud, Azure Databricks! This course has been taught by implementing a data engineering solution using Azure Databricks and Spark core for a real-world project of analyzing and reporting on Formula1 motor racing data.
This is like no other course in Udemy for Azure Databricks. Once you have completed the course, including all the assignments, I strongly believe that you can start a real-world data engineering project independently and be proficient in Azure Databricks. I have also included lessons on Azure Data Lake Storage Gen2, Azure Data Factory, and PowerBI. The course’s primary focus is Azure Databricks and Spark core, but it also covers the relevant concepts and connectivity to the other technologies mentioned. Please note that the course doesn’t cover other aspects of Spark, such as Spark Streaming and Spark ML. Also, the course has been taught using PySpark and Spark SQL; It doesn’t cover Scala or Java.
The course follows a logical progression of real-world project implementation, with technical concepts being explained and the Databricks notebooks being built simultaneously. Even though this course is not specifically designed to teach you the skills required for passing the Azure Data Engineer Associate Certification Exam DP203, it can greatly help you get most of the necessary skills required for the exam.
I value your time as much as I do mine. So, I have designed this course to be fast-paced and to the point. Also, the course is taught in simple English and has no jargon. I start the course from the basics, and you will be proficient in the technologies used by the end of the course.
Currently, the course teaches you the following
Azure Databricks
- Building a solution architecture for a data engineering solution using Azure Databricks, Azure Data Lake Gen2, Azure Data Factory, and Power BI
- Creating and using Azure Databricks service and the architecture of Databricks within Azure
- Working with Databricks notebooks as well as using Databricks utilities, magic commands, etc
- Passing parameters between notebooks as well as creating notebook workflows
- Creating, configuring, and monitoring Databricks clusters, cluster pools, and jobs
- Mounting Azure Storage in Databricks using secrets stored in Azure Key Vault
- Working with Databricks Tables, Databricks File System (DBFS), etc
- Using Delta Lake to implement a solution using Lakehouse architecture
- Creating dashboards to visualize the outputs
- Connecting to the Azure Databricks tables from PowerBI
Spark (Only PySpark and SQL)
- Spark architecture, Data Sources API, and Dataframe API
- PySpark – Ingestion of CSV, simple, and complex JSON files into the data lake as parquet files/ tables.
- PySpark – Transformations such as Filter, Join, Simple Aggregations, GroupBy, Window functions, etc.
- PySpark – Creating local and temporary views
- Spark SQL – Creating databases, tables, and views
- Spark SQL – Transformations such as Filter, Join, Simple Aggregations, GroupBy, Window functions, etc.
- Spark SQL – Creating local and temporary views
- Implementing full refresh and incremental load patterns using partitions
Delta Lake
- The emergence of Data Lakehouse architecture and the role of delta lake.
- Read, Write, Update, Delete, and Merge to delta lake using PySpark and SQL.
- History, Time Travel, and Vacuum
- Converting Parquet files to Delta files
- Implementing incremental load pattern using delta lake
Azure Data Factory
- Creating pipelines to execute Databricks notebooks
- Designing robust pipelines to deal with unexpected scenarios, such as missing files
- Creating dependencies between activities as well as pipelines
- Scheduling the pipelines using data factory triggers to execute at regular intervals
- Monitor the triggers/ pipelines to check for errors/ outputs.
Who this course is for:
- University students looking for a career in Data Engineering
- IT developers working on other disciplines are trying to move to Data Engineering
- Data Engineers/ Data Warehouse Developers currently working on on-premises technologies or other cloud platforms such as AWS or GCP who want to learn Azure Data Technologies.
- Data Architects are looking to understand the Azure Data Engineering stack.
Apache Spark: Master Big Data with PySpark and DataBricks
Get Course Now