Apache Spark is the standard for large-scale data processing. This course is the first in a series of courses leading to the IBM Advanced Data Science Specialization. It is essential to learn how to build a scalable platform for data science because memory and CPU limitations are the most important limiting factors in building advanced machine learning models.
This course will teach you how to use python with Apache Spark. In the first two weeks, we will introduce Apache Spark and then learn how to use it to perform basic exploratory and pre-processing tasks. This exercise will also introduce you to data visualization and statistical methods. This will give you the knowledge necessary to assume the role of data engineer in any modern setting. It also gives you the foundation for your data science career. Please have a look at the full specialization curriculum: https://www.coursera.org/specializations/advanced-data-science-ibm If you choose to take this course and earn the Coursera course certificate, you will also earn an IBM digital badge. For more information about IBM digital badges, visit ibm.biz/badging. This course will help you to recognize data patterns, patterns, deviations, inconsistencies, and outliers. * Identify useful techniques for working with big data such as dimension reduction and feature selection methods * Use advanced tools and charting libraries to: o improve efficiency of analysis of big-data with partitioning and parallel analysis o Visualize the data in an number of 2D and 3D formats (Box Plot, Run Chart, Scatter Plot, Pareto Chart, and Multidimensional Scaling) For successful completion of the course, the following prerequisites are recommended: * Basic programming skills in python * Basic math * Basic SQL (you can get it easily from https://www.coursera.org/learn/sql-data-science if needed) In order to complete this course, the following technologies will be used: (These technologies are introduced in the course as necessary so no previous knowledge is required.) Some of the course material is considered too complex. If you feel the exact same, please take a look at these materials before you start this course. We've heard that it really helps. You can try this course first, and then, if you feel the need, go to these courses. It's free... https://cognitiveclass.ai/learn/spark https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/f8982db1-5e55-46d6-a272-fd11b670be38/view?access_token=533a1925cd1c4c362aabe7b3336b3eae2a99e0dc923ec0775d891c31c5bbbc68 This course takes four weeks, 4-6h per week