name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } /* custom.css */ .plot-callout { width: 300px; bottom: 5%; right: 5%; position: absolute; padding: 0px; z-index: 100; } .plot-callout img { width: 100%; border: 1px solid #23373B; } </style>
--- class: middle, left, inverse # Technologies Big Data : Spark Tips ### 2023-04-11 #### [Master I MIDS Master I Informatique]() #### [Technologies Big Data](http://stephane-v-boucheron.fr/courses/isidata/) #### [Amélie Gheerbrant, Stéphane Gaïffas, Stéphane Boucheron](http://stephane-v-boucheron.fr) --- class: center, middle, inverse # Spark tips and error messages --- template: inter-slide ## Spark tips --- ### Tip 1. Use DataFrames instead of RDDs - Instead of using the `RDD` API ```default rdd = sc.textFile("/path/to/file.txt") ``` - Use the `DataFrame` API ```default df = spark.read.textFile("/path/to/file.txt") ``` - The DataFrame API uses the .stress[ `Catalyst`] optimizer to **improve** the execution plan of your Spark Job - The low-level `RDD` API does not - Most of the **recent Spark advances** are towards an improvement of the `SQL` --- ### Tip 2. Avoid using regular expressions - Java Regex is great to parse data in an expected structure - But, unfortunately, it is generally a .stress[slow process] when processing millions of rows - Increasing **a little bit** the parsing of rows .stress[increases a lot] the entire job - If possible, .stress[avoid using Regex’s] and try to load your data in a **more structured format** --- ### Tip 3. Joins: largest dataset on the left - When joining two datasets where **one is smaller than the other**, you **must** put the .stress[largest on the left] ```default joinedDF = largeDF.join(smallDF, on="id") ``` - The data specified .stress[on the left] is **static on the executors** while the data .stress[on the right] is **transfered** between the executors - Something like ```default joinedDF = smallDF.join(largeDF, on="id") ``` can be .stress[much longer] or even .stress[fail] if `largeDF` is large --- ### Tip 4. Joins: use broadcast joining - Often, we need to join a **huge** dataframe with a **small** one - Use .stress[broadcast joins] for joining small datasets to larger ones ```default from pyspark.sql.functions import broadcast joinedDF = largeDF.join(broadcast(smallDF), on="id") ``` - Usually leads to **much faster joins** since is allows to .stress[avoid shuffles] --- ### Tip 5. Use caching when repeating queries - If you are constantly using the same DataFrame on multiple queries, you can use .stress[caching] or .stress[persistence]: ```default df = spark.read.textFile("/path/to/file.txt").cache() ``` - But .stress[avoid overusing] this. Depending on caching strategy (in-memory then swap to disk), cache can **end up being slower** than reading - Storage space used for caching means **less space** for processing - Caching can **cost more** than reading the DataFrame (e.g. only few columns are useful, predictate pushdown) --- ### Tip 6. COMPUTE STATISTICS of tables - **Before querying** a table, it can be helpful to .stress[compute the statistics] of those tables so that Catalyst can **find a better plan** to process it: ```default query = "ANALYZE TABLE db.table COMPUTE STATISTICS" spark.sql(query) ``` - However, Spark **does not always get everything** it needs just from the above broad `COMPUTE STATISTICS` call --- ### Tip 6. COMPUTE STATISTICS of tables - Also helps to .stress[check specific columns] so that **Catalyst** can better check those columns - It's recommended to **COMPUTE STATISTICS** for any .stress[columns involved in filtering and joining] : ```default query = "ANALYZE TABLE db.table COMPUTE STATISTICS" " FOR COLUMNS joinColumn, filterColumn" spark.sql(query) ``` --- ### Tip 7. Shuffles: know your data - Shuffle is the .stress[transportation of data between workers] across a Spark cluster's network - It's central for operations where a **reorganization of data is required**, referred to as .stress[wide dependencies] (**wide** vs **narrow** dependencies) - This kind of operation .stress[usually is the bottleneck] of your Spark application - To use Spark well, you .stress[need to know what you shuffle], and for this it’s **essential that you know your data** --- ### Tip 8. Shuffles: beware of skews - .stress[Skew] is an .stress[imbalance] in the **distribution of your data** - If you fail to account for **how your data is distributed**, you may find that Spark naively places an overwhelming .stress[majority of rows on one executor], and a .stress[fraction on all the rest] - This is **skew**, and .stress[it will kill your application] - Whether by causing **out of memory** errors, **network timeouts**, or **exponentially long running processes** that will never terminate --- ### Tip 9. Partitions: change the default - It's **absolutely essential** to .stress[model the number of partitions] around the kind of stuff you're solving - The default value for `spark.sql.shuffle.partitions` is 200. It controls the .stress[number of partitions] used by .stress[shuffles] (= number of partitions in the **resulting** DataFrame of RDD). - Number of shuffle partitions .stress[does not change] with **different data size**. For **small** data, 200 is **overkill**, for **large** data, it does not **effectively use the all resources**. - **Rule of thumb**: set this configuration to the .stress[number of cores] you have available .stress[across all your executors] ```default spark.conf.set("spark.sql.shuffle.partitions", 42) ``` --- ### Tip 10. Partitions: well-distributed columns - A powerful way to **control Spark shuffles** is to .stress[partition your data intelligently] - Partitioning on the .stress[right column] (or set of columns) helps to .stress[balance] the **amount of data mapped across the cluster network** in order to perform actions - Partitioning on a **unique ID** is generally a good strategy, but **don't partition** on **sparsely filled columns** (with many NAs) or columns that **over-represent particular values** --- ### Tip 11. Joins again: highly flammable - .stress[Joins] are .stress[shuffle offenders]. Dangers of SQL joining are amplified by the **scale** enabled by Spark - Even joining **medium sized data** can .stress[cause an explosion] if there are .stress[repeated join values] on both sides of your join - Million rows datasets with **"pseudo unique"** keys can .stress[explode] into a .stress[billions rows join] ! - **Join columns** with .stress[null values] usually means .stress[massive skew] and an **explosive join** - A solution is to .stress[pre-fill empty cells] to **arbitrary balanced values** (e.g. uniform random values) before running a join --- ### Tip 12. Is your data real yet? - Don't forget that operations in Spark are divided between .stress[transformations] and .stress[actions]. Transformations are **lazy** operations allowing Spark to **optimize your query** - Transformations **set up** a DataFrame for changes (adding a column, joining it to another, etc.) but .stress[will not execute these] until an **action** is performed. - This can result in **surprising results:** imagine that you create an id column using `monotonically_increasing_id`, and then join on that column. If you do not place an **action** in between, your values **have not been materialized**. The result will be **non-deterministic!** --- ### Tip 13. Checkpointing is your friend - .stress[Checkpointing] means .stress[saving data to disk] and .stress[reloading it back in], which is **redundant** anywhere else besides Spark. - It **triggers an action** on any waiting transformations, and **truncates** the Spark **query plan** for the checkpointed data. - This action **shows up in your Spark UI**, indicating **where you are in your job**. - It can help to **conserve resources**, since it can **release memory** that would otherwise be cached for downstream access. - Checkpointed data is also a valuable source for **data-debugging**. --- ### Tip 14. Check your runtime with monitoring - .stress[Spark UI] is your friend, and so are other **monitoring tools** that let you know how your run is going in **real-time**. - The Spark UI contains information on the **job level**, the **stage level**, and the **executor level**. You can see if the **volume of data** going to each **partition** or each **executor** makes sense, if some part of your job is taking **too much time**. - Such a monitoring tool allowing to view your **total memory** and **CPU usage** across executors is essential for **resource planning** and "autopsies" on **failed jobs**. --- ### Tip 15. CSV reading is brittle - Naively reading CSVs in Spark can result in **silent** .stress[escape-character errors] ```default df = spark.read.csv("quote-happy.csv") ``` - Your DataFrame **seems happy**: no runtime exceptions, and you can execute operations on the DataFrame - But after careful debugging, you realize that at some point in the data, **everything has shifted** over one or several columns! - To be safe, you can include `escape` and `quote` options in your reads. Even better: .stress[use `Parquet`] instead of CSV files! --- ### Tip 16. Parquet is your friend - Read/Write operations are .stress[order of magnitude more efficient] with `Parquet` than with uncompressed CSV files - Parquet is "columnar": **reads only** the columns required for a sql query and **skip over** those that are not requested. - And also .stress[predicate pushdown] operations on filtering operations: run queries **only on relevant subsets** of the values. - Switching from CSV to Parquet is the .stress[first thing you can do] to **improve performance**. - If you are generating Parquet files from another format (using PyArrow, Pandas, etc.) be conscious that creating a **single** parquet file gives up a .stress[major benefit of the format]: you need to **partition it** ! --- ### Tip 17. Problems with UDFs .stress[UDF] = User Defined Function = something **very convenient** ```default >>> from pyspark.sql import functions as F, types as T >>> data = [{'a': 1, 'b': 0}, {'a': 10, 'b': 3}] >>> df = spark.createDataFrame(data) >>> def calculate_a_b_ratio(a, b): >>> if b > 0: >>> return a / b >>> return 0. >>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.FloatType()) >>> df = df.withColumn('a_b_ratio_float', udf_ratio_calculation('a', 'b')) >>> df.show() +---+---+---------------+ | a| b|a_b_ratio_float| +---+---+---------------+ | 1| 0| 0.0| | 10| 3| 3.3333333| +---+---+---------------+ ``` --- ### Tip 17. Problems with UDFs UDF are .stress[Excruciatingly slow] with `pyspark` and spark .stress[won't complain] if the **return type is incorrect** and just return `nulls` ```default >>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.DecimalType()) >>> df = df.withColumn('a_b_ratio_dec', udf_ratio_calculation('a', 'b')) >>> df.show() +---+---+---------------+-------------+ | a| b|a_b_ratio_float|a_b_ratio_dec| +---+---+---------------+-------------+ | 1| 0| 0.0| null| | 10| 3| 3.3333333| null| +---+---+---------------+-------------+ ``` ```default >>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.BooleanType()) >>> df = df.withColumn('a_b_ratio_bool', udf_ratio_calculation('a', 'b')) >>> df.show() +---+---+---------------+-------------+--------------+ | a| b|a_b_ratio_float|a_b_ratio_dec|a_b_ratio_bool| +---+---+---------------+-------------+--------------+ | 1| 0| 0.0| null| null| | 10| 3| 3.3333333| null| null| +---+---+---------------+-------------+--------------+ ``` --- ### Tip 18. Use all of the resources - Spark **driver memory** and **executor memory** are set by default to 1 Go. - It is in general very useful to take a look at the .stress[many configuration parameters] and their defaults: .center[https://spark.apache.org/docs/latest/configuration.html] - Many things there that can **influence your spark application** - When running **locally**, adjust `spark.driver.memory` to something that’s reasonable for your system, e.g. `"8g"` - When running on a **cluster**, you might also want to tweak the `spark.executor.memory` (though it depends on your cluster and its configuration). --- ### Tip 18. Use all of the resources ```default from pyspark import SparkConf from pyspark.sql import SparkSession conf = SparkConf() conf.set('spark.executor.memory', '16g') conf.set('spark.driver.memory', '8g') spark_session = SparkSession.builder \ .config(conf=conf) \ .appName('Name') \ .getOrCreate() ``` --- class: center, middle, inverse # 7.2. Interpret error messages --- # Interpret error messages - Error messages .stress[don't mean what they say] - Takes quite a while to understand that Spark **complains about one thing**, when the problem is **somewhere else** - `"Connection reset by peer"` often means that you have skews and .stress[one particular worker has run out of memory] - `"java.net.SocketTimeoutException: Write timed out"` can mean that the .stress[number of partitions too high], so that the filesystem is too slow to handle the **number of simultaneous writes** attempted by Spark --- # Interpret error messages - `"Total size of serialized results[...] is bigger than spark.driver.maxResultSize"` can mean that the .stress[number of partitions is too high] and **results can't fit onto a particular worker** - `"Column a is not a member of table b"`: you have a .stress[sql join error]. Try your **job locally on a small sample** to avoid reverse engineering of such errors - Sometimes you get a true `"out of memory"` error. You can **increase the size of individual workers**, but before you do that, ask yourself, .stress[is the data well distributed] ? --- # Interpret error messages - `"ClassNotFoundException"`: usually when you are trying to **connect** your application to an external a database. Here is an example .center[<img src="figs/classnotfounderror.png" style="width: 100%;" />] --- # Interpret error messages - Means that Spark **cannot find the necessary jar driver** to connect to the database - Need to **provide the correct jars** to your application using the spark configuration or as a command line argument ```default from pyspark import SparkConf from pyspark.sql import SparkSession jars = "/full/path/to/postgres.jar,/full/path/to/other/jar" conf = SparkConf() conf.set("spark.jars", jars) spark = SparkSession.builder.config(conf=conf)\ .appName('test').getOrCreate() ``` or ```bash spark-submit --jars /full/path/to/postgres.jar,/full/path/to/other/jar ... ``` --- # Interpret error messages - All the jars must be accessible to **all nodes** and not local to the driver. - This error might also mean a **Spark version mismatch** between the cluster components. - Make sure there is **no space** between the commas in the list of jars. --- # Interpret error messages Trying to connect to a database: `"java.sql.SQLException: No suitable driver"` .center[<img src="figs/sqlerror.png" style="width: 100%;" />] --- # Interpret error messages Error happens while trying to save to a database: `"java.lang.NullPointerException"` .center[<img src="figs/nullpointer.png" style="width: 100%;" />] --- # Interpret error messages This errors usually mean that we forgot to set the driver, `"org.postgresql.Driver"` for `Postgres`: ```default df = spark.read.format('jdbc').options( url= 'db_url', driver='org.postgresql.Driver', # <-- here dbtable='table_name', user='user', password='password' ).load() ``` and also make sure that the drivers' jars are set. --- # Interpret error messages Horrible error : `'NoneType' object has no attribute '_jvm'` .center[<img src="figs/nonetypeerror.png" style="width: 85%;" />] ...mainly comes from two mistakes --- # Interpret error messages 1) You are using pyspark functions without having an active spark session ```default from pyspark.sql import SparkSession, functions as F class A(object): def __init__(self): self.calculations = F.col('a') / F.col('b') ... # Instantiating A without an active spark session # will give you this error a = A() ``` --- # Interpret error messages 2) You are using pyspark functions within a UDF: ```default # Create a dataframe data = [{'a': 1, 'b': 0}, {'a': 10, 'b': 3}] df = spark.createDataFrame(data) # Define a simple function that returns a / b def calculate_a_b_max(a, b): return F.max([a, b]) # and a udf for this function - notice the return datatype udf_max_calculation = F.udf(calculate_a_b_ratio, T.FloatType()) df = df.withColumn('a_b_max', udf_max_calculation('a', 'b')) df.show() ``` We CANNOT use pyspark functions inside a udf: a UDF operates on a row per row basis while pyspark functions on a column basis. --- class: center, middle, inverse # Thank you !