Data Engineering

Posts

Showing posts from August, 2019

Databricks Spark DataFrame FAQs

DataFrame FAQs This FAQ addresses common use cases and example usage using the available APIs. For more detailed API descriptions, see the PySpark documentation . How can I get better performance with DataFrame UDFs? If the functionality exists in the available built-in functions, using these will perform better. Example usage below. Also see the pyspark.sql.function documentation . We use the built-in functions and the withColumn() API to add new columns. We could have also used withColumnRenamed() to replace an existing column after the transformation. Copy from pyspark.sql import functions as F from pyspark.sql.types import * # Build an example DataFrame dataset to work with. dbutils . fs . rm ( "/tmp/dataframe_sample.csv" , True ) dbutils . fs . put ( "/tmp/dataframe_sample.csv" , """id|end_date|start_date|location 1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF 2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD 3...

Cost-Based Optimizer

Cost-Based Optimizer in Databricks Spark SQL can use a Cost-Based Optimizer (CBO) to improve query plans. This is especially useful for queries with multiple joins. For this to work it is critical to collect table and column statistics and keep them up to date. This functionality requires Databricks Runtime 3.3 or above. Collect statistics To get the full benefit of the CBO it is important to collect both column statistics and table statistics . Statistics can be collected using the Analyze Table command. Tip To maintain the statistics up-to-date, run ANALYZE TABLE after writing to the table. Verify query plans There are several ways to verify the query plan. EXPLAIN command Use the SQL Explain command to check if the plan uses statistics. If statistics are missing then the query plan might not be optimal. Below is the sample explain plan == Optimized Logical Plan == Aggregate [ s_store...