Data Engineering

Posts

Showing posts from 2019

Error The Specified driver class (org.postgres.Driver) is not available!

SQL Workbench error for PostgreSQL connection: The Specified driver class (org.postgres.Driver) is not available! Below is the error which can appears while connecting to a PostgreSQL databases in SQL workbench: This could be due to Postgres driver is not found by the Workbench tool. This could happen if the folder containing the driver is moved or deleted. Solution: To fix this issue, 1. Open Workbench and go to File - > Manage Drivers 2. Select PostgreSQL 3. Under the Library option select the Folder where the driver is located and select the driver and click on Open. you can download the latest Postgres JDBC drivers at: https://jdbc.postgresql.org/download.html 4. Click on OK to to close the Manage Drivers window. 5. Now try to connect to the PostgreSQL database with correct credentials, it should connect.

What is a DAG

What is a DAG? Airflow refers to what we've been calling "pipelines" as DAGs (directed acyclic graphs). In computer science, a directed acyclic graph simply means a workflow which only flows in a single direction. Each "step" in the workflow (an edge ) is reached via the previous step in the workflow until we reach the beginning. The connection of edges is called a vertex . If this remains unclear, consider how nodes in a tree data structure relate to one another. Every node has a "parent" node, which of course means that a child node cannot be its parents' parent. That's it - there's no need for fancy language here. Edges in a DAG can have numerous "child" edges. Interestingly, a "child" edge can also have multiple parents (this is where our tree analogy fails us). Here's an example: An example DAG structure. In the above example, the DAG begins with edges 1, 2 and 3 kicking things off. At variou...

Databricks Spark DataFrame FAQs

DataFrame FAQs This FAQ addresses common use cases and example usage using the available APIs. For more detailed API descriptions, see the PySpark documentation . How can I get better performance with DataFrame UDFs? If the functionality exists in the available built-in functions, using these will perform better. Example usage below. Also see the pyspark.sql.function documentation . We use the built-in functions and the withColumn() API to add new columns. We could have also used withColumnRenamed() to replace an existing column after the transformation. Copy from pyspark.sql import functions as F from pyspark.sql.types import * # Build an example DataFrame dataset to work with. dbutils . fs . rm ( "/tmp/dataframe_sample.csv" , True ) dbutils . fs . put ( "/tmp/dataframe_sample.csv" , """id|end_date|start_date|location 1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF 2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD 3...

Cost-Based Optimizer

Cost-Based Optimizer in Databricks Spark SQL can use a Cost-Based Optimizer (CBO) to improve query plans. This is especially useful for queries with multiple joins. For this to work it is critical to collect table and column statistics and keep them up to date. This functionality requires Databricks Runtime 3.3 or above. Collect statistics To get the full benefit of the CBO it is important to collect both column statistics and table statistics . Statistics can be collected using the Analyze Table command. Tip To maintain the statistics up-to-date, run ANALYZE TABLE after writing to the table. Verify query plans There are several ways to verify the query plan. EXPLAIN command Use the SQL Explain command to check if the plan uses statistics. If statistics are missing then the query plan might not be optimal. Below is the sample explain plan == Optimized Logical Plan == Aggregate [ s_store...