Accelerate queries with Delta

August 27, 2019

Accelerate queries with Delta:

This query is on a Delta table with many small files. To improve the performance of queries, run the OPTIMIZE command on the table `Schema`.`table_name`

Optimize a table

This is similar to collect stats command as in most of the relational database systems. Once you have performed multiple changes to a table, you might have a lot of small files. To improve the speed of read queries, you can use OPTIMIZE to collapse small files into larger ones.

The query over the Databricks Delta table runs much faster after OPTIMIZE is run. How much faster the query runs can depend on the configuration of the cluster you are running on, however should be 5-10X faster compared to the standard table.

Any one of the below commands can be used to optimize the table performance.

OPTIMIZE delta.`/delta/events`

OPTIMIZE events

Get link
Facebook
X
Pinterest
Email
Other Apps

Labels

#Accelerate queries Delta #deltaoptimize #delta optimizeevents #deltadatabricks ##deltaevents #optimize #optimizequery

Labels: #Accelerate queries Delta #deltaoptimize #delta optimizeevents #deltadatabricks ##deltaevents #optimize #optimizequery

Get link
Facebook
X
Pinterest
Email
Other Apps

Comments

Post a Comment

Learn GitHub

March 01, 2019

Learn GitHub git init git add file.txt git commit -m "my first commit" git remote add origin https://github.com/dansullivanma/devlops_data_sci.git git clone https://github.com/dansullivanma/devlops_data_sci.git

Get link
Facebook
X
Pinterest
Email
Other Apps

Post a Comment

Garbage collection in Databricks

August 27, 2019

Clean up snapshots Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by running the VACUUM command: VACUUM events You control the age of the latest retained snapshot by using the RETAIN <N> HOURS option: VACUUM events RETAIN 24 HOURS Test the garbage collection You can specify DRY RUN to test the garbage collection and return a list of files to be deleted: VACUUM events DRY RUN Configure the retention threshold The VACUUM command removes any files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days, but you can specify an alternate retention interval. For example, to delete all stale files older t...

Get link
Facebook
X
Pinterest
Email
Other Apps

Post a Comment

Error The Specified driver class (org.postgres.Driver) is not available!

December 18, 2019

SQL Workbench error for PostgreSQL connection: The Specified driver class (org.postgres.Driver) is not available! Below is the error which can appears while connecting to a PostgreSQL databases in SQL workbench: This could be due to Postgres driver is not found by the Workbench tool. This could happen if the folder containing the driver is moved or deleted. Solution: To fix this issue, 1. Open Workbench and go to File - > Manage Drivers 2. Select PostgreSQL 3. Under the Library option select the Folder where the driver is located and select the driver and click on Open. you can download the latest Postgres JDBC drivers at: https://jdbc.postgresql.org/download.html 4. Click on OK to to close the Manage Drivers window. 5. Now try to connect to the PostgreSQL database with correct credentials, it should connect.

Get link
Facebook
X
Pinterest
Email
Other Apps

Post a Comment

Data Engineering

Search This Blog