Skip to main content

Garbage collection in Databricks

Clean up snapshots

Delta Lake provides snapshot isolation for reads, which means that it is safe to run OPTIMIZE even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by running the VACUUM command:

VACUUM events
You control the age of the latest retained snapshot by using the RETAIN <N> HOURS option:
VACUUM events RETAIN 24 HOURS

Test the garbage collection

You can specify DRY RUN to test the garbage collection and return a list of files to be deleted:

VACUUM events DRY RUN

Configure the retention threshold

The VACUUM command removes any files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days, but you can specify an alternate retention interval. For example, to delete all stale files older than 8 days, you can execute the following SQL command:
VACUUM events RETAIN 192 HOURS

Comments

Popular posts from this blog

Learn GitHub

Learn GitHub git init git add file.txt git commit -m "my first commit" git remote add origin https://github.com/dansullivanma/devlops_data_sci.git git clone https://github.com/dansullivanma/devlops_data_sci.git