Skip to main content

Time Travel - Query an earlier version of the table

Time travel in Databricks

Delta Lake time travel allows you to query an older snapshot of a Delta Lake table.
For timestamp_string, only date or timestamp strings are accepted. For example, "2019-01-01" and "2019-01-01'T'00:00:00.000Z".
To query an older version of a table, specify a version or timestamp in a SELECT statement. For example, to query version 0 from the history above, use:
Copy to clipboardCopy
SELECT * FROM events VERSION AS OF 0
or
Copy to clipboardCopy
SELECT * FROM events TIMESTAMP AS OF '2019-01-29 00:37:58'
Note
Because version 1 is at timestamp '2019-01-29 00:38:10', to query version 0 you can use any timestamp in the range '2019-01-29 00:37:58' to '2019-01-29 00:38:09' inclusive.
DataFrameReader options allow you to create a DataFrame from a Delta Lake table that is fixed to a specific version of the table.
Copy to clipboardCopy
df1 = spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/delta/events")
df2 = spark.read.format("delta").option("versionAsOf", version).load("/delta/events")

Comments

Popular posts from this blog

Learn GitHub

Learn GitHub git init git add file.txt git commit -m "my first commit" git remote add origin https://github.com/dansullivanma/devlops_data_sci.git git clone https://github.com/dansullivanma/devlops_data_sci.git

Garbage collection in Databricks

Clean up snapshots Delta Lake provides snapshot isolation for reads, which means that it is safe to run  OPTIMIZE  even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by running the  VACUUM  command: VACUUM events You control the age of the latest retained snapshot by using the  RETAIN   <N>   HOURS  option: VACUUM events RETAIN 24 HOURS Test the garbage collection You can specify  DRY   RUN  to test the garbage collection and return a list of files to be deleted: VACUUM events DRY RUN Configure the retention threshold The  VACUUM  command removes any files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days, but you can specify an alternate retention interval. For example, to delete all stale files older t...