Z-Ordering in Databricks
Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. To Z-Order data, you specify the columns to order on in the
ZORDER BY
clause:OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
You can specify multiple columns for
ZORDER BY
as a comma-separated list. However, the effectiveness of the locality drops with each additional column. Z-Ordering on columns that do not have statistics collected on them would be ineffective and a waste of resources as data skipping requires column-local stats such as min, max, and count. You can configure statistics collection on certain columns by re-ordering columns in the schema and/or increasing the number of columns to collect statistics on. See the section Data skipping for more details.
Note
- Z-Ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times.
- For example, if you
ZORDER BY
date and your most recent records are all much wider (for example longer arrays or string values) than the ones in the past, it is expected that theOPTIMIZE
job’s task durations will be skewed, as well as the resulting file sizes. This is, however, only a problem for theOPTIMIZE
command itself; it should not have any negative impact on subsequent queries.
Comments
Post a Comment