Skip to main content

schema on read and Schema on write

Schema on read  and Schema on write

Schema on Read :  Write your data first, figure out what it is later. 


Hive (in some cases), Hadoop, and many other NoSQL systems in general are about "schema on read" -- the schema is applied as the data is being read off of the data store

Benefits of schema on read:
  • Flexibility in defining how your data is interpreted at load time
    • This gives you the ability to evolve your "schema" as time goes on
    • This allows you to have different versions of your "schema"
    • This allows the original source data format to change without having to consolidate to one data format
  • You get to keep your original data
  • You can load your data before you know what to do with it (so you don't drop it on the ground)
  • Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data
Downsides of schema on read:
  • Generally it is less efficient because you have to reparse and reinterpret the data every time (this can be expensive with formats like XML)
  • The data is not self-documenting (i.e., you can't look at a schema to figure out what the data is)
  • More error prone and your analytics have to account for dirty data

Schema on write : Figure out what your data is first, then write it after.

A traditional relational database stores the data with schema in mind. It knows that the second column is an integer, it knows that it has 40 columns, etc. Therefore, you need to specify your schema ahead of time and have it well planned out. This is "schema on write" -- that is, the schema is applied when the data is being written to the data store.
Benefits of schema on write:
  • Better type safety and data cleansing done for the data at rest
  • Typically more efficient (storage size and computationally) since the data is already parsed
Downsides of schema on write:
  • You have to plan ahead of time what your schema is before you store the data (i.e., you have to do ETL)
  • Typically you throw away the original data, which could be bad if you have a bug in your ingest process
  • It's harder to have different views of the same data

Comments

Popular posts from this blog

Learn GitHub

Learn GitHub git init git add file.txt git commit -m "my first commit" git remote add origin https://github.com/dansullivanma/devlops_data_sci.git git clone https://github.com/dansullivanma/devlops_data_sci.git

Garbage collection in Databricks

Clean up snapshots Delta Lake provides snapshot isolation for reads, which means that it is safe to run  OPTIMIZE  even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by running the  VACUUM  command: VACUUM events You control the age of the latest retained snapshot by using the  RETAIN   <N>   HOURS  option: VACUUM events RETAIN 24 HOURS Test the garbage collection You can specify  DRY   RUN  to test the garbage collection and return a list of files to be deleted: VACUUM events DRY RUN Configure the retention threshold The  VACUUM  command removes any files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days, but you can specify an alternate retention interval. For example, to delete all stale files older t...

Error The Specified driver class (org.postgres.Driver) is not available!

SQL Workbench error for PostgreSQL connection: The Specified driver class (org.postgres.Driver) is not available! Below is the error which can appears while connecting to a PostgreSQL databases in SQL workbench: This could be due to Postgres driver is not found by the Workbench tool. This could happen if the folder containing the driver is moved or deleted. Solution: To fix this issue,  1. Open Workbench and go to File - > Manage Drivers 2. Select PostgreSQL 3. Under the Library option select the Folder where the driver is located and select the driver and click on Open. you can download the latest Postgres JDBC drivers at:  https://jdbc.postgresql.org/download.html 4. Click on OK to to close the Manage Drivers window. 5. Now try to connect to the PostgreSQL database with correct credentials, it should connect.