Data Engineering

Posts

Showing posts from March, 2019

schema on read and Schema on write

Schema on read and Schema on write Schema on Read : Write your data first, figure out what it is later. Hive (in some cases), Hadoop, and many other NoSQL systems in general are about " schema on read " -- the schema is applied as the data is being read off of the data store Benefits of schema on read : Flexibility in defining how your data is interpreted at load time This gives you the ability to evolve your "schema" as time goes on This allows you to have different versions of your "schema" This allows the original source data format to change without having to consolidate to one data format You get to keep your original data You can load your data before you know what to do with it (so you don't drop it on the ground) Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data Downsides of schema on read : Generally it is less efficient because you have to reparse and reinterpret the dat...

Learn GitHub

Learn GitHub git init git add file.txt git commit -m "my first commit" git remote add origin https://github.com/dansullivanma/devlops_data_sci.git git clone https://github.com/dansullivanma/devlops_data_sci.git