Skip to main content

What is a DAG

What is a DAG?

Airflow refers to what we've been calling "pipelines" as DAGs (directed acyclic graphs). In computer science, a directed acyclic graph simply means a workflow which only flows in a single direction. Each "step" in the workflow (an edge) is reached via the previous step in the workflow until we reach the beginning. The connection of edges is called a vertex.
If this remains unclear, consider how nodes in a tree data structure relate to one another. Every node has a "parent" node, which of course means that a child node cannot be its parents' parent. That's it - there's no need for fancy language here.
Edges in a DAG can have numerous "child" edges. Interestingly, a "child" edge can also have multiple parents (this is where our tree analogy fails us). Here's an example:

An example DAG structure.

In the above example, the DAG begins with edges 1, 2 and 3 kicking things off. At various points in the pipeline, information is consolidated or broken out. Eventually, the DAG ends with edge 8.

Comments

Popular posts from this blog

Learn GitHub

Learn GitHub git init git add file.txt git commit -m "my first commit" git remote add origin https://github.com/dansullivanma/devlops_data_sci.git git clone https://github.com/dansullivanma/devlops_data_sci.git

Garbage collection in Databricks

Clean up snapshots Delta Lake provides snapshot isolation for reads, which means that it is safe to run  OPTIMIZE  even while other users or jobs are querying the table. Eventually however, you should clean up old snapshots. You can do this by running the  VACUUM  command: VACUUM events You control the age of the latest retained snapshot by using the  RETAIN   <N>   HOURS  option: VACUUM events RETAIN 24 HOURS Test the garbage collection You can specify  DRY   RUN  to test the garbage collection and return a list of files to be deleted: VACUUM events DRY RUN Configure the retention threshold The  VACUUM  command removes any files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days, but you can specify an alternate retention interval. For example, to delete all stale files older t...

Error The Specified driver class (org.postgres.Driver) is not available!

SQL Workbench error for PostgreSQL connection: The Specified driver class (org.postgres.Driver) is not available! Below is the error which can appears while connecting to a PostgreSQL databases in SQL workbench: This could be due to Postgres driver is not found by the Workbench tool. This could happen if the folder containing the driver is moved or deleted. Solution: To fix this issue,  1. Open Workbench and go to File - > Manage Drivers 2. Select PostgreSQL 3. Under the Library option select the Folder where the driver is located and select the driver and click on Open. you can download the latest Postgres JDBC drivers at:  https://jdbc.postgresql.org/download.html 4. Click on OK to to close the Manage Drivers window. 5. Now try to connect to the PostgreSQL database with correct credentials, it should connect.