What strategies are there to document data lineage and keep it updated with a minimum amount of maintenance?
Quickly communicating data lineage to other stakeholders in our organization has become increasingly difficult as we scale.
What are effective strategies to address this and keep it maintained?
An example would be customer data that is stored in a data warehouse, processed in various ways, and used for analysis and reporting. The audience would be members of the business intelligence and analytics teams as well as product managers. The data lineage can change as developers add or modify the code.
This post was sourced from https://writers.stackexchange.com/q/33479. It is licensed under CC BY-SA 3.0.
2 answers
As your question is fairly general, this answer is too.
I'd insert the ETL process itself into the target database. As you have on-site developers this will have to be included in your extraction development. Best add a bit of architecture description and versioning as well to serve the level of understanding of your consumers.
When going after this I'd include a similar step in the actual extraction run process; even a straightforward run-date and field for comments can be worth gold when attempting to generate trends a year hence.
This post was sourced from https://writers.stackexchange.com/a/33480. It is licensed under CC BY-SA 3.0.
0 comment threads
I worked on a project that sounds similar to yours in which the ability to identify every process that touched every piece of data was vital in order to minimize the amount of code validation that had to be redone whenever any other piece of code changed.
To show the flow of data through processes, we use a three column vertical swim lane diagram where the first column was the input, the middle column the process and the third column was the output.
Of course, the output could be the input to another process, in which case we would draw a flow line looping it back to the first column. These diagrams could potentially get quite long, but the virtue was that you could follow any data back up a easily traced path through all the processes it passed through and identify any data that it was melded with by any process. And because the inputs, processes, and outputs were in separate columns, you could easily identify any input, output, or process you were interested in and trace its role in the system.
Updating it was reasonably simple as well. If you added a new process, you added a new row to the diagram, inserted the process in the middle columns, and then reconnected the data input and output lines. If you added a new data source, you added a new row, inserted the new source, and connected it to the relevant processes.
We also used shading to distinguish between internal and external inputs and processes.
0 comment threads