How to create a useful diff of a markdown writing project iteration
I wrote a play that I performed this summer, and have done significant re-writes for an upcoming second performance run. In order to help me relearn my lines, I think it would be useful to see which were changed/removed/added all together, rather than just reading through the new version and hoping for it to stick.
I wrote my play in markdown (using iaWriter). and I have both versions saved separately.
I'm also a programmer, and am used to version control, so I tried making a git repository where I started with a txt file that had the original version, and then pasted the new version on top of it and committed the changes. I was hoping the diff would be more useful than it was... but it's at least something! (I was hoping for instance to see 95% identical lines identified as the same line, but often they were not.. because multiple lines show up between them.
So I'm wondering if there may be a better way to get a useful diff of 2 plain text files that would be improved in terms of identifying more word-based rather than line-based differences.
And if so (even just for git diffs) if there would be a way to output this in a way that can be saved into useful reading formats.
This post was sourced from https://writers.stackexchange.com/q/32235. It is licensed under CC BY-SA 3.0.
2 answers
I presume you have already tried the flags --minimal and --ignore-all-space.
If line breaks due to added/deleted/changed words are causing the problem with diff (making it look like changes appeared that are just format changes), I would suggest (since you can program this yourself easily) you pre-process both files, to produce two new files for comparison.
The idea is to have ONE line for diff to process per element: So if it were just a story, I'd say combine all lines in each paragraph as a single line, even if thousands of characters long, so what diff does is compare each paragraph for changes, not each line.
(For diff, use the --width=nnn flag to ensure all characters are output, I'd set --width=16384 or so.)
You'll have to recognize, based on your formatting, where each type of element begins and ends. I am not familiar with play formatting, but in a screenplay I would want to compare slug lines (setting), exposition paragraphs, character dialogue labels, and character dialogue itself.
In a screenplay, I think these identifications could probably be done by counting leading spaces, and/or checking for ALL CAPS or keywords (INT., EXT., etc), and using blank lines as signals (if you are double-spaced, then ignore a single blank line, but two in a row signal the end of an element). I would write a simple state machine (each state being what type of element I am building for output, or that I am seeking text for the start of another element, etc), but the logic is pretty simple no matter how you code it.
So five lines of dialogue become a single line in the output file. So in the output a new scene might have line breaks like this:
EXT. Park Park, NIGHT
[five lines of exposition combined here, setting up DIANE and EDGAR jogging, about to be mugged]
DIANE
[ten lines of dialogue combined on this line, a story about work]
EDGAR
[two or three lines combined here]
That should ignore reflow changes the editor did and identify only where word changes occurred.
However, it becomes harder to identify where in the original files the text appeared. To solve that, I would add more to the output file, and use another flag of diff: --ignore-matching-lines=RE
RE here is a regular expression. So the idea is, before (or after, your choice) you output any built-up element, report on a separate line the original line numbers from the original file, and the modified file, with a marker that won't be found in the text: Like the hashtag '#', or a pair of them. Since you know that will begin on the first character of the 'notation' line, set your regular expression to '^#', and those lines will be ignored. So the line in the output might look like
# 1023 1027
meaning line 1023 in the first file, line 1027 in the second. If you prefer, make it more complex by counting pages too. I haven't tried it, but I think these should be output as part of the context when reporting a changed line.
You can also tell diff to only output changes, --suppress-common-lines.
That may not be exactly what you want; but once you identify the changes, you might be able to highlight them somehow in the new script (color or bolding or whatever) as a practice script.
Since you are identifying elements and in particular wish to isolate changed dialogue, it would be easy in your code to do all of this but suppress the output (in the two files to compare) of anything BUT the dialogue of your one character. The hashtag for original file line numbers would remain the same.
If you make the character name a variable or argument to your pre-processing program you write, you can make practice scripts for each character.
Have fun coding.
0 comment threads
I'm not sure about using Git commit. I don't use it much.
How about tokenizing your text in Python? Then calculate the entropy per sentence. This would show difference between sentences. Alternatively, you could do this for words or ngrams. The output can be stored to list and txt file.
# py3 algo for calculating entropy
# import math
from collections import Counter
p, lns = Counter(s), float(len(s))
return -sum( count/lns * math.log(count/lns, 2) for count in p.values())
Or perhaps use one of the fuzzy search algorithms to find sentences that are not an exact but are a fuzzy match (e.g. one substituted character).
Perhaps even easier, but not as powerful, Google Docs has a version history where you can view all edits for each day you have edited the file. It is also possible to export to Markdown.
Let us know your solution.
This post was sourced from https://writers.stackexchange.com/a/32241. It is licensed under CC BY-SA 3.0.
0 comment threads