Post History
I presume you have already tried the flags --minimal and --ignore-all-space. If line breaks due to added/deleted/changed words are causing the problem with diff (making it look like changes appear...
Answer
#4: Attribution notice removed
Source: https://writers.stackexchange.com/a/32239 License name: CC BY-SA 3.0 License URL: https://creativecommons.org/licenses/by-sa/3.0/
#3: Attribution notice added
Source: https://writers.stackexchange.com/a/32239 License name: CC BY-SA 3.0 License URL: https://creativecommons.org/licenses/by-sa/3.0/
#2: Initial revision
I presume you have already tried the flags --minimal and --ignore-all-space. If line breaks due to added/deleted/changed words are causing the problem with diff (making it look like changes appeared that are just format changes), I would suggest (since you can program this yourself easily) you pre-process both files, to produce two new files for comparison. The idea is to have ONE line for diff to process per **_element_** : So if it were just a story, I'd say combine all lines in each paragraph as a single line, even if thousands of characters long, so what diff does is compare each **paragraph** for changes, not each **line**. (For diff, use the **--width=nnn** flag to ensure all characters are output, I'd set **--width=16384** or so.) You'll have to recognize, based on your formatting, where each type of **element** begins and ends. I am not familiar with play formatting, but in a screenplay I would want to compare slug lines (setting), exposition paragraphs, character dialogue labels, and character dialogue itself. In a screenplay, I think these identifications could probably be done by counting leading spaces, and/or checking for ALL CAPS or keywords (INT., EXT., etc), and using blank lines as signals (if you are double-spaced, then ignore a single blank line, but two in a row signal the end of an element). I would write a simple state machine (each state being what type of element I am building for output, or that I am seeking text for the start of another element, etc), but the logic is pretty simple no matter how you code it. So five lines of dialogue become a single line in the output file. So in the output a new scene might have line breaks like this: > EXT. Park Park, NIGHT > > [five lines of exposition combined here, setting up DIANE and EDGAR jogging, about to be mugged] > > DIANE > > [ten lines of dialogue combined on this line, a story about work] > > EDGAR > > [two or three lines combined here] That should ignore reflow changes the editor did and identify only where word changes occurred. However, it becomes harder to identify where in the original files the text appeared. To solve that, I would add more to the output file, and use another flag of diff: **--ignore-matching-lines=RE** RE here is a regular expression. So the idea is, before (or after, your choice) you output any built-up element, report on a separate line the original line numbers from the original file, and the modified file, with a marker that won't be found in the text: Like the hashtag '#', or a pair of them. Since you know that will begin on the first character of the 'notation' line, set your regular expression to **'^#'** , and those lines will be ignored. So the line in the output might look like > # 1023 1027 meaning line 1023 in the first file, line 1027 in the second. If you prefer, make it more complex by counting pages too. I haven't tried it, but I think these should be output as part of the context when reporting a changed line. You can also tell diff to only output changes, **--suppress-common-lines**. That may not be exactly what you want; but once you **identify** the changes, you might be able to highlight them somehow in the new script (color or bolding or whatever) as a practice script. Since you are identifying elements and in particular wish to isolate changed dialogue, it would be easy in your code to do all of this but suppress the output (in the two files to compare) of anything BUT the dialogue of your one character. The hashtag for original file line numbers would remain the same. If you make the character name a variable or argument to your pre-processing program you write, you can make practice scripts for each character. Have fun coding.