Diff Tool Knows What You Mean

We will admit to not being particularly artistic, but we do remember an art teacher telling us that sometimes it is better to draw what isn’t there instead of what’s there — a concept known as negative space. [Wilfred] makes a similar point when explaining his “fantastic diff” tool called, appropriately, difftastic. He points out that when comparing two programs, the goal isn’t so much to determine what changed, but rather what stayed the same. The more you can identify as the same, the less you have to show as a change.

The tool compares source code in a smart way, assisted by tree-sitter which has many different languages already parsed, at least well enough for this purpose. According to [Wilfred’s] post the tool supports 44 different languages ranging from bash and YAML, Verilog to VHDL, and C++ to Rust, among others.

Of course, the tool by itself is worth taking note of. But the real gems in the write-up are things like tree-sitter and a lucid description of the algorithm (borrowed from autochrome) for working out the minimal set of changes.

The code is still under development and the output is not always as clear as he would like. Still, a pretty good tool and a great write-up on the development challenges.

Although Verilog and VHDL are a start, we really want diff for schematics. Oh, and PCB layouts, don’t forget those,e either.

20 thoughts on “Diff Tool Knows What You Mean

  1. Interesting topic. The logical approach is nice, though the visual presentation (and color choice) of WinDiff seems a lot easier on the eyes. The dark background and colorful text is straining to the eyes. It’s much nicer to have thick colored bars of somewhat pastel colors as background color, with neutral text color.

    Good color contrast, yet not overwhelming. I find people often struggle with the choice of easy to distinguish color and good design, also in scientific publication. A real issue.

    Visual diff is something I will try to use more, lots of things really fail not based on the tools, but based on the time you have to get used to them and integrate them in your workflow (getting around bugs etc. and getting used to it).

    After reading the article:
    If you have a parser that understands the semantics of the sourcecode, you can reorganize it as well and pretty print it to make comparison easier. To make a proper structural diff tool that respects the intent as well as the original formatting is quite complex, though.

    It’s also a shame Pascal/Delphi is not supported by those libraries.

      1. Good default colors matter for accessibility, since it’s a pain to change it everywhere. This is not a matter of taste, but usability.

        Also people will take screenshots for articles or books or tutorials, where the colors cannot be selected.

        1. “Good default colors matter for accessibility” – In an OS or OS Distribution.

          “it’s a pain to change it everywhere” – Indeed! This is why to the fullest extent possible every terminal or gui application should follow the terminal or desktop environment’s color scheme settings respectively. If you are thinking about this much at the application level you are doing it wrong.

          Don’t ignore the user’s system-wide preferences to force upon them the theme that someone else finds most pleasant or that someone who is differently sighted finds necessary. They will have their own preferences on their own computers.

          “Also people will take screenshots for articles or books or tutorials, where the colors cannot be selected.” – If one wants to post or publish pictures of their screen and they are concerned that their readers might not like or have the ability to see their eclectic color choices.. Well that’s up to that person to decide their priorities and set their theme accordingly. This is NOT the application designer’s problem.

          1. If you need more information than is in the system-theme, such as additional colors for color coding then consider calculating complimentary colors of the various colors that are part of the theme to automatically create something that goes well with it.

  2. I love the idea behind the tool but the implementation of the tool itself is lacking as it goes through some unnecessary steps as a result of using tree-sitter. I’m sure it’s fine for minor comparisons but with project-wide parsing the delay will be quite noticeable.

  3. i guess i prefer braindead tools, this seems way overwrought to me. the problem i have with diff in real life is that sometimes for large changesets it keys on the wrong unchanged parts and winds up presenting an indecipherable mess, showing the changes for a function that changed a lot mixed in with code from a function that didn’t change at all. i could imagine this doing better at that but i also could easily imagine it being much harder to figure out what it did when it screws the pooch. real changes are a wild world!

    but i have a story.

    a year or two ago, i had a diff that took too long to complete. it was big files but still, i couldn’t believe diff was being janky at me. the thing i love about diff’s simplicity is that it always just works, it doesn’t get into O(n^2) on magic input.

    so i did “apt update && apt install diff” and the new version didn’t have that problem because “diff going O(n^2)” really isn’t any more acceptable to the diff maintainers than it is to me. and everyone lived happily ever after.

    1. I’m intrigued as to how a diff operation could be anything less than O(n^2).
      For each character (shortest substring) in src1 you need to eventually compare it against each character in src2, hence n*n comparisons.
      Even if you use vectorisation that doesn’t give you an nx speedup.

      Heck, even if you only diff a hash of each line, that’s still 2n for the raw hash generation, and then n*n for the line by line hash compares.

      But perhaps you mean it should not become O(n^3)

      1. You can run a rolling checksum on both file. They will automatically synchronize on similar content, so you’re left with the windows where there’s a difference. In these windows, you can run a standard diff algorithm so you don’t have a O(N^2) for the whole document, but only for the changed part (which are usually sparse). The whole algorithm run in O(n log n) time.

      2. you don’t need to eventually compare every char in src1 with every char in src2. if diff could really understand re-ordering, i think it might devolve to that. but since it assumes that src1 and src2 are in the same order (re-orderings are represented as insertion here and deletion there), once it finds a matching sequence, it doesn’t ever compare things before it with things after it. on top of that, like sweethack said, you can use various hash/checksum approaches to compare larger blocks at once.

        i don’t know what the actual final complexity of the algorithm is but i use diff a lot, and sometimes on very large files, and it is not typically O(n^2). and when it was, it really was just for a short moment because of a novel feature that they hadn’t finished refining.

      1. I have downloaded tree-sitter-linux-x64.gz . I have `npm i tree-sitter-verilog` . What should I do next, follow the “http://tree-sitter.github.io/tree-sitter/creating-parsers” ? I am a little bit lost.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.