Introduction
Difftastic is a structural diff tool that understands syntax. It supports over 30 programming languages and when it works, it's fantastic.
Difftastic is open source software (MIT license) and available on GitHub.
This copy of the manual describes version 0.64.0. The changelog records which features and bug fixes are in each version.
This manual is also available in Chinese.
Syntactic Diffing
Difftastic detects the language, parses the code, and then compares the syntax trees. Let's look at an example.
// old.rs
let ts_lang = guess(path, guess_src).map(tsp::from_language);
// new.rs
let ts_lang = language_override
    .or_else(|| guess(path, guess_src))
    .map(tsp::from_language);
$ difft old.rs new.rs
1 1 let ts_lang = language_override
. 2     .or_else(|| guess(path, guess_src))
. 3     .map(tsp::from_language);
Notice how difftastic recognises that .map is unchanged, even though
it's now on a new line with whitespace.
A line-oriented diff does a much worse job here.
$ diff -u old.rs new.rs
@@ -1 +1,3 @@
-let ts_lang = guess(path, guess_src).map(tsp::from_language);
+let ts_lang = language_override
+    .or_else(|| guess(path, guess_src))
+    .map(tsp::from_language);
Some line-oriented diff tools also highlight word changes (e.g. GitHub
or git's --word-diff). They still don't understand the code
though. Difftastic will always find matched delimiters: you can see
the closing ) from or_else has been highlighted.
Fallback Line-Oriented Diffing
If input files are not in a format that difftastic understands, it uses a conventional line-oriented diff with word highlighting.
Difftastic will also use line-oriented diffing when given extremely large inputs.
Installation
Diffstatic can be installed as pre-built binaries or using various package managers.
Pre-Built Binaries
Diffstatic releases are published as GitHub releases with pre-built binaries for Windows, macOS and Linux.
Open the latest release page, download the file matching your OS and CPU architecture, and extract the difft executable application file.
Package Manager
macOS
If you're a Homebrew user, you can install
difftastic with brew.
$ brew install difftastic
Linux and Unix
If you're an Arch Linux user, you can install
difftastic
with pacman.
$ sudo pacman -S difftastic
If you're a Nix user, you can install
difftastic
with nix-env.
$ nix-env --install difftastic
If you're a Fedora user, you can install difftastic with dnf.
$ sudo dnf install difftastic
If you're a FreeBSD user, you can install
difftastic
with pkg.
$ sudo pkg install difftastic
Windows
If you're a Windows user using Windows Package Manager (WinGet), you can install difftastic with winget.
$ winget install difftastic
If you're a Windows user using Scoop, you can install
difftastic
with scoop.
$ scoop install difftastic
If you're a Windows user using Chocolatey, you can install
difftastic
with choco.
$ choco install difftastic
Full Package Listing
This table lists all the platforms that have packaged difftastic.
Installing From Source
Build Requirements
Difftastic is written in Rust, so you will need Rust installed. I recommend rustup to install Rust. Difftastic requires Rust version 1.74.1 or later.
You will also need a C++ compiler that supports C++14. If you're using GCC, you need at least version 8.
Build
You can download and build difftastic on crates.io with Cargo (which is part of Rust).
$ cargo install --locked difftastic
Difftastic uses the cc crate for building C/C++ dependencies. This
allows you to use environment variables CC and CXX to control the
compiler used (see the cc
docs).
See contributing for instructions on debug builds.
(Optional) Install a MIME Database
If a MIME database is available, difftastic will use it to detect
binary files more accurately. This is the same database used by the
file command, so you probably already have it.
The MIME database path is specified in the XDG specification. The database should be at one of the following paths:
- /usr/share/mime/magic
- /usr/local/share/mime/magic
- $HOME/.local/share/mime/magic
Packaging Difftastic
This page contains recommendations for people creating a difftastic package.
Note that the difftastic author only provides the source code and the prebuilt binaries on GitHub. Packages have been created by other people -- thank you!
Packaging The Binary
Difftastic can be built with cargo. The compiled binary will be at
target/release/difft when using the following command.
$ cargo build --release
C library dependencies are built with the cc crate, which exposes
various environment
variables
to configure the C toolchain (e.g. CFLAGS).
Reproducible Builds
Difftastic's build script (the build.rs file) uses Rayon to build C
libraries in parallel, which can lead to minor ordering changes in the
final binary.
You can avoid this by disabling Rayon parallelism.
$ RAYON_NUM_THREADS=1 cargo build --release
MIME Database
Difftastic depends on tree_magic_mini, which accesses the MIME database on the current system. The MIME database is used to recognise file types, so difftastic does not try to compoare binary files as text.
This means that the difftastic package should depend on a MIME database package, if available.
Difftastic respects the XDG base
specification
to find the MIME database files. These files are typically at
/usr/share/mime/, /usr/local/share/mime/ or
/opt/homebrew/share/mime/.
Man Page
As of difftastic 0.58, a man page is available. See the file
difft.1.
This file is generated from difft.1.md, but the generated difft.1
is included in the repository for convenience.
The Manual
Please consider including the difftastic manual with your
package. These are HTML files that can be generated with mdbook. The
following command generates HTML at manual/book/.
$ cd manual
$ mdbook build
manual/book.toml also references a script
replace_version_placeholder.sh that replaces occurrences of
DFT_VERSION_HERE in the manual. For packaging, it may be easier to
remove the configuration from book.toml and replace the text
directly.
-[preprocessor.replace-version-placeholder]
-command = "./replace_version_placeholder.sh"
$ export CURRENTVERSION="7.8.9"
$ sed -i "s/DFT_VERSION_HERE/$CURRENTVERSION/g" -i src/introduction.md
Testing
If your packaging tool supports testing, consider running the difftastic unit tests.
# Run the normal tests.
$ cargo test
# Run the tests that depend on the
# MIME database being present.
$ cargo test -- --ignored
Usage
This page describes how to use the difft binary directly. See also
the Git, Mercurial and
Fossil pages for instructions on how to configure
them to use difftastic.
File Arguments
Diffing Files
$ difft FIRST-FILE SECOND-FILE
# For example:
$ difft sample_files/simple_1.js sample_files/simple_2.js
Diffing Directories
$ difft FIRST-DIRECTORY SECOND-DIRECTORY
# For example:
$ difft sample_files/dir_1/ sample_files/dir_2/
Difftastic will recursively walk the two directories, diffing files with the same name.
The --skip-unchanged option is useful when diffing directories that
contain many unchanged files.
Reading stdin
You can read a file from stdin by specifying - as the file path.
$ difft - SECOND-FILE
# For example:
$ cat sample_files/simple_1.js | difft - sample_files/simple_2.js
Files With Conflicts
(Added in version 0.50.)
If you have a file with <<<<<<< conflict markers, you can pass it as
a single argument to difftastic. Difftastic will construct the two
file states and diff those.
$ difft FILE-WITH-CONFLICTS
# For example:
$ difft sample_files/conflicts.el
Configuration Options
Every difftastic option can be set with a command line argument or an
environment variable. For example, DFT_BACKGROUND=light is equivalent to
--background=light.
Environment variables are often useful when using VCS tools like git,
because they invoke the difft binary directly.
For a full list of configuration options, see --help.
$ difft --help
...
OPTIONS:
        --background <BACKGROUND>
            Set the background brightness. Difftastic will prefer brighter colours on dark backgrounds.
            [env: DFT_BACKGROUND=]
            [default: dark]
            [possible values: dark, light]
...
Exit Codes
2: Difftastic was given invalid arguments. This includes invalid usage (e.g. the wrong number of arguments) as well as paths that difftastic cannot read (e.g. non-existent paths or insufficient permissions).
1: When called with --exit-code, difftastic will return an exit code
of 1 when it finds any syntactic changes (in text files) or byte changes
(in binary files).
0: All other cases.
Git
Difftastic can be used an external diff command in git, allowing difftastic to be used with any git subcommand.
Warning: git v2.43.1 and earlier can crash when using an external diff and file permissions have changed.
If you can't upgrade git, use the difftool configuration described
below.
One-Off Usage
You can set the diff.external configuration option when running git diff, or set the
GIT_EXTERNAL_DIFF environment variable.
View uncommitted changes with difftastic:
$ git -c diff.external=difft diff
Other git commands also require the --ext-diff argument in order to
use diff.external.
View changes from the most recent commit with difftastic:
$ git -c diff.external=difft show --ext-diff
View changes from recent commits on the current branch with difftastic:
$ git -c diff.external=difft log -p --ext-diff
Regular Usage
If you like difftastic, we recommend that you configure git aliases so you can use difftastic more easily.
[alias]
    # Difftastic aliases, so `git dlog` is `git log` with difftastic and so on.
    dlog = -c diff.external=difft log --ext-diff
    dshow = -c diff.external=difft show --ext-diff
    ddiff = -c diff.external=difft diff
The author likes the following additional aliases to reduce typing:
[alias]
    # `git log` with patches shown with difftastic.
    dl = -c diff.external=difft log -p --ext-diff
    # Show the most recent commit with difftastic.
    ds = -c diff.external=difft show --ext-diff
    # `git diff` with difftastic.
    dft = -c diff.external=difft diff
Difftastic By Default
If you want to use difftastic as your default diff tool, add the
following to your ~/.gitconfig.
[diff]
    external = difft
This changes git diff to use difftastic, and other commands now only
require --ext-diff.
$ git diff
$ git show --ext-diff
$ git log -p --ext-diff
If you've configured difftastic as the default diff tool, you can
opt-out for an individual command with --no-ext-diff.
$ git diff --no-ext-diff
Difftool
Git also has a difftool feature which allows users to invoke CLI or GUI comparison tools.
For best results, we recommend using -c diff.external=difft as
described above. Git passes more information to the external diff,
including file permission changes and rename information, so
difftastic can show more information.
To define a difftool named difftastic, add the following to your
~/.gitconfig.
[difftool "difftastic"]
    # See `man git-difftool` for a description of MERGED, LOCAL and REMOTE.
    cmd = difft "$MERGED" "$LOCAL" "abcdef1" "100644" "$REMOTE" "abcdef2" "100644"
You can now use difftastic as a difftool:
$ git difftool -t difftastic
For the best results when using difftastic as a difftool, we recommend the following additional git configuration:
[difftool]
    # Run the difftool immediately, don't ask 'are you sure' each time.
    prompt = false
[pager]
    # Use a pager if the difftool output is larger than one screenful,
    # consistent with the behaviour of `git diff`.
    difftool = true
[diff]
    # Set difftastic as the default difftool, so we don't need to specify
    # `-t difftastic` every time.
    tool = difftastic
Mercurial
Mercurial supports external diff
tools with the
Extdiff extension. Enable it by adding an entry to extensions in
your .hgrc.
[extensions]
extdiff =
You can then run hg extdiff -p difft instead of hg diff
(assumes the difft binary is on your $PATH).
You can also define an alias to run difftastic with hg. Add the
following to your .hgrc to run difftastic with hg dft.
[extdiff]
cmd.dft = difft
# You can add further options which will be passed to the command line, e.g.
# opts.dft = --background light
All options of hg diff are also supported by hg dft; for example,
hg dft --stat will show statistics of changed lines and hg dft -r 42 -r 45
will show the diff between two revisions.
hg log -p
Mercurial does not have a way of changing the default diff tool, at least to the author's knowledge.
If you just want to view the diff of the most recent commit, you can use the following.
hg dft -r .^ -r .
This is equivalent to hg log -l 1 -p, although it does not show the
commit message.
Fossil
Fossil supports external diff commands by setting diff-command for the current repository:
fossil settings diff-command difft
To use difftastic for all repositories, use --global:
fossil settings diff-command --global difft
Skip difftastic on Fossil
If you set difftastic as Fossil's diff command, but you need to use Fossil's internal diff once, use -i to skip difftastic once:
fossil diff -i
If you want to remove difftastic from one repository (or globally), use unset:
fossil unset diff-command
unset also supports the --global flag.
Jujutsu
Jujutsu supports external diff commands.
To use difftastic for diffing in jujutsu, add the following to your user configuration file.
[ui]
diff.tool = ["difft", "--color=always", "$left", "$right"]
Languages Supported
This page lists all the languages supported by difftastic. You can
also view the languages supported in your current installed version
with difft --list-languages.
Programming Languages
Structured Text Formats
Language Detection
Difftastic guesses the language used based on the file extension, file name, and the contents of the first few lines.
To see the languages available, and the associated file names, use the
--list-languages option.
$ difft --list-languages
...
XML
 *.ant *.csproj *.plist *.resx *.svg *.ui *.vbproj *.xaml *.xml *.xsd *.xsl *.xslt *.zcml App.config nuget.config packages.config .classpath .cproject .project
YAML
 *.yaml *.yml
Zig
 *.zig
You can override language detection for specific file globs using the
--override option.
$ difft --override=GLOB:NAME FIRST-FILE SECOND-FILE
# For example, treating .h files as C rather than C++:
$ difft --override=*.h:c sample_files/preprocessor_1.h sample_files/preprocessor_2.h
See difft --help for more examples of --override usage.
Internals: Parsing
Difftastic uses tree-sitter to build a parse tree. The parse tree is then converted to a simpler tree which can be diffed.
Parsing with Tree-sitter
Difftastic relies on tree-sitter to understand syntax. You can view
the parse tree that tree-sitter produces using the --dump-ts
flag.
$ difft --dump-ts sample_files/javascript_simple_1.js | head
program (0, 0) - (7, 0)
  comment (0, 0) - (0, 8) "// hello"
  expression_statement (1, 0) - (1, 6)
    call_expression (1, 0) - (1, 5)
      identifier (1, 0) - (1, 3) "foo"
      arguments (1, 3) - (1, 5)
        ( (1, 3) - (1, 4) "("
        ) (1, 4) - (1, 5) ")"
    ; (1, 5) - (1, 6) ";"
  expression_statement (2, 0) - (2, 6)
Simplified Syntax
Difftastic converts the tree-sitter parse tree to a simplified syntax tree. The syntax tree is a uniform representation where everything is either an atom (e.g. integer literals, comments, variable names) or a list (consisting of the open delimiter, children and the close delimiter).
The flag --dump-syntax will display the syntax tree generated for a
file.
$ difft --dump-syntax sample_files/before.js
[
    Atom id:1 {
        content: "// hello",
        position: "0:0-8",
    },
    List id:2 {
        open_content: "",
        open_position: "1:0-0",
        children: [
          ...
Conversion Process
The simple representation of the difftastic parse tree makes diffing much easier. Converting the detailed tree-sitter parse tree is a recursive tree walk, treating tree-sitter leaf nodes as atoms. There are two exceptions.
(1) Tree-sitter parse trees sometimes include unwanted structure. Some grammars consider string literals to be a single token, whereas others treat strings as a complex structure where the delimiters are separate.
tree_sitter_parser.rs uses atom_nodes to mark specific tree-sitter
node names as flat atoms even if the node has children.
(2) Tree-sitter parse trees include open and closing delimiters as
tokens. A list [1] will have a parse tree that includes [ and ]
as nodes.
$ echo '[1]' > example.js
$ difft --dump-ts example.js
program (0, 0) - (1, 0)
  expression_statement (0, 0) - (0, 3)
    array (0, 0) - (0, 3)
      [ (0, 0) - (0, 1) "["
      number (0, 1) - (0, 2) "1"
      ] (0, 2) - (0, 3) "]"
tree_sitter_parser.rs uses open_delimiter_tokens to ensure that
[ and ] are used as delimiter content in the enclosing list,
rather than converting them to atoms.
Difftastic can match up atoms that occur in different parts of the
simplified syntax tree. If e.g. a [ is treated as an atom,
difftastic might match it with another [ elsewhere. The resulting
diff would be unbalanced, highlighting different numbers of open and
close delimiters.
Lossy Syntax Trees
The simplified syntax tree only stores node content and node position. It does not store whitespace between nodes, and position is ignored during diffing.
Internals: Diffing
Difftastic treats diff calculations as a route finding problem on a directed acyclic graph.
Graph Representation
A vertex in the graph represents a position in two syntax trees.
The start vertex has both positions pointing to the first syntax node in both trees. The end vertex has both positions just after the last syntax node in both trees.
Consider comparing A with X A.
START
+---------------------+
| Left: A  Right: X A |
|       ^         ^   |
+---------------------+
END
+---------------------+
| Left: A  Right: X A |
|        ^           ^|
+---------------------+
From the start vertex, we have two options:
- we can mark the first syntax node on the left as novel, and advance to the next syntax node on the left (vertex 1 above), or
- we can mark the first syntax node on the right as novel, and advance to the next syntax node on the right (vertex 2 above).
            START
            +---------------------+
            | Left: A  Right: X A |
            |       ^         ^   |
            +---------------------+
                   /       \
     Novel atom L /         \ Novel atom R
1                v       2   v
+---------------------+  +---------------------+
| Left: A  Right: X A |  | Left: A  Right: X A |
|        ^        ^   |  |       ^           ^ |
+---------------------+  +---------------------+
Choosing "novel atom R" to vertex 2 will turn out to be the best choice. From vertex 2, we can see three routes to the end vertex.
            2
            +---------------------+
            | Left: A  Right: X A |
            |       ^           ^ |
            +---------------------+
                   /    |   \
     Novel atom L /     |    \ Novel atom R
                 v      |     v
+---------------------+ | +---------------------+
| Left: A  Right: X A | | | Left: A  Right: X A |
|        ^          ^ | | |       ^            ^|
+---------------------+ | +---------------------+
  |                     |                    |
  | Novel atom R        | Nodes match        | Novel atom L
  |                     |                    |
  |         END         v                    |
  |         +---------------------+          |
  +-------->| Left: A  Right: X A |<---------+
            |        ^           ^|
            +---------------------+
Comparing Routes
We assign a cost to each edge. Marking a syntax node as novel is worse than finding a matching syntax node, so the "novel atom" edge has a higher cost than the "syntax nodes match" edge.
The best route is the lowest cost route from the start vertex to the end vertex.
Finding The Best Route
Difftastic uses Dijkstra's algorithm to find the best (i.e. lowest cost) route.
One big advantage of this algorithm is that we don't need to construct the graph in advance. Constructing the whole graph would require exponential memory relative to the number of syntax nodes. Instead, vertex neighbours are constructed as the graph is explored.
There are lots of resources explaining Dijkstra's algorithm online, but I particularly recommend the graph search section of Red Blob Games.
Tricky Cases
Tree diffing is challenging in some situations. This page discusses difficult cases, and how difftastic handles them.
Not all of these cases work well in difftastic yet.
Adding Delimiters
;; Before
x
;; After
(x)
Possible result: (x)
Desired result: (x)
This is tricky because x has changed its depth in the tree, but x
itself is unchanged.
Not all tree diff algorithms handle this case. It is also challenging to display this case clearly: we want to highlight the changed delimiters, but not their content. This is challenging in larger expressions.
Difftastic: Difftastic considers nodes to be equal even at different depths, achieving the desired result in this case.
Changing Delimiters
;; Before
(x)
;; After
[x]
Desired result: (x), [x]
As with the wrapping case, we want to highlight the delimiters rather
than the x.
Difftastic: Difftastic handles this correctly through its tree diffing.
Expanding Delimiters
;; Before
(x) y
;; After
(x y)
Possible result 1: (x) y, (x y)
Possible result 2: (x) y, (x y)
It's not clear which is better in this case.
Difftastic: Difftastic currently shows result 2, but this case is sensitive to the cost model. Some previous versions of difftastic have shown result 1.
Contracting Delimiters
;; Before
(x y)
;; After
(x) y
This case is similar to the expanding delimiter case.
Disconnected Delimiters
;; Before
(foo (bar))
;; After
(foo (novel) (bar))
Desired result: (foo (novel) (bar))
It is easy to end up with
(foo (novel) (bar)),
where a later pair of delimiters are chosen.
Rewrapping Large Nodes
;; Before
[[foo]]
(x y)
;; After
([[foo]] x y)
We want to highlight [[foo]] being moved inside the
parentheses. However, a naive syntax differ prefers to consider a removal
of () in the before and an addition of () in the after to be more
minimal diff.
(Reported as issue 44.)
Reordering Within A List
;; Before
(x y)
;; After
(y x)
Desired result: (y x)
We want to highlight the list contents and not the delimiters.
Middle Insertions
// Before
foo(bar(123))
// After
foo(extra(bar(123)))
Desired result: foo(extra(bar(123)))
We want to consider both foo and bar to be unchanged. This case is
challenging for diffing algorithms that do a bottom-up then top-down
matching of trees.
Punctuation Atoms
// Before
foo(1, bar)
// After
foo(bar, 2)
Possible result: foo(bar, 2)
Desired result: foo(bar, 2)
There are two atoms inside the () that we could consider as
unchanged, either the bar or the ,. (We can't consider both to be
unchanged as they're reordered.)
We want to consider bar to be unchanged, as it's a more important
atom than the , punctuation atom. Doing this is in a
language-agnostic way is difficult, so difftastic has a small list of
punctuation characters that always get lower priority than other
atoms.
Sliders (Flat)
Sliders are a common problem in text based diffs, where lines are matched in a confusing way.
They typically look like this. The diff has to arbitrarily choose a line containing delimiter, and it chooses the wrong one.
+ }
+
+ function foo () {
  }
git-diff has some heuristics to reduce the risk of this (e.g. the "patience diff"), but it can still occur.
There's a similar problem in tree diffs.
;; Before
A B
C D
;; After
A B
A B
C D
Possible result:
A B
A B
C D
Preferred result:
A B
A B
C D
Ideally we'd prefer marking contiguous nodes as novel. From the perspective of a longest-common-subsequence algorithm, these two choices are equivalent.
Sliders (Nested)
// Before
old1(old2)
// After
old1(new1(old2))
Possible result: old1(new1(old2))
Desired result: old1(new1(old2))
The correct answer depends on the language. Most languages want to prefer the inner delimiter, whereas Lisps and JSON prefer the outer delimiter.
Minimising Depth Changes
// Before
if true {
  foo(123);
}
foo(456);
// After
foo(789);
Do we consider foo(123) or foo(456) to match with foo(789)?
Difftastic prefers foo(456) by preferring nodes at the same nesting depth.
Replacements With Minor Similarities
// Before
function foo(x) { return x + 1; }
// After
function bar(y) { baz(y); }
Possible result: function bar(y) { baz(y); }
In this example, we've deleted a function and written a completely
different one. A tree-based diff could match up the function and the
outer delimiters, resulting in a confusing display showing lots of
small changes.
As with sliders, the replacement problem can also occur in textual line-based diffs. Line-diffs struggle if there are a small number of common lines. The more precise, granular behaviour of tree diffs makes this problem much more common though.
Matching Substrings In Comments
// Before
/* The quick brown fox. */
foobar();
// After
/* The slow brown fox. */
foobaz();
foobar and foobaz are completely different, and their common
prefix fooba should not be matched up. However, matching common
prefixes or suffixes for comments is desirable.
Multiline Comments
// Before
/* Hello
 * World. */
// After
if (x) {
  /* Hello
   * World. */
}
The inner content of these two comments is technically different. We want to treat them as identical however.
Reflowing Doc Comments
Block comments have prefixes that aren't meaningful.
// Before
/* The quick brown fox jumps 
 * over the lazy dog. */
// After
/* The quick brown fox immediately
 * jumps over the lazy dog. */
The inner content has changed from jumps * over to immediately * jumps over. However, the * is decorative and we don't care that
it's moved.
Small Changes To Large Strings
// Before
"""A very long string
with lots of words about
lots of stuff."""
// After
"""A very long string
with lots of NOVEL words about
lots of stuff."""
It would be correct to highlight the entire string literal as being removed and replaced with a new string literal. However, this makes it hard to see what's actually changed.
It's clear that variable names should be treated atomically, and comments are safe to show subword changes. It's not clear how to handle a small change in a 20 line string literal.
It's tempting to split strings on spaces and diff that, but users
still want to know when whitespace changes inside strings. " " and
"  " are not the same.
Autoformatter Punctuation
// Before
foo("looooong", "also looooong");
// After
foo(
  "looooong",
  "novel",
  "also looooong",
);
Autoformatters (e.g. prettier) will sometimes add or remove punctuation when formatting. Commas and parentheses are the most common.
Syntactic diffing can ignore whitespace changes, but it has to assume punctuation is meaningful. This can lead to punctuation changes being highlighted, which may be quite far from the relevant content change.
Unordered Data Types
// Before
set(1, 2)
// After
set(2, 1)
Users may expect difftastic to find no changes here. This is difficult for several reasons.
For programming languages, side effects might make the order
relevant. set(foo(), bar()) might behave differently to set(bar(), foo()).
For configuration languages like JSON or YAML, some parser
implementations do actually expose ordering information
(e.g. object_pairs_hook=OrderedDict in Python, or serde_json's
preserve_order feature in Rust).
To make matters worse, unordered tree diffing is NP-hard.
For the unordered case, it turns out that all of the problems in general are NP-hard. Indeed, the tree edit distance and alignment distance problems are even MAX SNP-hard.
Difftastic: Difftastic considers ordering to be meaningful everywhere, so it will always report ordering changes.
Novel Blank Lines
Blank lines are challenging for syntactic diffs. We are comparing syntactic tokens, so we don't see blank lines.
// Before
A
B
// After
A
B
Generally we want syntactic diffing to ignore blank lines. In this first example, this should show no changes.
This is occasionally problematic, as it can hide accidental code reformatting.
// Before
A
B
// After
A
X
Y
B
In this second example, we've inserted X and Y and a blank line. We want to highlight the blank line as an addition.
// Before
A
B
// After
A
X
B
In this third example, the syntactic diffing only sees an addition. From the user's perspective, there has also been a removal of two blank lines.
Invalid Syntax
There's no guarantee that the input we're given is valid syntax. Even if the code is valid, it might use syntax that isn't supported by the parser.
Difftastic: Difftastic will fall back to a line-oriented diff if any parse errors occur, to avoid diffing incomplete syntax trees. When this occurs, the file header reports the error count.
$ difft sample_files/syntax_error_1.js sample_files/syntax_error_2.js
sample_files/syntax_error_after.js --- Text (2 errors, exceeded DFT_PARSE_ERROR_LIMIT)
...
Users may opt-in to syntactic diffing by setting
DFT_PARSE_ERROR_LIMIT to a larger value. In this mode, difftastic
treats tree-sitter error nodes as atoms and performs a tree diff as
normal.
Contributing
Building
Install Rust with rustup, then clone the code.
$ git clone git@github.com:Wilfred/difftastic.git
$ cd difftastic
Difftastic uses Cargo for building.
$ cargo build
Debug builds are significantly slower than release builds. For files with more than fifty lines, it's usually worth using an optimised build.
$ cargo build --release
Manual
This website is generated with mdbook. mdbook can be installed with Cargo.
Note: difftastic uses an older Rust toolchain version. You have to run cargo install mdbook outside of the repository directory. Otherwise, installation fails.
$ cargo install mdbook
You can then use the mdbook binary to build and serve the site
locally.
$ cd manual
$ mdbook serve
API Documentation
You can browse the internal API documentation generated by rustdoc here.
Difftastic's internal docs are not available on docs.rs, as it does not support binary crates today.
Testing
$ cargo test
There are also several files in sample_files/ that you can use.
The best way to test difftastic is to look at history from a real
project. Set GIT_EXTERNAL_DIFF to point to your current build.
For example, you can run difftastic on its own source code.
$ GIT_EXTERNAL_DIFF=./target/release/difft git log -p --ext-diff -- src
Logging
Difftastic uses the pretty_env_logger library to log some additional
debug information.
$ DFT_LOG=debug cargo run sample_files/old.jsx sample_files/new.jsx
See the env_logger
documentation for full details.
Releasing
Use Cargo to create a new release, and tag it in git. Difftastic has a helper script for this:
$ ./scripts/release.sh
You can now increment the version in Cargo.toml and add a new entry to CHANGELOG.md.
Vendoring
Git Subtrees
Tree-sitter parsers are sometimes packaged on npm, sometimes packaged on crates.io, and have different release frequencies. Difftastic uses git subtrees (not git submodules) to track parsers.
Updating a parser
To update a parser, pull commits from the upstream git repository. For example, the following command will update the Java parser:
$ git subtree pull --prefix=vendored_parsers/tree-sitter-java git@github.com:tree-sitter/tree-sitter-java.git master
To see when each parser was last updated, use the following shell command:
$ for d in $(git log | grep git-subtree-dir | tr -d ' ' | cut -d ":" -f2 | sort); do echo "$d"; git log --pretty="  %cs" -n 1 $d; done
Adding A Parser
Finding a parser
New parsers for difftastic must be reasonably complete and maintained.
There are many tree-sitter parsers available, and the tree-sitter website includes a list of some well-known parsers.
Add the source code
Once you've found a parser, add it as a git subtree to
vendored_parsers/. We'll use
tree-sitter-json as
an example.
$ git subtree add --prefix=vendored_parsers/tree-sitter-json https://github.com/tree-sitter/tree-sitter-json.git master
Configure the build
Cargo does not allow packages to include subdirectories that contain a
Cargo.toml. Add a symlink to the src/ parser subdirectory.
$ cd vendored_parsers
$ ln -s tree-sitter-json/src tree-sitter-json-src
You can now add the parser to build by including the directory in
build.rs.
TreeSitterParser {
    name: "tree-sitter-json",
    src_dir: "vendored_parsers/tree-sitter-json-src",
    extra_files: vec![],
},
If your parser includes custom C or C++ files for lexing (e.g. a
scanner.cc), add them to extra_files.
Configure parsing
Add an entry to tree_sitter_parser.rs for your language.
Json => {
    let language = unsafe { tree_sitter_json() };
    TreeSitterConfig {
        language,
        atom_nodes: vec!["string"].into_iter().collect(),
        delimiter_tokens: vec![("{", "}"), ("[", "]")],
        highlight_query: ts::Query::new(
            language,
            include_str!("../../vendored_parsers/highlights/json.scm"),
        )
        .unwrap(),
        sub_languages: vec![],
    }
}
atom_nodes is a list of tree-sitter node names that should be
treated as atoms even though the nodes have children. This is common
for things like string literals or interpolated strings, where the
node might have children for the opening and closing quote.
If you don't set atom_nodes, you may notice added/removed content
shown in white. This is usually a sign that child node should have its
parent treated as an atom.
delimiter_tokens are delimiters that difftastic stores on
the enclosing list node. This allows difftastic to distinguish
delimiter tokens from other punctuation in the language.
If you don't set delimiter_tokens, difftastic will consider the
tokens in isolation, and may think that a ( was added but the )
was unchanged.
You can use difft --dump-ts foo.json to see the results of the
tree-sitter parser, and difft --dump-syntax foo.json to confirm that
you've set atoms and delimiters correctly.
sub-languages is empty for most languages: see the code documentation for details.
Configure language detection
Update language_name in guess_language.rs to detect your new
language. Insert a match arm like:
Json => "json",
There may also file names or shebangs associated with your language; configure those
by adapting the language_globs, from_emacs_mode_header and from_shebang functions
in that file.
GitHub's linguist definitions
are a useful source of common file extensions.
Syntax highlighting (Optional)
To add syntax highlighting for your language, you'll also need a symlink
to the queries/highlights.scm file, if available.
$ cd vendored_parsers/highlights
$ ln -s ../tree-sitter-json/queries/highlights.scm json.scm
Test It
Search GitHub for a popular repository in your target language (example search) and confirm that git history looks sensible with difftastic.
Add a regression test
Finally, add a regression test for your language. This ensures that the output for your test file doesn't change unexpectedly.
Regression test files live in sample_files/ and have the form
foo_1.abc and foo_2.abc.
$ nano simple_1.json
$ nano simple_2.json
Run the regression test script and update the .expected file.
$ ./sample_files/compare_all.sh
$ cp sample_files/compare.result sample_files/compare.expected
Maintenance
To update a parser that is already imported, use git subtree pull.
$ git subtree pull --prefix=vendored_parsers/tree-sitter-json git@github.com:tree-sitter/tree-sitter-json.git master
Profiling
If you have a file that's particularly slow, you can use cargo-flamegraph to see which functions are slow.
$ CARGO_PROFILE_RELEASE_DEBUG=true cargo flamegraph --bin difft -- sample_files/slow_1.rs sample_files/slow_2.rs
It's also worth looking at memory usage, as graph traversal bugs can lead to huge memory consumption.
$ /usr/bin/time -v ./target/release/difft sample_files/slow_1.rs sample_files/slow_2.rs
If timing measurement are noisy, Linux's perf tool will report
instructions executed, which is more stable.
$ perf stat ./target/release/difft sample_files/slow_1.rs sample_files/slow_2.rs
$ perf stat ./target/release/difft sample_files/typing_1.ml sample_files/typing_2.ml
Many more profiling techniques are discussed in The Rust Performance Book.
Glossary
Atom: An atom is an item in difftastic's syntax tree structure that has no children. It represents things like literals, variable names, and comments. See also 'list'.
Delimiter: A paired piece of syntax. A list has an open delimiter
and a close delimiter, such as [ and ]. Delimiters may not be
punctuation (e.g. begin and end) and may be empty strings (e.g. infix
syntax converted to difftastic's syntax tree).
Hunk: A group of lines displayed together in the diff output. Increasing the number of context lines increases the size of the hunk.
LHS: Left-hand side. Difftastic compares two items, and LHS refers to the first item. See also 'RHS'.
Line-oriented: A traditional diff that compares which lines have been added or removed, unlike difftastic. For example, GNU diff or the diffs displayed on GitHub.
List: A list is an item in difftastic's syntax tree structure that has an open delimiter, children, and a close delimiter. It represents things like expressions and function definitions. See also 'atom'.
Novel: An addition or a removal. Syntax is novel if it occurs in only one of the two items being compared.
RHS: Right-hand side. Difftastic compares two items, and RHS refers to the second item. See also 'LHS'.
Root: A syntax tree without a parent node. Roots represent top-level definitions in the file being diffed.
Slider: A diffing situation where there are multiple minimal diffs possible, due to adjacent content. It is possible to 'slide' to produce better results in this situation. See the discussion in Tricky Cases.
Syntax node: An item in difftastic's syntax tree structure. Either an atom or a list.
Token: A small piece of syntax tracked by difftastic (e.g. $x,
function or ]), for highlighting and aligned display. This is
either an atom or a non-empty delimiter.
Alternative Projects
Many different tools exist for diffing files. This section of the manual discusses the design of other tools that have influenced difftastic.
Tree Diffing
This page summarises some of the other tree diffing tools available.
If you're in a hurry, start by looking at Autochrome. It's extremely capable, and has an excellent description of the design.
If you're interested in a summary of the academic literature, this blog post (and its accompanying paper -- mirrored under a CC BY-NC license) are great resources.
json-diff (2012)
Languages: JSON
Algorithm: Pairwise comparison
Output: CLI colours
json-diff performs a
structural diff of JSON files. It considers subtrees to be different
if they don't match exactly, so e.g. "foo" and ["foo"] are
entirely different.
json-diff is also noteworthy for its extremely readable display of results.
GumTree (2014)
Languages: ~10 programming
languages
Parser: Several, including srcML
Algorithm: Top-down, then bottom-up
Output: HTML, Swing GUI, or text
GumTree can parse several programming languages and then performs a tree-based diff, outputting an HTML display.
The GumTree algorithm is described in the associated paper 'Fine-grained and accurate source code differencing' by Falleri et al (DOI, PDF). It performs a greedy top-down search for identical subtrees, then performs a bottom-up search to match up the rest.
Tree Diff (2017)
Languages: S-expression data format
Algorithm: A* search
Output: Merged s-expression file
Tristan Hume wrote a tree diffing algorithm during his 2017 internship at Jane Street. The source code is not available, but he has a blog post discussing the design in depth.
This project finds minimal diffs between s-expression files used as
configuration by Jane Street. It uses A* search to find the minimal
diff between them, and builds a new s-expression with a section marked
with :date-switch for the differing parts.
(Jane Street also has patdiff, but that seems to be a line-oriented
diff with some whitespace/integer display polish. It doesn't
understand that e.g. whitespace in "foo " is meaningful).
Autochrome (2017)
Languages: Clojure
Parser: Custom, preserves comments
Algorithm: Dijkstra (previously A* search)
Output: HTML
Autochrome parses Clojure with a custom parser that preserves comments. Autochrome uses Dijkstra's algorithm to compare syntax trees.
Autochrome's webpage includes worked examples of the algorithm and a discussion of design tradeoffs. It's a really great resource for understanding tree diffing techniques in general.
graphtage (2020)
Languages: JSON, XML, HTML, YAML, plist, and CSS
Parser: json5, pyYAML, ignores comments
Algorithm: Levenshtein distance
Output: CLI colours
graphtage compares structured data by parsing into a generic file format, then displaying a diff. It even allows things like diffing JSON against YAML.
As with json-diff, it does not consider ["foo"] and "foo" to have
any similarities.
Diffsitter (2020)
Parser: Tree-sitter
Algorithm: Longest-common-subsequence
Output: CLI colours
Diffsitter is another tree-sitter based diff tool. It uses LCS diffing on the leaves of the syntax tree.
sdiff (2021)
Languages: Scheme
Parser: Scheme's built-in read, ignores comments
Algorithm: MH-Diff from the Chawathe paper
Output: CLI colours
Semantically meaningful S-expression diff: Tree-diff for lisp source code was presented at FOSDEM 2021.