What would make hg diff see a txt file as binary?

Matt Mackall mpm at selenic.com
Sat Aug 30 23:18:53 UTC 2014


On Sat, 2014-08-30 at 13:09 -0400, Harry Putnam wrote:
> Sorry, but no, it is NOT FALSE here.

Ok. Then your files must be too big for the heuristic used by file(1)
and diff(1).

Both of these files contain a NUL and are identical except for one line
near the end:

$ file a b
a: data
b: ASCII text
$ diff -u a b | cat -v
--- a	2014-08-31 00:16:16.602198317 +0200
+++ b	2014-08-31 00:16:21.006195420 +0200
@@ -16381,4 +16381,5 @@
 hello
 hello
 hello
+hello
 ^@ goodbye
\ No newline at end of file
$ ls -n a b
-rw-r--r-- 1 1000 1000 98307 Aug 31 00:16 a
-rw-r--r-- 1 1000 1000 98313 Aug 31 00:16 b

My version of file(1) looks for a NUL in the first 16k lines (could be
quite large).
My version of diff(1) looks in the first 4k bytes.
Git checks the first 8000 bytes (not 8192).
CVS checks somewhere between 512 and 8k bytes.
SVN (as far as I can decipher the source) checks the whole file.
Mercurial looks anywhere in the file.

But they all use the same basic rule: if NULs, then not text.

Diff, file, and CVS are making an explicit compromise: they're designed
for files that might not fit in memory, so they try not to read all of
it or make multiple read passes. This is not surprising, given they were
created in an era when Unix machines might have ~1MB of RAM. They'd LIKE
to scan the whole file for NUL, but it's maybe too expensive.

Mercurial is designed to assume every file it's going to work with will
fit in memory so it can use faster in-memory algorithms. Since it's
already read the whole file, checking the whole thing for NUL is not an
issue.

Lastly, Mercurial doesn't actually care, it's just trying to save your
eyeballs from looking at garbage. If you do hg diff -a, it'll gladly
give you a correct but completely unreadable diff between two JPEGs.

-- 
Mathematics is the supreme nostalgia of our time.





More information about the Mercurial mailing list