Position of merge/diff in en/decode chain

Hans-Peter Oeri hp at oeri.ch
Mon Apr 19 14:42:35 UTC 2010


I intend to use hg for repositories with lots of big 'pseudo binary'
files. 'Pseudo' because they are actually mainly compressed text files,
like ODF or OOXML. In hg's docs I found en/decode filters, which would
be great for this purpose. However, the documentation only mentions line
endings and keywords as scope for filter operations, which drastly
limits their scope. It does not define WHERE in the chain different hg
operations take place. Please accept me presenting my thoughts as I
didn't find anything on that topic:

/------------\ encoded  /-------------\  decoded /--------------\
| Repository | - - - - <  Filter       > - - - - | Working Dir  |
| (text)     |  form    | (bijective) |   form   | (pseudo-bin) |
\------------/          \-------------/          \--------------/

The filter truly being bijective, the different forms of the file in
question are equivalent, but not equal. By definition, the repository
consumes the encoded files[1] and by axiom, the working directory holds
"decoded" files. By trial and error I found that hg seems to 'mix' those
states and e.g. applying changeset information (encoded) directly to
workspace files (decoded).

Now, IF I'm right and understand correctly, I would suggest clearly
defining where the different operations should take place and thereby
what the scope of "filtering" is:

ENcoding should enhance 'revision controllability' in all cases
(really?). Automatic operations should therefore default to operate on
the ENcoded form - probably initiating a further DEcoding step (to the
working dir) afterwards.

MERGING
Internal merging should always be tried on the ENcoded form - which is
the form changesets are saved in. This works fine for the "old" line
endings and keyword use cases as well as the 'pseudo bin' case.
However, what to do in case of conflicts? *.rej files are most certainly
not en-/decodable by the same filters as the original file and - apart
from the simple line endings case - not applyable to DEcoded files.
Changeset information on the other hand is most probably not DEcodable
(in order to appply to DEcoded files) as well.
Without a more specific (external) merge tool, the most dogmatic
solution would probably be to keep that file in ENcoded form and let the
user sort out the mess... probably add an option to hg resolve that
initiates DEcoding of the user's merge effort.

External merging might be desired on en- as well as decoded form; e.g.
opening a word processor with "document changes" or graphics with
layers. The configuration could cater to this with a flag per merge tool.

DIFFING
"Diffing" seems to be used in two connotations: a) machine readable,
applyable diffs and b) human-readable, informational diffs (e.g. [2]).
This mixup is unfortunate. Has anyone considered splitting those use
cases? Prima vista, it would be cool to create "human diffs" for
different kind of data. Apart from that use case mixup, diffs could be
taken from both en- and decoded forms (again the word processor /
graphics examples).

MQ/Export
The documentation page warns about problems with MQ/Export. Given that
those consistently work on ENcoded forms, decoding to the working dir
(which MQ currently does not...) I don't see any systematical problems
with filters.

What do you think? Am I completely mistaken?
HPO

[1] http://mercurial.selenic.com/wiki/EncodeDecodeFilter
[2] http://www-verimag.imag.fr/~moy/opendocument/




More information about the Mercurial mailing list