Filter for uncompressed storage of zipped document formats like docx (http://stackoverflow.com/questions/3298525/version-control-for-docx-and-pdf)

Andreas Gobell andreasgobell at gmx.de
Mon May 9 17:12:42 UTC 2011


On 2011-05-05 at 12:00 Didly wrote:

> On Thu, May 5, 2011 at 10:11 AM, Martin Geisler <mg at aragost.com> wrote:
>> Andreas Gobell <andreasgobell at gmx.de> writes:
>> 
>>> Dear Mercurial team,
>>> 
>>> I am setting up Mercurial as my Version Control System. In my case it
>>> is not only meant to manage source code but also Microsoft Word
>>> documents in the docx format and some binary files.
>>> 
>>> I already wrote some scripts to handle diffing and merging of docx
>>> files in Word. Another goal was to improve the delta compression in
>>> the repository. To achieve this I first had tried putting directories
>>> containing the extracted docx contents under version control. This
>>> worked fine for the repository but the usage was cumbersome because of
>>> the necessary conversion between the directories and the docx files. I
>>> then stumbled upon the thread
>>> http://stackoverflow.com/questions/3298525/version-control-for-docx-and-pdf
>>> where Martin Geisler mentions Mercurial's Filter System. This seemed a
>>> good solution as it is completely transparent to the user.
>>> 
>>> As Martin stated that he is interested in a solution to this problem
>>> and I haven't found an extension on the internet I am sending the
>>> filter extension that I've written. I've done some tests and compared
>>> the space required for storing standard compressed docx files, docx
>>> with no compression created manually before a commit and docx
>>> processed by my filter and the results show clear space savings for
>>> the filter version (and of course the manually uncompressed docx). I
>>> also tested odt files created in LibreOffice with the filter and they
>>> work as well.
>>> 
>>> I am new to Mercurial and I haven't written Python for a few years so
>>> I am would be very glad to hear about improvements and comments.
>> 
>> Great work, the extension looks good!
>> 
>> You should definitely publish it somewhere more permanently: put it in a
>> public repository (create one on bitbucket.org or code.google.com if you
>> don't have one already) and create a wiki page for it:
>> 
>>  http://mercurial.selenic.com/wiki/DoczipExtension?action=edit&template=ExtensionTemplate
>> 
>> Then add a link to the new page here:
>> 
>>  http://mercurial.selenic.com/wiki/UsingExtensions
>> 
>> Also, send this to the TortoiseHg guys -- maybe Steve will bundle the
>> extension with TortoiseHg since it seems particularly useful in
>> Word-heavy (e.g., Windows) environments.
>> 
>> --
>> Martin Geisler
>> 
> 
> Please do send it (you can send an email to thg-dev at googlegroups.com).
> I think this would be really useful for a lot of users!
> 
> Does your filter work for regular zip files too?
> 
> BTW, how does mercurial's storage compression compare to the zip
> format compression of text files?
> 
> Angel

Thanks for the positive comments.

I created the repository https://bitbucket.org/gobell/hg-zipdoc/, the http://mercurial.selenic.com/wiki/ZipdocExtension and made an entry in http://mercurial.selenic.com/wiki/UsingExtension.
I renamed the extension to zipdoc - it sounds better and reflects that it is really just a filter processing zip files and is just intended to be used for zipped document formats. It works for any zip file including regular zips.

To improve the extension I added error handling when the file passed to the filter is not a zip file or is broken. I think a common case would be if a symlink to a zip is version controlled.
try:
   uncompressedDoc = zipfile.ZipFile(memoryInFile, "r")
except zipfile.BadZipfile:
   userInterface = ui.ui()
   userInterface.note(_("zipdoc: Skipped decode due to bad ZIP archive. Either the file is not a ZIP (might be a link to a ZIP file) or the archive is broken.\n"))
   return s   

Is this the correct way to obtain a reference to the ui object?
Is it possible to get the name/path of the file that is passed to the filter so I can give better user feedback by including the file concerned by the message?

Cheers
Andreas Gobell




More information about the Mercurial mailing list