Filter for uncompressed storage of zipped document formats like docx (http://stackoverflow.com/questions/3298525/version-control-for-docx-and-pdf)
Andreas Gobell
andreasgobell at gmx.de
Mon May 9 17:12:42 UTC 2011
On 2011-05-05 at 12:00 Didly wrote:
> On Thu, May 5, 2011 at 10:11 AM, Martin Geisler <mg at aragost.com> wrote:
>> Andreas Gobell <andreasgobell at gmx.de> writes:
>>
>>> Dear Mercurial team,
>>>
>>> I am setting up Mercurial as my Version Control System. In my case it
>>> is not only meant to manage source code but also Microsoft Word
>>> documents in the docx format and some binary files.
>>>
>>> I already wrote some scripts to handle diffing and merging of docx
>>> files in Word. Another goal was to improve the delta compression in
>>> the repository. To achieve this I first had tried putting directories
>>> containing the extracted docx contents under version control. This
>>> worked fine for the repository but the usage was cumbersome because of
>>> the necessary conversion between the directories and the docx files. I
>>> then stumbled upon the thread
>>> http://stackoverflow.com/questions/3298525/version-control-for-docx-and-pdf
>>> where Martin Geisler mentions Mercurial's Filter System. This seemed a
>>> good solution as it is completely transparent to the user.
>>>
>>> As Martin stated that he is interested in a solution to this problem
>>> and I haven't found an extension on the internet I am sending the
>>> filter extension that I've written. I've done some tests and compared
>>> the space required for storing standard compressed docx files, docx
>>> with no compression created manually before a commit and docx
>>> processed by my filter and the results show clear space savings for
>>> the filter version (and of course the manually uncompressed docx). I
>>> also tested odt files created in LibreOffice with the filter and they
>>> work as well.
>>>
>>> I am new to Mercurial and I haven't written Python for a few years so
>>> I am would be very glad to hear about improvements and comments.
>>
>> Great work, the extension looks good!
>>
>> You should definitely publish it somewhere more permanently: put it in a
>> public repository (create one on bitbucket.org or code.google.com if you
>> don't have one already) and create a wiki page for it:
>>
>> http://mercurial.selenic.com/wiki/DoczipExtension?action=edit&template=ExtensionTemplate
>>
>> Then add a link to the new page here:
>>
>> http://mercurial.selenic.com/wiki/UsingExtensions
>>
>> Also, send this to the TortoiseHg guys -- maybe Steve will bundle the
>> extension with TortoiseHg since it seems particularly useful in
>> Word-heavy (e.g., Windows) environments.
>>
>> --
>> Martin Geisler
>>
>
> Please do send it (you can send an email to thg-dev at googlegroups.com).
> I think this would be really useful for a lot of users!
>
> Does your filter work for regular zip files too?
>
> BTW, how does mercurial's storage compression compare to the zip
> format compression of text files?
>
> Angel
Thanks for the positive comments.
I created the repository https://bitbucket.org/gobell/hg-zipdoc/, the http://mercurial.selenic.com/wiki/ZipdocExtension and made an entry in http://mercurial.selenic.com/wiki/UsingExtension.
I renamed the extension to zipdoc - it sounds better and reflects that it is really just a filter processing zip files and is just intended to be used for zipped document formats. It works for any zip file including regular zips.
To improve the extension I added error handling when the file passed to the filter is not a zip file or is broken. I think a common case would be if a symlink to a zip is version controlled.
try:
uncompressedDoc = zipfile.ZipFile(memoryInFile, "r")
except zipfile.BadZipfile:
userInterface = ui.ui()
userInterface.note(_("zipdoc: Skipped decode due to bad ZIP archive. Either the file is not a ZIP (might be a link to a ZIP file) or the archive is broken.\n"))
return s
Is this the correct way to obtain a reference to the ui object?
Is it possible to get the name/path of the file that is passed to the filter so I can give better user feedback by including the file concerned by the message?
Cheers
Andreas Gobell
More information about the Mercurial
mailing list