[PATCH] Decode UTF-8 e-mail headers
Mads Kiilerich
mads at kiilerich.com
Tue Nov 12 13:00:16 UTC 2013
On 11/12/2013 11:36 AM, funman at videolan.org wrote:
> # HG changeset patch
> # User Rafaël Carré <funman at videolan.org>
> # Date 1384251185 -3600
> # Tue Nov 12 11:13:05 2013 +0100
> # Branch stable
> # Node ID fb293d62a1e6e75a640a15006e014dbf2ffb96e3
> # Parent ba6486076429e5c20d910b8a5d4f8acf1e9dc1b1
> Decode UTF-8 e-mail headers
Thank you for your patch.
Note that Mercurial rigidly follows the rules on
http://mercurial.selenic.com/wiki/ContributingChanges , and the patch
description should thus be something like "patch: decode UTF-8 e-mail
headers".
Most of the analysis and description in the "0" mail should also be
included in the commit message.
It might also be worth mentioning that this apparently is all about
supporting RFC 2047 (or later).
> diff -r ba6486076429 -r fb293d62a1e6 mercurial/patch.py
> --- a/mercurial/patch.py Wed Nov 06 10:20:18 2013 -0800
> +++ b/mercurial/patch.py Tue Nov 12 11:13:05 2013 +0100
> @@ -12,6 +12,7 @@
> # load. This was not a problem on Python 2.7.
> import email.Generator
> import email.Parser
> +import email.header
>
> from i18n import _
> from node import hex, short
> @@ -162,6 +163,29 @@
> Any item in the returned tuple can be None. If filename is None,
> fileobj did not contain a patch. Caller must unlink filename when done.'''
>
> + def email_decode(h):
> + '''Decode ?=UTF-8? from e-mail headers.'''
Shouldn't it just decode headers no matter what encoding they have?
email_decode could be anything ... and underscores are not welcome in
Mercurial. I suggest naming it something like decodemimeheader ... even
though it is a bit long ...
> + if h is None:
> + return None
> + res = ''
> + pairs = email.header.decode_header(h)
> + if pairs is None:
> + return None
Will pairs ever be None?
> + n = len(pairs)
> + pair = 0
> + for p in pairs:
> + pair += 1
It seems like it would be more elegant to use the pattern
for pair, p in enumerate(pairs):
(- if you you need the enumeration...)
> + if p[1] == 'utf-8':
> + res += p[0]
> + if pair < n:
> + res += ' '
String concatenation is relatively expensive. It is no real issue here,
but it would probably be more elegant to use
res = []
...
res.append(...)
...
return ' '.join(res)
> + elif p[1] is None:
> + res += p[0]
> + if pair < n:
> + res += ' '
This is silently skipping chunks with other encodings?
> +
> + return encoding.tolocal(res)
It seems like it would be better if this function really returned utf-8.
I understand that it cause problems with conversions done in other
layers ... but that sounds like a problem elsewhere.
> +
> # attempt to detect the start of a patch
> # (this heuristic is borrowed from quilt)
> diffre = re.compile(r'^(?:Index:[ \t]|diff[ \t]|RCS file: |'
> @@ -174,8 +198,8 @@
> try:
> msg = email.Parser.Parser().parse(fileobj)
>
> - subject = msg['Subject']
> - user = msg['From']
> + subject = email_decode(msg['Subject'])
> + user = email_decode(msg['From'])
> if not subject and not user:
> # Not an email, restore parsed headers if any
> subject = '\n'.join(': '.join(h) for h in msg.items()) + '\n'
> diff -r ba6486076429 -r fb293d62a1e6 tests/test-import.t
> --- a/tests/test-import.t Wed Nov 06 10:20:18 2013 -0800
> +++ b/tests/test-import.t Tue Nov 12 11:13:05 2013 +0100
> @@ -1156,3 +1156,19 @@
> line
>
> $ cd ..
> +
> + $ cat > utf8.patch <<EOF
> + > From: =?UTF-8?q?=C3=AB?=
> + > Subject: patch
> + > diff --git /dev/null b/a
> + > --- /dev/null
> + > +++ b/a
> + > @@ -0,0 +1,1 @@
> + > +a
> + > EOF
> + $ hg init utf
> + $ cd utf
> + $ hg import ../utf8.patch
> + applying ../utf8.patch
> + $ hg log | grep ^user -
> + user: ë
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
More information about the Mercurial-devel
mailing list