Issue850997
Created on 2003-11-29 01:24 by mhammond, last changed 2010-02-05 16:57 by ezio.melotti.
| File name |
Uploaded |
Description |
Edit |
Remove |
|
mbcs_errors.py
|
mhammond,
2003-11-29 01:24
|
Trivial demo of the bug |
|
|
|
mbcs_errors.patch
|
mhammond,
2003-11-29 01:38
|
Working patch, but with a few issues |
|
|
|
msg19177 - (view) |
Author: Mark Hammond (mhammond) * |
Date: 2003-11-29 01:24 |
|
The following snippet:
>>> u'@test-\u5171'.encode("mbcs", "strict")
'@test-?'
Should raise a UnicodeError. The errors param is
completely ignored, and the function always works as
though errors='replace'.
Attaching a test case, and the start of a patch. The
patch has a number of issues:
* I'm not sure what errors are considered 'mandatory'.
I have handled 'strict', 'ignore' and 'replace' -
however, 'ignore' and 'replace' currently are exactly
the same (ie, replace)
* The Windows functions don't tell us exactly what
character failed in the conversion. Thus, the
exception I raise implies the first character is the
one that failed. For the same reason, I have made no
attempt to support error callbacks.
Comments/guidance appreciated.
|
|
msg19178 - (view) |
Author: Mark Hammond (mhammond) * |
Date: 2003-11-29 01:31 |
|
Logged In: YES
user_id=14198
Attaching a patch. This patch also attempts to handle
Encode, but I haven't worked out how to exercise this
code-path - ie, what mbcs encoded string can I pass that can
not be converted to unicode?
As I mentioned, patch has a few issues
|
|
msg19179 - (view) |
Author: Thomas Heller (theller) * |
Date: 2003-11-29 15:18 |
|
Logged In: YES
user_id=11105
No idea why this was assigned to me - unicode is certainly
not one of my strengths.
|
|
msg19180 - (view) |
Author: Martin v. Löwis (loewis) |
Date: 2003-12-01 21:25 |
|
Logged In: YES
user_id=21627
The conventional semantics of "ignore" would be "remove the
failing characters from the output". This would be difficult
to implement if the Microsoft API provides no detailed error
indication.
You could try to get more detailed error indication by
re-encoding the resulting string with a NULL buffer,
counting the number of characters that have successfully
been encoded, atleast in the .decode case.
In the .encode case, you could try using \0 as the default
char. To my knowledge, no ACP ever uses \0 in a multi-byte
string.
What is the meaning of the WC_DEFAULTCHAR flag, in
WideCharToMultiByte, and why are you not using it?
I'm somewhat concerned with backwards compatibility, since
the mbcs codec has never returned errors. So this should be
applied to 2.4 only, and listed in whatsnew.tex.
|
|
msg82015 - (view) |
Author: Daniel Diniz (ajaksu2) |
Date: 2009-02-14 11:35 |
|
Is this behavior still present? If so, is it still interesting to change it?
|
|
msg82133 - (view) |
Author: Mark Hammond (mhammond) * |
Date: 2009-02-14 22:40 |
|
It is still present, but I'm not sure what problems can be seen due to
this so can't comment on its desirability. It would also introduce a
backwards compatability concern but I've not enough experience to know
how much of a problem that would be in practice either.
|
|
| Date |
User |
Action |
Args |
| 2010-02-05 16:57:18 | ezio.melotti | set | nosy:
+ ezio.melotti
versions:
+ Python 2.7, Python 3.2 |
| 2009-02-14 22:40:35 | mhammond | set | messages:
+ msg82133 |
| 2009-02-14 12:14:14 | theller | set | nosy:
- theller |
| 2009-02-14 11:35:45 | ajaksu2 | set | nosy:
+ ajaksu2 messages:
+ msg82015 components:
+ Unicode keywords:
+ patch type: feature request stage: test needed |
| 2003-11-29 01:24:21 | mhammond | create | |
|