P4XFER-8

ronprestenback (ronprestenback)
ronprestenback created this job , modified by Robert Cowham
Open
Encoding issues

I encountered a file that contained an accented character in the filename: Melée.jpg.  The code point for that character was E9, but the UTF-8 code point is C3 A9.  The file that was downloaded from the source server had the correct characters in the filename.  But the file name that was being tracked by the script had incorrect characters (usually the UTF-8 replacement char, uFFF0).  When it tries to open that file for add on the target server, it fails because the filename it has in memory no longer matches the name of the file that's on disk.  Additionally, the logging library threw exceptions when trying to log the filename, which led to some additional "fun".
I still not sure exactly which encoding is the "right" one for this (since python uses different names for some encodings than what I could find elsewhere online), but I was able to work around the issue by setting self.p4.encoding = '1252' in P4Base.connect.

This worked great for the filename, but introduced a new issue.  Now that the p4 connection was running in 1252 encoding, it threw exceptions when it encountered a changelist description (different changelist) that contained those microsoft "smart quotes", or as I like to call them "screw-up-perforce-encoding quotes" (as they also cause p4's encoding detection algorithm when to fail when adding a text file, if the unicode characters don't occur in the first X number of bytes of an otherwise ANSI compatible file....and they frequently don't).

Note: copy/pasting that é character will often change the code page being used for encoding that character, so you'll need to type it manually if you're trying to repro the issue.  The problematic code point is E9, which you can type with a US keyboard by holding ALT while typing 0233 on the numeric keypad (has to be the numpad; the numbers at the top of the keyboard don't work as those are accelerator/shortcut keys).  I had to use HexEdit4 to verify the character was still encoded as E9, because copy/pasting the filename would "helpfully" re-encode that character using UTF-8
  • Details
  • Comments -
Status
Open
Project
perforce-software-p4transfer
Severity
B
Reported By
ronprestenback
Reported Date
Modified By
Robert Cowham
Modified Date
Owned By
ronprestenback
Dev Notes
There is a new charset option which may help with this.
Note that invalid encodings on Windows are sometimes best handled with Python 2.7 on Windows...
Type
Bug