Unicode Support
The following Unicode enable options apply to Unicode support for Import Mode and Convert Mode. The Charset options are only applicable to Import Mode, when translating a file through the Perforce client. In Convert Mode archive files are always written in UTF-8 for a Unicode enabled Perforce server.
Defaults (for non-Unicode servers):
com.p4convert.p4.unicode=false
com.p4convert.p4.translate=true
com.p4convert.p4.charset=<none>
In some situations it is preferable to import text files as-is (untranslated).
Typically this is true for a non-unicode environment where all the user are on
Windows clients. To disable translation set the following option to
false
.
com.p4convert.p4.translate=false
When translation is disabled high-ascii text uses the content type
TEXT-RAW
and the following warning is disabled:
... Non-unicode server, downgrading file to text
Recommended configuration for a Unicode conversion:
com.p4convert.p4.unicode=true
com.p4convert.p4.translate=true
com.p4convert.p4.charset=utf8
For Unicode conversions set the JVM arg:
-Dfile.encoding=UTF-8
Some Solaris and Linux conversions may need the locale set:
export LC_ALL=en_GB.UTF-8
Once a Perforce server is switched to Unicode enabled mode
(-xi
), all client workspaces need to define a
character set. For details see:
http://answers.perforce.com/articles/KB_Article/Internationalization-and-Localization
Note
A non-Unicode enabled Perforce Server will accept UTF16 files.
Normalisation
Platform Unicode normalisation is detected when the configuration file
is generated, however it can be changed by setting the following
configuration option to NFC
or
NFD
:
com.p4convert.p4.normalisation=NFD
The default detection is based on the following:
Platform |
Normalization |
---|---|
Windows |
|
Mac |
|
*nix/*nux |
|
Sun |
|
Subversion Properties
By default, the converter parses Subversion properties as UTF-8 strings.
The conversion uses properties such as svn:log
,
svn:author
for attributes in Perforce and must decode
the byte sequence to UTF-8. In some data sets Windows users may have
added high ASCII characters in one or more code pages. This release now
supports a configuration option:
com.p4convert.svn.propTextType=UNKNOWN
The following strings denote the supported char-sets:
Big5 | IBM424_rtl | ISO-8859-7 | UTF-16LE |
BINARY | ISO-2022-CN | ISO-8859-8 | UTF-32BE |
EUC-JP | ISO-2022-JP | ISO-8859-9 | UTF-32LE |
EUC-KR | ISO-2022-KR | KOI8-R | UTF-8 |
GB18030 | ISO-8859-1 | Shift_JIS | windows-1251 |
IBM420_ltr | ISO-8859-2 | UNKNOWN | windows-1252 |
IBM420_rtl | ISO-8859-5 | US-ASCII | windows-1254 |
IBM424_ltr | ISO-8859-6 | UTF-16BE | windows-1256 |
The first scan is always UTF-8
followed by the
configuration option. BINARY
implies a skip and the
string <binary property>
is inserted.