Unicode Support

The following Unicode enable options apply to Unicode support for Import Mode and Convert Mode. The Charset options are only applicable to Import Mode, when translating a file through the Perforce client. In Convert Mode archive files are always written in UTF-8 for a Unicode enabled Perforce server.

Defaults (for non-Unicode servers):

com.p4convert.p4.unicode=false
com.p4convert.p4.translate=true
com.p4convert.p4.charset=<none>

In some situations it is preferable to import text files as-is (untranslated). Typically this is true for a non-unicode environment where all the user are on Windows clients. To disable translation set the following option to false.

com.p4convert.p4.translate=false

When translation is disabled high-ascii text uses the content type TEXT-RAW and the following warning is disabled:

... Non-unicode server, downgrading file to text

Recommended configuration for a Unicode conversion:

com.p4convert.p4.unicode=true
com.p4convert.p4.translate=true
com.p4convert.p4.charset=utf8

For Unicode conversions set the JVM arg:

-Dfile.encoding=UTF-8

Some Solaris and Linux conversions may need the locale set:

export LC_ALL=en_GB.UTF-8 

Once a Perforce server is switched to Unicode enabled mode (-xi), all client workspaces need to define a character set. For details see:

http://answers.perforce.com/articles/KB_Article/Internationalization-and-Localization

Note

A non-Unicode enabled Perforce Server will accept UTF16 files.

Normalisation

Platform Unicode normalisation is detected when the configuration file is generated, however it can be changed by setting the following configuration option to NFC or NFD:

com.p4convert.p4.normalisation=NFD

The default detection is based on the following:

Platform

Normalization

Windows

NFC

Mac

NFD

*nix/*nux

NFC

Sun

NFC

Subversion Properties

By default, the converter parses Subversion properties as UTF-8 strings. The conversion uses properties such as svn:log, svn:author for attributes in Perforce and must decode the byte sequence to UTF-8. In some data sets Windows users may have added high ASCII characters in one or more code pages. This release now supports a configuration option:

com.p4convert.svn.propTextType=UNKNOWN

The following strings denote the supported char-sets:

Big5 IBM424_rtl ISO-8859-7 UTF-16LE
BINARY ISO-2022-CN ISO-8859-8 UTF-32BE
EUC-JP ISO-2022-JP ISO-8859-9 UTF-32LE
EUC-KR ISO-2022-KR KOI8-R UTF-8
GB18030 ISO-8859-1 Shift_JIS windows-1251
IBM420_ltr ISO-8859-2 UNKNOWN windows-1252
IBM420_rtl ISO-8859-5 US-ASCII windows-1254
IBM424_ltr ISO-8859-6 UTF-16BE windows-1256

The first scan is always UTF-8 followed by the configuration option. BINARY implies a skip and the string <binary property> is inserted.