Unicode

From X-Plane SDK
Jump to: navigation, search

Unicode Background

There are four classes of character encoding to consider:

  • [ASCII] - 8-bit characters with the codes all being in the 0-127 range. ASCII is the lowest common denominator; all other encodings assign the first 128 codes to match ASCII.
  • 8-bit "code pages" - a number of encodings use the first 128 codes from ASCII, and then fill the unused 128 characters with additional symbols. The most common ones seen by X-Plane users are [ISO-Latin-1] on Windows and [MacRoman] on Mac. Both of these contain significant coverage for Latin languages but no Cyrillic characters.
  • [UTF8] - Unicode characters encoded in one or more 8-bit characters. UTF8 maps such that:
    • Pure-ASCII strings are the same in UTF8 and 8-bit ASCII.
    • No UTF8 character contains another UTF8 character in one of its sub-bytes. This means that string match, parse and search can be performed even if multi-byte codes are used.
    • The UTF8 byte stream is 'marked', so given a big pile of bytes, we can locate the first byte in a multi-byte character sequence.
  • [UTF16] - Unicode characters encoded in one or more 16-bit characters. UTF16 maps such that:
    • All Unicode characters in the 'basic multilingual plane' take one 16-bit character.
    • All ASCII characters map to the low 128 codes (but will thus appear as a pair of nulls and ASCII characters).
    • All other unicode charaters are formed by using 'surrogate pairs' to merge multiple 16-bit characters.

Not all of these encodings are equally flexible. Generally:

  • Every other encoding can represent all ASCII strings.
  • The two unicode encodings (UTF8, UTF16) can represent any of the other encodings, as well as each other.
  • MacRoman, ISO-Latin-1 and other 'code pages' are limited: they only contain a finite number of "international characters" and thus can represent only a subset of all possible strings.

Unicode and OS APIs

OS X / Macintosh

OS X natively runs in UTF8; Unix-style APIs will natively return UTF8 paths.

The older legacy OS 9/Carbon FileSpec routines typically return MacRoman (or the current system script), and thus cannot represent all possible string encodings.

The CoreFoundation APIs contain conversion utilities - that is, a CFString can have its contents extracted to any encoding scheme (although extraction will fail, for example, if you try to extract Chinese characters into an ASCII format).

Windows

The Win32 API contains two sets of APIs: the "narrow" APIs (e.g. CreateFileA) and the "Wide" APIs (e.g. CreateFileW).

  • The narrow APIs will accept characters from the current code page, which will vary depending on the user's computer. Thus these APIs can handle some non-ASCII characters (and hopefully the code page is set appropriately for the user's language preference) but not all strings will be available.
  • The wide APIs will accept UTF16 strings.

The underlying OS runs in UTF16, thus the user can create files and directories that cannot be accessed with the narrow APIs.

Linux

On Linux, UTF8 is the 'native' encoding for all strings, and is the format accepted by system routines like fopen().

Ben says: can anyone verify the above notes or fix it?

Unicode, X-Plane, and the SDK

The level of support for unicode in X-Plane and the SDK vary with X-Plane's version; this documentation applies only to the latest patches of each version.

All X-Plane SDK APIs are based on 8-bit characters; thus the only possible inputs are ASCII, some kind of code page or UTF8; the XPLM APIs will not accept 16-bit characters.

Note: X-Plane has never, in any shipping version, correctly handled local code page characters for string drawing - that is, X-Plane's font was revised at the same time that the XPLMDrawString API started to accept UTF8 strings. Therefore plugins should always draw with one of UTF8 (if targeting x-plane 9 or newer) or ASCII (for older versions of X-Plane).

X-Plane 6, 7, and 8

X-Plane 6, 7 and 8 always operate using the local code page for the OS-native 8-bit character APIs - typically MacRoman on OS X, ISO-Latin-1 (or the user's code page) on Windows, and UTF8 on Linux.

On these operating systems, X-Plane cannot function if there are characters in any of its file paths that are outside the local code page. (For example, on OS X, the sim will fail if Cyrillic characters are present in the folder name containing X-Plane.)

X-Plane's character set for these versions of X-Plane is strictly ASCII; non-ASCII characters will fail to print.

X-Plane 9

X-Plane 9 is Unicode-Aware, but the XPLM APIs are not. Therefore the sim can operate correctly given any file paths with any character set, but plugin loading will fail unless the paths to plugins are fully representable in the local code page. For example, on OS X French diacritical marks like é will not stop plugin operation, but a Cyrillic character like з will stop plugin loading.

X-Plane 9 interprets strings to XPLMDrawString as UTF8; X-Plane's native font can handle Latin, Cyrillic, and Greek characters.

Proposed Future Behavior

(The following is a proposal for possible X-Plane unicode support; it is not currently available.)

The SDK will correctly load all plugins no matter what characters are in their file paths.

The SDK will contain a new feature selector (available via the 2.0 XPLMEnableFeature API) that causes the XPLM APIs to treat all file paths as UTF8 rather than the local code page; the default behavior will match X-Plane 9. If a plugin 'opts in' by setting the feature, then it will receive UTF8 paths and be able to read any file path.

(Note that if a plugin is in an unreadable file path for the local code page and does not opt for UTF8, routines like XPLMGetPluginInfo and XPLMGetSystemPath may return bogus results.)

Ben's random thoughts on Unicode:

General Approach

Allow UTF8 in place of ASCII strings for char* based APIs.

Unicode Drawing

  • X-Plane will natively support some range of unicode characters.
  • Pass strings to X-Plane as UTF8 for drawing - allows for any character set.
  • How does X-Plane publish the range of available characters? Perhaps by the capabilities system? (Do you need to enable support for a char range, or are they no-op capabilities?)
  • New global message when sim language is changed.
  • Open issue: six Mac-Roman characters are supported - confirm apps aren't using this obscure fact.

Unicode and the File System

The SDK's default file path handling will be the historical standard:

  • Mac-Roman HFS paths on OS X.
  • Windows code page on Windows.
  • UTF8 on Linux.

For Mac and Windows, a new capability will advertise the capability to do UTF8 paths. Plugins that enable this capability will then get UTF8 paths. (This will have the advantage of also giving Mac clients posix paths, which are more convenient than HFS.)

Keyboard Input

There are fundamentally two separate problems: the keyboard as a 102-button joystick, for which the virtual key system was invented (virtual keys represent physical pressable things and are not confounded by modifiers) and how to type meaningful text (for which character codes apply).

Virtual Keys

Problems with the current virtual key system in a unicode/foreign keyboard environment: - dead keys (euro-native keys on mac) - ambiguous match-up (partial match from ascii, partial hard coded) - incorrect match-up (ascii on Russian win keyboard)

Idea: - use unused enum range to map to raw keys 0-127 e.g. vkey 4 = unidentified raw key 50 - names guessed from key layout config info - keys mapped to defaults based on ascii translation - open issue: how to deal with num pad ambiguities, etc.? - we have to use a three-level system:

  1. first try to map known ASCII output back to the vkey via the key map (tracks 'a' through the azerty swizzle)
  2. second, have a table of known special keys (e.g. F1-F24, escape, arrows)
  3. finally, simply pick an unused slot and pull an "unidentified vkey" - try to use system resources to make a good name.

Text Input

Legacy APIs are incapable of receiving unicode due to byte width. Existing plugins will continue to only receive ASCII characters in the "char" field.

Extended function calls can let plugins register for unicode callbacks, providing a single UTF32 (32-bit) character code.

NOTE: in the long term we need input-method compatibility between the sim and the SDK, but we don't even have this in the sim yet.