Unicode in 0.4.0 Savefiles? (Was: Unicode Font...)

Pipian · Post by **Pipian** » 07 Sep 2004 00:14

Certainly a tricky bit, and currently clearly not in the works until the string code is cleaned up. i'm starting this thread anyway however to get ideas and such...

Two primary items I'm wondering about:

1. UTF-8 or UTF-16?
2. Where are the fonts currently stored, and how will we adopt fonts that need to be larger in terms of pixels (CJK in particular)?

Pipian · Post by **Pipian** » 07 Sep 2004 00:35

Scratch needing to fine it. More important is how it decides how to allocate colors on them, given that it tends to end up with two colors (shadowing) but the images only have two colors...

Hadez · Post by **Hadez** » 07 Sep 2004 19:28

Pipian wrote:Certainly a tricky bit, and currently clearly not in the works until the string code is cleaned up. i'm starting this thread anyway however to get ideas and such...

Two primary items I'm wondering about:

1. UTF-8 or UTF-16?
2. Where are the fonts currently stored, and how will we adopt fonts that need to be larger in terms of pixels (CJK in particular)?

I heard Unicode support is in OpenTTD roadmap (expected in some further version). Images of letters are stored in some GRF files. Hmm, what´s that program´s name? GRF...Decode? Search in forums or via hm, Google. This program can decode GRF files (either OpenTTD´s). I hope it helped you out.

Pipian · Post by **Pipian** » 08 Sep 2004 15:31

I know. I got that much, and I'm reconstructing the shadow bits from memory and double checking. Thus far I've got capital Greek and Cyrillic and lower-case Greek letters for the basic text done (small caps and newspaper headlines have not been touched yet)

The following three files show the entire text that I've made so far, as well as a mockup screenshot in Greek and its English equivalent (note that I've not tried to make the script "Dr", so I'm working with the Greek characters instead.) Also note that transliterations are not meant to be accurate.

The full character list has alternate characters for capital delta, capital lambda, capital tsa (ca?), capital shcha, lowercase tau, and lowercase phi. Ya is under yu.

I will post a Russian mockup as soon as I finish the lower case cyrillic letters.

Do not quote me on these translations either.

ChrisCF · Post by **ChrisCF** » 09 Sep 2004 14:17

Ooooh. Looks good, keep it up - getting it to work would be the clincher, though. Καλι τυχη (good luck)

Pipian · Post by **Pipian** » 14 Sep 2004 00:37

I've created the cyrillic main character set and I've also finished the tiny-char set. Right now I'm slowly working on the main latin unicode char set as well (e.g. the U+0100 and U+0200 blocks)

elomage · Post by **elomage** » 29 Nov 2004 05:49

Any idea when your unicode subset will be part of the release?

I have a few Latvian letters I'd like to add (mostly latin letters with bars above or commas below). I could update the font image table if you tell me how.

--Leo (Latvian translation)

Bjarni · Post by **Bjarni** » 06 Dec 2004 10:08

elomage wrote:Any idea when your unicode subset will be part of the release?

when it's done

If somebody wants to do it, it would be done faster (wow, what a surprise

)
the easiest way to do it would likely be to switch to UTF-8 because then the current array of chars would not need to be changed. Then the encode/decode would need to be modified. Getting the length of a string is currently based on counting the number of chars(bytes). Of cause this needs to be changed too. If you want to do it, please let us know

elomage · Post by **elomage** » 18 Dec 2004 09:04

OK, I can look into the code for support for Unicode strings and possible solutions that could still keep UTF-8 for internal purposes while using special re-coding table. Or some other solution, maybe an extra data structure that keeps track of special characters and is used when rendering the text... Or I could investigate how painful it really is to go to pure UTF-16 and fix strlen() all over the place...

Actually, I already fixed a minor bug in strgen. It always wanted to use forward (Unix-like) slashes, and would not load lang/english.txt on Windows because of the hardcoded forward slash. Now, if it is compiled under Win32, the code is using proper (back) slashes.

The only problem is, I can not submit to SVN (no proper rights...) Help?

--Leo (Latvian translation)

DaleStan · Post by **DaleStan** » 18 Dec 2004 09:16

elomage wrote:The only problem is, I can not submit to SVN (no proper rights...) Help?

Until/unless you receive write rights to SVN, post diffs on the forums.

Bjarni · Post by **Bjarni** » 18 Dec 2004 10:11

DaleStan wrote:
elomage wrote:The only problem is, I can not submit to SVN (no proper rights...) Help?
Until/unless you receive write rights to SVN, post diffs on the forums.

NO!
The developers might miss it and then it will never reach the svn server

Upload it as a patch on sourceforge instead
HackyKid looked a bit on unicode too, but he got busy with networking and making OTTD talk directly to a main server to get a list of active servers instead, so I guess it's ok if somebody else do it

StavrosG · Post by **StavrosG** » 18 Dec 2004 23:56

When will the Greek characters make it into the game? I am interested in translating the game to Greek, but up to 0.3.4 this is not possible

elomage · Post by **elomage** » 19 Dec 2004 02:42

OK, I can do the Unicode port. Please tell me if there is anything already done. Specifically, I'd like to get the font image files (Pipian did some work on them already, Greek and Cyrillic fonts?) so that I do not do anything that's done already. But I'll focus on the code, and latvian font images (a few latvian letters have bars on top (AEIU), kinda-asterix on top (CSZ) and commas under (KLG), etc).

First attempt would be to translate the funny-letters to similar looking ones in strgen. This we preserve the same UTF-8.

Second attempt will be enabling true UTF-16, or UTF-8 with custom, language specific font image maps.

Would anybody suggest where in the code is major string-processing done? In the mean time I'll look on my own (strings.c look like a good start), but I would not want to miss anything. Thanks.

Let me know if you'll give svn writing rights, or should I keep sending patches till you trust me more

I just sent a patch on sourceforge about Strgen slash problem in Windows.

Also, I modified the VC6 project "lang" for the language file processing automation (it would not work right away, could not find the files, looks like somebody had a few assumptions about the default directories). I used the VC6 environment variables to point to the project and target directories instead of explicit file names, and also added a check for autodependencies. Now it recompiles only those language.txt files that have been modified. This is a development-related stuff, not runtime, so I dunno if you want me to submit this patch as well.

Cheers!
--Leo

Pipian · Post by **Pipian** » 08 Jan 2005 04:20

I will try to get back to work on fixing up a full set of Latin, Greek, and Cyrillic characters that should allow for translation into all European languages. The tricky thing lies in the fact that the bitmaps (I believe) are limited in size, and I'm not certain what this size is. As a post of my Latin-Extension-A would show, the bitmap restriction tends to wreak havoc on character design. (Note Latin Extension A, necessary for Eastern European language translations, is not done yet, and only has through "Latin Small L With Little Dot" (U+0140).)

And this is just the main text. The small text found in the "View All Trains" dialog (for example) would be almost impossible to make workable unique diacritics for, and the newspaper font only marginally less so (the problem stemming from only giving enough height for one pixel worth of diacritics)

Pipian · Post by **Pipian** » 08 Jan 2005 07:25

Here's what I've got so far that's truly complete. Each table is aligned as per Unicode.

Colors other than blue and red apply only to the current charset (which shows current allocation of codepoints).

Blue is allocated in the Unicode 4.0.1 standard. (There are two extra spots that will be allocated in the Cyrillic block in 4.1. These are not marked blue).

Red is not allocated in that standard to any visible character.

Orange is a character that is not representative of the character allocated to that codepoint. These would probably be best either in the dingbats and geometric shapes sections (Airplane moved to U+2708, Check Mark to U+2713, "Ballot" X to U+2717, Big Up Triangle to U+25B2, Big Down Triangle to U+25BC, Big Right Triangle to U+25B6, Small Up Triangle to U+25B4, Small Down Triangle to U+25BE) or to Private Use (the rest of them)

Cyan is a character allocated in the Windows codepages, but allocated in Unicode to control characters.

Green is a character allocated in Unicode as a visible character, but not shown in OpenTTD.

Pink are characters edited from the originals in order to be able to distinguish them from other characters in the Latin Extended A block. (mainly telling Breves from Carons from Macrons)

Note that there are several undesigned characters in the ASCII/Latin-1 blocks and a lot in the cyrillic block.

The intent is to get the characters specified in the WGL4 character (sub)set designed, with other characters pending on judged importance.

Also note that the capital y with dieresis will be copied from the original charset to Latin Extended A.

Bjarni · Post by **Bjarni** » 08 Jan 2005 10:43

just to make sure you are up to date, we switched to ISO-8859-15

Pipian · Post by **Pipian** » 08 Jan 2005 17:19

Not quite, unless you changed the charset in the past three months by removing the standard-currency sign (0xA4) replacing it with the Euro-sign (which was at 0x80), changed out 0xA6 and 0xA8 with S-with-hacek (capital and small respectively), replaced 0xBE with Capital-Y-with-dieresis (which was at 0x9F), and shuffled some of those special characters around in the orange spots to make way for OE ligature and Z-with-hacek.

Anyway, the idea of these glyphs is primarily to show Unicode anyway, for which 0x0080-0x00FF is the set of characters found in ISO-8859-1, not ISO-8859-15, which maps the characters to other locations (the Euro, for one, is in the currency block in Unicode, not at 0x00A4)

Pipian · Post by **Pipian** » 10 Jan 2005 07:38

In a follow up to this, I would like to say that the normal font with WGL4 character (plus Drachma character and current specials) is mostly complete (all characters drawn through U+2265).

The small font has some work done with basic Greek and Cyrillic support complete, and work on the newspaper font is ongoing.

I suppose it really should be brought up, however, as to how Unicode will be implemented in the future for coding purposes (especially given impending savefile updates which will hopefully plan for future Unicode implementation)

By this, I mean to pose the question as to whether or not UTF-8 or UTF-16 will be used for internal representation of strings.

The advantage of UTF-16 is that strlen and various functions that depend on length or character position are much easier to program (each character having two bytes) but conversely, it has a lot more nulls, and thus cannot be represented easily as a default string, and will be longer especially for languages that use much Latin text.

The advantage of UTF-8 is that it is considerably easier to put into effect now by only handling accented characters (i.e. all characters > 0x80) with proper UTF-8 escapes, and thus merely use an escaped character mapping to the standard internal representation until proper Unicode support is coded into OpenTTD (of course, special consideration would have to be given to escapes of characters not in the Latin-1 block, such as the Euro symbol, and the OTTD special characters, such as the check, X, vehicles, and triangles). The downside however, is that since a character can be represented by multiple escape characters, strlen, and other functions dependant on character position are much harder to implement.

Perhaps the best implementation would deal with UTF-8 conversion to the current encoding currently (so you don't have to do anything with current code, other than converting internal representations to and from UTF-8 when doing saving) and to UTF-16 for future internal representations.

Nevertheless, I believe it is important to plan Unicode plans now while planning on rewriting the save-file standard, so that backwards compatibility is more feasible, rather than having to break compatibility a second time due to not having UTF-8 escapes into the text fields.

Pipian · Post by **Pipian** » 10 Jan 2005 17:32

A tentative mapping from OpenTTD to Unicode is attached. All control characters have been escaped using the standard control character 0x009F, which stands for Application Program Command sequence, and terminated with 0x009C, which is the String Terminator sequence to signify the termination of the control sequence.

All Unicode characters except the block from 0xB4 to 0xB9 are mapped to proper characters. The latter 5 are mapped to the Private Use Area.

Only thing I'm not sure on is how to distinguish BLACK from SETXY.

Pipian · Post by **Pipian** » 12 Jan 2005 04:01

Some musing about displaying these characters.

To my understanding, these characters would almost certainly have to be implemented as newgrf, which would thus require replacing existing character positions in the old grf (i.e. you could not logically display all of cyrillic, all of latin and latin 1, and some greek at the same time).

Given that language is chosen during run-time (rather than before) seems to tell me that such support of newgrfs would require dialog-language-dependant dynamic overlaying of characters.

Furthermore, unicode support would thus be complicated by not only introducing a mapping from old files to UTF-8, but also by mapping from internal UTF-16 to the limited ~224 characters which we can actually overlay (which is further restricted by certain characters being required by necessity, such as all currency signs, characters typable from the keyboard, and the TTD special characters.)

In reality, this slims us to the approximately 96 characters in the unmapped 0x80-0x9F block and the mapped 0xC0-0xFF block. (Note that in this case I'm referring to mapping to display characters, not mapping to a character set and then generating the text, but rather mapping during the process of displaying)

Note that mapping 0x80-0x9F is most likely a non-issue, given my assumption that these control characters are actually not used in the final display, and are used only as style guidelines when formatting text. If these characters are used as an intricate part of any text-mapping process (i.e. images in the 0x80-0x9F block cannot be characters that are used when mapping, lest control characters break) then we're left with only 64 characters in that last block.

My proposition, in this case, (to make mapping somewhat easier) would be to map all required special characters that are not present in the default ASCII charset to the 0x80-0x9F block (and of course reinstate 0x7B-0x7C as displayable), and then use various mappings on top of ASCII (particularly ISO 8859 varients) as guidelines for how dynamic overlaying would be performed (i.e. replacing é with щ and so forth).

This would then leave the only tricky issue as requiring multiple town-name generation lists for each display-language/town-language option (you couldn't easily use Polish names with their diacritics with the Cyrillic charset, as you may not have enough space to map the necessary characters). I would be more than willing to set up the necessary character maps for this. It'd just require knowing what characters are to be allowed/needed for each town-name set, and allowed/needed for each language.

The only misgiving in such a system would be that non-alphabetic systems (such as Chinese, Japanese, Hindi, etc.) could never be displayed due to the sheer number of characters that would require replacement. (not to mention font size issues. The smallest pixel font able to display chinese characters is probably no less than 12px, more than twice the size of the small-style 5-6px characters)) They would at the least require a newgrf system so robust that it could selectively replace sprites based on character requested, rather than just replacing a preset list.

In summary, my view of how Unicode would be best implemented would require several steps (correct me if it's too improbable)

1. Simple replacement of external strings in lng and sav files with UTF-8 (or a preselected endian form of UTF-16) strings, mapping back to the old OpenTTD charmap until Unicode issues can be addressed more directly. (this is perhaps easiest and can be done by merely making UTF-8 or 16 standard in savefiles, and using two functions (to and from the internal representation) for mapping purposes. This could probably be done without too much hassle in a couple hours of coding, if that, as all that would be affected is saving and loading files.

2. Implement Unicode on a per-language basis, with some dynamic runtime overlaying of sprites using newgrf to (hopefully) support any combination of display-language and town-name language. This would be a bit more labor-intensive. This would require:
a) ensuring that newgrf can replace sprites in-game
b) tieing such dynamic replacement with dialog language selection
c1) as well as town-name selection
c2) or making town-name selection display-language dependent
d) ensuring that UTF-8/16 is used properly in all strings in memory
e) ensuring that such strings can be properly mapped and handled in whatever function displays text

3. (hopefully, with or without #2) Implement complete Unicode support by eschewing newgrf overlays altogether for text at least and hopefully allowing for drastically different pixel sizes for fonts. this is probably the most labor-intensive job, and may or may not be done (with or without #2) depending on how much work people want to put into Unicode support.

Transport Tycoon Forums

Unicode in 0.4.0 Savefiles? (Was: Unicode Font...)

Unicode in 0.4.0 Savefiles? (Was: Unicode Font...)

Re: Unicode Font...

Who is online