Home_Page sep Recent_Changes sep Feed_back sep Online_Help sep Search:
where there is freedom, things tend to come to some balance. (in free_site)
Japanese_Localization  right Referencing_Pages, Similar_Pages, Page_Activity
Chiq_Chaq – your friendly free_site
(with full online_help)

New Pages:


Popular Pages:

  1. Recent_Changes
  2. New=
  3. Referencing_Pages
  4. Home_Page
  5. Add_Messages


Random Pages:


Search this site:


You can edit this side_bar!

Preferred settings:

$locale = 'Japanese';
$charset = 'utf-8';
Laurent_Safa? (Tuesday, January 21, 2003, 07:57):
Hi,
I have been evaluating various wiki engines to start a team-wide message board at my company and chiq_chaq appears to be the best so far in terms of ease of use, ease of installation, interface layout, and others (the list of good points it too long to write it down). Furthermore, it almost supports Japanese characters happy so we can mix English and Japanese contents with almost no problem except for some KATAKANA two-byte characters that confuse chiq.pl.

I explain:


$locale = 'Japanese';
$charset = 'shift-jis';

TEMPORARY SOLUTION

As a fix I modified function convert_span_patterns in Format.pm (see the code hereafter). The only solution I found was to temporarily replace the problematic characters with some special tags of mine so as to avoid confusion in the pattern conversion. Sorry for the very C-ish code, it is my first try with Perl, so I am sure there may be many much better ways to do the same thing.


### convert_span_patterns() # Apply a set of pattern conversions. ##############
###   $_ - string with text to convert
sub convert_span_patterns {

    # Replace some problematic katakana characters with a temporary tag
    # to avoid problems with chiq_chaq's standard's escape chatacters ^ _ and `
    s/\203\136/--TA--/g;  # katakana TA = 0x83 ^
    s/\203\140/--TI--/g;  # katakana TI = 0x83 `
    s/\203\137/--DA--/g;  # katakana DA = 0x83 _

    # Apply chiq-chaq's standard pattern conversion
    foreach my $pattern (keys(%span_patterns)) {      # format character spans
        s!$pattern!&wrap_tag($1,$span_patterns{$pattern})!egms;
    }

    # Restore katakana from temporary tags --TA--, --TI-- and --DA--
    s/--TA--/\203\136/g;
    s/--TI--/\203\140/g;
    s/--DA--/\203\137/g;
}

QUESTION

Could this code be inserted in the standard distribution of chiq_chaq so that I wouldn't need to patch my install the next time you release a new version ? (if you did so, I believe you would execute the additional code for $local="Japanese" only).

Anyway, thanks for the great tool.
Regards.

Yonat_Sharon (Tuesday, January 21, 2003, 09:21):
Wow, this is wonderfully fascinating!
I wonder: Do words_with_underscores work with these characters? Do all characters work well in page titles? I did some improvements in this, so if you encountered problems you may want to test the new version.

As for the code change, I would rather put it in the function span_delimited in Format.pm – please try changing its last line to:

qr{(?:(?<=_|[^\w\|\203])|^)$delimiter$delimited_span$delimiter(?=_|[^\w\|]|$)};
(I.e., add \203 to the characters that can't precede a marked span.)

Please tell me how this turns out. Thanks!

Yonat_Sharon (Tuesday, January 21, 2003, 18:07):
Also:
Would you consider distributing the Japanese localization at SourceForge?
Would you like me to send you the almost-1.5 version so that you can try it out, and also translate some additions?
Laurent_Safa? (Wednesday, January 22, 2003, 08:44):
Yonat,
I have tried your suggested code change and it works fine with \203 ` and \203 _ but it fails with \203 ^ when surrounded by ^ to make superscript.

right the sequence ^ \203 ^ ^ doesn't parse correctly and results in following HTML code:

    "<sup>\203</sup>^"
warning this problem doesn't occur with the initial solution I proposed on 21 st.

As for your question, I will undoutfully distribute the Japanese localization of the UI at SourceForge (I think this can be done sometime next month). And I am interested in trying out the almost-1.5 version. I write my e-mail address at laurent_safa? so you can mail it to me.

Yonat_Sharon (Wednesday, January 22, 2003, 10:36):
Here is a fix for this problem, I hope this one will work:

sub span_delimited { # pattern for text enclosed between two delimiter chars
    my $delimiter = quotemeta($_[0]);
    my $delimited_span = qr{(\S(?s:.*?\S)??)};
    qr{(?:(?<=_|[^\w\|$delimiter\203])|^)$delimiter$delimited_span$delimiter(?=_|[^\w\|$delimiter]|$)};
}
Laurent_Safa? (Thursday, January 23, 2003, 07:52):
This solution has the same problem as the former one : it seems to fail when processing imbricated sequences of escape characters. Indeed, your patch checks for the left span, but not for the right one. This may be the problem in sequences such as ^ \203 ^ ^ .
I think we need some rule like "\203 can't precede a marked left span and a marked right span", but I don't know if that can be expressed in one line of Perl. We may be stretching Perl to its limits (well, the limits of Perl I know are very limited indeed happy ).
I tried the following but it didn't work either (add \203 to the right-hand expression).

sub span_delimited { # pattern for text enclosed between two delimiter chars
    my $delimiter = quotemeta($_[0]);
    my $delimited_span = qr{(\S(?s:.*?\S)??)};
    qr{(?:(?<=_|[^\w\|$delimiter\203])|^)$delimiter$delimited_span$delimiter(?=_|[^\w\|$delimiter\203]|$)};
}
Yonat_Sharon (Thursday, January 23, 2003, 08:48):
Another try:
Try changing the second line of span_delimited above to:

my $delimited_span = qr{(\S(?s:.*?[^\s\203])??)};
Laurent_Safa? (Thursday, January 23, 2003, 10:18):
Great, that worked just fine ! Now span_delimited looks like this:


sub span_delimited { # pattern for text enclosed between two delimiter chars
    my $delimiter = quotemeta($_[0]);
    my $delimited_span = qr{(\S(?s:.*?[^\s\203])??)};
    qr{(?:(?<=_|[^\w\|$delimiter\203])|^)$delimiter$delimited_span$delimiter(?=_|[^\w\|$delimiter\203]|$)};
}

However I discovered other characters that are not properly managed by this patch. The explanation: \203 is not the only escape character sad. I have found a description of Japanese encodings that describes the situation very well in English with some illustrations.
Reading this explanation, it seems to me that we need to replace \203 with not in [0x80-0xA0, 0xE0-0xEF], or as an estimation not above 128.

Adee_Ran (Thursday, January 23, 2003, 15:04):
This seems to me to be the beginning of a need of support for multibyte characters in chiq_chaq. AFAIK the right way to do that is to use Unicode.
I am no expert in Unicode, but I found out that if you save a text file in utf8 (or Unicode) format in Notepad, then any Hebrew character is converted to a sequence of \xD7 + some other byte above \x80. I suspect that the \203 and other prefixes that Laurent encountered are the Japanese counterparts.

By using the pragma 'use utf8' the file can be read and handled correctly, that is, the two-byte chars should be treated as a single entity. I tried adding it to chiq.pl but it crashed of course, since the current contents are saved as single-byte chars.
But I think converting all the contents to Utf8 and adding this pragma may work.

Laurent, you may try the following code to help investigating the problem:


use utf8;

while (<STDIN>) {
    my @chars = split '';
    foreach (@chars) {
        my $o = ord $_;
        print "'$_':$o\n";
    }
}

Run it with a Utf8-encoded text file and see if it produces the right (Japanese) characters. If you remove the 'use utf8' I believe you will see garbage bytes instead of Japanese.

Lastly, this is considered experimental as of Perl 5.6, but I think it only means that later versions may use different syntax.

Yonat_Sharon (Thursday, January 23, 2003, 17:45):
Adee, I haven't tried using utf8 yet, but I have some doubts about Perl's support for it. I suspect I'll need to implement the support for multi-byte characters myself, or use some modules from CPAN. It's on my list, though.
I don't understand why use utf8 caused a crash – for characters in the range 0-127 utf8 is one-byte.

Laurent, you can change \203 to \200-\377 to fix this. However, from the email you sent me I see that chiq_chaq does not recognize link words in Japanese. This may be either because of Perl's inability to handle shift-jis, or because the $locale you used is not recognized by the server. You can do "locale -a" to see the names of the locaes on the server (in Unix).

Adee_Ran (Thursday, January 23, 2003, 19:44):
See also http://www.perldoc.com/perl5.6.1/pod/perlguts.html#What-is-Unicode%2c-anyway-.
Yonat_Sharon (Thursday, January 23, 2003, 23:46):
Thanks.
Laurent_Safa? (Friday, January 24, 2003, 05:57):
Adee, I processed a UTF-8 file with your sample code but it displayed garbage bytes (I am using Perl v5.8.0).
Yonat, changing \203 to \200-\377 only partly solved the problem. Also, I am not sure yet about how to retrieve the system's local (I am running a Win2K-professional-Japanese version PC, so I believe it is Japanese).
However I have a good news: considering all comments so far I modified the locale's configuration in chiq.cf to use UTF-8 encoding for .chiq files and HTML rendering.

# Language locale for determining legal word characters and date format: @@@
$locale = 'Japanese';
$charset = 'utf-8';

And it worked ! hilarious

right All Japanese characters are properly displayed
right Bold, italic and superscript now work fine
warning There is still a problem: words_with_underscores don't work with 2-byte letters

This last issue is a minor one since we can always use roman_words_with_underscores? until it is resolved, so I plan to move on with the actual use of chiq_chaq in the team. That will help further test the support of more Japanese characters.

Yonat_Sharon (Friday, January 24, 2003, 07:49):
That's good to know!
I will start testing the "use utf8" right after the coming release. I hope it will sove the problem with link words in Japanese.
Laurent_Safa? (Friday, January 24, 2003, 08:18):
Yonat, one thing I forgot: Japanese documents don't contain space characters most of the time, so it may be very difficult to write words_with_underscores in Japanese.
For example, my name is laurent would be watasinonamaeharorandesu (I use a roman letter transpositon of Japanese). There is absolutly no space, and without a good knowledge of the language you have no way to find words boundaries (in this case watasi no namae ha roran desu).
You may want to allow for another way to define pages, for example enclosing the page name within brackets. The above sentence would then look like watashinonamaeha[roran]desu.
Yonat_Sharon (Friday, January 24, 2003, 08:39):
Interesting. Alternate linking notation is on my list too.
Created using Chiq_Chaq sep Terms_of_Use sep User_Privacy