Saturday, March 31, 2007

Converting Orchid corpus to XML

Orchid corpus is a Thai part-of-speech annotated corpus, which is used to be freely available on Nectec's website. (I wish it will become available again.) Since, it has quite unique format so it is quite inconvenient to handle. Therefore I just wrote a script to convert it to XML. Then I can just use a XML parser like pulldom to handle it by using a familiar API e.g. (pull)DOM etc. The example for Orchid corpus format. %metadata %metadata #P1 #1 blaa blaa blaa// blaa/NNNN blaa/NNNN blaa/NNNN // The example XML for Orchid corpus format. <corpus> <document author="abcd" ...> <paragraph> <sentence raw_txt="blaa blaa blaa"> <word surface="blaa" pos="NNNN"/> <word surface="blaa" pos="NNNN"/> <word surface="blaa" pos="NNNN"/> <word surface="blaa" pos="NNNN"/> </sentence> </paragraph> </document> ... </corpus> TEI format is probably suit for this job but I am just to lazy to read the specification.

Wednesday, March 28, 2007

Displaying multilingual text in SVG using Firefox

In Khem's tree editor, SVG is used for displaying tree in Firefox. Firefox 2.x on Windows XP can display English text and Thai text in SVG correctly. But when I try to use Firefox 2.x on Mac OS X, Thai, Bengari and Chinese text became a box as shown below. firefox screenshot
(using this following code) <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" baseProfile="full"> <text x="50" y="50" font-size="16" fill="blue" > Wikipedia 維基百科 วิกิพีเดีย উইকিপিডিয়া </text> </svg>
Thus, I try to assign a font family to the text as the following code:
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" baseProfile="full"> <text x="50" y="50" font-family="Garuda" font-size="16" fill="blue" > Wikipedia 維基百科 วิกิพีเดีย উইকিপিডিয়া </text> </svg>
It works. Firefox can display Thai text correctly. However, Firefox still cannot display Bangari text and Chinese text. As shown below. firefox screenshot I try to use other font families, i.e. Times, Sans and Helvetica but only English text can be displayed.
Creative Commons License
This workis licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.