Saturday, March 31, 2007
Converting Orchid corpus to XML
Orchid corpus is a Thai part-of-speech annotated corpus, which is used to be freely available on Nectec's website. (I wish it will become available again.) Since, it has quite unique format so it is quite inconvenient to handle. Therefore I just wrote a script to convert it to XML. Then I can just use a XML parser like pulldom to handle it by using a familiar API e.g. (pull)DOM etc.
The example for Orchid corpus format.
%metadata
%metadata
#P1
#1
blaa blaa blaa//
blaa/NNNN
blaa/NNNN
blaa/NNNN
//
The example XML for Orchid corpus format.
<corpus>
<document author="abcd" ...>
<paragraph>
<sentence raw_txt="blaa blaa blaa">
<word surface="blaa" pos="NNNN"/>
<word surface="blaa" pos="NNNN"/>
<word surface="blaa" pos="NNNN"/>
<word surface="blaa" pos="NNNN"/>
</sentence>
</paragraph>
</document>
...
</corpus>
TEI format is probably suit for this job but I am just to lazy to read the specification.
Labels:
corpus,
format,
orchid corpus,
part-of-speech,
thai,
XML
Wednesday, March 28, 2007
Displaying multilingual text in SVG using Firefox
In Khem's tree editor, SVG is used for displaying tree in Firefox. Firefox 2.x on Windows XP can display English text and Thai text in SVG correctly. But when I try to use Firefox 2.x on Mac OS X, Thai, Bengari and Chinese text became a box as shown below.
(using this following code)
<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="1.1"
baseProfile="full">
<text x="50" y="50"
font-size="16" fill="blue" >
Wikipedia 維基百科 วิกิพีเดีย উইকিপিডিয়া
</text>
</svg>
Thus, I try to assign a font family to the text as the following code:
<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="1.1"
baseProfile="full">
<text x="50" y="50"
font-family="Garuda" font-size="16"
fill="blue" >
Wikipedia 維基百科 วิกิพีเดีย উইকিপিডিয়া
</text>
</svg>
It works. Firefox can display Thai text correctly. However, Firefox still cannot display Bangari text and Chinese text. As shown below.
I try to use other font families, i.e. Times, Sans and Helvetica but only English text can be displayed.
Subscribe to:
Posts (Atom)
This workis licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.