Saturday, March 31, 2007
Converting Orchid corpus to XML
Orchid corpus is a Thai part-of-speech annotated corpus, which is used to be freely available on Nectec's website. (I wish it will become available again.) Since, it has quite unique format so it is quite inconvenient to handle. Therefore I just wrote a script to convert it to XML. Then I can just use a XML parser like pulldom to handle it by using a familiar API e.g. (pull)DOM etc.
The example for Orchid corpus format.
%metadata
%metadata
#P1
#1
blaa blaa blaa//
blaa/NNNN
blaa/NNNN
blaa/NNNN
//
The example XML for Orchid corpus format.
<corpus>
<document author="abcd" ...>
<paragraph>
<sentence raw_txt="blaa blaa blaa">
<word surface="blaa" pos="NNNN"/>
<word surface="blaa" pos="NNNN"/>
<word surface="blaa" pos="NNNN"/>
<word surface="blaa" pos="NNNN"/>
</sentence>
</paragraph>
</document>
...
</corpus>
TEI format is probably suit for this job but I am just to lazy to read the specification.
Subscribe to:
Post Comments (Atom)
This workis licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
1 comment:
FWIW, it's been back at http://www.hlt.nectec.or.th/orchid/
Post a Comment