Saturday, March 31, 2007

Converting Orchid corpus to XML

Orchid corpus is a Thai part-of-speech annotated corpus, which is used to be freely available on Nectec's website. (I wish it will become available again.) Since, it has quite unique format so it is quite inconvenient to handle. Therefore I just wrote a script to convert it to XML. Then I can just use a XML parser like pulldom to handle it by using a familiar API e.g. (pull)DOM etc. The example for Orchid corpus format. %metadata %metadata #P1 #1 blaa blaa blaa// blaa/NNNN blaa/NNNN blaa/NNNN // The example XML for Orchid corpus format. <corpus> <document author="abcd" ...> <paragraph> <sentence raw_txt="blaa blaa blaa"> <word surface="blaa" pos="NNNN"/> <word surface="blaa" pos="NNNN"/> <word surface="blaa" pos="NNNN"/> <word surface="blaa" pos="NNNN"/> </sentence> </paragraph> </document> ... </corpus> TEI format is probably suit for this job but I am just to lazy to read the specification.

1 comment:

Conductor said...

FWIW, it's been back at http://www.hlt.nectec.or.th/orchid/

Creative Commons License
This workis licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.