Monday, December 25, 2006

GIZA++ Guide

A newer and easier guide for Ubuntu/Debian users is available at
  • Firstly, we have to prepare 2 text files, which each lines are identical. For example,

    ฉัน กิน ข้าว
    ฉัน ไป โรงเรียน

    I eat rice.
    I go to school.
  • Secondly, generating vocabulary files and correspondences file, using plain2snt.out. For example plain2snt eng.txt tha.txt. It must generate eng_tha.snt, eng.vcb and tha.vcb.
  • Writing configuration file. For example,

    outputfileprefix play_giza
    sourcevocabularyfile eng.vcb
    targetvocabularyfile tha.vcb
    c eng_tha.snt
  • Finally, running GIZA++ using this command. "GIZA++ config". Then the final result must be in the file (be careful if you use Mac OS X)

GIZA++: XML output

An alignment output from GIZA++ is in special format. It looks nice and readable but I just don't want to write a parser. Hence I modified GIZA++ to output XML instead. [Download the patch]

Sunday, December 24, 2006

GIZA++ on Mac OS X (HFS+)

Today I find that and are the same file on HFS+ (the file system are used in my iBook). Now I know why in my working directory is not the same as what mentioned in GIZA++'s README. A workaround is as follow:
diff -Nuar GIZA++-v2/ GIZA++-v2-osx/
--- GIZA++-v2/ Tue Sep 30 21:24:18 2003
+++ GIZA++-v2-osx/     Sat Dec 23 18:16:08 2006
@@ -318,8 +318,8 @@
     d4file = Prefix + ".d4." + number ;
     d4file2 = Prefix + ".D4." + number ;
     d5file = Prefix + ".d5." + number ;
-      alignfile = Prefix + ".A3." + number ;
-      test_alignfile = Prefix + ".tst.A3." + number ;
+      alignfile = Prefix + ".uA3." + number ;
+      test_alignfile = Prefix + ".tst.uA3." + number ;
     p0file = Prefix + ".p0_3." + number ;
   // clear count tables
I noticed this after running GIZA++ on NetBSD and the result was just like in README. Update: Now I switched from Mac OS X to Ubuntu
Creative Commons License
This workis licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.