Mapping malayalam UNICODE characters to 8 bits representation in lua by preserving existing ASCII characters.


In natural language processing, when it comes to UNICODE text processing, it becomes slightly complex task to manage this unicode character, especially when it requires to randomly access characters from a sequence. Because of UNICODE’s very clever encoding, it is inherently sequential and a headache during random accesses. Possibilities are limited and we need error control and other corrective mechanisms to do so. Thus, most of the times, its a good idea to understand what our domain of unicode characters and map them into byte representations as ASCII characters do.

Some mapping techniques are already out there, like the WX mapping used for indian characters. What this does is, they map each unicode character to a unique and pronetcally similar ASCII character. like,

  • a for അ.
  • A for ആ.
  • i for ഇ.

Recently, I’ve encountered the same problem where a character mapping is needed.

Situation:

As part of my academic major project, which deals with current generation Malayalam text data, I realied that almost all web or books uses a mixed approach. Malayalam text, nowadays is not purely malayalam. It will contain both Malayalam and English words to complete a maeningful sentance. Inorder to deal with it, we must consider common English words as well, as part of ordinary Malayalam vocabulary. Sometimes these English words are written using English characters or sometimes written with Malayalam characters.

Advandages in lua:

My project is being developed in lua language to avail Torch support. Another remarkable feature of lua which helps me is that, lua string uses eight bit clean representation for each characters. We know, a byte (8 bit) could represent 2^8 (256) different data, and ASCII characters require only first 2^7 (128) positions reserved to represent characters. Now, because there’s another 128 rooms available, we’re very much free to use them.

Malayalam UNICODE

unicode table From this table, we can see that, Malayalam character set requires only 128 positions in whole UNICODE. Whoa.. we could easily map them to occupy those 128 rooms available after ASCII reservation.

Coding

Since its very hard to type in these remaining 128 characters using keyboard, we will make use of lua’s string.byte( give_me_a_character ) and string.char( give_me_a_number ). everything else is very much easy, just play with lua’s table datastructure.

Here’s a quick evaluation of different lua chunks needed for the mapping code: jupyter notebook. Here’s the final code: jupyter notebook

Note: Create a Malayalam text file before running the final code: /data/malayalam_in.txt.

Indian Script Code for Information Interchange works this way wikipedia.

Related Posts

Q learning flappy bird game demystified (part 1)

Breaking down 'language transliteration' ( phonetic translation ) Project ( LSTM version )

Programming in lua [ tutorial ] - ( quick reference ).

Deep Learning for premitives [part 2].

Deep Learning for premitives [part 1].

Breaking down 'language transliteration' ( phonetic translation ) Project ( version 1 ).

I struggled a lot to get started this first article as a minimalist. How I decided best suitable tools for me.