Selecting Thai Words

Defining a dictionary

In this section, you'll:
  • Create an HTML file containing some Thai text and an English translation withnospacesbetweenwords
  • Create a dictionary object for all the words in the Thai text and all the words that can be found in the English string.
Download the source files Test here

Mimicking Thai text

During this tutorial, you'll be working with some text in Thai. To make it easier to understand what your code is doing, it might be easier to use a translation of the Thai text in English, from which all the spaces have been removed. You can then perform exactly the same actions on the English textwithnospaces, even if you understand no Thai.

You can start by creating a file named index.html with the following HTML code:

index.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Selecting Thai words</title>
</head>

<body>
  <p lang="th">งมเข็มในมหาสมุทร
  พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง
  ตากลม  ตา&#8203;กลม</p>

  <p lang="enx">lookforaneedleinthesea
  talkisworthtuppence silenceisworthgold
  exposedtotheair roundeye</p>

  <script src="js/dictionaries.js"></script>
</body>
</html>

For now, the text in the HTML is simply to give you an idea of how Thai text looks. Later, in the section on using the arrow keys to jump to the neighbouring words, you'll actually be interacting with Thai text.

ISO language codes

The International Organization for Standardization (ISO) has published a list of two- and three-letter language codes. You can use the lang attribute of an HTML element to indicate which language the text of the element is in. The standard codes for English and Thai en and th.

Normally, you would add this attribute to the <html> element ...

<!DOCTYPE html>
<html lang="en">
   <head>
    ...
  <head>
  <body>
    ...
  </body>
</html>

... to indicate the language of the entire page. But this page is in two languages, so it makes sense to give the attribute to the individual <p> tags.

Since the "English" used here to mimic the structure of Thai text is non-standard English, I have used a non-standard code: enx.

Zero-width space

The character encoded as &#8203; is a "zero-width space". It is used here to force a word-break in the string ตากลม so that it will be treated as ตา กลม (eye round) and not ตาก ลม (exposed-to air).

Creating a dictionary

The JavaScript file below creates a global object called dico, with two dictionary maps: one for the non-standard enx language code, one for th, the Thai language.

If you are reading this in English, the chances are that you already know all the English words, so a pronunciation guide and translation is provided only for the Thai words.

Notice that the enx dictionary contains all the words that can be found in the enx strings, even those that are not used to create meaning from the strings. For testing purposes, the enx dictionary object contains the invented word "forane", which can be found in the chunk "foraneedle"

js/dictionaries.js

"use strict"

var dico = {}

;(function addDictionaries() {
  dico = {

    dictionaries: {
      enx: {
        " ": 0
      , "a": 0
      , "air": 0
      , "exposed": 0
      , "eye": 0
      , "for": 0
      , "forane": 0 // nonsense word for testing
      , "gold": 0
      , "he": 0
      , "in": 0
      , "is": 0    
      , "look": 0
      , "need": 0
      , "needle": 0
      , "old": 0
      , "or": 0
      , "pence": 0
      , "round": 0
      , "sea": 0
      , "silence": 0
      , "talk": 0
      , "the": 0
      , "these": 0
      , "to": 0
      , "tot": 0
      , "tup": 0
      , "tuppence": 0
      , "up": 0
      , "worth": 0
      }

    , th: {
        " ": {}
      , "​": {} // &#8203; = zero-width space
      , "กลม": {
          "pronunciation": "glohm-"
        , "translation": "round; circular"
        }
      , "งม" : {
          "pronunciation": "ngohm-"
        , "translation": "to grope; search; seek; fumble for"
        }
      , "ตา": {
          "pronunctiation": "dtaa-"
        , "translation": "eye; maternal grandfather"
        }
      , "ตาก": {
          "pronunctiation": "dtaak_"
        , "translation": "[is] exposed (e.g., to the sun)"
        }
      , "ตำ" : {
          "pronunciation": "dtam-"
        , "translation": "beat; pound an object; pulverize; to pierce; puncture; prick"
        }
      , "ตำลึง" : {
          "pronunciation": "dtam- leung-"
        , "translation": "ancient Thai monetary unit"
        }
      , "ทอง" : {
          "pronunciation": "thaawng-"
        , "translation": " gold"
        }
      , "นิ่ง" : {
          "pronunciation": "ning`"
        , "translation": "[is] still; immobile; silent; motionless; quiet"
        }
      , "พูด" : {
          "pronunciation": "phuut`"
        , "translation": " to speak; to talk; to say"
        }
      , "มหา" : {
          "pronunciation": "ma' haa´"
        , "translation": "great; omnipotent; large; many; much; maximal; paramount; exalted"
        }
      , "มหาสมุทร" : {
          "pronunciation": "ma' haa´ sa_ moot_"
        , "translation": "ocean"
        }
      ,"ลม": {
          "pronunciation": "lohm-"
        , "translation": "air; wind; storm"
        }
      , "สมุทร" : {
          "pronunciation": "sa_  moot_"
        , "translation": "ocean; sea"
        }
      , "สอง" : {
          "pronunciation": "saawng´"
        , "translation": "two"
        }
      , "เข็ม" : {
          "pronunciation": "khem´"
        , "translation": "clasp; brooch; safety pin; needle; sewing pin"
        }
      , "เบี้ย" : {
          "pronunciation": "biia`"
        , "translation": "a cowrie shell [formerly used as] money"
        }
      , "เสีย" : {
          "pronunciation": "siia´"
        , "translation": "to spend; use up; lose; give up; sacrifice; pay"
        }
      , "ใน" : {
          "pronunciation": "nai-"
        , "translation": "in; inside; within; amidst; into; on; at a particular time"
        }
      , "ไป" : {
          "pronunciation": "bpai-"
        , "translation": "to go; <subject> goes"
        }
      , "ไพ" : {
          "pronunciation": "phai-"
        , "translation": "a certain old coin equal in value to 1/32 baht"
        }
      }
    }
  }
})()

Creating a trie

In this section, you'll learn how to:
  • Create a trie object
Download the source files

The word "trie" comes from the French word for "to choose, classify, filter, pick out, separate, sort". It also sounds like the English word "tree". In computing, it refers to a data structure than starts at a common root and divides multiple times to end in a unique leaf.

You can add the code below to your dictionaries.js file.

js/dictionaries.js

"use strict"

var dico = {}

;(function addDictionaries() {
  dico = {

    dictionaries: {
      enx: {
        // code omitted for clarity
      }
    , th: {
        // code omitted for clarity
      }
    }

  , tries: {}

  , initialize: function createTries(){
      for (var languageCode in this.dictionaries) {
        this.tries[languageCode] = 
          this.createTrie(this.dictionaries[languageCode])
      }

      return this
    }

  , createTrie: function createTrie(languageMap) {
      var trie = {}
      var end = "$"
      var word
        , ii
        , length
        , last
        , chars
        , char
        , path
        , place


      for (word in languageMap) {
        chars = word.split("")
        path = trie
   
        for (ii = 0, length = word.length; ii < length; ii += 1) {
          char = chars[ii]
          place = path[char]
   
          if (!place) {
            place = {}
            path[char] = place
          }
          path = place
        }
        path[end] = true
      }

      return trie
    }
  }.initialize()
})()

console.log(JSON.stringify(dico.tries.enx))

The createTrie method creates an empty trie object, and then takes each word key in a given dictionary and splits it into its individual characters. It then checks if there is already a word which begins with the first letter. If not, it creates a new entry for that letter in the trie object; if so, it checks whether the existing sub-object already contains a sub-object for the next letter. In this manner, it works its way along until it reaches the end of the word, when it adds an entry for "$": true to the current branch, as a final leaf.

Console output

If you save your changes, relaunch your page and check the Developer Console, you should see output that looks something like this (but a bit less prettified).

{" ":{"$":true}
,"a":{"$":true
     ,"i":{"r":{"$":true}}}
,"e":{"x":{"p":{"o":{"s":{"e":{"d":{"$":true}}}}}}
     ,"y":{"e":{"$":true}}}
,"f":{"o":{"r":{"$":true
               ,"a":{"n":{"e":{"$":true}}}}}}
,"g":{"o":{"l":{"d":{"$":true}}}}
,"h":{"e":{"$":true}}
,"i":{"n":{"$":true}
     ,"s":{"$":true}}
,"l":{"o":{"o":{"k":{"$":true}}}}
,"n":{"e":{"e":{"d":{"$":true
                    ,"l":{"e":{"$":true}}}}}}
,"o":{"l":{"d":{"$":true}}
     ,"r":{"$":true}}
,"p":{"e":{"n":{"c":{"e":{"$":true}}}}}
,"r":{"o":{"u":{"n":{"d":{"$":true}}}}}
,"s":{"e":{"a":{"$":true}}
     ,"i":{"l":{"e":{"n":{"c":{"e":{"$":true}}}}}}}
,"t":{"a":{"l":{"k":{"$":true}}}
     ,"h":{"e":{"$":true
               ,"s":{"e":{"$":true}}}}
     ,"o":{"$":true,"t":{"$":true}}
     ,"u":{"p":{"$":true
               ,"p":{"e":{"n":{"c":{"e":{"$":true}}}}}}}}
,"u":{"p":{"$":true}}
,"w":{"o":{"r":{"t":{"h":{"$":true}}}}}}

Note how the letters of the word "these" (shown in red) have been added to the trie object.

Detecting words with a trie

In this section, you'll learn how to:
  • Use a trie object to analyze a string of text and split it into known words.
Download the source files Test Here

Looking for known words

Now that you have a trie object, you can use it identify words in a string. The splitIntoWords method in the code listing below will step through the string one character at a time, matching it to a path through the trie. If it fails to find a word at any point, it will backtrack and start from the last point where it found a whole word.

js/.dictionaries.js

"use strict"

var dico = {}

;(function addDictionaries() {
  dico = {

    dictionaries: {
      // code omitted for clarity
    }

  , tries: {}

  , initialize: function createTries(){
      // code omitted for clarity
    }

  , createTrie: function createTrie(languageMap) {
      // code omitted for clarity
    }

  , splitIntoWords: function splitIntoWords(string, languageCode) {
      var trie = this.tries[languageCode]
      var alternatives = []
      var path = trie
      var words = []
      var word = ""
      var found = false
      var char
        , next
        , alternative

      words.index = 0

      for (var ii = 0, total = string.length; ii < total ; ii += 1) {
        char = string[ii]
        next = path[char]

        if (next) { // { "$": 0, "a": { ... }. ...}          
          word += char

          if (next.$) {
            if (found) {
              // We've already added a shorter word to the list.
              // Push the list with the shorter word onto the
              // alternatives array and continue with the longer word
              alternative = words.slice(0)
              alternative.index = words.index
              alternatives.push(alternative)

              words.index -= words.pop().length // remove shorter word
            }
            // Add this whole word to the list
            words.push(word)
            words.index += word.length
            found = true
          }

          path = next

        } else if (found) {
          // !next, but we have a complete word already
          // This character could be the start of a new word
          word = char
          path = trie[char]
          if (!path) {
            // It's imposible to start a word with this letter
            backtrack("Initial letter absent: " + char)

          } else if (path.$) {
            // This may be a one-letter word: found remains true
            words.push(word)
            words.index += word.length
            next = path

          } else {
            found = false
          }

        } else {
          // !next && !found: stop trying to complete this word
          backtrack("Word fragment: " + word)
        }
      }

      if (!next || !next.$) {
        backtrack("Fail – Unknown word: ", word)
      }

      return words

      function backtrack(reason) {
        console.log(reason)

        // Try the longest earlier alternative ...
        words = alternatives.pop()
        // ... if one exists ...
        if (!words) {
          console.log("Failed to segment")
          ii = total
          return
        }
        // ... and see if all the remaining characters cand be
        // segmented into words
        ii = words.index - 1
        path = trie
        word = ""
        found = false
      }
    }
  }.initialize()
})()

console.log(dico.splitIntoWords("lookforaneedleinthesea", "enx"))
console.log(dico.splitIntoWords(
  "งมเข็มในมหาสมุทร "
+ "พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง "
+ "ตากลม ตา​กลม"
, "th"))

The output in the Developer Console should look something like this:

["look", "for", "a", "needle", "in", "these", "a", index: 22]
["งม", "เข็ม", "ใน", "มหาสมุทร", " ", "พูด", "ไป", "สอง",
   "ไพ", "เบี้ย", " ", "นิ่ง", "เสีย", "ตำลึง", "ทอง", " ", 
   "ตาก", "ลม", " ", "ตา", "​", "กลม", index: 62]

You'll notice that the last two words in English are "these", "a" and not "the", "sea". The last chunks of Thai has been split differently from the previous chunk, because of the zero-width space &#8203;.

splitIntoWords

The splitIntoWords method accepts a string of text and a language code. It uses the language code to select the appropriate language trie. It takes the first character in the string and looks for it at the top level of the trie, then takes the second letter, and looks for it in the branch of the trie defined by the previous letter, adding it to its word variable as it goes. It will continue to do this until one of three things happens:

  • It meets an object with a "$": true key/value pair
  • It finds an object which contains no key for the letter in the string that it is currently looking at
  • It reaches the end of the text

Adding words

The algorithm will step through the string and tunnel into the trie until it has created the string "look". At this point, it will encounter a {"$": true} object. This tells it to add the word "look" to its current words list, and to set a flag to say that at least one word has been found. Since an array is special kind of object, it can add a property to the array: it adds an index key with a value equal to the length of the string read so far. You'll see in a moment how this is used.

Preparing alternatives

The splitIntoWords algorithm is greedy: it will keep looking for the longest word that fits, and this may lead it into a blind alley. To test this, you'll find the nonsense word "forane" in the dictionary. After finding "look", the algorithm keeps moving forward. When it has read "for", it will find an object with both a "$": true key pair and the key "a", with a submap that it can continue to follow.

It adds the word "for" to the words array, and updates the index of the array: index is now the sum of the lengths of the word in the array.

The process continues until the {"$": true} object, indicating the end of the nonsense word "forane", is found. This time, however, there is already a word that uses the letters "for" in the words array. The splitIntoWords function first uses words.slice(0) to copy the array, and also copies the value of index, to create a full clone. This clone is added to the alternatives array. In the original words array, the last word ("for") is removed and replaced with "forane", and the index value is updated.

A new word now begins, starting with the letter "e", but at the next iteration, no sub-map of trie is found that corresponds to the word fragment "ed". The splitIntoWords function has reached the end of a blind alley: it calls the backtrack function. This drops the current words array and replaces it with the longest array in alternatives: ["look", "for"], with its index value of 7. It resets the ii iterator so that the next iteration starts building a new word from position ii = 7: the index of the next letter after "lookfor". This time, it finds the words "a" and "need", which is soon replaced by "needle", and continues to read "these", and "a". This is a complete word and the final letter in the string, so the process stops there.

Issues with a dictionary approach

This approach suffers from two major problems:

  • It is possible to split the letters in the wrong place and still create a set of valid (but meaningless) words
  • If a word is not in the dictionary, it will fail

Ambiguous word boundaries

The algorithm has no "intelligence" built in, so it has no reason to reject "these", "a" and replace it with "the", "sea". Its greediness has led it into error. Refusing to be greedy could lead it to make different errors.

People who speak English know almost instantly where to split what they hear into words, based on meaning and statistics. Until you read this paragraph, it's unlikely that you have ever encountered a meaningful sentence that ends with the words "in these a".

As far as we know, computers are totally unaware of meaning, but a quick search on Google will show you that, statistically, "in the sea" is over 300 times more likely to appear in written English than "in these a". It should therefore be possible to improve the accuracy of the splitIntoWords algorithm by checking the statistical likelihood of each subset of words appearing together. However, this will require access to a vast database: not just a dictionary of words but a store of phrases indexed by the words they contain.

It is not feasible to load this database in your browser. It uses much less bandwidth to send the text for analysis to the server, where it can be treated by all the CPU power necessary, and then to send the results back.

Unknown words

In this simple implementation, the absence of a single word from the dictionary will cause the algorithm to fail. A more robust version would necessarily be slower, and, without a database to provide statistical input, may still produce erroneous results.

A rule-based system for detecting syllables

Even in a language where there are no gaps between words, there are several places that are clearly boundaries between syllables. In this section, you'll learn about:
  • Characters that only appear at the beginning of a Thai syllable
  • Characters that appear only at the end of a Thai syllable
  • Thai numbers
  • Boundaries between Thai characters and non-Thai characters
  • Common Thai words that cannot be created by chance across a word boundary

You'll create a series of regular expressions to detect any of these boundaries, and use them to determine places where there must be a syllable boundary.

Download the source files Test Here

The approach described in this section is inspired by the ThaiWrap bookmarklet / ตัวตัดบรรทัดข้อความไทย written by Arthit Suriyawongkul / อาทิตย์ สุริยะวงศ์กุล

Initial and final letters in Thai

In English, you are used to seeing vowel sounds that are with multiple vowel characters. For example: "fair" and "fire". In the word "fire", the final "e" appears after the last consonant, but it is not pronounced. Instead, its existence changes the sound of the preceding "i".

In Thai, certain vowels are written before the first consonant, but are pronounced after the consonant. The Thai tone-twister ไม้ ใหม่ ไม่ ไหม้ ไหม (máai mài mâi mâi mái, green wood doesn't burn, does it?) uses two of these vowels: ไ and ใ. The character for "m" is ม, which is written after the vowel. You can therefore be sure that wherever you find a ไ or ใ character, it is the first character in a syllable.

The five vowels that work this way are เ แ โ ใ ไ. Fortunately, they have been placed together in the Thai Unicode block, so you can use the regular expression /[เ-ไ]/ for them.

There are also a certain number of characters that only ever appear at the end of a syllable. These are more scattered so you need to use each individual character in the regular expression that identifies them all: /[ฯะำฺๅๅๆ๎]/. If you encounter any of these characters in Thai text, you can be 100% sure that the next character is the start of a new syllable.

Numbers

The Thai numbers from 0 to 9 are written as:

๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙

Anywhere where a number is followed by a non-number, or a non-number is followed by a number, must be the end of a word. These numbers can be expressed in the regular expression: /[๐-๙]/. To define all the Thai characters that are not numbers, you can use /[ก-ฺเ-๎]/

Non-Thai characters

Thai people often use Western numerals (0 - 9), and Western names may be written in Roman characters. You can be sure that any transition from Thai to non-Thai will occur at a word boundary.

Common words

Out of the top 100 words in Thai, there are 30 which form unambiguous syllables. Out of the top 2500, there are 51 such words, as shown in the following regular expression:

/(เป็น|ใน|จะ|ไม่|และ|ได้|ให้|ความ|แล้ว|กับ|อยู่|หรือ|กัน|จาก|เขา|ต้อง|ด้วย|นั้น|ผู้|ซึ่ง|โดย|ใช้|ยัง|เข้า|ถึง|เพราะ|จึง|ไว้|ทั้ง|ถ้า|ส่วน|อื่น|สามารถ|ใหม่|ใช่|ใด|ช่วย|ใหญ่|เล็ก|ใส่|เท่า|ใกล้|ทั่ว|ฉบับ|ใต้|เร็ว|ไกล|เช้า|ซ้ำ|เนื่อง|ค้น)/

If you encounter any of these words, you can be sure that the preceding character is the end of the previous syllable and the next character is the beginning of the following syllable.

Rule-based segmentation

You can add the segment object, as shown below, to your dictionaries.js script. When you call segment.th(string), it will search through the string, applying each regular expression in turn, looking for all these identifiable syllable boundaries. It adds the position of these boundaries to an array. Some boundaries will be found multiple times, and these are filtered out. In the last step, it takes each boundary point, starting from the end, and replaces it with the segment of the string that starts from that point.

js/dictionaries.js

"use strict"

var dico = {}
var segment

;(function addDictionaries() {
  dico = {
    // code omitted for clarity
  }.initialize()

  segment = {
    th: function thai(string) {
      // unambiguous words that are common, like prepositions
      var cw = "(เป็น|ใน|จะ|ไม่|และ|ได้|ให้|ความ|แล้ว|กับ|อยู่|หรือ|กัน|จาก|เขา|ต้อง|ด้วย|นั้น|ผู้|ซึ่ง|โดย|ใช้|ยัง|เข้า|ถึง|เพราะ|จึง|ไว้|ทั้ง|ถ้า|ส่วน|อื่น|สามารถ|ใหม่|ใช่|ใด|ช่วย|ใหญ่|เล็ก|ใส่|เท่า|ใกล้|ทั่ว|ฉบับ|ใต้|เร็ว|ไกล|เช้า|ซ้ำ|เนื่อง|ค้น)"
      // leading chars
      var lc = "[เ-ไ]"
      // final chars
      var fc = "[ฯะำฺๅๅๆ๎]"
      // thai chars
      var tc = "ก-ฺเ-๎" // not including numbers + ฿, ๏ or ๚๛
      var no = "๐-๙"
      var isThai = /[ก-ฺเ-๎]/
      var isNumber = /[๐-๙]/

      var regexes = [
        // characters that start a syllable
        new RegExp(lc, "g")
        // characters than end a syllable
      , new RegExp(fc, "g")
         // non-number followed by any Thai number
      , new RegExp("[^"+no+"](?=["+no+"])", "g")
         // Thai number followed by any non-number
      , new RegExp("["+no+"](?=[^"+no+"])", "g")
         // Thai character followed by a non-Thai character
      , new RegExp("["+tc+"](?=[^"+tc+"])", "g")
        // non-Thai character followed by Thai character
      , new RegExp("[^"+tc+"](?=["+tc+"])", "g")
        // any char followed by known word
      , new RegExp("."+cw+"", "g")
        // known word followed by any character
      , new RegExp(cw+"(.)", "g")
        // beginning of a space
      , /.\s/g
        // end of a space
      , /\s+./g
      ]

      var adjustments = [
        // characters that start a syllable
        0
        // characters that end a syllable
      , 1
        // non-number followed by any Thai number
      , 1
        // Thai number followed by any non-number
      , 1
        // Thai character followed by a non-Thai character
      , 1
        // non-Thai character followed by Thai character
      , 1
        // any char followed by known word
      , 1
        // known word followed by any character
      , true
        // spaces
      , 1
      , true
      ]

      function numerical(a, b) {
        return a - b
      }

      function removeDuplicates(value, index, array) {
        return array.indexOf(value) === index
      }

      function splitIntoWords(string) {
        var segments = [0]
        var total
          , ii
          , regex
          , adjust
          , result
          , split
          , start
          , end

        for (ii = 0, total = regexes.length; ii < total; ii++) {
          regex = regexes[ii]
          adjust = adjustments[ii]
          while (result = regex.exec(string)) {
            split = result.index + (adjust === true
                           ? result[0].length - 1
                           : adjust)
            segments.push(split)
          }
        }

        segments.sort(numerical)
        segments = segments.filter(removeDuplicates)

        end = string.length
        ii = segments.length
        while (ii--) {
          start = segments[ii]
          segments[ii] = string.slice(start, end)
          end = start
        }  

        return segments
      }

      return splitIntoWords(string)
    }
  }
})()

console.log(dico.splitIntoWords(
  "งมเข็มในมหาสมุทร "
+ "พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง "
+ "ตากลม ตา​กลม"
, "th"
))
console.log(segment.th(
  "งมเข็มในมหาสมุทร "
+ "พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง "
+ "ตากลม ตา​กลม"
))

Results

In the Developer Console, you will see the output of the dictionary-based and the rule-based algorithms. Differences are marked in colour below:

["งม","เข็ม","ใน","มหาสมุทร"," ",
   "พูด","ไป","สอง","ไพ","เบี้ย"," ","นิ่ง","เสีย","ตำลึง","ทอง"," ",
   "ตาก","ลม"," ",ตา","​","กลม", index: 62]

["งม","เข็ม","ใน","มหาสมุทร"," ",
   "พูด","ไปสอง","ไพ","เบี้ย"," ","นิ่ง","เสียตำ","ลึงทอง"," ",
   "ตากลม"," ",ตา","​","กลม"]

In the majority of cases, the two approaches agree. However:

  • The rule-based approach does not identify a boundary in four cases (one green, one blue and two either side of the red chunk)
  • The rule-based approach splits one word (shown in red) along its syllable boundaries, and the two halves fail to be detached from the adjacent words.

This rule-based approach can function with moderate accuracy with a dictionary of only a few dozen words. It may make fail to recognize all the syllable boundaries, but it does not fail if it encounters an unknown word.

Asynchronous calls

Correctly identifying word boundaries in Thai is not a task that is easy to perform in the browser. A better solution is to send the Thai text to a server for treatment, and to use the results as soon as they are available.

Over the next three sections you will be learning to:

  • Make an asynchronous call to a simulated server
  • Make this call each time new Thai text is found in the web page, whether this is when the page is loaded or when new text is added on the fly
  • Use the results of these calls to identify the neighbouring word when the arrow keys are pressed
In this section, you'll learn how to:
  • Create an ASYNC object that behaves like Meteor.call in a Meteor environment, or as a wrapper for an plain AJAX call
  • Add a getWordSegmentation method to your global dico object
  • Test an asynchronous call to getWordSegmentation and observe the results in the Developer Console.
Download the source files Test Here

The ASYNC object

You can start by creating a file called async.js and saving it in your js/ folder. Here's the code that defines the essential call method:

js/async.js

"use strict"

var ASYNC

;(function async(){
  ASYNC = {
    call: function call (methodName) {
      var method = this.methods[methodName]
      var args = [].slice.call(arguments)
      var callback = args.pop()
      var result
        , error
   
      if (typeof method === "function") {
        args.shift()
        
        if (typeof callback === "function") {
          // Simulate asynchronous call
          result = method.apply(this, args)

          if (result instanceof Error) {
            error = result
            result = null
          } else {
            error = null
          }

          setTimeout(function () {
            callback(error, result)
          }, Math.floor(Math.random() * 2000))

        } else {
          // Return result synchronously
          args.push(callback)
          return method.apply(this, args)
        }

      } else {
        error = new Error("Unknown method: " + methodName)
        if (typeof callback === "function") {
          callback(error)
        } else {
          return error
        }
      }
    }

  , methods: {}
  }
})()

Just as with Meteor.call you call this using the syntax:

ASYNC.call("methodName", arguments ... , callback function)

If a callback function is given, it will return the result asynchronously (after a random delay). If no callback is given, a result will be returned synchronously.

The getWordSegmentation method

You can add any method you like to the methods object. Here's how you can add a getWordSegmentation method, which will call a getWordMap method that you haven't yet added to the global dico object.

js/async.js

"use strict"

var ASYNC

;(function async(){
  ASYNC = {
    call: function call(methodName) {      
      // code omitted for clarity
    }

  , methods: {
     /**
       * getWordSegmentation calls dico.getWordMap() for each 
       * string in the map object, and returns an object defining
       * the positionsof the start and end of each word in each
       * string.
       * @param  {object} map has the format
       *                    { <lang code>: [
       *                        <string>
       *                      , ...
       *                      ]
       *                    , ...
       *                    }
       * @return {object}   { <lang code>: [
       *                        { text: <string>
       *                        , offsets: { starts: [
       *                                       <integer>
       *                                     , ...
       *                                     ]
       *                                   , ends: [
       *                                       <integer>
       *                                     , ...
       *                                     ]
       *                                   }
       *                           }
       *                         , ...
       *                         ]
       *                       , ...
       *                       }
       */ 
    , getWordSegmentation: function getWordSegmentation(map) {      
        var result = {}
        var lang

        for (lang in map) {
          treatLanguage(lang)
        }

        return result

        function treatLanguage(lang) {
          var stringArray = map[lang]
          var langArray = result[lang]

          if (!langArray) {
            langArray = []
            result[lang] = langArray
          }

          ;(function treatStrings() {
            var ii = stringArray.length
            var wordMap

            while (ii--) {
              try {
                wordMap = dico.getWordMap(stringArray[ii], lang)
                langArray.push(wordMap)
              } catch (error) {
                result = error
                ii = 0
              }
            }
          })()
        }
      }
    }
  }
})()

The getWordSegmentation expects as an argument an object map whose keys are lang ISO codes, and whose properties are arrays of strings in the given language. You'll have the chance to test it now.

Editing index.html

You'll your to declare your new async.js script in your index.html file. At the same time, you can add a little JavaScript test to call the getWordSegmentation method in the ASYNC object.

index.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Selecting Thai words</title>
</head>

<body>
  <p lang="th">งมเข็มในมหาสมุทร พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง ตากลม  ตา​กลม</p>
  <p lang="enx">lookforaneedleinthesea talkisworthtuppence silenceisworthgold exposedtotheair roundeye</p>

  <script src="js/dictionaries.js"></script>
  <script src="js/async.js"></script>
  <script>
    var map = {
      th: [
        "งมเข็มในมหาสมุทร"
      , "พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง"
      , "ตากลม ตา\u200bกลม"
      ]
    , enx: [
        "lookforaneedleinthesea"
      , "talkisworthtuppence silenceisworthgold"
      , "exposedtotheair roundeye"
      ]
    }

    ASYNC.call("getWordSegmentation", map, callback)

    function callback(error, result) {
      console.log(error || JSON.stringify(result))
    }
  </script>
</body>
</html>

Provoking an error

You may have noticed that you haven't yet added a getWordMap method to your dico object. The JavaScript engine in your browser certainly will. If you save your edited files and refresh your browser, you should see the following in the Developer Console:

TypeError: dico.getWordMap is not a function(…)

The getWordMap method

To fix this error, you need to add a getWordMap method to your dico object.

js/dictionaries.js

"use strict"

var dico
var segment

;(function addDictionaries() {
  dico = {

    dictionaries: {
      // code omitted for clarity
    }

  , tries: {}

  , initialize: function createTries(){
      // code omitted for clarity
    }

  , createTrie: function createTrie(languageMap) {
      // code omitted for clarity
    }

  , splitIntoWords: function splitIntoWords(string, languageCode){
      // code omitted for clarity
    }

  , getWordMap: function getWordMap(string, languageCode) {
      var segments = this.splitIntoWords(string, languageCode)
      var regex = /[^\s!-\/:-@[-`{-~\u00A0-¾—-⁊\u200b]/
      var offsets

      if (!segments) {
        // Split words in unknown language by ASCII word boundaries
        segments = string.split(/\b/)
      }

      offsets = getOffsets()

      return { 
        text: string
      , offsets: offsets
      }

      function getOffsets() {
        var starts = []
        var ends = []
        var total = segments.length
        var index = 0
        var ii
          , segment
          , length
        
        for (ii = 0; ii < total; ii += 1) {
          segment = segments[ii]

          if (regex.test(segment)) {
            // This segment contains word characters
            starts.push(index)
            index += segment.length
            ends.push(index)
          } else {
            // This segment is punctuation or whitespace
            index += segment.length
          }
        }

        return { starts: starts, ends: ends }
      }
    }
  }.initialize()

  segment = {
    // code omitted for clarity
  }
})()

The getWordMap method accepts a string and a language code as arguments. It first checks if the dico object has a dictionary for the given language, and if so, it uses that to split the string into known words. In practice, this would be done on the server by a specialized application. If no dictionary exists for the given language, it splits the string into words and spaces using the simplistic /\b/ regular expression, which really only works will with plain English text.

For example, it will split the string "talkisworthtuppence silenceisworthgold" into:

["talk", "is", "worth", "tuppence", " ", "silence", "is", "worth", "gold"]

The getOffsets function

The getOffsets function looks at each entry in the segments array in turn, and checks whether or not it contains a non-space character. The regular expression it uses a to do this is rather overkill for this proof-of-concept.

/[^\s!-\/:-@[-`{-~\u00A0-¾—-⁊\u200b]/

This checks that the string contains none of the non-word characters (such as whitespace, zero-width spaces and punctuation) that are used in any European language. If so, the segment must be a word.

The getOffsets function populates two lists – starts and ends – with the index of the starting point and end point of each word. For example, for the string "talkisworthtuppence silenceisworthgold", starts will be ...

[0, 4, 6, 11, 20, 27, 29, 34]

... indicating that the word "talk" starts at 0 , the word "is" starts at 4 , and so on. The ends array will be ...

   [4, 6, 11, 19, 27, 29, 34, 38]

Each end point corresponds to the start point of the next word except for 19 which indicates the end of "tuppence" and the beginning of a space, while 20 in the starts array indicates the beginning of the word "silence".

Testing a second time

Now that you have added the getWordMap method, you can save your files and refresh your browser. This time you should see more interesting output in the Developer Console. Here's a prettified version, complete with colouring so that you can see how the index numbers and the word boundaries match each other.

{ "th": [
    { "text": "ตากลม ตา​กลม"
    , "offsets": {
        "starts": [0,3,6,9]
      , "ends":     [3,5,8,12]
      }
    }
  , { "text": "พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง"
    , "offsets": {
        "starts": [0,3,5,8,10,16,20,24,29]
      , "ends":     [3,5,8,10,15,20,24,29,32]
      }
    }
  , { "text": "งมเข็มในมหาสมุทร"
    , "offsets": {
        "starts":[0,2,6,8]
      , "ends":    [2,6,8,16]
      }
    }
  ]
, "enx": [
    { "text": "exposedtotheair roundeye"
    , "offsets": {
        "starts": [0,7,9,12,16,21]
      , "ends":     [7,9,12,15,21,24]
      }
    }
  , { "text": "talkisworthtuppence silenceisworthgold"
    , "offsets": {
        "starts": [0,4,6,11,20,27,29,34]
      , "ends":     [4,6,11,19,27,29,34,38]
      }
    }
  , { "text": "lookforaneedleinthesea"
    , "offsets": {
        "starts": [0,4,7,8,14,16,19]
      , "ends":     [4,7,8,14,16,19,22]
      }
    }
  ]
}
Figure 1. Prettified output from getWordSegmentation

Detecting text changes

In this section, you'll learn how to:
  • Use a MutationObserver object to detect each time the text in an HTML page changes
  • Use a TreeWalker object to find which textNodes have changed
  • Detect the lang associated with a given textNode
  • Check whether the user can select the text in a given textNode
  • Maintain a map of word boundary data for all relevant textNodes on the page
Download the source files Test Here

Editing the index.html file

To test the new features that you will be adding in this section, you can edit your index.html file as shown below. These changes:

  • Set the default value of the lang attribute of the entire HTML page to en (English)
  • Add text that will not need to be treated
  • Add spans to the Thai (th) and EnglishNoSpaces (enx) text, to create a number of textNodes
  • Add two spans with a class of unselectable, which will also not need to be treated
  • Add a button and code so that you can remove the unselectable class, after the page has loaded
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Selecting Thai words</title>
  <link rel="stylesheet" type="text/css" href="css/style.css">
</head>

<body>
  <h1>Word Segmentation Test</h1>
  <p>The default <em>lang</em> of this page is <em>en</em>.
  The paragraphs below are in Thai (<em>th</em>) and
  EnglishNoSpaces (<em>enx</em>).</p>
  <p lang="th">งมเข็มในมหาสมุทร
  <span>พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง</span>
  <span class="unselectable">คำเหล่านี้ไม่สามารถเลือกได้</span>
  ตากลม  ตา​กลม</p>
  <p lang="enx">lookforaneedleinthesea
  <span>talkisworthtuppence silenceisworthgold</span>
  <span class="unselectable">thewordsherecannotbeselected</span>
  exposedtotheair roundeye</p>

  <button type="button">Remove .unselectable class</button>

  <script>
  document.querySelector("button").onclick = function (event) {
    var spans = document.querySelectorAll(".unselectable")
    for (var ii = 0, length = spans.length; ii < length; ii += 1) {
      spans[ii].classList.remove("unselectable")
    }
    this.setAttribute("disabled", true)
  }
  </script>

  <script src="js/dictionaries.js"></script>
  <script src="js/async.js"></script>
  <script src="js/selection.js"></script>
</body>
</html>

Updating the dictionaries

This new version of index.html includes new words that you'll need to add to the dictionaries in the dico object:

js/dictionaries.js

"use strict"

var dico
var segment

;(function addDictionaries() {
  dico = {

    dictionaries: {
      enx: {
        " ": 0
      , "a": 0
      , "air": 0
      , "be": 0
      , "can": 0
      , "cannot": 0
      , "elect": 0
      , "elected": 0
      , "ere": 0
      , "exposed": 0
      , "eye": 0
      , "for": 0
      , "gold": 0
      , "he": 0
      , "here": 0
      , "hew": 0
      , "in": 0
      , "is": 0    
      , "look": 0
      , "need": 0
      , "needle": 0
      , "not": 0
      , "old": 0
      , "or": 0
      , "pence": 0
      , "ran": 0
      , "round": 0
      , "sea": 0
      , "select": 0
      , "selected": 0
      , "she": 0
      , "silence": 0
      , "talk": 0
      , "the": 0
   // , "these": 0
      , "to": 0
   // , "tot": 0
      , "tup": 0
      , "tuppence": 0
      , "up": 0
      , "word": 0     
      , "words": 0
      , "worth": 0
      }

    , th: {
        " ": {}
      , "​": {} // ​
      // code omitted for clarity
      , "ไพ" : {
          "pronunciation": "phai-"
        , "translation": "a certain old coin equal in value to 1/32 baht"
        }
      , "คำ": {
          "pronunciation": "kham-"
        , "translation": "term; discourse; a mouthful or bite; morsel [numerical classifier for a word, an answer to a question, a spoonful of food]"
        }
      , "เหล่า": {
          "pronunciation": "lao_"
        , "translation": "[numerical classifier for groups of items]; a group [of items or things]"
        }
      , "นี้": {
          "pronunciation": "nee'"
        , "translation": "this; these; [is] now"
        }
      , "ไม่": {
          "pronunciation": "mai`"
        , "translation": "not; no; [auxiliary verb] does not; has not; is not; [negator particle]"
        }
      , "สามารถ": {
          "pronunciation": "saa´ maat`"
        , "translation": "capable; able; to have the ability to; can; a Thai given name"
        }
      , "เลือก": {
          "pronunciation": "leuuak`"
        , "translation": "to select or choose; elect; pick"
        }
      , "ได้": {
          "pronunciation": "dai`"
        , "translation": "can; to be able; is able; am able; may; might [auxiliary of potential, denoting possbility, ability, or permission]; to receive; to obtain; acquire; get; have got; to pick out; to choose; to pass [a test or an exam]"
        }
      }
    }

  , tries: {}

  , initialize: function createTries(){
      // code omitted for clarity
    }

  , createTrie: function createTrie(languageMap) {
      // code omitted for clarity
    }

  , splitIntoWords: function splitIntoWords(string, languageCode) {
      // code omitted for clarity
    }
  }.initialize()

  segment = {
    // code omitted for clarity
  }
})()

Making text unselectable

The unselectable class does nothing without a CSS rule. You can create a file called styles.css and save it in a folder named css alongside your HTML document.

css/styles.css

body {
  width: 22em;
}

.unselectable {
  -webkit-touch-callout: none; /* iOS Safari */
  -webkit-user-select: none;   /* Chrome/Safari/Opera */
  -khtml-user-select: none;    /* Konqueror */
  -moz-user-select: none;      /* Firefox */
  -ms-user-select: none;       /* Internet Explorer/Edge */
  user-select: none;
  -webkit-user-drag: none;
  user-drag: none;
  color: #999;
}

Detecting the value of the lang attribute for a node

In the current circumstances, you want to ask the server only for the word boundary data for textNodes in Thai or in our invented EnglishNoSpaces. To choose only text nodes in these languages, you need no detect the closest parent that has a lang attribute, and return the value of that attribute.

You can create a new file named selection.js and save it in your js/ folder. Here's a getLang function. You can leave it in the global namespace for now.

js/selection.js

function getLang(node) {
  var lang = ""

  if(!node.closest) {
    node = node.parentNode
  }
  node = node.closest("[lang]")
  if (node) {
    lang = node.getAttribute("lang")
  }

  return lang
}

You can test this in the Developer Console:

getLang(document.body)
"en"
getLang(document.querySelector(".unselectable"))
"th"
getLang(document.querySelector("[lang=enx] span"))
"enx"
getLang(document.querySelector("button"))
"en"

Early versions of Internet Explorer do not support the element.closest() method. If you need to support such browsers, you can use this polyfill.

Detecting unselectable nodes

There is no point asking the server send word boundary information on text that cannot be selected. To make the contents of an HTML element unselectable, your can create a CSS rule based on the user-select property. As of August 2016, he major browser vendors have not reached agreement on how this is to be implemented, so there are a number of properties with vendor prefixes that you need to test for.

You can use getComputedStyle(element) to obtain an object with key/value pairs for all properties applied to an element. You can then use the array.every method to check all the versions of user-select and to stop looking any further if any of them has a value of none.

js/selection.js

function elementIsSelectable(element) {
  var prefixes = [
    "-webkit-"
  , "-khtml-"
  , "-moz-"
  , "-ms-"
  , ""
  ]
  var style = window.getComputedStyle(element)

  var selectable = prefixes.every(function check(key) {
    key += "user-select"
    return style.getPropertyValue(key) !== "none"
  })

  return selectable
}

You can test this in the Developer Console:

elementIsSelectable(document.body)
true
elementIsSelectable(document.querySelector(".unselectable"))
false
elementIsSelectable(document.querySelector("[lang=enx] span"))
true
elementIsSelectable(document.querySelector("button"))
true

Detecting changes to text on the page

To detect changes to the DOM, you can create a MutationsObserver object, and tell it which function to call when an mutation is observed.

In the code listing below, the getLang and elementIsSelectable functions have been moved inside an immediately-invoked function expression, to take them out of global namespace, now that you have had a chance to test them.

"use strict"

;(function selection() {
  var observer = new MutationObserver(checkForAlteredTextNodes)

  observer.observe(document.body, { 
    childList: true
  , attributes: true
  , subtree: true
  })

  function getLang(node) {
    // code omitted for clarity
  }

  function elementIsSelectable(element) {
    // code omitted for clarity
  }

  function checkForAlteredTextNodes(mutations) {
    mutations.forEach(populateNewTextArray)

    function populateNewTextArray(mutation) {
      // TODO
      console.log(mutation.type, mutation.target, mutation)
    }
  }
})()

A MutationObserver is set up in two steps. First you create the object and define the function to call when any changes are detected. Then you call observer.observe(target, mutationOptions), to tell it which target element to observe, and what changes to observe on that element.

The options object needs to set at least one of the following properties to true: attributes, characterData or childList. If your target is a specific textNode then it is useful to observe characterData. If you want to observe changes to the text of all the children of a element (as you do here), then you need to set both childList and subtree to true. In the test that you will run shortly, you will convert unselectable text to selectable text by removing a class from a node; to observe this change, you will also need to set attributes to true.

The populateNewTextArray function doesn't do much yet, but it does display information about the mutations that are observed in the Developer Console. Here's what you should see when you refresh the page:

The MutationRecord, with its type and target, that is created when the page is loaded
Figure 2. A MutationRecord with its type and target, that is created when the page is loaded

When the page is loaded, the childList of the document.body is altered. When you click on the button, you will change the classList attribute of two spans, and the disabled attribute of the button itself. Here is what these mutations look like:

The Mutation Records created when you click on the button
Figure 3. The Mutation Records created when you click on the button

Generating the map to send to getWordSegmentation

The purpose of the populateNewTextArray function is to walk through the body of the entire document, looking for textNodes that have been added or changed, and specifically those that need to be set to be treated by the getWordSegmentation method. Once a string has been segmented into words, its segmentation data will be stored in a wordsMap object to be used as needed. The keys in the original emptywordsMap object also indicate which languages need such special treatment.

Using a TreeWalker

An efficient way of doing this is to use a TreeWalker object. The document.createTreeWalker method accepts up to four arguments. The fourth is not useful here. The three arguments you can use are:

  • The root node for the tree to walk
  • An NodeFilter integer to indicate which types of nodes in the tree you want to consider
  • A filter object with an acceptNode method, which will be called for each node that matches the types indicated by the NodeFilter integer

Normally, the acceptNode method is used to return true or false, to indicate whether the node should be returned and added to an array for further treatment. In the code listing below, the acceptNode method is co-opted to do all the work of building the map that will be sent to the getWordSegmentation method.

js/selection.js

"use strict"

;(function selection() {
  var wordsMap = { th: {}, enx: {} }
  var observer = new MutationObserver(checkForAlteredTextNodes)

  observer.observe(document.body, { 
    childList: true
  , attributes: true
  , subtree: true
  })

  function getLang(node) {
    // code omitted for clarity
  }

  function elementIsSelectable(element) {
    // code omitted for clarity
  }

  function checkForAlteredTextNodes(mutations) {
    var newTextMap = {} // { <lang>: [<string>, ...], ...}
    var newTextFound = false

    mutations.forEach(populateNewTextArray)

    if (newTextFound) {
      ASYNC.call("getWordSegmentation", newTextMap, updateWordsMap)
    }

    function populateNewTextArray(mutation) {
      var filter = {
        acceptNode: function(node) {
          var text = node.data
          var lang
            , map
            , langArray

          if (!/^\s*$/.test(text)) {
            lang = getLang(node)

            if (map = wordsMap[lang]){
              if (elementIsSelectable(node.parentNode)) {
                if (!map[text]) {
                  if (!(langArray = newTextMap[lang])) {
                    langArray = []
                    newTextMap[lang] = langArray
                  }

                  langArray.push(text)
                  map[text] = true
                  newTextFound = true
                }
              }
            }
          }
        }
      }
      var walker = document.createTreeWalker(
        mutation.target
      , NodeFilter.SHOW_TEXT
      , filter
      )

      while (walker.nextNode()) {
        // Action occurs in filter.acceptNode 
      }
    }

    function updateWordsMap(error, result) {
      // TODO
      console.log(error || JSON.stringify(result))
    }
  }
})()

For each textNode, the filter.acceptNode function is called. This runs through the following checklist:

if (!/^\s*$/.test(text)) {...}
Does this textNode contain anything other than whitespace?
lang = getLang(node); if (map = wordsMap[lang]) {...}
Is this textNode in a language that needs custom word segmentation?
if (elementIsSelectable(node.parentNode)) {...}
Does this textNode contain text that the user can select?
if (!map[text]) {
Has the string data found in this textNode not already been treated, or started to be treated?

If the answer to all these questions is Yes, then the string data of the textNode will be added to the appropriate language array in newTextMap, and an entry will be added in the appropriate place in the wordsMap object, to indicate that the string is about to be treated. If an identical string is found later, the script will know not to make a new request for treatment.

When you save your changes and refresh the page in your browser, you should see the same output as is shown in Figure 1 at the end of the previous section. This time, however, the text is taken directly from the HTML page, and not hard-coded in JavaScript.

Output generated automatically from newTextMap
Figure 4. Figure 1. Output generated automatically from newTextMap

If you click on the button, the spans that currently have the unselectable class will lose that class, and the getWordSegmentation method will be called again, with just the text that has become selectable:

Making text selectable will trigger the checkForAlteredTextNodes function again
Figure 5. Making text selectable will trigger the checkForAlteredTextNodes function again

updateWordsMap

The last step for this section is to update the wordsMap object, to store the data retrived by the asynchronous call to getWordSegmentation. You can make the changes shown below in your selection.js script, and then refresh the page in your browser.

js/selection.js

"use strict"

;(function selection(){
  var wordsMap = { th: {}, enx: {} }
  var observer = new MutationObserver(checkForAlteredTextNodes)

  observer.observe(document.body, { 
    childList: true
  , attributes: true
  , subtree: true
  })

  function getLang(node) {
    // code omitted for clarity
  }

  function elementIsSelectable(element) {
    // code omitted for clarity
  }

  function checkForAlteredTextNodes(mutations) {
    // code omitted for clarity
  }

  /**
   * updateWordsMap is triggered by a callback from the ASYNC
   * call to getWordSegmentation. For each language and each
   * text key, it replaces the existing `true` entry with the
   * `offsets` array for the word boundaries in the given
   * string.
   * @param  {Error|null}  error
   * @param  {null|object} result 
   *                       { <lang>: [
   *                           { "text": <string>
   *                           , "offsets": {
   *                               "starts": [<integer>, ...]
   *                             , "ends": [<integers>, ...]
   *                             }
   *                           }
   *                         , ...
   *                         ]
   *                       , ...
   *                       }
   */
  function updateWordsMap(error, result) {    
    if (error) {
      return console.log(error)
    }

    var lang
      , langArray
      , langMap
      , ii
      , textData

    for (lang in result) {
      langMap = wordsMap[lang]
      langArray = result[lang]

      ii = langArray.length
      while (ii--) {
        textData = langArray[ii]
        langMap[textData.text] = textData.offsets
      }
    }

    console.log(JSON.stringify(wordsMap))
  }
})()

Jump to adjacent words

In the Text Selection tutorial, you can see how to develop a script named selection.js that allows you to use the arrow keys to select the word adjacent to the current selection.

The script developed in that tutorial only works with languages where words are separated by spaces. In this current tutorial, you have learned how to generate word segmentation data for a string of Thai text. Now you're ready to merge the two selection.js scripts so that you can use the arrow keys to jump to adjacent words in Thai text too.

In this section, you'll discover how to extend the selection.js script from the Text Selection tutorial so that it also works with the Thai language. You see how to:
  • Add the functions you have developed in this tutorial to the original selection.js script
  • Modify the jumpLeft and jumpRight functions of the original selection.js script so that they work with Thai text
Download the source files Test Here

The original selection.js script is quite lengthy. This section shows only the parts of it where you need to make changes. It's a good idea to download the full source code and follow where the changes have been made.

Merging the two selection.js scripts

The abridged code listing below shows where the code from your new selection.js script can be added to the original script. Note that the jumpLeft and jumpRight functions that you will be looking at here are different from those in the original version.

js/selection.js

"use strict"

;(function selection(){
  // code omitted for clarity

  var box = document.body // querySelector(".box")
  var wordsMap = { th: {}, enx: {} }
  var observer = new MutationObserver(checkForAlteredTextNodes)

  observer.observe(document.body, { 
    childList: true
  , attributes: true
  , subtree: true
  })

  box.ondblclick = selectHyphenatedWords
  document.body.onkeydown = jumpToNextWord

  function selectHyphenatedWords(event) {
    // code omitted for clarity
  }

  function extendSelectionBackBeforeHypen(string, offset) {
    // code omitted for clarity
  }

  function extendSelectionForwardAfterHyphen(string, offset) { 
    // code omitted for clarity
  }

  function jumpToNextWord (event) {
    // code omitted for clarity
  }

  function jumpLeft() {
    // code here will be edited
  }

  function jumpRight() {
    // code here will be edited
  }

  function getAdjacentTextNode(node, whichSibling, arrayMethod) {
    // code omitted for clarity
  }

  function scrollIntoView(range) {
    // code omitted for clarity
  }

  function getLang(node) {
    // code omitted for clarity
  }

  function elementIsSelectable(element) {
    // code omitted for clarity
  }

  function checkForAlteredTextNodes(mutations) {
    // code omitted for clarity
  }

  function updateWordsMap(error, result) {
    // code omitted for clarity
  }
})()

Checking lang in jumpLeft

The technique that you will use for selecting the adjacent word will be different when the language is Thai. The code listing below shows how jumpLeft can be split into two internal functions: jumpLeftWithRegex, which behaves use almost exactly the same code as the original jumpLeft function, and jumpLeftWithMap which will use the data in the wordsMap object. A switch statement uses the lang attribute applied to the container node to determine which approach should be taken.

Note the minor change in the jumpLeftWithRegex function. If the current word is the first in the current container, word to select will be in the previous textNode. It's possible that the previous textNode is in a different language, and the jumpLeftWithRegex function might not be appropriate for that language.

The solution is to:

  • Create a new range object (so that the selection is not changed if there is nothing earlier to select)
  • Place the range.endOffset at the end of the previous textNode
  • Call the parent jumpLeft function recursively.

The endOffset cannot be placed earlier than the startOffset, so startOffset is now set to exactly the same point. You'll see the importance of this in a moment. This technique ensures that the function used for selecting the last word in the previous textNode is appropriate for the language of that node.

js/selection.js

...

  function jumpLeft() {
    container = range.endContainer
    var lang = getLang(container)

    switch (lang) {
      case "th":
      case "enx":
        return jumpLeftByMap(container)
      default:
        return jumpLeftByRegex(container)
    }

    function jumpLeftByMap(container) {
      // TODO
    }

    function jumpLeftByRegex(container) {
      var string = container.textContent
      var result = getPreviousWord(string, range.startOffset)
      var startOffset
        , endOffset
        , rangeData

      if (!result) {
        // There are no more words in this text node. Try the previous.
        container = getAdjacentTextNode(
          container
        , "previousSibling"
        , "pop"
        )

        if (container) {
          range = document.createRange()
          range.setEnd(container, container.textContent.length)
          return jumpLeft()

        } else {
          // We're at the very beginning of the selectable text. 
          // There's nothing earlier to select.
          return
        }
      }

      startOffset = result.index
      endOffset = startOffset + result[0].length

      rangeData = {
        container: container
      , startOffset: startOffset
      , endOffset: endOffset
      , string: string
      }

      return rangeData

      function getPreviousWord(string, offset) {
        string = string.substring(0, offset)
        var result
          , temp

        while (temp = lastWordRegex.exec(string)) {
          result = temp
        }

        return result
      }
    }
  }

...

jumpLeftByMap

The jumpLeftByMap function first checks that there is an entry in wordsMap for the textContent of the current node. If the server has not yet called back with the result of the getWordSegmentation call, then the only solution is to cancel the action.

In a production version of this feature, it would be good to show a tooltip or a spinner to justify why nothing is happening.

If the server has already responded, then endOffset is set to range.endOffset + (range.startOffset === string.length) and the function setOffsets is called to set the values of startOffset and endOffset as required to select the previous word.

In most cases, range.startOffset === string.length will be false so endOffset will be set to range.endOffset. When jumpLeft is called recursively because the current selection is the first word in the next textNode, then the instruction range.setEnd(container, container.textContent.length) described above will also set the startOffset to the end of the previous container. In this case range.startOffset === string.length will be true and endOffset will be set to range.endOffset + 1. This ensures that the setOffsets function will correctly choose the start and end offsets for the last word in the current container.

js/selection.js

...

    function jumpLeftByMap(container) {
      var string = container.textContent
      var wordMap = wordsMap[lang][string]
      if (typeof wordMap !== "object")) {
        return console.log("No word map for ", string)
      }
      var wordEnds = wordMap.ends
      var wordStarts = wordMap.starts
      var endOffset = range.endOffset
                    + (range.startOffset === string.length)
      var startOffset
      var rangeData
      var result = setOffsets(endOffset)

      if (!result) { 
        // TODO
      }

      rangeData = {
        container: container
      , startOffset: startOffset
      , endOffset: endOffset
      , string: string
      }

      return rangeData

      function setOffsets(offset) {
        var ii = wordEnds.length

        while (ii--) {
          if ((endOffset = wordEnds[ii]) < offset) {
            startOffset = wordStarts[ii]
            ii = 0
          }
        }

        return (startOffset !== undefined)
      }
    }

...

setOffsets

The wordEnds variable is an array of index numbers of the next character after the end of each word. For the sentence "lookforaneedleinthesea" wordEnds will be [4,7,8,14,16,19,22]. The setOffsets function receives an offset which may be the current range.endOffset (if a word in this textNode is already selected), or the-last-item-in-wordEnds-plus-1 (if the first word in the next textNode is selected). It works backwards from the end of the wordEnds, looking for the first number that is smaller. If it finds such a number, it sets startOffsetto the corresponding entry in wordStarts. However, if the first word in the current textNode is selected, then it won't find a smaller number, and startOffset will remain undefined. When this happens, setOffsets will return false. This will trigger a recursive jump to the previous textNode, as you'll see below.

js/selection.js

...

    function jumpLeftByMap(container) {
      var string = container.textContent
      var wordMap = wordsMap[lang][string]
      if (!wordMap) {
        return console.log("No word map for ", string)
      }
      var wordEnds = wordMap.ends
      var wordStarts = wordMap.starts
      var endOffset = range.endOffset
                    + (range.startOffset === string.length)
      var startOffset
      var rangeData
      var result = setOffsets(endOffset)

      if (!result) {     
        container = getAdjacentTextNode(
          container
        , "previousSibling"
        , "pop"
        )

        if (container) {
          range = document.createRange()
          range.setEnd(container, container.textContent.length)
          return jumpLeft()

        } else {
          // We're at the very beginning of the selectable text.
          // There's nothing earlier to select.
          return
        }
      }

      rangeData = {
        container: container
      , startOffset: startOffset
      , endOffset: endOffset
      , string: string
      }

      return rangeData

      function setOffsets(offset) {
        // code omitted for clarity
      }
    }
...

The technique for jumping back to the previous textNode is identical to the way it is done in jumpLeftByRegex

Jumping to the right

Jumping to the right works in a very similar way.

js/selection.js

...

  function jumpRight() {
    container = range.startContainer
    var lang = getLang(container)

    switch (lang) {
      case "th":
      case "enx":
        return jumpRightByMap(container)
      default:
        return jumpRightByRegex(container)
    }

    function jumpRightByMap(container) {
      var string = container.textContent
      var wordMap = wordsMap[lang][string]
      if (typeof wordMap !== "object") {
        return console.log("No word map for ", string)
      }
      var wordEnds = wordMap.ends
      var wordStarts = wordMap.starts
      var startOffset = range.startOffset - (!range.endOffset)
      var endOffset
      var rangeData
      var result = setOffsets(startOffset)

      if (!result) {     
        container = getAdjacentTextNode(
          container
        , "nextSibling"
        , "shift"
        )

        if (container) {
          range = document.createRange()
          range.setStart(container, 0)
          return jumpRight()

        } else {
          // We're at the very beginning of the selectable text.
          // There's nothing earlier to select.
          return
        }
      }

      rangeData = {
        container: container
      , startOffset: startOffset
      , endOffset: endOffset
      , string: string
      }

      return rangeData

      function setOffsets(offset) {
        var ii
        var length = wordStarts.length

        for (ii = 0; ii < length; ii += 1) {
          if ((startOffset = wordStarts[ii]) > offset) {
            endOffset = wordEnds[ii]
            ii = length
          }
        }

        return (endOffset !== undefined)
      }
    }

    function jumpRightByRegex(container) {
      var startOffset = range.endOffset
      var string = container.textContent
      var result = nextWordRegex.exec(string.substring(startOffset))
      var endOffset
        , rangeData

      if (result) {
        startOffset += result[0].length

      } else {
        // There are no more words in this text node. Try the next.
        container = getAdjacentTextNode(
          container
        , "nextSibling"
        , "shift"
        )

        if (container) {
          range = document.createRange()
          range.setStart(container, 0)
          return jumpRight()

        } else {
          // We're at the very end of the selectable text. There's
          // nothing more to select.
          return
        }
      }

      result = wordEndRegex.exec(string.substring(startOffset))
      endOffset = startOffset + result.index

      rangeData = {
        startOffset: startOffset
      , endOffset: endOffset
      , string: string
      }

      return rangeData
    }
  }

You can now double-click on a word to select it and use the left and right arrow keys to move the selection to the adjacent word.

Querying an online dictionary

In this section, you'll learn how to:
  • Query the dico object asynchronously the translation of the current selection
  • Display the translation in your web page
  • Cache translations in order to display them faster the second and subsequent times they are requested
Download the source files Test Here

Simplifying index.hmtl

In this final section, you can simplify your index.html file by removing all the text in EnglishNoSpaces. To limit the action of the word selection feature, you can create a box around the Thai text. You can also add a <div> element to contain the translation, when it becomes available.

index.html

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Selecting Thai words</title>
  <link rel="stylesheet" type="text/css" href="css/style.css">
</head>

<body>
  <h1>Word Segmentation Test</h1>
  <p>Double-click to select a Thai word to see its pronounciation
  and translation. Use the left and right arrow keys to select
  neighbouring words.</p>

  <div class="box" lang="th">
    <p>งมเข็มในมหาสมุทร</p>
    <p>พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง</p>
    <p>ตากลม ตา​กลม</p>
  </div>

  <div id="translation"><div>

  <script src="js/dictionaries.js"></script>
  <script src="js/async.js"></script>
  <script src="js/selection.js"></script>
</body>
</html>

You can add some cosmetic CSS, to make the page more elegant.

css/style.css

body {
  width: 22em;
}

.box {
background-color: #eef;
border: 1px solid #ccf;
border-style:inset;
}
.box p {
  margin: 0.5em;
}

dt {
  font-weight: bold;
}

Adding a getWordData method to the dictionary

The th entry in your dictionary is a map object where the keys are Thai words and the values are objects with the format:

{ pronunctiation: <string>
  , translation: <string>
  }

You can add a getWordData method to the dico object that will return an object with the format:

{ pronunctiation: <string>
  , translation: <string>
  , word: <string>
  , lang: <string>
  }

However, if a request is made for a word that doesn't exist, or for a language for which there is no dictionary, it makes sense to return an object with the format ...

{ 
  , ⚠: "no data available"
  , word: <string>
  , lang: <string>
  }

... where ⚠ is encoded in JavaScript as "\u26A0".

A fast way to make a deep clone of a JavaScript object is to use clone = JSON.parse(JSON.stringify(original))

js/dictionary.js

"use strict"

var dico
var segment

;(function addDictionaries() {
  dico = {

    dictionaries: {
      // code omitted for clarity
    }

  , tries: {}

  , initialize: function createTries(){
      // code omitted for clarity
    }

  , createTrie: function createTrie(languageMap) {
      // code omitted for clarity
    }

  , splitIntoWords: function splitIntoWords(string, languageCode) {
      // code omitted for clarity
    }

  , getWordMap: function getWordMap(string, languageCode) {
      // code omitted for clarity
    }

  , getWordData: function getWordData(word, languageCode) {
      var dictionary = this.dictionaries[languageCode]
      var data

      if (dictionary) {
        data = dictionary[word]
        if (data) {
          data = JSON.parse(JSON.stringify(data))
        } else {
          data = { "\u26A0": "no data available" }
        }
        data.word = word
        data.lang = languageCode
      }

      return data
    }
  }.initialize()

  segment = {
    // code omitted for clarity
  }
})()

Defining an asynchronous method for calling dico.getWordData

To call dico.getWordData asynchronously, you can add the following getTranslation method to your ASYNC object:

js/async.js

"use strict"

var ASYNC

;(function async(){
  ASYNC = {
    // code omitted for clarity
    }

  , methods: {
      getWordSegmentation: function getWordSegmentation(map) {
        // code omitted for clarity
      }

    , getTranslation: function getTranslation(word, languageCode) {
        try {
          return dico.getWordData(word, languageCode)
        } catch (error) {
          return error
        }
      }
    }
  }
})()

Updating selection.js

With all these changes in place, you can now add a requestTranslation function to your selection.js script. You can add calls to this from 3 places:

  • Any mouseup event on the div.box element, in case the user clicks and drags to make a custom selection
  • Any dblclick event on the div.box element, after a complete word (which may contain hyphens) is selected
  • Any time the user presses the left or right arrow key to select a new word

As you will see in the code listing below, the requestTranslation function can work both asynchronously and synchronously. A wordsCache object is created when the code is loaded. The requestTranslation function first checks if this wordsCache object already contains an entry for the selected word and language. If not, it makes a call to ASYNC.call("getTranslation", ...), using the nested showTranslation function as a callback. If the wordsCache object already contains an entry for the word, the showTranslation function is called immediately, with the alreadyCached argument set to true.

If the translation is retrieved via the asynchronous call to ASYNC.call("getTranslation", ...), the alreadyCached argument will be undefined, and the incoming data will be added to wordsCach, so that it can be retrieved locally the next time. If the selection has changed in the meantime, the operation ends there. If the selected is string is identical to the word property of the incoming data object, then a header and a description list element (<dl>) are created to show information about the word, and displayed in the div#translation element.

js/selection.js

;(function selection(){
  // code omitted for clarity

  var box = document.querySelector(".box")
  var wordsCache = {}
  var wordsMap = { th: {}, enx: {} }
  var observer = new MutationObserver(checkForAlteredTextNodes)

  observer.observe(document.body, { 
    childList: true
  , attributes: true
  , subtree: true
  })

  box.onmouseup = requestTranslation
  box.ondblclick = selectHyphenatedWords
  document.body.onkeydown = jumpToNextWord

  function selectHyphenatedWords(event) {
    // code omitted for clarity

    if (selectionUpdated) {
      selection.removeAllRanges()
      selection.addRange(range)
    }

    scrollIntoView(range)
    requestTranslation()
  }

  // code omitted for clarity

  function jumpToNextWord (event) {
    // code omitted for clarity

    selection.removeAllRanges()
    selection.addRange(range)
    scrollIntoView(range)
    requestTranslation()
  }
  
  // code omitted for clarity

  function requestTranslation() {
    var word = selection.toString()
    var lang = getLang(selection.getRangeAt(0).startContainer)
    var data

    // Ignore word if it is all whitespace
    word = word.match(lastWordRegex)
    if (word) {
      word = word[0]
    } else {
      return
    }

    if ((data = wordsCache[lang]) && (data = data[word])) {
      return showTranslation(null, data, true)
    }
    
    ASYNC.call("getTranslation", word, lang, showTranslation)

    function showTranslation (error, data, alreadyCached) {
      if (error) {
        return console.log(error)
      }

      var pTranslation = document.getElementById("translation")
      var string
        , word
        , lang

      if (data) {
        word = data.word
        lang = data.lang
        if (!alreadyCached) {
          addToWordsCache(lang, word, data)
        }

        if (word !== selection.toString()) {
          return
        }

        string = "<h3>" + word + "</h3><dl>"

        for (var key in data) {
          if (key !== "word" && key !== "lang") {
            string += "<dt>" + key + "</dt>"
            string += "<dd>" + data[key] + "</dd>"
          }
        }

        string += "</dl>"
      }

      pTranslation.innerHTML = string
    }

    function addToWordsCache(lang, word, data) {
      var langMap = wordsCache[lang]
      if (!langMap) {
        langMap = {}
        wordsCache[lang] = langMap
      }

      langMap[word] = data
    }
  }
})()

Testing

After you save your changes and relaunch your web page, you can test the translation feature. You can try both double-clicking on a Thai word or clicking and dragging to make a custom selection.

See what happens when you:
  • Select the individual syllables "มหา" or "สมุทร" in the word "มหาสมุทร"
  • Select all the characters between two spaces, so that more than one word is selected
  • Use the right arrow key to jump to the next word. You can start by moving slowly from word to word, waiting until the asynchronous call returns, and then use the left arrow key to jump back to words whose translation has already been cached. You'll see that the response is much faster.

In this section, you've completed a simulation of calling a server in order to get both word boundary data and word definitions, using a number of techniques that you discovered earlier in this tutorial.

Conclusion

If you have made the effort to get this far, well done!

Here's what you've learnt:
  • Finding word boundaries in certain languages requires a call to a dedicated application on a server
  • You can detect when the text on a page changes, and call the server after each change to obtain the word boundary data for the new text
  • You can cache the results of calls to the server to speed up future operations
  • You can use the word boundary data to determine the next and previous words, and use the arrow keys to jump between them.
  • You can query a database for information about the words

Where to go from here

The Language Reference tutorial builds on the techniques that you have learned here, to connect to a Wiktionary page when you select a word in the box at the top of the NoteBook popup window.

Overview

This tutorial is part of a series that describes how to build a Chrome extension that helps you to learn a foreign language. With the extension, any text you select in any web page will be copied to a pop-up window. If you select a word in the pop-up window, the appropriate Wiktionary page for that word will be shown in an iFrame. An earlier tutoral shows you how to get this to work for European languages, which use spaces as word boundaries. This tutorial shows you how to extend it to Thai (and by extension to other Asian languages), where no spaces are used between words.

In a browser, you expect a double-click to select a whole word, regardless of which language that word is written in. A browser like Chrome, the word boundaries for many languages are automatically detected. Other browsers may fail to indentify word boundaries in non-European languages and select either syllables or whole phrases instead. If you want to select a whole word in such browsers, or without using a double-click, you need to know how to analyze text to find word boundaries.

The aim of this tutorial is to develop a proof-of-concept system which will allow you to use the arrow keys to jump to neighbouring Thai words. As you will see, distinguishing boundaries between Thai words is best performed asynchronously on the server. This tutorial will show you how to create a minimal word boundary detection system that will run in your browser, and to make asynchronous calls to it, to simulate a connection to the server.

The final step is to simulate a connection to a server to retrieve the dictionary definition of the selected word.

Word boundaries

In spoken language, words tend to flow together. In a sentence like "We went on a long journey", you have to wait until the word "journey" is spoken before you can be sure that "a long" is two words. Conversely, if the sentence was "We went on along the road", you would need to wait until the word "the" to know that the same sounds were the single word "along".

In many written languages, there are visual signs to indicate where words end. In European languages descended from Latin and Greek, a space is used between words, but there is still some ambiguity about whether "can't" is one word or two. In Arabic, letters have different shapes depending on whether they appear at the beginning, in the middle or the end of a word, as an additional clue.

However, in other languages, such as Chinese and Japanese, there are spaces between sentences, but no spaces between words. In Thai and Lao, there may also be spaces between clauses, so the space character itself is ambiguous. In Vietnamese, there is a space between each syllable, while several syllables may combine to make a single word.

This tutorial explores two simple techniques for segmenting written Thai into words. However, there is no generic automated solution for this, for several reasons, including:

  • Ambiguous word boundaries
  • Unknown words

Ambiguous word boundaries

Word boundaries found in identical character sequences can depend on the meaning, which is inaccessible to anyone (including a computer) who does not speak the language. A classic example in Thai is ตากลม, which can be ตา|กลม (round eyes) or ตาก|ลม (exposed to the air). You see a similar kind of ambiguity in the following English sentences:

  • Those who died are devils
  • The do or die daredevils
  • The door did not open

If this were written with no spaces, a machine or non-English speaker would have no way of knowing where to place the word boundaries. There are multiple ways of dividing the letters to yield real English words.

Unknown words

A dictionary-based approach fails as soon as it encounters an unknown word. At best, it can skip over chunks that do not correspond to any entry in the dictionary, but this can be a costly and error-prone process. If it had no spaces, and the name "Houdini" was not in the dictionary, a sentence like ...

A noted wit, Houdini told many jokes.

... could be segmented as ...

a noted with ou din it old many jokes

... or ...

a noted wi thou din it old many jokes

... with one unknown word leading to a longer garbled passage, and a misleading chunk being labelled "unknown".

This tutorial will not address either of these issues, which are still the subject of much research.

In this tutorial, you will be learning to:
  • Create a trie from a dictionary
  • Use the trie to detect word boundaries in EnglishNoSpaces
  • Apply the trie technique to Thai text
  • Use a "rule-based" system to identify syllable boundaries in Thai text
  • Run an asynchronous method (to simulate an AJAX request to a server) to obtain the word segmentation of a chunk of text
  • Use the callback result to select the word indicated by a click on a press on an arrow key.

Note: the selection.js code used in this tutorial is describe in more detail here.