Defining a dictionary
- Create an HTML file containing some Thai text and an English translation withnospacesbetweenwords
- Create a dictionary object for all the words in the Thai text and all the words that can be found in the English string.
Mimicking Thai text
During this tutorial, you'll be working with some text in Thai. To make it easier to understand what your code is doing, it might be easier to use a translation of the Thai text in English, from which all the spaces have been removed. You can then perform exactly the same actions on the English textwithnospaces, even if you understand no Thai.
You can start by creating a file named index.html
with the following HTML code:
index.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Selecting Thai words</title> </head> <body> <p lang="th">งมเข็มในมหาสมุทร พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง ตากลม ตา​กลม</p> <p lang="enx">lookforaneedleinthesea talkisworthtuppence silenceisworthgold exposedtotheair roundeye</p> <script src="js/dictionaries.js"></script> </body> </html>
For now, the text in the HTML is simply to give you an idea of how Thai text looks. Later, in the section on using the arrow keys to jump to the neighbouring words, you'll actually be interacting with Thai text.
ISO language codes
The International Organization for Standardization (ISO) has published a list of two- and three-letter language codes. You can use the lang
attribute of an HTML element to indicate which language the text of the element is in. The standard codes for English and Thai en
and th
.
Normally, you would add this attribute to the <html>
element ...
<!DOCTYPE html> <html lang="en"> <head> ... <head> <body> ... </body> </html>
... to indicate the language of the entire page. But this page is in two languages, so it makes sense to give the attribute to the individual <p>
tags.
Since the "English" used here to mimic the structure of Thai text is non-standard English, I have used a non-standard code: enx
.
Zero-width space
The character encoded as ​
is a "zero-width space". It is used here to force a word-break in the string ตากลม so that it will be treated as ตา กลม (eye round) and not ตาก ลม (exposed-to air).
Creating a dictionary
The JavaScript file below creates a global object called dico
, with two dictionary maps: one for the non-standard enx
language code, one for th
, the Thai language.
If you are reading this in English, the chances are that you already know all the English words, so a pronunciation guide and translation is provided only for the Thai words.
Notice that the enx
dictionary contains all the words that can be found in the enx
strings, even those that are not used to create meaning from the strings. For testing purposes, the enx
dictionary object contains the invented word "forane", which can be found in the chunk "foraneedle"
js/dictionaries.js
"use strict" var dico = {} ;(function addDictionaries() { dico = { dictionaries: { enx: { " ": 0 , "a": 0 , "air": 0 , "exposed": 0 , "eye": 0 , "for": 0 , "forane": 0 // nonsense word for testing , "gold": 0 , "he": 0 , "in": 0 , "is": 0 , "look": 0 , "need": 0 , "needle": 0 , "old": 0 , "or": 0 , "pence": 0 , "round": 0 , "sea": 0 , "silence": 0 , "talk": 0 , "the": 0 , "these": 0 , "to": 0 , "tot": 0 , "tup": 0 , "tuppence": 0 , "up": 0 , "worth": 0 } , th: { " ": {} , "": {} // ​ = zero-width space , "กลม": { "pronunciation": "glohm-" , "translation": "round; circular" } , "งม" : { "pronunciation": "ngohm-" , "translation": "to grope; search; seek; fumble for" } , "ตา": { "pronunctiation": "dtaa-" , "translation": "eye; maternal grandfather" } , "ตาก": { "pronunctiation": "dtaak_" , "translation": "[is] exposed (e.g., to the sun)" } , "ตำ" : { "pronunciation": "dtam-" , "translation": "beat; pound an object; pulverize; to pierce; puncture; prick" } , "ตำลึง" : { "pronunciation": "dtam- leung-" , "translation": "ancient Thai monetary unit" } , "ทอง" : { "pronunciation": "thaawng-" , "translation": " gold" } , "นิ่ง" : { "pronunciation": "ning`" , "translation": "[is] still; immobile; silent; motionless; quiet" } , "พูด" : { "pronunciation": "phuut`" , "translation": " to speak; to talk; to say" } , "มหา" : { "pronunciation": "ma' haa´" , "translation": "great; omnipotent; large; many; much; maximal; paramount; exalted" } , "มหาสมุทร" : { "pronunciation": "ma' haa´ sa_ moot_" , "translation": "ocean" } ,"ลม": { "pronunciation": "lohm-" , "translation": "air; wind; storm" } , "สมุทร" : { "pronunciation": "sa_ moot_" , "translation": "ocean; sea" } , "สอง" : { "pronunciation": "saawng´" , "translation": "two" } , "เข็ม" : { "pronunciation": "khem´" , "translation": "clasp; brooch; safety pin; needle; sewing pin" } , "เบี้ย" : { "pronunciation": "biia`" , "translation": "a cowrie shell [formerly used as] money" } , "เสีย" : { "pronunciation": "siia´" , "translation": "to spend; use up; lose; give up; sacrifice; pay" } , "ใน" : { "pronunciation": "nai-" , "translation": "in; inside; within; amidst; into; on; at a particular time" } , "ไป" : { "pronunciation": "bpai-" , "translation": "to go; <subject> goes" } , "ไพ" : { "pronunciation": "phai-" , "translation": "a certain old coin equal in value to 1/32 baht" } } } } })()
- An HTML page for testing
- Two JavaScript dictionary objects containing a limited number of words in English and Thai.
In the next section, you'll use these dictionaries to create a trie object for each language.
Creating a trie
- Create a trie object
The word "trie" comes from the French word for "to choose, classify, filter, pick out, separate, sort". It also sounds like the English word "tree". In computing, it refers to a data structure than starts at a common root and divides multiple times to end in a unique leaf.
You can add the code below to your dictionaries.js
file.
js/dictionaries.js
"use strict" var dico = {} ;(function addDictionaries() { dico = { dictionaries: { enx: { // code omitted for clarity } , th: { // code omitted for clarity } } , tries: {} , initialize: function createTries(){ for (var languageCode in this.dictionaries) { this.tries[languageCode] = this.createTrie(this.dictionaries[languageCode]) } return this } , createTrie: function createTrie(languageMap) { var trie = {} var end = "$" var word , ii , length , last , chars , char , path , place for (word in languageMap) { chars = word.split("") path = trie for (ii = 0, length = word.length; ii < length; ii += 1) { char = chars[ii] place = path[char] if (!place) { place = {} path[char] = place } path = place } path[end] = true } return trie } }.initialize() })() console.log(JSON.stringify(dico.tries.enx))
The createTrie
method creates an empty trie
object, and then takes each word key in a given dictionary and splits it into its individual characters. It then checks if there is already a word which begins with the first letter. If not, it creates a new entry for that letter in the trie
object; if so, it checks whether the existing sub-object already contains a sub-object for the next letter. In this manner, it works its way along until it reaches the end of the word, when it adds an entry for "$": true
to the current branch, as a final leaf.
Console output
If you save your changes, relaunch your page and check the Developer Console, you should see output that looks something like this (but a bit less prettified).
{" ":{"$":true} ,"a":{"$":true ,"i":{"r":{"$":true}}} ,"e":{"x":{"p":{"o":{"s":{"e":{"d":{"$":true}}}}}} ,"y":{"e":{"$":true}}} ,"f":{"o":{"r":{"$":true ,"a":{"n":{"e":{"$":true}}}}}} ,"g":{"o":{"l":{"d":{"$":true}}}} ,"h":{"e":{"$":true}} ,"i":{"n":{"$":true} ,"s":{"$":true}} ,"l":{"o":{"o":{"k":{"$":true}}}} ,"n":{"e":{"e":{"d":{"$":true ,"l":{"e":{"$":true}}}}}} ,"o":{"l":{"d":{"$":true}} ,"r":{"$":true}} ,"p":{"e":{"n":{"c":{"e":{"$":true}}}}} ,"r":{"o":{"u":{"n":{"d":{"$":true}}}}} ,"s":{"e":{"a":{"$":true}} ,"i":{"l":{"e":{"n":{"c":{"e":{"$":true}}}}}}} ,"t":{"a":{"l":{"k":{"$":true}}} ,"h":{"e":{"$":true ,"s":{"e":{"$":true}}}} ,"o":{"$":true,"t":{"$":true}} ,"u":{"p":{"$":true ,"p":{"e":{"n":{"c":{"e":{"$":true}}}}}}}} ,"u":{"p":{"$":true}} ,"w":{"o":{"r":{"t":{"h":{"$":true}}}}}}
Note how the letters of the word "these" (shown in red) have been added to the trie
object.
In the next section, you'll use your trie object to detect the word boundaries in a given string.
Detecting words with a trie
- Use a trie object to analyze a string of text and split it into known words.
Looking for known words
Now that you have a trie object, you can use it identify words in a string. The splitIntoWords
method in the code listing below will step through the string one character at a time, matching it to a path through the trie. If it fails to find a word at any point, it will backtrack and start from the last point where it found a whole word.
js/.dictionaries.js
"use strict" var dico = {} ;(function addDictionaries() { dico = { dictionaries: { // code omitted for clarity } , tries: {} , initialize: function createTries(){ // code omitted for clarity } , createTrie: function createTrie(languageMap) { // code omitted for clarity } , splitIntoWords: function splitIntoWords(string, languageCode) { var trie = this.tries[languageCode] var alternatives = [] var path = trie var words = [] var word = "" var found = false var char , next , alternative words.index = 0 for (var ii = 0, total = string.length; ii < total ; ii += 1) { char = string[ii] next = path[char] if (next) { // { "$": 0, "a": { ... }. ...} word += char if (next.$) { if (found) { // We've already added a shorter word to the list. // Push the list with the shorter word onto the // alternatives array and continue with the longer word alternative = words.slice(0) alternative.index = words.index alternatives.push(alternative) words.index -= words.pop().length // remove shorter word } // Add this whole word to the list words.push(word) words.index += word.length found = true } path = next } else if (found) { // !next, but we have a complete word already // This character could be the start of a new word word = char path = trie[char] if (!path) { // It's imposible to start a word with this letter backtrack("Initial letter absent: " + char) } else if (path.$) { // This may be a one-letter word: found remains true words.push(word) words.index += word.length next = path } else { found = false } } else { // !next && !found: stop trying to complete this word backtrack("Word fragment: " + word) } } if (!next || !next.$) { backtrack("Fail – Unknown word: ", word) } return words function backtrack(reason) { console.log(reason) // Try the longest earlier alternative ... words = alternatives.pop() // ... if one exists ... if (!words) { console.log("Failed to segment") ii = total return } // ... and see if all the remaining characters cand be // segmented into words ii = words.index - 1 path = trie word = "" found = false } } }.initialize() })() console.log(dico.splitIntoWords("lookforaneedleinthesea", "enx")) console.log(dico.splitIntoWords( "งมเข็มในมหาสมุทร " + "พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง " + "ตากลม ตากลม" , "th"))
The output in the Developer Console should look something like this:
["look", "for", "a", "needle", "in", "these", "a", index: 22] ["งม", "เข็ม", "ใน", "มหาสมุทร", " ", "พูด", "ไป", "สอง", "ไพ", "เบี้ย", " ", "นิ่ง", "เสีย", "ตำลึง", "ทอง", " ", "ตาก", "ลม", " ", "ตา", "", "กลม", index: 62]
You'll notice that the last two words in English are "these", "a"
and not "the", "sea"
. The last chunks of Thai has been split differently from the previous chunk, because of the zero-width space ​
.
splitIntoWords
The splitIntoWords
method accepts a string of text and a language code. It uses the language code to select the appropriate language trie. It takes the first character in the string and looks for it at the top level of the trie, then takes the second letter, and looks for it in the branch of the trie defined by the previous letter, adding it to its word
variable as it goes. It will continue to do this until one of three things happens:
- It meets an object with a
"$": true
key/value pair - It finds an object which contains no key for the letter in the string that it is currently looking at
- It reaches the end of the text
Adding words
The algorithm will step through the string and tunnel into the trie until it has created the string "look". At this point, it will encounter a {"$": true}
object. This tells it to add the word "look" to its current words
list, and to set a flag to say that at least one word has been found. Since an array is special kind of object, it can add a property to the array: it adds an index
key with a value equal to the length of the string read so far. You'll see in a moment how this is used.
Preparing alternatives
The splitIntoWords
algorithm is greedy: it will keep looking for the longest word that fits, and this may lead it into a blind alley. To test this, you'll find the nonsense word "forane" in the dictionary. After finding "look", the algorithm keeps moving forward. When it has read "for", it will find an object with both a "$": true
key pair and the key "a", with a submap that it can continue to follow.
It adds the word "for" to the words
array, and updates the index
of the array: index
is now the sum of the lengths of the word in the array.
The process continues until the {"$": true}
object, indicating the end of the nonsense word "forane", is found. This time, however, there is already a word that uses the letters "for" in the words
array. The splitIntoWords
function first uses words.slice(0)
to copy the array, and also copies the value of index
, to create a full clone. This clone is added to the alternatives
array. In the original words
array, the last word ("for") is removed and replaced with "forane", and the index
value is updated.
A new word now begins, starting with the letter "e", but at the next iteration, no sub-map of trie
is found that corresponds to the word fragment "ed". The splitIntoWords
function has reached the end of a blind alley: it calls the backtrack
function. This drops the current words
array and replaces it with the longest array in alternatives
: ["look", "for"]
, with its index
value of 7. It resets the ii
iterator so that the next iteration starts building a new word from position ii = 7
: the index of the next letter after "lookfor". This time, it finds the words "a" and "need", which is soon replaced by "needle", and continues to read "these", and "a". This is a complete word and the final letter in the string, so the process stops there.
Issues with a dictionary approach
This approach suffers from two major problems:
- It is possible to split the letters in the wrong place and still create a set of valid (but meaningless) words
- If a word is not in the dictionary, it will fail
Ambiguous word boundaries
The algorithm has no "intelligence" built in, so it has no reason to reject "these", "a" and replace it with "the", "sea". Its greediness has led it into error. Refusing to be greedy could lead it to make different errors.
People who speak English know almost instantly where to split what they hear into words, based on meaning and statistics. Until you read this paragraph, it's unlikely that you have ever encountered a meaningful sentence that ends with the words "in these a".
As far as we know, computers are totally unaware of meaning, but a quick search on Google will show you that, statistically, "in the sea" is over 300 times more likely to appear in written English than "in these a". It should therefore be possible to improve the accuracy of the splitIntoWords
algorithm by checking the statistical likelihood of each subset of words appearing together. However, this will require access to a vast database: not just a dictionary of words but a store of phrases indexed by the words they contain.
It is not feasible to load this database in your browser. It uses much less bandwidth to send the text for analysis to the server, where it can be treated by all the CPU power necessary, and then to send the results back.
Unknown words
In this simple implementation, the absence of a single word from the dictionary will cause the algorithm to fail. A more robust version would necessarily be slower, and, without a database to provide statistical input, may still produce erroneous results.
In the next section, you'll see a different approach based on the spelling rules of the Thai language.
A rule-based system for detecting syllables
- Characters that only appear at the beginning of a Thai syllable
- Characters that appear only at the end of a Thai syllable
- Thai numbers
- Boundaries between Thai characters and non-Thai characters
- Common Thai words that cannot be created by chance across a word boundary
You'll create a series of regular expressions to detect any of these boundaries, and use them to determine places where there must be a syllable boundary.
Download the source files Test HereThe approach described in this section is inspired by the ThaiWrap bookmarklet / ตัวตัดบรรทัดข้อความไทย written by Arthit Suriyawongkul / อาทิตย์ สุริยะวงศ์กุล
Initial and final letters in Thai
In English, you are used to seeing vowel sounds that are with multiple vowel characters. For example: "fair" and "fire". In the word "fire", the final "e" appears after the last consonant, but it is not pronounced. Instead, its existence changes the sound of the preceding "i".
In Thai, certain vowels are written before the first consonant, but are pronounced after the consonant. The Thai tone-twister ไม้ ใหม่ ไม่ ไหม้ ไหม (máai mài mâi mâi mái, green wood doesn't burn, does it?) uses two of these vowels: ไ and ใ. The character for "m" is ม, which is written after the vowel. You can therefore be sure that wherever you find a ไ or ใ character, it is the first character in a syllable.
The five vowels that work this way are เ แ โ ใ ไ. Fortunately, they have been placed together in the Thai Unicode block, so you can use the regular expression /[เ-ไ]/
for them.
There are also a certain number of characters that only ever appear at the end of a syllable. These are more scattered so you need to use each individual character in the regular expression that identifies them all: /[ฯะำฺๅๅๆ๎]/
. If you encounter any of these characters in Thai text, you can be 100% sure that the next character is the start of a new syllable.
Numbers
The Thai numbers from 0 to 9 are written as:
๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙
Anywhere where a number is followed by a non-number, or a non-number is followed by a number, must be the end of a word. These numbers can be expressed in the regular expression: /[๐-๙]/
. To define all the Thai characters that are not numbers, you can use /[ก-ฺเ-๎]/
Non-Thai characters
Thai people often use Western numerals (0 - 9), and Western names may be written in Roman characters. You can be sure that any transition from Thai to non-Thai will occur at a word boundary.
Common words
Out of the top 100 words in Thai, there are 30 which form unambiguous syllables. Out of the top 2500, there are 51 such words, as shown in the following regular expression:
/(เป็น|ใน|จะ|ไม่|และ|ได้|ให้|ความ|แล้ว|กับ|อยู่|หรือ|กัน|จาก|เขา|ต้อง|ด้วย|นั้น|ผู้|ซึ่ง|โดย|ใช้|ยัง|เข้า|ถึง|เพราะ|จึง|ไว้|ทั้ง|ถ้า|ส่วน|อื่น|สามารถ|ใหม่|ใช่|ใด|ช่วย|ใหญ่|เล็ก|ใส่|เท่า|ใกล้|ทั่ว|ฉบับ|ใต้|เร็ว|ไกล|เช้า|ซ้ำ|เนื่อง|ค้น)/
If you encounter any of these words, you can be sure that the preceding character is the end of the previous syllable and the next character is the beginning of the following syllable.
Rule-based segmentation
You can add the segment
object, as shown below, to your dictionaries.js
script. When you call segment.th(string)
, it will search through the string, applying each regular expression in turn, looking for all these identifiable syllable boundaries. It adds the position of these boundaries to an array. Some boundaries will be found multiple times, and these are filtered out. In the last step, it takes each boundary point, starting from the end, and replaces it with the segment of the string that starts from that point.
js/dictionaries.js
"use strict" var dico = {} var segment ;(function addDictionaries() { dico = { // code omitted for clarity }.initialize() segment = { th: function thai(string) { // unambiguous words that are common, like prepositions var cw = "(เป็น|ใน|จะ|ไม่|และ|ได้|ให้|ความ|แล้ว|กับ|อยู่|หรือ|กัน|จาก|เขา|ต้อง|ด้วย|นั้น|ผู้|ซึ่ง|โดย|ใช้|ยัง|เข้า|ถึง|เพราะ|จึง|ไว้|ทั้ง|ถ้า|ส่วน|อื่น|สามารถ|ใหม่|ใช่|ใด|ช่วย|ใหญ่|เล็ก|ใส่|เท่า|ใกล้|ทั่ว|ฉบับ|ใต้|เร็ว|ไกล|เช้า|ซ้ำ|เนื่อง|ค้น)" // leading chars var lc = "[เ-ไ]" // final chars var fc = "[ฯะำฺๅๅๆ๎]" // thai chars var tc = "ก-ฺเ-๎" // not including numbers + ฿, ๏ or ๚๛ var no = "๐-๙" var isThai = /[ก-ฺเ-๎]/ var isNumber = /[๐-๙]/ var regexes = [ // characters that start a syllable new RegExp(lc, "g") // characters than end a syllable , new RegExp(fc, "g") // non-number followed by any Thai number , new RegExp("[^"+no+"](?=["+no+"])", "g") // Thai number followed by any non-number , new RegExp("["+no+"](?=[^"+no+"])", "g") // Thai character followed by a non-Thai character , new RegExp("["+tc+"](?=[^"+tc+"])", "g") // non-Thai character followed by Thai character , new RegExp("[^"+tc+"](?=["+tc+"])", "g") // any char followed by known word , new RegExp("."+cw+"", "g") // known word followed by any character , new RegExp(cw+"(.)", "g") // beginning of a space , /.\s/g // end of a space , /\s+./g ] var adjustments = [ // characters that start a syllable 0 // characters that end a syllable , 1 // non-number followed by any Thai number , 1 // Thai number followed by any non-number , 1 // Thai character followed by a non-Thai character , 1 // non-Thai character followed by Thai character , 1 // any char followed by known word , 1 // known word followed by any character , true // spaces , 1 , true ] function numerical(a, b) { return a - b } function removeDuplicates(value, index, array) { return array.indexOf(value) === index } function splitIntoWords(string) { var segments = [0] var total , ii , regex , adjust , result , split , start , end for (ii = 0, total = regexes.length; ii < total; ii++) { regex = regexes[ii] adjust = adjustments[ii] while (result = regex.exec(string)) { split = result.index + (adjust === true ? result[0].length - 1 : adjust) segments.push(split) } } segments.sort(numerical) segments = segments.filter(removeDuplicates) end = string.length ii = segments.length while (ii--) { start = segments[ii] segments[ii] = string.slice(start, end) end = start } return segments } return splitIntoWords(string) } } })() console.log(dico.splitIntoWords( "งมเข็มในมหาสมุทร " + "พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง " + "ตากลม ตากลม" , "th" )) console.log(segment.th( "งมเข็มในมหาสมุทร " + "พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง " + "ตากลม ตากลม" ))
Results
In the Developer Console, you will see the output of the dictionary-based and the rule-based algorithms. Differences are marked in colour below:
["งม","เข็ม","ใน","มหาสมุทร"," ", "พูด","ไป","สอง","ไพ","เบี้ย"," ","นิ่ง","เสีย","ตำลึง","ทอง"," ", "ตาก","ลม"," ",ตา","","กลม", index: 62] ["งม","เข็ม","ใน","มหาสมุทร"," ", "พูด","ไปสอง","ไพ","เบี้ย"," ","นิ่ง","เสียตำ","ลึงทอง"," ", "ตากลม"," ",ตา","","กลม"]
In the majority of cases, the two approaches agree. However:
- The rule-based approach does not identify a boundary in four cases (one green, one blue and two either side of the red chunk)
- The rule-based approach splits one word (shown in red) along its syllable boundaries, and the two halves fail to be detached from the adjacent words.
This rule-based approach can function with moderate accuracy with a dictionary of only a few dozen words. It may make fail to recognize all the syllable boundaries, but it does not fail if it encounters an unknown word.
Attempting to solve the problem in the browser appears unrealistic. In the next section, you'll see how to make an asynchronous call, to simulate a request for word segmentation data from the server.
Asynchronous calls
Correctly identifying word boundaries in Thai is not a task that is easy to perform in the browser. A better solution is to send the Thai text to a server for treatment, and to use the results as soon as they are available.
Over the next three sections you will be learning to:
- Make an asynchronous call to a simulated server
- Make this call each time new Thai text is found in the web page, whether this is when the page is loaded or when new text is added on the fly
- Use the results of these calls to identify the neighbouring word when the arrow keys are pressed
- Create an ASYNC object that behaves like
Meteor.call
in a Meteor environment, or as a wrapper for an plain AJAX call - Add a
getWordSegmentation
method to your globaldico
object - Test an asynchronous call to
getWordSegmentation
and observe the results in the Developer Console.
The ASYNC object
You can start by creating a file called async.js
and saving it in your js/
folder. Here's the code that defines the essential call
method:
js/async.js
"use strict" var ASYNC ;(function async(){ ASYNC = { call: function call (methodName) { var method = this.methods[methodName] var args = [].slice.call(arguments) var callback = args.pop() var result , error if (typeof method === "function") { args.shift() if (typeof callback === "function") { // Simulate asynchronous call result = method.apply(this, args) if (result instanceof Error) { error = result result = null } else { error = null } setTimeout(function () { callback(error, result) }, Math.floor(Math.random() * 2000)) } else { // Return result synchronously args.push(callback) return method.apply(this, args) } } else { error = new Error("Unknown method: " + methodName) if (typeof callback === "function") { callback(error) } else { return error } } } , methods: {} } })()
Just as with Meteor.call
you call this using the syntax:
ASYNC.call("methodName", arguments ... , callback function)
If a callback function is given, it will return the result asynchronously (after a random delay). If no callback is given, a result will be returned synchronously.
The getWordSegmentation
method
You can add any method you like to the methods
object. Here's how you can add a getWordSegmentation
method, which will call a getWordMap
method that you haven't yet added to the global dico
object.
js/async.js
"use strict" var ASYNC ;(function async(){ ASYNC = { call: function call(methodName) { // code omitted for clarity } , methods: { /** * getWordSegmentation calls dico.getWordMap() for each * string in the map object, and returns an object defining * the positionsof the start and end of each word in each * string. * @param {object} map has the format * { <lang code>: [ * <string> * , ... * ] * , ... * } * @return {object} { <lang code>: [ * { text: <string> * , offsets: { starts: [ * <integer> * , ... * ] * , ends: [ * <integer> * , ... * ] * } * } * , ... * ] * , ... * } */ , getWordSegmentation: function getWordSegmentation(map) { var result = {} var lang for (lang in map) { treatLanguage(lang) } return result function treatLanguage(lang) { var stringArray = map[lang] var langArray = result[lang] if (!langArray) { langArray = [] result[lang] = langArray } ;(function treatStrings() { var ii = stringArray.length var wordMap while (ii--) { try { wordMap = dico.getWordMap(stringArray[ii], lang) langArray.push(wordMap) } catch (error) { result = error ii = 0 } } })() } } } } })()
The getWordSegmentation
expects as an argument an object map whose keys are lang ISO codes, and whose properties are arrays of strings in the given language. You'll have the chance to test it now.
Editing index.html
You'll your to declare your new async.js
script in your index.html
file. At the same time, you can add a little JavaScript test to call the getWordSegmentation
method in the ASYNC
object.
index.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Selecting Thai words</title> </head> <body> <p lang="th">งมเข็มในมหาสมุทร พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง ตากลม ตากลม</p> <p lang="enx">lookforaneedleinthesea talkisworthtuppence silenceisworthgold exposedtotheair roundeye</p> <script src="js/dictionaries.js"></script> <script src="js/async.js"></script> <script> var map = { th: [ "งมเข็มในมหาสมุทร" , "พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง" , "ตากลม ตา\u200bกลม" ] , enx: [ "lookforaneedleinthesea" , "talkisworthtuppence silenceisworthgold" , "exposedtotheair roundeye" ] } ASYNC.call("getWordSegmentation", map, callback) function callback(error, result) { console.log(error || JSON.stringify(result)) } </script> </body> </html>
Provoking an error
You may have noticed that you haven't yet added a getWordMap
method to your dico
object. The JavaScript engine in your browser certainly will. If you save your edited files and refresh your browser, you should see the following in the Developer Console:
TypeError: dico.getWordMap is not a function(…)
The getWordMap
method
To fix this error, you need to add a getWordMap
method to your dico
object.
js/dictionaries.js
"use strict" var dico var segment ;(function addDictionaries() { dico = { dictionaries: { // code omitted for clarity } , tries: {} , initialize: function createTries(){ // code omitted for clarity } , createTrie: function createTrie(languageMap) { // code omitted for clarity } , splitIntoWords: function splitIntoWords(string, languageCode){ // code omitted for clarity } , getWordMap: function getWordMap(string, languageCode) { var segments = this.splitIntoWords(string, languageCode) var regex = /[^\s!-\/:-@[-`{-~\u00A0-¾—-⁊\u200b]/ var offsets if (!segments) { // Split words in unknown language by ASCII word boundaries segments = string.split(/\b/) } offsets = getOffsets() return { text: string , offsets: offsets } function getOffsets() { var starts = [] var ends = [] var total = segments.length var index = 0 var ii , segment , length for (ii = 0; ii < total; ii += 1) { segment = segments[ii] if (regex.test(segment)) { // This segment contains word characters starts.push(index) index += segment.length ends.push(index) } else { // This segment is punctuation or whitespace index += segment.length } } return { starts: starts, ends: ends } } } }.initialize() segment = { // code omitted for clarity } })()
The getWordMap
method accepts a string and a language code as arguments. It first checks if the dico
object has a dictionary for the given language, and if so, it uses that to split the string into known words. In practice, this would be done on the server by a specialized application. If no dictionary exists for the given language, it splits the string into words and spaces using the simplistic /\b/
regular expression, which really only works will with plain English text.
For example, it will split the string "talkisworthtuppence silenceisworthgold" into:
["talk", "is", "worth", "tuppence", " ", "silence", "is", "worth", "gold"]
The getOffsets
function
The getOffsets
function looks at each entry in the segments
array in turn, and checks whether or not it contains a non-space character. The regular expression it uses a to do this is rather overkill for this proof-of-concept.
/[^\s!-\/:-@[-`{-~\u00A0-¾—-⁊\u200b]/
This checks that the string contains none of the non-word characters (such as whitespace, zero-width spaces and punctuation) that are used in any European language. If so, the segment must be a word.
The getOffsets
function populates two lists – starts
and ends
– with the index of the starting point and end point of each word. For example, for the string "talkisworthtuppence silenceisworthgold", starts
will be ...
[0, 4, 6, 11, 20, 27, 29, 34]
... indicating that the word "talk" starts at 0
, the word "is" starts at 4
, and so on. The ends
array will be ...
[4, 6, 11, 19, 27, 29, 34, 38]
Each end point corresponds to the start point of the next word except for 19
which indicates the end of "tuppence" and the beginning of a space, while 20
in the starts
array indicates the beginning of the word "silence".
Testing a second time
Now that you have added the getWordMap
method, you can save your files and refresh your browser. This time you should see more interesting output in the Developer Console. Here's a prettified version, complete with colouring so that you can see how the index numbers and the word boundaries match each other.
{ "th": [
{ "text": "ตากลม ตากลม"
, "offsets": {
"starts": [0,3,6,9]
, "ends": [3,5,8,12]
}
}
, { "text": "พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง"
, "offsets": {
"starts": [0,3,5,8,10,16,20,24,29]
, "ends": [3,5,8,10,15,20,24,29,32]
}
}
, { "text": "งมเข็มในมหาสมุทร"
, "offsets": {
"starts":[0,2,6,8]
, "ends": [2,6,8,16]
}
}
]
, "enx": [
{ "text": "exposedtotheair roundeye"
, "offsets": {
"starts": [0,7,9,12,16,21]
, "ends": [7,9,12,15,21,24]
}
}
, { "text": "talkisworthtuppence silenceisworthgold"
, "offsets": {
"starts": [0,4,6,11,20,27,29,34]
, "ends": [4,6,11,19,27,29,34,38]
}
}
, { "text": "lookforaneedleinthesea"
, "offsets": {
"starts": [0,4,7,8,14,16,19]
, "ends": [4,7,8,14,16,19,22]
}
}
]
}
getWordSegmentation
dictionaries.js
script behave asynchronously, is if it were on a server. In particular, you've
- Created an
async.js
script that allows you to simulate an asynchronous call to a server - Added a
getWordMap
method to yourdico
object that takes a string and a language code as in input and returns an array of start and end points of the words in the string - Used
ASYNCH.call
to send a map of strings in various languages and to receive a callback some time later with the results of the call.
For now, you've simply tested your system using a hard-coded map. In the next section, you'll see how to use a MutationObserver
to detect whenever new text is added to the web page, and to trigger ASYNC.call("getWordSegmentation", ...)
automatically.
Detecting text changes
- Use a
MutationObserver
object to detect each time the text in an HTML page changes - Use a
TreeWalker
object to find which textNodes have changed - Detect the
lang
associated with a giventextNode
- Check whether the user can select the text in a given
textNode
- Maintain a map of word boundary data for all relevant
textNode
s on the page
Editing the index.html
file
To test the new features that you will be adding in this section, you can edit your index.html
file as shown below. These changes:
- Set the default value of the
lang
attribute of the entire HTML page toen
(English) - Add text that will not need to be treated
- Add spans to the Thai (
th
) and EnglishNoSpaces (enx
) text, to create a number oftextNode
s - Add two spans with a class of
unselectable
, which will also not need to be treated - Add a button and code so that you can remove the
unselectable
class, after the page has loaded
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <title>Selecting Thai words</title> <link rel="stylesheet" type="text/css" href="css/style.css"> </head> <body> <h1>Word Segmentation Test</h1> <p>The default <em>lang</em> of this page is <em>en</em>. The paragraphs below are in Thai (<em>th</em>) and EnglishNoSpaces (<em>enx</em>).</p> <p lang="th">งมเข็มในมหาสมุทร <span>พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง</span> <span class="unselectable">คำเหล่านี้ไม่สามารถเลือกได้</span> ตากลม ตากลม</p> <p lang="enx">lookforaneedleinthesea <span>talkisworthtuppence silenceisworthgold</span> <span class="unselectable">thewordsherecannotbeselected</span> exposedtotheair roundeye</p> <button type="button">Remove .unselectable class</button> <script> document.querySelector("button").onclick = function (event) { var spans = document.querySelectorAll(".unselectable") for (var ii = 0, length = spans.length; ii < length; ii += 1) { spans[ii].classList.remove("unselectable") } this.setAttribute("disabled", true) } </script> <script src="js/dictionaries.js"></script> <script src="js/async.js"></script> <script src="js/selection.js"></script> </body> </html>
Updating the dictionaries
This new version of index.html
includes new words that you'll need to add to the dictionaries in the dico
object:
js/dictionaries.js
"use strict" var dico var segment ;(function addDictionaries() { dico = { dictionaries: { enx: { " ": 0 , "a": 0 , "air": 0 , "be": 0 , "can": 0 , "cannot": 0 , "elect": 0 , "elected": 0 , "ere": 0 , "exposed": 0 , "eye": 0 , "for": 0 , "gold": 0 , "he": 0 , "here": 0 , "hew": 0 , "in": 0 , "is": 0 , "look": 0 , "need": 0 , "needle": 0 , "not": 0 , "old": 0 , "or": 0 , "pence": 0 , "ran": 0 , "round": 0 , "sea": 0 , "select": 0 , "selected": 0 , "she": 0 , "silence": 0 , "talk": 0 , "the": 0 // , "these": 0 , "to": 0 // , "tot": 0 , "tup": 0 , "tuppence": 0 , "up": 0 , "word": 0 , "words": 0 , "worth": 0 } , th: { " ": {} , "": {} // // code omitted for clarity , "ไพ" : { "pronunciation": "phai-" , "translation": "a certain old coin equal in value to 1/32 baht" } , "คำ": { "pronunciation": "kham-" , "translation": "term; discourse; a mouthful or bite; morsel [numerical classifier for a word, an answer to a question, a spoonful of food]" } , "เหล่า": { "pronunciation": "lao_" , "translation": "[numerical classifier for groups of items]; a group [of items or things]" } , "นี้": { "pronunciation": "nee'" , "translation": "this; these; [is] now" } , "ไม่": { "pronunciation": "mai`" , "translation": "not; no; [auxiliary verb] does not; has not; is not; [negator particle]" } , "สามารถ": { "pronunciation": "saa´ maat`" , "translation": "capable; able; to have the ability to; can; a Thai given name" } , "เลือก": { "pronunciation": "leuuak`" , "translation": "to select or choose; elect; pick" } , "ได้": { "pronunciation": "dai`" , "translation": "can; to be able; is able; am able; may; might [auxiliary of potential, denoting possbility, ability, or permission]; to receive; to obtain; acquire; get; have got; to pick out; to choose; to pass [a test or an exam]" } } } , tries: {} , initialize: function createTries(){ // code omitted for clarity } , createTrie: function createTrie(languageMap) { // code omitted for clarity } , splitIntoWords: function splitIntoWords(string, languageCode) { // code omitted for clarity } }.initialize() segment = { // code omitted for clarity } })()
Making text unselectable
The unselectable
class does nothing without a CSS rule. You can create a file called styles.css
and save it in a folder named css
alongside your HTML document.
css/styles.css
body { width: 22em; } .unselectable { -webkit-touch-callout: none; /* iOS Safari */ -webkit-user-select: none; /* Chrome/Safari/Opera */ -khtml-user-select: none; /* Konqueror */ -moz-user-select: none; /* Firefox */ -ms-user-select: none; /* Internet Explorer/Edge */ user-select: none; -webkit-user-drag: none; user-drag: none; color: #999; }
Detecting the value of the lang
attribute for a node
In the current circumstances, you want to ask the server only for the word boundary data for textNode
s in Thai or in our invented EnglishNoSpaces. To choose only text nodes in these languages, you need no detect the closest parent that has a lang
attribute, and return the value of that attribute.
You can create a new file named selection.js
and save it in your js/
folder. Here's a getLang
function. You can leave it in the global namespace for now.
js/selection.js
function getLang(node) { var lang = "" if(!node.closest) { node = node.parentNode } node = node.closest("[lang]") if (node) { lang = node.getAttribute("lang") } return lang }
You can test this in the Developer Console:
getLang(document.body) "en" getLang(document.querySelector(".unselectable")) "th" getLang(document.querySelector("[lang=enx] span")) "enx" getLang(document.querySelector("button")) "en"
Early versions of Internet Explorer do not support the element.closest()
method. If you need to support such browsers, you can use this polyfill.
Detecting unselectable nodes
There is no point asking the server send word boundary information on text that cannot be selected. To make the contents of an HTML element unselectable, your can create a CSS rule based on the user-select
property. As of August 2016, he major browser vendors have not reached agreement on how this is to be implemented, so there are a number of properties with vendor prefixes that you need to test for.
You can use getComputedStyle(element)
to obtain an object with key/value pairs for all properties applied to an element. You can then use the array.every
method to check all the versions of user-select
and to stop looking any further if any of them has a value of none
.
js/selection.js
function elementIsSelectable(element) { var prefixes = [ "-webkit-" , "-khtml-" , "-moz-" , "-ms-" , "" ] var style = window.getComputedStyle(element) var selectable = prefixes.every(function check(key) { key += "user-select" return style.getPropertyValue(key) !== "none" }) return selectable }
You can test this in the Developer Console:
elementIsSelectable(document.body) true elementIsSelectable(document.querySelector(".unselectable")) false elementIsSelectable(document.querySelector("[lang=enx] span")) true elementIsSelectable(document.querySelector("button")) true
Detecting changes to text on the page
To detect changes to the DOM, you can create a MutationsObserver
object, and tell it which function to call when an mutation is observed.
In the code listing below, the getLang
and elementIsSelectable
functions have been moved inside an immediately-invoked function expression, to take them out of global namespace, now that you have had a chance to test them.
"use strict" ;(function selection() { var observer = new MutationObserver(checkForAlteredTextNodes) observer.observe(document.body, { childList: true , attributes: true , subtree: true }) function getLang(node) { // code omitted for clarity } function elementIsSelectable(element) { // code omitted for clarity } function checkForAlteredTextNodes(mutations) { mutations.forEach(populateNewTextArray) function populateNewTextArray(mutation) { // TODO console.log(mutation.type, mutation.target, mutation) } } })()
A MutationObserver
is set up in two steps. First you create the object and define the function to call when any changes are detected. Then you call observer.observe(target, mutationOptions)
, to tell it which target element to observe, and what changes to observe on that element.
The options object needs to set at least one of the following properties to true
: attributes
, characterData
or childList
. If your target is a specific textNode
then it is useful to observe characterData
. If you want to observe changes to the text of all the children of a element (as you do here), then you need to set both childList
and subtree
to true
. In the test that you will run shortly, you will convert unselectable text to selectable text by removing a class from a node; to observe this change, you will also need to set attributes
to true
.
The populateNewTextArray
function doesn't do much yet, but it does display information about the mutations that are observed in the Developer Console. Here's what you should see when you refresh the page:

When the page is loaded, the childList
of the document.body
is altered. When you click on the button, you will change the classList
attribute of two spans, and the disabled
attribute of the button itself. Here is what these mutations look like:

Generating the map to send to getWordSegmentation
The purpose of the populateNewTextArray
function is to walk through the body of the entire document, looking for textNode
s that have been added or changed, and specifically those that need to be set to be treated by the getWordSegmentation
method. Once a string has been segmented into words, its segmentation data will be stored in a wordsMap
object to be used as needed. The keys in the original emptywordsMap
object also indicate which languages need such special treatment.
Using a TreeWalker
An efficient way of doing this is to use a TreeWalker
object. The document.createTreeWalker
method accepts up to four arguments. The fourth is not useful here. The three arguments you can use are:
- The root node for the tree to walk
- An NodeFilter integer to indicate which types of nodes in the tree you want to consider
- A filter object with an
acceptNode
method, which will be called for each node that matches the types indicated by the NodeFilter integer
Normally, the acceptNode
method is used to return true
or false
, to indicate whether the node should be returned and added to an array for further treatment. In the code listing below, the acceptNode
method is co-opted to do all the work of building the map that will be sent to the getWordSegmentation
method.
js/selection.js
"use strict" ;(function selection() { var wordsMap = { th: {}, enx: {} } var observer = new MutationObserver(checkForAlteredTextNodes) observer.observe(document.body, { childList: true , attributes: true , subtree: true }) function getLang(node) { // code omitted for clarity } function elementIsSelectable(element) { // code omitted for clarity } function checkForAlteredTextNodes(mutations) { var newTextMap = {} // { <lang>: [<string>, ...], ...} var newTextFound = false mutations.forEach(populateNewTextArray) if (newTextFound) { ASYNC.call("getWordSegmentation", newTextMap, updateWordsMap) } function populateNewTextArray(mutation) { var filter = { acceptNode: function(node) { var text = node.data var lang , map , langArray if (!/^\s*$/.test(text)) { lang = getLang(node) if (map = wordsMap[lang]){ if (elementIsSelectable(node.parentNode)) { if (!map[text]) { if (!(langArray = newTextMap[lang])) { langArray = [] newTextMap[lang] = langArray } langArray.push(text) map[text] = true newTextFound = true } } } } } } var walker = document.createTreeWalker( mutation.target , NodeFilter.SHOW_TEXT , filter ) while (walker.nextNode()) { // Action occurs in filter.acceptNode } } function updateWordsMap(error, result) { // TODO console.log(error || JSON.stringify(result)) } } })()
For each textNode
, the filter.acceptNode
function is called. This runs through the following checklist:
if (!/^\s*$/.test(text)) {...}
- Does this textNode contain anything other than whitespace?
lang = getLang(node); if (map = wordsMap[lang]) {...}
- Is this textNode in a language that needs custom word segmentation?
if (elementIsSelectable(node.parentNode)) {...}
- Does this textNode contain text that the user can select?
if (!map[text]) {
- Has the string data found in this textNode not already been treated, or started to be treated?
If the answer to all these questions is Yes, then the string data
of the textNode will be added to the appropriate language array in newTextMap
, and an entry will be added in the appropriate place in the wordsMap
object, to indicate that the string is about to be treated. If an identical string is found later, the script will know not to make a new request for treatment.
When you save your changes and refresh the page in your browser, you should see the same output as is shown in Figure 1 at the end of the previous section. This time, however, the text is taken directly from the HTML page, and not hard-coded in JavaScript.

newTextMap
If you click on the button, the spans that currently have the unselectable
class will lose that class, and the getWordSegmentation
method will be called again, with just the text that has become selectable:

checkForAlteredTextNodes
function againupdateWordsMap
The last step for this section is to update the wordsMap
object, to store the data retrived by the asynchronous call to getWordSegmentation
. You can make the changes shown below in your selection.js
script, and then refresh the page in your browser.
js/selection.js
"use strict" ;(function selection(){ var wordsMap = { th: {}, enx: {} } var observer = new MutationObserver(checkForAlteredTextNodes) observer.observe(document.body, { childList: true , attributes: true , subtree: true }) function getLang(node) { // code omitted for clarity } function elementIsSelectable(element) { // code omitted for clarity } function checkForAlteredTextNodes(mutations) { // code omitted for clarity } /** * updateWordsMap is triggered by a callback from the ASYNC * call to getWordSegmentation. For each language and each * text key, it replaces the existing `true` entry with the * `offsets` array for the word boundaries in the given * string. * @param {Error|null} error * @param {null|object} result * { <lang>: [ * { "text": <string> * , "offsets": { * "starts": [<integer>, ...] * , "ends": [<integers>, ...] * } * } * , ... * ] * , ... * } */ function updateWordsMap(error, result) { if (error) { return console.log(error) } var lang , langArray , langMap , ii , textData for (lang in result) { langMap = wordsMap[lang] langArray = result[lang] ii = langArray.length while (ii--) { textData = langArray[ii] langMap[textData.text] = textData.offsets } } console.log(JSON.stringify(wordsMap)) } })()
- Used the
closest
method to find the closest element in the hierarchy with alang
attribute - Used
getComputedStyle
to determine if the text in a given node is selectable - Used a
MutationsObserver
to detect when text in the page changes - Created a
TreeWalker
to find which textNodes have changed
In the next section, you'll use the code that you have created so far to extend a jumpToNextWord function, where pressing arrow keys will select the neighbouring word.
Jump to adjacent words
In the Text Selection tutorial, you can see how to develop a script named selection.js
that allows you to use the arrow keys to select the word adjacent to the current selection.
The script developed in that tutorial only works with languages where words are separated by spaces. In this current tutorial, you have learned how to generate word segmentation data for a string of Thai text. Now you're ready to merge the two selection.js
scripts so that you can use the arrow keys to jump to adjacent words in Thai text too.
selection.js
script from the Text Selection tutorial so that it also works with the Thai language. You see how to:
- Add the functions you have developed in this tutorial to the original
selection.js
script - Modify the
jumpLeft
andjumpRight
functions of the originalselection.js
script so that they work with Thai text
The original selection.js
script is quite lengthy. This section shows only the parts of it where you need to make changes. It's a good idea to download the full source code and follow where the changes have been made.
Merging the two selection.js
scripts
The abridged code listing below shows where the code from your new selection.js
script can be added to the original script. Note that the jumpLeft
and jumpRight
functions that you will be looking at here are different from those in the original version.
js/selection.js
"use strict" ;(function selection(){ // code omitted for clarity var box = document.body // querySelector(".box") var wordsMap = { th: {}, enx: {} } var observer = new MutationObserver(checkForAlteredTextNodes) observer.observe(document.body, { childList: true , attributes: true , subtree: true }) box.ondblclick = selectHyphenatedWords document.body.onkeydown = jumpToNextWord function selectHyphenatedWords(event) { // code omitted for clarity } function extendSelectionBackBeforeHypen(string, offset) { // code omitted for clarity } function extendSelectionForwardAfterHyphen(string, offset) { // code omitted for clarity } function jumpToNextWord (event) { // code omitted for clarity } function jumpLeft() { // code here will be edited } function jumpRight() { // code here will be edited } function getAdjacentTextNode(node, whichSibling, arrayMethod) { // code omitted for clarity } function scrollIntoView(range) { // code omitted for clarity } function getLang(node) { // code omitted for clarity } function elementIsSelectable(element) { // code omitted for clarity } function checkForAlteredTextNodes(mutations) { // code omitted for clarity } function updateWordsMap(error, result) { // code omitted for clarity } })()
Checking lang
in jumpLeft
The technique that you will use for selecting the adjacent word will be different when the language is Thai. The code listing below shows how jumpLeft
can be split into two internal functions: jumpLeftWithRegex
, which behaves use almost exactly the same code as the original jumpLeft
function, and jumpLeftWithMap
which will use the data in the wordsMap
object. A switch
statement uses the lang
attribute applied to the container node to determine which approach should be taken.
Note the minor change in the jumpLeftWithRegex
function. If the current word is the first in the current container, word to select will be in the previous textNode. It's possible that the previous textNode is in a different language, and the jumpLeftWithRegex
function might not be appropriate for that language.
The solution is to:
- Create a new
range
object (so that the selection is not changed if there is nothing earlier to select) - Place the
range.endOffset
at the end of the previous textNode - Call the parent
jumpLeft
function recursively.
The endOffset
cannot be placed earlier than the startOffset
, so startOffset
is now set to exactly the same point. You'll see the importance of this in a moment. This technique ensures that the function used for selecting the last word in the previous textNode is appropriate for the language of that node.
js/selection.js
... function jumpLeft() { container = range.endContainer var lang = getLang(container) switch (lang) { case "th": case "enx": return jumpLeftByMap(container) default: return jumpLeftByRegex(container) } function jumpLeftByMap(container) { // TODO } function jumpLeftByRegex(container) { var string = container.textContent var result = getPreviousWord(string, range.startOffset) var startOffset , endOffset , rangeData if (!result) { // There are no more words in this text node. Try the previous. container = getAdjacentTextNode( container , "previousSibling" , "pop" ) if (container) { range = document.createRange() range.setEnd(container, container.textContent.length) return jumpLeft() } else { // We're at the very beginning of the selectable text. // There's nothing earlier to select. return } } startOffset = result.index endOffset = startOffset + result[0].length rangeData = { container: container , startOffset: startOffset , endOffset: endOffset , string: string } return rangeData function getPreviousWord(string, offset) { string = string.substring(0, offset) var result , temp while (temp = lastWordRegex.exec(string)) { result = temp } return result } } } ...
jumpLeftByMap
The jumpLeftByMap
function first checks that there is an entry in wordsMap
for the textContent of the current node. If the server has not yet called back with the result of the getWordSegmentation
call, then the only solution is to cancel the action.
In a production version of this feature, it would be good to show a tooltip or a spinner to justify why nothing is happening.
If the server has already responded, then endOffset
is set to range.endOffset + (range.startOffset === string.length)
and the function setOffsets
is called to set the values of startOffset
and endOffset
as required to select the previous word.
In most cases, range.startOffset === string.length
will be false
so endOffset
will be set to range.endOffset
. When jumpLeft
is called recursively because the current selection is the first word in the next textNode, then the instruction range.setEnd(container, container.textContent.length)
described above will also set the startOffset
to the end of the previous container. In this case range.startOffset === string.length
will be true
and endOffset
will be set to range.endOffset + 1
. This ensures that the setOffsets
function will correctly choose the start and end offsets for the last word in the current container.
js/selection.js
... function jumpLeftByMap(container) { var string = container.textContent var wordMap = wordsMap[lang][string] if (typeof wordMap !== "object")) { return console.log("No word map for ", string) } var wordEnds = wordMap.ends var wordStarts = wordMap.starts var endOffset = range.endOffset + (range.startOffset === string.length) var startOffset var rangeData var result = setOffsets(endOffset) if (!result) { // TODO } rangeData = { container: container , startOffset: startOffset , endOffset: endOffset , string: string } return rangeData function setOffsets(offset) { var ii = wordEnds.length while (ii--) { if ((endOffset = wordEnds[ii]) < offset) { startOffset = wordStarts[ii] ii = 0 } } return (startOffset !== undefined) } } ...
setOffsets
The wordEnds
variable is an array of index numbers of the next character after the end of each word. For the sentence "lookforaneedleinthesea" wordEnds
will be [4,7,8,14,16,19,22]
. The setOffsets
function receives an offset which may be the current range.endOffset
(if a word in this textNode is already selected), or the-last-item-in-wordEnds
-plus-1 (if the first word in the next textNode is selected). It works backwards from the end of the wordEnds
, looking for the first number that is smaller. If it finds such a number, it sets startOffset
to the corresponding entry in wordStarts
. However, if the first word in the current textNode is selected, then it won't find a smaller number, and startOffset
will remain undefined. When this happens, setOffsets
will return false
. This will trigger a recursive jump to the previous textNode, as you'll see below.
js/selection.js
... function jumpLeftByMap(container) { var string = container.textContent var wordMap = wordsMap[lang][string] if (!wordMap) { return console.log("No word map for ", string) } var wordEnds = wordMap.ends var wordStarts = wordMap.starts var endOffset = range.endOffset + (range.startOffset === string.length) var startOffset var rangeData var result = setOffsets(endOffset) if (!result) { container = getAdjacentTextNode( container , "previousSibling" , "pop" ) if (container) { range = document.createRange() range.setEnd(container, container.textContent.length) return jumpLeft() } else { // We're at the very beginning of the selectable text. // There's nothing earlier to select. return } } rangeData = { container: container , startOffset: startOffset , endOffset: endOffset , string: string } return rangeData function setOffsets(offset) { // code omitted for clarity } } ...
The technique for jumping back to the previous textNode is identical to the way it is done in jumpLeftByRegex
Jumping to the right
Jumping to the right works in a very similar way.
js/selection.js
... function jumpRight() { container = range.startContainer var lang = getLang(container) switch (lang) { case "th": case "enx": return jumpRightByMap(container) default: return jumpRightByRegex(container) } function jumpRightByMap(container) { var string = container.textContent var wordMap = wordsMap[lang][string] if (typeof wordMap !== "object") { return console.log("No word map for ", string) } var wordEnds = wordMap.ends var wordStarts = wordMap.starts var startOffset = range.startOffset - (!range.endOffset) var endOffset var rangeData var result = setOffsets(startOffset) if (!result) { container = getAdjacentTextNode( container , "nextSibling" , "shift" ) if (container) { range = document.createRange() range.setStart(container, 0) return jumpRight() } else { // We're at the very beginning of the selectable text. // There's nothing earlier to select. return } } rangeData = { container: container , startOffset: startOffset , endOffset: endOffset , string: string } return rangeData function setOffsets(offset) { var ii var length = wordStarts.length for (ii = 0; ii < length; ii += 1) { if ((startOffset = wordStarts[ii]) > offset) { endOffset = wordEnds[ii] ii = length } } return (endOffset !== undefined) } } function jumpRightByRegex(container) { var startOffset = range.endOffset var string = container.textContent var result = nextWordRegex.exec(string.substring(startOffset)) var endOffset , rangeData if (result) { startOffset += result[0].length } else { // There are no more words in this text node. Try the next. container = getAdjacentTextNode( container , "nextSibling" , "shift" ) if (container) { range = document.createRange() range.setStart(container, 0) return jumpRight() } else { // We're at the very end of the selectable text. There's // nothing more to select. return } } result = wordEndRegex.exec(string.substring(startOffset)) endOffset = startOffset + result.index rangeData = { startOffset: startOffset , endOffset: endOffset , string: string } return rangeData } }
You can now double-click on a word to select it and use the left and right arrow keys to move the selection to the adjacent word.
selection.js
script with the selection.js
script from an earlier tutorial. You've split the jumpLeft
and jumpRight
functions to create two ways to select the adjacent word, depending on the language.
In the next section, you'll see how to simulate querying the server for the translation of the selected word.
Querying an online dictionary
- Query the
dico
object asynchronously the translation of the current selection - Display the translation in your web page
- Cache translations in order to display them faster the second and subsequent times they are requested
Simplifying index.hmtl
In this final section, you can simplify your index.html
file by removing all the text in EnglishNoSpaces. To limit the action of the word selection feature, you can create a box around the Thai text. You can also add a <div> element to contain the translation, when it becomes available.
index.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Selecting Thai words</title>
<link rel="stylesheet" type="text/css" href="css/style.css">
</head>
<body>
<h1>Word Segmentation Test</h1>
<p>Double-click to select a Thai word to see its pronounciation
and translation. Use the left and right arrow keys to select
neighbouring words.</p>
<div class="box" lang="th">
<p>งมเข็มในมหาสมุทร</p>
<p>พูดไปสองไพเบี้ย นิ่งเสียตำลึงทอง</p>
<p>ตากลม ตากลม</p>
</div>
<div id="translation"><div>
<script src="js/dictionaries.js"></script>
<script src="js/async.js"></script>
<script src="js/selection.js"></script>
</body>
</html>
You can add some cosmetic CSS, to make the page more elegant.
css/style.css
body {
width: 22em;
}
.box {
background-color: #eef;
border: 1px solid #ccf;
border-style:inset;
}
.box p {
margin: 0.5em;
}
dt {
font-weight: bold;
}
Adding a getWordData
method to the dictionary
The th
entry in your dictionary is a map object where the keys are Thai words and the values are objects with the format:
{ pronunctiation: <string> , translation: <string> }
You can add a getWordData
method to the dico
object that will return an object with the format:
{ pronunctiation: <string> , translation: <string> , word: <string> , lang: <string> }
However, if a request is made for a word that doesn't exist, or for a language for which there is no dictionary, it makes sense to return an object with the format ...
{ , ⚠: "no data available" , word: <string> , lang: <string> }
... where ⚠ is encoded in JavaScript as "\u26A0"
.
A fast way to make a deep clone of a JavaScript object is to use clone = JSON.parse(JSON.stringify(original))
js/dictionary.js
"use strict" var dico var segment ;(function addDictionaries() { dico = { dictionaries: { // code omitted for clarity } , tries: {} , initialize: function createTries(){ // code omitted for clarity } , createTrie: function createTrie(languageMap) { // code omitted for clarity } , splitIntoWords: function splitIntoWords(string, languageCode) { // code omitted for clarity } , getWordMap: function getWordMap(string, languageCode) { // code omitted for clarity } , getWordData: function getWordData(word, languageCode) { var dictionary = this.dictionaries[languageCode] var data if (dictionary) { data = dictionary[word] if (data) { data = JSON.parse(JSON.stringify(data)) } else { data = { "\u26A0": "no data available" } } data.word = word data.lang = languageCode } return data } }.initialize() segment = { // code omitted for clarity } })()
Defining an asynchronous method for calling dico.getWordData
To call dico.getWordData
asynchronously, you can add the following getTranslation
method to your ASYNC
object:
js/async.js
"use strict" var ASYNC ;(function async(){ ASYNC = { // code omitted for clarity } , methods: { getWordSegmentation: function getWordSegmentation(map) { // code omitted for clarity } , getTranslation: function getTranslation(word, languageCode) { try { return dico.getWordData(word, languageCode) } catch (error) { return error } } } } })()
Updating selection.js
With all these changes in place, you can now add a requestTranslation
function to your selection.js
script. You can add calls to this from 3 places:
- Any mouseup event on the
div.box
element, in case the user clicks and drags to make a custom selection - Any dblclick event on the
div.box
element, after a complete word (which may contain hyphens) is selected - Any time the user presses the left or right arrow key to select a new word
As you will see in the code listing below, the requestTranslation
function can work both asynchronously and synchronously. A wordsCache
object is created when the code is loaded. The requestTranslation
function first checks if this wordsCache
object already contains an entry for the selected word and language. If not, it makes a call to ASYNC.call("getTranslation", ...)
, using the nested showTranslation
function as a callback. If the wordsCache
object already contains an entry for the word, the showTranslation
function is called immediately, with the alreadyCached
argument set to true
.
If the translation is retrieved via the asynchronous call to ASYNC.call("getTranslation", ...)
, the alreadyCached
argument will be undefined
, and the incoming data will be added to wordsCach
, so that it can be retrieved locally the next time. If the selection has changed in the meantime, the operation ends there. If the selected is string is identical to the word
property of the incoming data
object, then a header and a description list element (<dl>
) are created to show information about the word, and displayed in the div#translation
element.
js/selection.js
;(function selection(){ // code omitted for clarity var box = document.querySelector(".box") var wordsCache = {} var wordsMap = { th: {}, enx: {} } var observer = new MutationObserver(checkForAlteredTextNodes) observer.observe(document.body, { childList: true , attributes: true , subtree: true }) box.onmouseup = requestTranslation box.ondblclick = selectHyphenatedWords document.body.onkeydown = jumpToNextWord function selectHyphenatedWords(event) { // code omitted for clarity if (selectionUpdated) { selection.removeAllRanges() selection.addRange(range) } scrollIntoView(range) requestTranslation() } // code omitted for clarity function jumpToNextWord (event) { // code omitted for clarity selection.removeAllRanges() selection.addRange(range) scrollIntoView(range) requestTranslation() } // code omitted for clarity function requestTranslation() { var word = selection.toString() var lang = getLang(selection.getRangeAt(0).startContainer) var data // Ignore word if it is all whitespace word = word.match(lastWordRegex) if (word) { word = word[0] } else { return } if ((data = wordsCache[lang]) && (data = data[word])) { return showTranslation(null, data, true) } ASYNC.call("getTranslation", word, lang, showTranslation) function showTranslation (error, data, alreadyCached) { if (error) { return console.log(error) } var pTranslation = document.getElementById("translation") var string , word , lang if (data) { word = data.word lang = data.lang if (!alreadyCached) { addToWordsCache(lang, word, data) } if (word !== selection.toString()) { return } string = "<h3>" + word + "</h3><dl>" for (var key in data) { if (key !== "word" && key !== "lang") { string += "<dt>" + key + "</dt>" string += "<dd>" + data[key] + "</dd>" } } string += "</dl>" } pTranslation.innerHTML = string } function addToWordsCache(lang, word, data) { var langMap = wordsCache[lang] if (!langMap) { langMap = {} wordsCache[lang] = langMap } langMap[word] = data } } })()
Testing
After you save your changes and relaunch your web page, you can test the translation feature. You can try both double-clicking on a Thai word or clicking and dragging to make a custom selection.
- Select the individual syllables "มหา" or "สมุทร" in the word "มหาสมุทร"
- Select all the characters between two spaces, so that more than one word is selected
- Use the right arrow key to jump to the next word. You can start by moving slowly from word to word, waiting until the asynchronous call returns, and then use the left arrow key to jump back to words whose translation has already been cached. You'll see that the response is much faster.
In this section, you've completed a simulation of calling a server in order to get both word boundary data and word definitions, using a number of techniques that you discovered earlier in this tutorial.
Conclusion
If you have made the effort to get this far, well done!
- Finding word boundaries in certain languages requires a call to a dedicated application on a server
- You can detect when the text on a page changes, and call the server after each change to obtain the word boundary data for the new text
- You can cache the results of calls to the server to speed up future operations
- You can use the word boundary data to determine the next and previous words, and use the arrow keys to jump between them.
- You can query a database for information about the words
Where to go from here
The Language Reference tutorial builds on the techniques that you have learned here, to connect to a Wiktionary page when you select a word in the box at the top of the NoteBook popup window.
Overview
This tutorial is part of a series that describes how to build a Chrome extension that helps you to learn a foreign language. With the extension, any text you select in any web page will be copied to a pop-up window. If you select a word in the pop-up window, the appropriate Wiktionary page for that word will be shown in an iFrame. An earlier tutoral shows you how to get this to work for European languages, which use spaces as word boundaries. This tutorial shows you how to extend it to Thai (and by extension to other Asian languages), where no spaces are used between words.
In a browser, you expect a double-click to select a whole word, regardless of which language that word is written in. A browser like Chrome, the word boundaries for many languages are automatically detected. Other browsers may fail to indentify word boundaries in non-European languages and select either syllables or whole phrases instead. If you want to select a whole word in such browsers, or without using a double-click, you need to know how to analyze text to find word boundaries.
The aim of this tutorial is to develop a proof-of-concept system which will allow you to use the arrow keys to jump to neighbouring Thai words. As you will see, distinguishing boundaries between Thai words is best performed asynchronously on the server. This tutorial will show you how to create a minimal word boundary detection system that will run in your browser, and to make asynchronous calls to it, to simulate a connection to the server.
The final step is to simulate a connection to a server to retrieve the dictionary definition of the selected word.
Word boundaries
In spoken language, words tend to flow together. In a sentence like "We went on a long journey", you have to wait until the word "journey" is spoken before you can be sure that "a long" is two words. Conversely, if the sentence was "We went on along the road", you would need to wait until the word "the" to know that the same sounds were the single word "along".
In many written languages, there are visual signs to indicate where words end. In European languages descended from Latin and Greek, a space is used between words, but there is still some ambiguity about whether "can't" is one word or two. In Arabic, letters have different shapes depending on whether they appear at the beginning, in the middle or the end of a word, as an additional clue.
However, in other languages, such as Chinese and Japanese, there are spaces between sentences, but no spaces between words. In Thai and Lao, there may also be spaces between clauses, so the space character itself is ambiguous. In Vietnamese, there is a space between each syllable, while several syllables may combine to make a single word.
This tutorial explores two simple techniques for segmenting written Thai into words. However, there is no generic automated solution for this, for several reasons, including:
- Ambiguous word boundaries
- Unknown words
Ambiguous word boundaries
Word boundaries found in identical character sequences can depend on the meaning, which is inaccessible to anyone (including a computer) who does not speak the language. A classic example in Thai is ตากลม, which can be ตา|กลม (round eyes) or ตาก|ลม (exposed to the air). You see a similar kind of ambiguity in the following English sentences:
- Those who died are devils
- The do or die daredevils
- The door did not open
If this were written with no spaces, a machine or non-English speaker would have no way of knowing where to place the word boundaries. There are multiple ways of dividing the letters to yield real English words.
Unknown words
A dictionary-based approach fails as soon as it encounters an unknown word. At best, it can skip over chunks that do not correspond to any entry in the dictionary, but this can be a costly and error-prone process. If it had no spaces, and the name "Houdini" was not in the dictionary, a sentence like ...
A noted wit, Houdini told many jokes.
... could be segmented as ...
a noted with ou din it old many jokes
... or ...
a noted wi thou din it old many jokes
... with one unknown word leading to a longer garbled passage, and a misleading chunk being labelled "unknown".
This tutorial will not address either of these issues, which are still the subject of much research.
- Create a trie from a dictionary
- Use the trie to detect word boundaries in EnglishNoSpaces
- Apply the trie technique to Thai text
- Use a "rule-based" system to identify syllable boundaries in Thai text
- Run an asynchronous method (to simulate an AJAX request to a server) to obtain the word segmentation of a chunk of text
- Use the callback result to select the word indicated by a click on a press on an arrow key.
Note: the selection.js
code used in this tutorial is describe in more detail here.