Can you support Glagolitic?
May 25th, 2007 by Andrew Jenks
The Glagolitic alphabet was invented during the 9th century by the missionaries St Cyril (827-869 AD) and St Methodius (826-885 AD) in order to translate the bible and other religious works into the language of the Great Moravia region. What does that have to do with e-discovery? Believe it or not, everything, if you care about Unicode.
There are different revisions of Unicode, each adding or improving writing systems, to be used within the electronic world. Within Unicode there is a concept of encoding. Encoding is where things can get crazy. Encoding is the way in which the computer uses a series of numbers to store the characters that are represented. It leaves the display up to the software(browser, office doc, email client) and does a nice job of separating the two. There are limitations to encoding documents this way, however I’m not going into that now. You can check the wikipedia page if you want more info.
Back to our problem; How and what do you need to look for in a vendor that can support Unicode. Well there are different standards that are all considered “Unicode”. The most common is ISO 8859. This gives you about 23+ languages, but it’s all based on the Latin character set. So if you care about Norwegian, a vendor that can support this Unicode standard is going to be fine for the job. Where it begins to get tricky is using the “double-byte” characters. ISO 10646 is an extension of previous standards like ISO 8859. Characters (letters, numbers, symbols, ideograms, logo-grams, etc.) from the many languages, scripts, and traditions of the world are represented in ISO 10646 with unique code points. The inclusiveness of ISO 10646 is continually improving as characters from previously unrepresented writing systems are added. This standard allows Discovery Mining to support almost every written language and almost every language supported by machines. It defined 128 groups of 256 planes of 256 rows of 256 cells, for a total of 2,147,483,648 characters, about triple the number covered by ISO 8859. Yes this includes the CJK(Chinese, Japanese, Korean) set.
What also makes this tricky is that within each language there are “sub” encoding as well, further complicating the issue. Electronic Discovery vendors that support Unicode are sometimes fooled by the lack of reliable metadata about the files. Not every file has a label identifying which encoding is used to store it, and the label that is sometimes found with some files is occasionally incorrect. What does this mean? Well it implies that when data is collected, even if it’s collected in a forensically sound way, you only have a “decent” chance of being able to get the encoding directly from the file. Let’s use an example:
If your encoding metadata is wrong and the processing engine saw a series of numbers like 84-104-101(in ASCII the word “The”) you wouldn’t know if it meant “‘The” or “and” or just gibberish. The computer spits out the word and it takes a native speaker to realize that you processed gibberish.
In the example above your vendor gives you data that “looks” right and the native Japanese speaker reads it and it means nothing. Since Unicode was only finalized in 1999, and wasn’t really widely used until the last few years, there is an enormous amount of old-data encoded in this variety of non-standard ways. This is a huge problem for processing foreign language data because you want to provide customers with non-gibberish for them to review and search. This problem is exacerbated by the fact that it’s very common to find 2 or more different encoding within each chunk of data that Discovery Mining receives from our customers, with some of the files stored on disk in one manner, and other files stored in another way (e.g. a set of email with attachments from both Microsoft and IBM products).
Does this mean you should give up? NO. What most vendors in our space do, especially for double byte characters, is employ a person to make sure that the encoding on the documents is correct; a sort of Native Speaker Human QC. This is good, but can slow things down or add costs to your Unicode project. Also one other issue with this is, multiple encoding types in one string of documents. What happens when an email is encoded one way, the attachment is another, and the reply different than both. There is another way……use software.
As you can imagine the software on your computer must be able to switch between encoding types on the fly without degradation to the content on your screen. Because, as of today, most encoding data cannot be trusted, Microsoft and other software vendors have written “smart” programs to get around this. This is the approach of Discovery Mining. It is harder, but in the long run better. It gives greater than 99% accuracy on a mixed set of documents for both processing and search. We don’t believe the encoding, we use it, but don’t trust it for processing. Typically we see about a 65% accuracy rate on the encoding contained in the metadata we are processing, our decoder software picks up the additional 34%. Discovery Mining’s software solves the problem by looking at each individual file, automatically identifies the correct encoding and the language that is used, converts it to Unicode and then processes it with our normal Unicode facilities. What’s nice about this approach is we can process data, regardless of language, in the “normal” way and be confident that it’s correct to the native speaker who has to review the documents.
While there are many people in the market that are beginning to support Unicode, I’d say that knowing what “type” of Unicode is supported as well as the encoding issues are great places to start the questions.