Sunday, June 29, 2014

Testing OCR Program for Japanese Text

History 

For around 2 decades I have JDict, a Japanese dictionary application. Until now I still think this is the best (and free) Japanese dictionary.
Yet two decades is a long time for stagnation, thus I tried to find a better variant. From what I could gather, the dictionary data is based on EDict, maintained by Jim Breen, Monash University. While trying to find free applications that rely on EDict, I found these:
  • Word Processor - JWPce 
  • OCR Programs - KanjiTomo (Java - Windows, Linux, Mac) 
  • OCR Programs - Capture2Text (AutoHotKey - Windows) 
The word processor works well with decent functionality while the OCR programs...

Testing OCR Program for Japanese Text 

Armed with "OCR Japanese" keyword I tried to find other alternatives; my search yielded another: SmartOCR Lite.

Capture2Text 

Under the hood it relies on NHocr and Tesseract as the OCR engine. The developer has added another layer: binarization (aka thresholding) to sharpen a smudge character (see how awesome it is).
Steps to Test:
  1. Download -- the latest is v3.3, I tried v3.1. 
  2. Extract to any folder 
  3. Execute 
  4. Select a section of Japanese characters 
Result
I tried several times in Windows XP; never worked. Since I have no idea how to troubleshoot I put this one aside

KanjiTomo

Wow, the engine is developer's own creation.
Steps to Test: 
  1. Download 
  2. Extract to any folder 
  3. Install JDK/JRE (the developer recommends JDK over JRE) 
  4. Execute, wait until the loading process is finished 
  5. I need to do these (ymmv) 
    1. Un-check Automatic OCR 
    2. Settings > Text Orientation > Vertical 
    3. Settings > Text Colors > Black on White 
  6. Open an image 
  7. Select 1 or 2 Japanese characters 
Result
でもこの年になって ==> becomes
で|この年仁ケつて ==> correct detection: 5/9
I make no assumption that OCR quality is bad; the application highlight "the best match" yet provided us with an opportunity to select another from the list of similar characters. Alas, my XP had no support for East Asian language then - thus I never change from default.
One obvious drawback is the incapability to process multiple columns. As shown in the second image, 6 characters are considered as single character. This drawback is confounded by fit to screen image loading implementation. The drawback means I have to select little by little. Zoom feature has been provided to address this issue; but! How to zoom to the top 20% of the image? I can't drag beyond the top screen.
Edit: 4-Jul-2014 70:20 GMT+7
KanjiTomo creator has kindly advised to use middle mouse button to drag the zoom window. Coincidently I stumbled upon Joel's article on UI design - a nice piece; though he took his sweet time to deliver his message across.

SmartOCR Lite

Test Data
Since this application is a product - instead of individual effort - I have higher expectation; hence a more thorough test.
Test data consists of 3 strips of 4koma manga; removing all pictures and realigning the characters. This tedious extra step is required because:
  • Feeding a whole page causes false detection (ex: a picture of mouth is detected as a character)
  • Fragments ordering is a mess. It's easier to order the source once rather than to re-order the result as many times as the number of tests.
Test data was created by basic copy-n-paste operations; there was no extra effort to sharpen the characters or to clean the surrounding (from light gray spots).


Steps to Test: 
  1. Install Microsoft .Net Framework 1.1 
  2. Install Microsoft .NET framework japanese language pack
  3. Install SmartOCR Lite (The original download link is down and I got my copy from a server that is inaccessible from other countries)
  4. Restart PC
  5. Open an image (Ctrl+O)
  6. On the result window - to the left of image window - change display mode (表示モード) to horizontal (横書き).
  7. (Ignoring the fact that result window is dominated with ? - instead of proper characters) Save the result into html.
    1. If you save it to txt, it would be unreadable in Notepad. To read it you'll need Notepad++ and select Encoding > Character Sets > Japanese > Shift-JS. 
    2. Html has different problem - some characters might be detected but not displayed due to css bug. I need to check the source in Notepad++.
  8. Compare the result against original.
Result
On the same phrase (でもこの年になって) used to test KanjiTomo, SmartOCR managed to read them flawlessly. The overall performance, however, is a different story. As shown in the following table, the success rate is merely 60%.
RecapitulationTotalKanji KanaOthers
Total chars2998018237
Not Detected103176026
Wrong Detection15645
Recognized181571186
Success Rate60%71%65%16%
The biggest miss came from different fonts. Observing the ん character in 2nd 4koma reveals there are at least 3 different fonts. SmartOCR failed completely on the thick fonts in 2nd 4koma.
Another flaw: SmartOCR always failed to detect double punctuation marks (!! and !?) - granted, some are quite close to the marks.
List of invalid detection, with the format actual (wrong)
Kanji
(ロ)

(助)

(口)

(凄)

(渠)

(」」)
Kana
(乳)

(桝)

(靖)

(ば)
Others!!
(11日)
!!
(岬)
!!
(11H)
!?
(17)
!?
(盟)
I don't expect 100% success rate from this data - there are 2 characters too smudged beyond recognition. Nevertheless, the low rate came as a surprise. In particular, these misses were unexpected:
  • Title of 2nd 4koma is big and clear; how could it read 懐(凄) wrongly?
  • Repeated phrase (動物園) in 1st 4koma. It was correct the 1st time, yet on the next recurrences it failed: 動(助) and 園(口). 
Another Drawback
Prior to language pack installation, JWPce (in WinXP) works with UTF-8. I can type here, save it as txt file and open it in another box (Windows 7). After installation, everything work with Shift-JIS. No matter what I do newer txt file is unreadable in Windows 7.

KanjiTomo Redux

Previously, I accepted all the 口 from KanjiTomo, save them as txt file and open them in Windows 7. This method produced 5/9 detection rate.
After  language pack installation I can see the default selection as well as other alternatives in the list.
Steps to Test:
Using the same test data as SmartOCR Lite
  1. Read all that detected incorrectly by SmartOCR
  2. Read (random picks) those unreadable by SmartOCR
  3. Read (random picks) those detected correctly by SmartOCR.
Result
From 15 invalid detection, お always got 七 as "best fit" even though お was listed as an alternative. Kanjimoto  always failed to detect double punctuation marks (!! and !?) as well; but the other 9 were detected correctly.
Those unreadable by SmartOCR were detected correctly by KanjiMoto.
Those detected correctly by SmartOCR were detected correctly by KanjiMoto as well.
KanjiMoto OCR engine has better detection capability - considering caveats below.

Caveats
Character selection in KanjiMoto is fragile. Precision in click-n-drag is paramount.
Sometimes single character can be detected as 2 characters. Other times (and this is the worst) different boxing yields different character. Refer to the image on the side. Observe how the 2nd attempt (with tight boxing) produces correct character while  the first attempt (with almost equal spacing on all sides; box doesn't intersect neighboring character) produces wrong character.
To clarify: it's not about tightness. Simply that different boxing may yield different characters. In fact for this particular character getting the wrong result is harder than getting the correct one. 

I got the impression this application relies on dictionary to guess some characters. If I box 3 characters that form a phrase/meaning I'd got that phrase. On the other hand, if I box 2 characters that doesn't form any, the default for 1st &/ 2nd characters may be wrong or blank.

Conclusion

For quick and dirty processing, SmartOCR Lite is the clear winner among 3 participants; however, considering its weakness, it is better fit to process a textbook with plain font. Applying binarization (aka thresholding) from Capture2Text may increase success rate a bit, but I doubt the underlying OCR engine can properly process thick-rounded fonts as per test data above.
For accuracy KanjiMoto is the clear winner.
Considering the hassle of KanjiMoto, the recommended way is to use SmartOCR first followed by KanjiMoto to fill in the blanks.