History
For around 2 decades I have JDict, a Japanese dictionary application. Until now I still think this is the best (and free) Japanese dictionary.Yet two decades is a long time for stagnation, thus I tried to find a better variant. From what I could gather, the dictionary data is based on EDict, maintained by Jim Breen, Monash University. While trying to find free applications that rely on EDict, I found these:
- Word Processor - JWPce
- OCR Programs - KanjiTomo (Java - Windows, Linux, Mac)
- OCR Programs - Capture2Text (AutoHotKey - Windows)
Testing OCR Program for Japanese Text
Armed with "OCR Japanese" keyword I tried to find other alternatives; my search yielded another: SmartOCR Lite.Capture2Text
Under the hood it relies on NHocr and Tesseract as the OCR engine. The developer has added another layer: binarization (aka thresholding) to sharpen a smudge character (see how awesome it is).Steps to Test:
- Download -- the latest is v3.3, I tried v3.1.
- Extract to any folder
- Execute
- Select a section of Japanese characters
I tried several times in Windows XP; never worked. Since I have no idea how to troubleshoot I put this one aside
KanjiTomo
Wow, the engine is developer's own creation.Steps to Test:
- Download
- Extract to any folder
- Install JDK/JRE (the developer recommends JDK over JRE)
- Execute, wait until the loading process is finished
- I need to do these (ymmv)
- Un-check Automatic OCR
- Settings > Text Orientation > Vertical
- Settings > Text Colors > Black on White
- Open an image
- Select 1 or 2 Japanese characters
でもこの年になって ==> becomes
で|この年仁ケつて ==> correct detection: 5/9
I make no assumption that OCR quality is bad; the application highlight "the best match" yet provided us with an opportunity to select another from the list of similar characters. Alas, my XP had no support for East Asian language then - thus I never change from default.
One obvious drawback is the incapability to process multiple columns. As shown in the second image, 6 characters are considered as single character. This drawback is confounded by fit to screen image loading implementation. The drawback means I have to select little by little. Zoom feature has been provided to address this issue;
Edit: 4-Jul-2014 70:20 GMT+7
KanjiTomo creator has kindly advised to use middle mouse button to drag the zoom window. Coincidently I stumbled upon Joel's article on UI design - a nice piece; though he took his sweet time to deliver his message across.
KanjiTomo creator has kindly advised to use middle mouse button to drag the zoom window. Coincidently I stumbled upon Joel's article on UI design - a nice piece; though he took his sweet time to deliver his message across.
SmartOCR Lite
Test DataSince this application is a product - instead of individual effort - I have higher expectation; hence a more thorough test.
Test data consists of 3 strips of 4koma manga; removing all pictures and realigning the characters. This tedious extra step is required because:
- Feeding a whole page causes false detection (ex: a picture of mouth is detected as a character)
- Fragments ordering is a mess. It's easier to order the source once rather than to re-order the result as many times as the number of tests.
Steps to Test:
- Install Microsoft .Net Framework 1.1
- Install Microsoft .NET framework japanese language pack
- Install SmartOCR Lite (The original download link is down and I got my copy from a server that is inaccessible from other countries)
- Restart PC
- Open an image (Ctrl+O)
- On the result window - to the left of image window - change display mode (表示モード) to horizontal (横書き).
- (Ignoring the fact that result window is dominated with ? - instead of proper characters) Save the result into html.
- If you save it to txt, it would be unreadable in Notepad. To read it you'll need Notepad++ and select Encoding > Character Sets > Japanese > Shift-JS.
- Html has different problem - some characters might be detected but not displayed due to css bug. I need to check the source in Notepad++.
- Compare the result against original.
On the same phrase (でもこの年になって) used to test KanjiTomo, SmartOCR managed to read them flawlessly. The overall performance, however, is a different story. As shown in the following table, the success rate is merely 60%.
Recapitulation | Total | Kanji | Kana | Others |
---|---|---|---|---|
Total chars | 299 | 80 | 182 | 37 |
Not Detected | 103 | 17 | 60 | 26 |
Wrong Detection | 15 | 6 | 4 | 5 |
Recognized | 181 | 57 | 118 | 6 |
Success Rate | 60% | 71% | 65% | 16% |
Another flaw: SmartOCR always failed to detect double punctuation marks (!! and !?) - granted, some are quite close to the marks.
List of invalid detection, with the format actual (wrong) | ||||||
---|---|---|---|---|---|---|
Kanji | 買 (ロ) | 動 (助) | 園 (口) | 懐 (凄) | 菜 (渠) | 二 (」」) |
Kana | え (乳) | が (桝) | お (靖) | は (ば) | ||
Others | !! (11日) | !! (岬) | !! (11H) | !? (17) | !? (盟) |
- Title of 2nd 4koma is big and clear; how could it read 懐(凄) wrongly?
- Repeated phrase (動物園) in 1st 4koma. It was correct the 1st time, yet on the next recurrences it failed: 動(助) and 園(口).
Prior to language pack installation, JWPce (in WinXP) works with UTF-8. I can type here, save it as txt file and open it in another box (Windows 7). After installation, everything work with Shift-JIS. No matter what I do newer txt file is unreadable in Windows 7.
KanjiTomo Redux
Previously, I accepted all the 口 from KanjiTomo, save them as txt file and open them in Windows 7. This method produced 5/9 detection rate.After language pack installation I can see the default selection as well as other alternatives in the list.
Steps to Test:
Using the same test data as SmartOCR Lite
- Read all that detected incorrectly by SmartOCR
- Read (random picks) those unreadable by SmartOCR
- Read (random picks) those detected correctly by SmartOCR.
Result
From 15 invalid detection, お always got 七 as "best fit" even though お was listed as an alternative. Kanjimoto always failed to detect double punctuation marks (!! and !?) as well; but the other 9 were detected correctly.
Those unreadable by SmartOCR were detected correctly by KanjiMoto.Those detected correctly by SmartOCR were detected correctly by KanjiMoto as well.
KanjiMoto OCR engine has better detection capability - considering caveats below.
Caveats
Character selection in KanjiMoto is fragile. Precision in click-n-drag is paramount.
Sometimes single character can be detected as 2 characters. Other times (and this is the worst) different boxing yields different character. Refer to the image on the side. Observe how the 2nd attempt (with tight boxing) produces correct character while the first attempt (with almost equal spacing on all sides; box doesn't intersect neighboring character) produces wrong character.
To clarify: it's not about tightness. Simply that different boxing may yield different characters. In fact for this particular character getting the wrong result is harder than getting the correct one.
I got the impression this application relies on dictionary to guess some characters. If I box 3 characters that form a phrase/meaning I'd got that phrase. On the other hand, if I box 2 characters that doesn't form any, the default for 1st &/ 2nd characters may be wrong or blank.
Conclusion
For quick and dirty processing, SmartOCR Lite is the clear winner among 3 participants; however, considering its weakness, it is better fit to process a textbook with plain font. Applying binarization (aka thresholding) from Capture2Text may increase success rate a bit, but I doubt the underlying OCR engine can properly process thick-rounded fonts as per test data above.For accuracy KanjiMoto is the clear winner.
Considering the hassle of KanjiMoto, the recommended way is to use SmartOCR first followed by KanjiMoto to fill in the blanks.