History
For around 2 decades I have JDict, a Japanese dictionary application. Until now I still think this is the best (and free) Japanese dictionary.Yet two decades is a long time for stagnation, thus I tried to find a better variant. From what I could gather, the dictionary data is based on EDict, maintained by Jim Breen, Monash University. While trying to find free applications that rely on EDict, I found these:
- Word Processor - JWPce
- OCR Programs - KanjiTomo (Java - Windows, Linux, Mac)
- OCR Programs - Capture2Text (AutoHotKey - Windows)
Testing OCR Program for Japanese Text
Armed with "OCR Japanese" keyword I tried to find other alternatives; my search yielded another: SmartOCR Lite.Capture2Text
Under the hood it relies on NHocr and Tesseract as the OCR engine. The developer has added another layer: binarization (aka thresholding) to sharpen a smudge character (see how awesome it is).Steps to Test:
- Download -- the latest is v3.3, I tried v3.1.
- Extract to any folder
- Execute
- Select a section of Japanese characters
I tried several times in Windows XP; never worked. Since I have no idea how to troubleshoot I put this one aside
KanjiTomo
Wow, the engine is developer's own creation.Steps to Test:
- Download
- Extract to any folder
- Install JDK/JRE (the developer recommends JDK over JRE)
- Execute, wait until the loading process is finished
- I need to do these (ymmv)
- Un-check Automatic OCR
- Settings > Text Orientation > Vertical
- Settings > Text Colors > Black on White
- Open an image
- Select 1 or 2 Japanese characters
でもこの年になって ==> becomes
で|この年仁ケつて ==> correct detection: 5/9
I make no assumption that OCR quality is bad; the application highlight "the best match" yet provided us with an opportunity to select another from the list of similar characters. Alas, my XP had no support for East Asian language then - thus I never change from default.
One obvious drawback is the incapability to process multiple columns. As shown in the second image, 6 characters are considered as single character. This drawback is confounded by fit to screen image loading implementation. The drawback means I have to select little by little. Zoom feature has been provided to address this issue;
Edit: 4-Jul-2014 70:20 GMT+7
KanjiTomo creator has kindly advised to use middle mouse button to drag the zoom window. Coincidently I stumbled upon Joel's article on UI design - a nice piece; though he took his sweet time to deliver his message across.
KanjiTomo creator has kindly advised to use middle mouse button to drag the zoom window. Coincidently I stumbled upon Joel's article on UI design - a nice piece; though he took his sweet time to deliver his message across.
SmartOCR Lite
Test DataSince this application is a product - instead of individual effort - I have higher expectation; hence a more thorough test.
Test data consists of 3 strips of 4koma manga; removing all pictures and realigning the characters. This tedious extra step is required because:
- Feeding a whole page causes false detection (ex: a picture of mouth is detected as a character)
- Fragments ordering is a mess. It's easier to order the source once rather than to re-order the result as many times as the number of tests.
Steps to Test:
- Install Microsoft .Net Framework 1.1
- Install Microsoft .NET framework japanese language pack
- Install SmartOCR Lite (The original download link is down and I got my copy from a server that is inaccessible from other countries)
- Restart PC
- Open an image (Ctrl+O)
- On the result window - to the left of image window - change display mode (表示モード) to horizontal (横書き).
- (Ignoring the fact that result window is dominated with ? - instead of proper characters) Save the result into html.
- If you save it to txt, it would be unreadable in Notepad. To read it you'll need Notepad++ and select Encoding > Character Sets > Japanese > Shift-JS.
- Html has different problem - some characters might be detected but not displayed due to css bug. I need to check the source in Notepad++.
- Compare the result against original.
On the same phrase (でもこの年になって) used to test KanjiTomo, SmartOCR managed to read them flawlessly. The overall performance, however, is a different story. As shown in the following table, the success rate is merely 60%.
Recapitulation | Total | Kanji | Kana | Others |
---|---|---|---|---|
Total chars | 299 | 80 | 182 | 37 |
Not Detected | 103 | 17 | 60 | 26 |
Wrong Detection | 15 | 6 | 4 | 5 |
Recognized | 181 | 57 | 118 | 6 |
Success Rate | 60% | 71% | 65% | 16% |
Another flaw: SmartOCR always failed to detect double punctuation marks (!! and !?) - granted, some are quite close to the marks.
List of invalid detection, with the format actual (wrong) | ||||||
---|---|---|---|---|---|---|
Kanji | 買 (ロ) | 動 (助) | 園 (口) | 懐 (凄) | 菜 (渠) | 二 (」」) |
Kana | え (乳) | が (桝) | お (靖) | は (ば) | ||
Others | !! (11日) | !! (岬) | !! (11H) | !? (17) | !? (盟) |
- Title of 2nd 4koma is big and clear; how could it read 懐(凄) wrongly?
- Repeated phrase (動物園) in 1st 4koma. It was correct the 1st time, yet on the next recurrences it failed: 動(助) and 園(口).
Prior to language pack installation, JWPce (in WinXP) works with UTF-8. I can type here, save it as txt file and open it in another box (Windows 7). After installation, everything work with Shift-JIS. No matter what I do newer txt file is unreadable in Windows 7.
KanjiTomo Redux
Previously, I accepted all the 口 from KanjiTomo, save them as txt file and open them in Windows 7. This method produced 5/9 detection rate.After language pack installation I can see the default selection as well as other alternatives in the list.
Steps to Test:
Using the same test data as SmartOCR Lite
- Read all that detected incorrectly by SmartOCR
- Read (random picks) those unreadable by SmartOCR
- Read (random picks) those detected correctly by SmartOCR.
Result
From 15 invalid detection, お always got 七 as "best fit" even though お was listed as an alternative. Kanjimoto always failed to detect double punctuation marks (!! and !?) as well; but the other 9 were detected correctly.
Those unreadable by SmartOCR were detected correctly by KanjiMoto.Those detected correctly by SmartOCR were detected correctly by KanjiMoto as well.
KanjiMoto OCR engine has better detection capability - considering caveats below.
Caveats
Character selection in KanjiMoto is fragile. Precision in click-n-drag is paramount.
Sometimes single character can be detected as 2 characters. Other times (and this is the worst) different boxing yields different character. Refer to the image on the side. Observe how the 2nd attempt (with tight boxing) produces correct character while the first attempt (with almost equal spacing on all sides; box doesn't intersect neighboring character) produces wrong character.
To clarify: it's not about tightness. Simply that different boxing may yield different characters. In fact for this particular character getting the wrong result is harder than getting the correct one.
I got the impression this application relies on dictionary to guess some characters. If I box 3 characters that form a phrase/meaning I'd got that phrase. On the other hand, if I box 2 characters that doesn't form any, the default for 1st &/ 2nd characters may be wrong or blank.
Conclusion
For quick and dirty processing, SmartOCR Lite is the clear winner among 3 participants; however, considering its weakness, it is better fit to process a textbook with plain font. Applying binarization (aka thresholding) from Capture2Text may increase success rate a bit, but I doubt the underlying OCR engine can properly process thick-rounded fonts as per test data above.For accuracy KanjiMoto is the clear winner.
Considering the hassle of KanjiMoto, the recommended way is to use SmartOCR first followed by KanjiMoto to fill in the blanks.
Hello, I'm the author of KanjiTomo
ReplyDeleteThank you for trying out the program, I have a few comments.
You should not run the program in a computer that doesn't have Japanese fonts. Not only are results not displayed, but actual detection accuracy will be bad. If you later install a font pack, please re-install the program so that cache is generated again; this might improve accuracy.
Kana detection is not as good as kanji detection; I'm assuming that people who most benefit from this program already know kana, so I have not made that much effort in kana detection compared to kanji.
Having to select individual words/characters is by design; the program is written for interactive use while reading text, not for batch operations.
You can move the zoom window to top part of screen by grabbing it (not the title par) with middle mouse button.
I admit that boundary box detection doesn't always work reliably. This is an example of a problem that is easy for humans but difficult for computers. The actual OCR is the opposite; after correct boundary is in place, kanji detection is quite reliable.
You should try the automatic OCR mode again, it is much faster that drawing boxes.
You are right that KanjiTomo uses dictionary to improve accuracy. If two caracters are close matches but other is part of a word instead of single character, it is prioritized.
>> I admit that boundary box detection doesn't always work reliably. ...
DeleteJust a thought...
What if you add another checkbox "snap to grid" when "automatic ocr" is disabled?
With default == true, implement limited "automatic ocr" to stretch/shrink the selection box a bit before attempting to segregate characters.
It might be too much hassle to implement; "auto" expand a dot, this idea expand/contract an area.
Snap to grid wouldn't work because manual box drawing is really a fallback for situations when grid/character detection has already failed. So the program doesn't know in this case were the grid is located.
DeleteWow, didn't expect to get a feedback from you.
ReplyDelete>> Kana detection is not as good as kanji detection; ...
>> Having to select individual words/characters is by design; ...
>> I admit that boundary box detection doesn't always work reliably. ...
>> You are right that KanjiTomo uses dictionary to improve accuracy. ...
Noted
>> You can move the zoom window to top part of screen by grabbing it (not the title par) with middle mouse button.
Good to know. May I suggest adding a [?] in the bar for people (like me) who don't read documentation?
>> You should not run the program in a computer that doesn't have Japanese fonts.
I've installed the pack and retested in test Redux
>> You should try the automatic OCR mode again, it is much faster that drawing boxes.
Nice, very nice.
Tbh, before installing language pack, automatic OCR never yielded anything.
From the console I got either:
- OCR time limit exceeded
- or exception: java.awt.image.RasterFormatException: (x + width) is outside raster
at kanjitomo.reader.RikaiThread.run(RikaiThread.java:239)
at kanjitomo.reader.Reader$OCRThread.run(Reader.java:3589)
Bottom of the stack is the inner-most right?
Cheers,
Most important part of stack trace is usually near the top, but it's best to paste the full trace anyway. RasterFormatException might be related to missing language pack but since Japanese fonts are needed anyway, there's probably nothing to be done with this error except give a better error message if possible.
DeleteI think so; I could use the auto-OCR after Japanese font installation.
DeleteI uploaded the logs at http://1drv.ms/1xr1Qd8 anyway - jic.
Another log came from ubuntu: java.lang.IllegalArgumentException: The window must use a translucency-compatible graphics configuration <-- should be clear enough as well.
Error about translucency might be related to ubuntu's window manager, but I don't have ubuntu installed so I can't test it. Translucency feature is required by the program because that is how boxes are painted in automatic mode.
DeleteYou can try this free online Japanese ocr to convert image to text.
ReplyDeleteTôi sử dụng onlineocr.org , nơi có thể xử lý hơn 200 tệp miễn phí mà không cần đăng ký
ReplyDeleteHouse Lawyers… [...]just below, are some totally unrelated sites to ours, however, they are definitely worth checking out[...]… Japanese language fonts for free
ReplyDelete