Join 136,941 Programmers for FREE! Get instant access to thousands of experts, tutorials, code snippets, and more! There are 1,867 people online right now. Registration is fast and FREE... Join Now!
I've posted a similar thread in the Java forum, but I think it is interesting on a wider scale, too. So do you have any experience with OCR software, libraries, or SDKs? What would you recommend, what are the pitfalls, how to optimize performance, etc. ? If you write about one, please mention the platform it is available for, its licence, and maybe the supported character sets. Did you integrate a library into your software, or used a standalone application?
So it looks-like I am the only one interested in this, but since I've seen threads like this going mostly unanswered in several forums, I thought I would sum up some findings I have so far. I have tried a couple of applications. At first take, I just wanted to see what they could do with a screenshot of a pdf file, just to assume ideal scanning. The next turn will be testing the engines with real scans, and finding out the right scanning settings and preprocessing needed on the pictures. Simple OCR (commercial) was a let down, since it didn't have support for Latin2 characters (or at least the demo lacked it), so it went nuts on my sample page. I was really excited aboout GOCR and Conjecture, as it was an open source OCR, that seemed like the easiest one to incorporate into an applicaiton. Well, it had problems with Latin2, too - maybe it could be extended in some way, but if there was an out of the box solution... Tesseract from Google: same problem as above, only US charset support, although as it is OpenSource and training code is inculded, it might worth a second look, if everything else fail... And here come the big guns. these are commercial ones, but they seem to be fine: ABBYY FineReader 8.0 - this one worked perfectly. Scansoft OmniPage 15.0 - first I've started out it in some batch mode, and it just hung on the first page. Then I've found out, that it can work as a normal application, like the ABBY FR, so I've tried it too, and it worked fine. I'll give another go to the batch mode, because it seems to have very interesting automation in it, that would suit my needs well. Both of the latter two have support for a multitude of languages and charsets. Anyway, unfortunately it looks-like I'll have to go with one of the commercial ones, however they both have SDK licenses (probably as com modules) (although I think right now they are too expensive for my customer) that sounds interesting. So probably I won't implement full-scale integrations, and some things will have to be done manually right now, but since the amount of OCRing is minimal, it should work out fine...
I highly recommend tocr (http://www.transym.com). It's no frills and it works really, really well. It's not free but it's CHEAP for what you get. And support is responsive too. Unfortuantely, it doesn't run in *nix, only Windows. You get a library you can work into your code. All I had to do was modify the VB demo for my needs and it's been excellent.
Enlai PS no, I don't work for them - I'm a paying customer!
So I've tried TOCR, and it was around on pair with gocr. It doesn't support Latin2 characters, simply skipped a large part of my test data, but what it recognized was 99% correct. However the the provided SDK might be useful in some cases. I've only tested it with the application provided, so if there is a way to feed it soem other charsets it might be useful (although it seemed that it has some problems with the accents..)
I am new to this forum, i have download the GOCR OCR Software.
But i don't know how to run this. There is no EXE file in the downloaded folder only one batch file in that also when i double click one Dos window open and close in second.