Scanned and OCR'd PDF--OCR content is not indexed (Network Steve Forum)

Scanned and OCR'd PDF--OCR content is not indexed

I am setting up a new SharePoint 2013 install, and have put a handful files in a doc library to test search. The content has been indexed, and I can find the content inside many files and file types without issue--including "native" PDF files. However, it doesn't seem to index the content of a scanned and OCR'd (text with image overlay) PDF. I have verified that the text is indeed in the OCR text by copying and pasting phrases, and I also confirmed that the crawl log shows the file as successfully crawled. The filename is also indexed.

So... it would seem that the SharePoint 2013 indexer does not index the text in scanned and OCR'd PDF files. Am I missing something? Can anyone else confirm this behavior?

Thanks!

Ryan

May 31st, 2013 11:59pm

Hi,

I haven't tried this myself but I know 2013 has a new internal PDF converter module. It might very well be that this module fails on your files. I also remembering someone telling me you cannot change this behavior to use one of the ifilters for pdf either (which do work for OCR files).

You should file a support ticket with MS on this issue.

Thanks,
Mikael Svenson

Free Windows Admin Tool Kit Click here and download it now

June 3rd, 2013 5:57pm

Thank you for your response, Mikael. So do you know if this worked for FAST for 2010? I seem to recall that it did, but it has been a while since I checked. I have tried multiple image-overlay with OCR text PDF files--very standard stuff, just B&W images with the OCR text behind the image (and selectable)--and although it shows the file as indexed, no hits are returned for words I know exist in the OCR text (and are OCR'd correctly--I cut and paste the text to verify).

June 3rd, 2013 7:04pm

Hi,

I believe it did, and it used pdf2html.exe for the conversion. 2013 uses something different, but not sure exactly what. Maybe some custom written parser.

Thanks,
Mikael Svenson

Free Windows Admin Tool Kit Click here and download it now

June 3rd, 2013 7:28pm

Thank you very much. I have submitted a support ticket and will update this with the results.

June 3rd, 2013 8:51pm

Thank you very much. I have submitted a support ticket and will update this with the results.

Hi,

I know it too old thread but I have the same issue, I am interested to know that what solution did Microsoft gave to you.

Free Windows Admin Tool Kit Click here and download it now

February 18th, 2014 8:16am

Hi,

Your solution in 2013 is to add a content enrichment stage which does the OCR for you after the built-in PDF parser has executed.

You can for example implement Abbyy or similar in this stage.

Either that, or implement the OCR as an event receiver on the library. There are not other solutions at the moment for this.

Thanks,
Mikael Svenson

February 18th, 2014 8:55am

We are using Read I.R.I.S as OCR software but still I am not getting my document contents in the search result.

Free Windows Admin Tool Kit Click here and download it now

February 18th, 2014 9:04am

Hi,

So you have a text layer in the PDF, but it's not showing, is that correct? If so there is a bug as the built-in PDF extractor should pull out the text layer.

Thanks,
Mikael Svenson

February 18th, 2014 9:06am

Yes, it correct what you understand, So now what can be the solution for it.

Free Windows Admin Tool Kit Click here and download it now

February 18th, 2014 9:09am

Hi,

File a ticket with Microsoft Support about it. If possible, I would also try a sample file in Office 365 and see if it works there or not.

Thanks,
Mikael Svenson

February 18th, 2014 9:10am

I filed a support ticket back in June 2013, and they acknowledged that it is a bug and were submitting it to development, but with no ETA on a fix (understandably). They thought that it would go into the next rollup. However, I haven't heard anything back from support since, and they have not responded to my emails requesting an update (yet). I'll send another today, as this is a rollout-blocking issue for us.

Free Windows Admin Tool Kit Click here and download it now

February 18th, 2014 3:03pm

So you question is how to indext OCR'd pdf document using SharePoint 2013? I searched the internet and find this blog, it says that "In Sharepoint versions prior to 2013 there was no PDF Icon and and PDF documents would not be indexed for Sharepoint search unless a separate iFilter was installed. "

I do not know whether it can serve your purpose.

http://www.aquaforest.com/wp/index.php/configuring-sharepoint-for-pdf-files/

February 21st, 2014 2:59am

Hi,

That post is for 2007,2010. 2013 is a different architecture where you cannot override the built-in converters (which you would gather from reading the thread :) )

Thanks,
Mikael Svenson

Free Windows Admin Tool Kit Click here and download it now

February 21st, 2014 2:17pm

To clarify:

- From what I've read, iFilters can still be installed, but as Mikael said, they can't override the built-in file format handlers in 2013. 2013 has a built-in handler for PDFs, whereas previous versions required a PDF iFilter for indexing PDFs that have text content. If one could install the Adobe PDF iFilter in 2013 successfully, it would resolve the issue in this thread, but PDF iFilters don't work in 2013.

- Aquaforest makes a product that OCRs PDF files. That takes an image-only PDF and makes the file searchable, but it is not an indexer. Rather, it enables an index engine to make a big collection of OCR'd PDF files searchable via a search engine.

- The built-in PDF handler in 2013 does index native PDFs. It does not index OCR'd PDF files.

So, that's the issue for which I submitted the ticket to Microsoft. In our case, we don't need to OCR our PDF files--they are already OCR'd. But they don't show up in searches.

(Regarding Aquaforest... I've talked with someone there previously--for a non-SharePoint DMS--and they seem to make a cool product, but I don't have any personal experience using it.)

February 21st, 2014 5:26pm

You can try this free online ocr to convert image to text(http://www.online-code.net/ocr.html).

Free Windows Admin Tool Kit Click here and download it now

September 9th, 2015 10:40pm

This topic is archived. No further replies will be accepted.