While crawling a pdf file, it gives description of the text as below
where first line is the file name with url and the text is of the description of the pdf file. Here description shows numbers in place of the text of PDF file.
PDF which we crawled was :
Please suggest a way to solve this.
Let me know if you need any further detail for the same.
i had tried to get some solution from the itextsharp but no reply still, in between i had tried another method of itextsharp which is as below:
iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage() using this the attached PDF file in this communication which gives result as numbers also shows proper data.
can you please take a look on this method to verify that Is it fine to use this method in place of the existing?
below is the screen of code which i had changed and so get the response of the pdf too. I had changed the code in "Arachnode.Plugins.CrawlActions.ManageLuceneDotNetIndexes" class - "PerformAction()" method
Let me know if you have any concern.
I got that method from some forum , but the same method is having one more overloaded form. which does not required this text Strategies to be specified.
so In that way too i had crawled the same pdf file which was not showing proper data is now showing actual content. i had changed the code as below.
So suggest me if this we can use as a solution or not. Below is the screen of code change in the same class "Arachnode.Plugins.CrawlActions.ManageLuceneDotNetIndexes" method "PerformAction()"
Hmm... I don't see this. What is the full AbsoluteUri of the .pdf from atkins, etc.?
Looks like changes have been made to the source over there. Could this be the cause? Perform a diff. between your source and the trunk?
For best service when you require assistance:
Apologies for the delay in reply.
At First , I had made some changes in the "ManageLuceneDotNetIndexes.cs" and "CustomManageLuceneDotNetIndexes.cs". I had added "text" field in the document as below,
document.Add(new Field("text", UserDefinedFunctions.ExtractText(contentToIndex).Value, Field.Store.YES, Field.Index.ANALYZED));
and this "text" field value will be the response of the PDF file or a web page.
Means while crawling such PDF, getting response of the PDF as the numbers.
you can check that from the
response variable "stringBuilder" as highlighted below by crawling that PDF again in the class "ManageLuceneDotNetIndexes.cs": Just add the breakpoint on CreateDocument() and check the reponse of "stringBuilder" variable
below is the URL while crawling getting numbers in a description ("text" field).
Also on the screen which you had shared there is no description its showing "..." only.
let me know if you need any further detail for the same.
I can take a look in a day or less... busy, busy...
Looks like iTextSharp isn't crashing when I use the latest version.
It still doesn't know how to extract the text though... looking at it further...
As best as I can tell, the fonts/text are actually vector positions. So, there isn't any text to capture. ???
I updated itextsharp.dll FWIW. It seems to have fixed a problem in reading the byte into the PdfReader object.
Thanks for the reply, Can you plz tell me which project you had updated so i can take its latest from svn.
I updated the Library and the Plugins projects. Can you always look at SVN .> Check for Modifications to let you know what has changed.
Apologies for delay in reply, I had taken a latest dump from the svn and checked the same file by crawling it again.
It still shows same issue.
I had taken latest copy of plugins and library project both.
Please suggest some way.
The issue is that it appears there isn't any text to extract. AN uses iTextSharp, so beyond this, if iTextSharp doesn't work to extract the text it looks like an iTextSharp bug, or... it also could be that the text is presented in vector format and there is a code per letter.
Yes, I will take a look.
What have you discovered in the usage of the text strategies?
Yes, this is correct and should match what I checked in.