arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Search the Live Index Does arachnode.net scale? | Download the latest release

crawler is not including .docx files to crawl

rated by 0 users
Answered (Verified) This post has 1 verified answer | 3 Replies | 2 Followers

Top 10 Contributor
82 Posts
InvestisDev posted on Thu, Mar 22 2012 7:16 AM

Hello,

crawler is not including .docx files to crawl. Its content type is : "application/vnd.openxmlformats-officedocument.wordprocessingml.document". while PPTX, XLSX kind of extensions are working properly

Please suggest me how to achieve this.

Thanks,

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by InvestisDev

1261.arachnode.docx

Adding a test .docx...

The .docx in this post is coming back as 'application/octet-stream'...

Set a breakpoint in DataManager.cs at the line shown in the screenshot.  Take the values from the DataType and add them to the cfg.AllowedDataTypes table.

Using this .docx file: {http://www.ecma-international.org/news/TC45_current_work/Ecma%20TC45%20OOXML%20Standard%20-%20Draft%201.3.docx}...

Looks like it's working for me.

Let me know...

Thanks,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Add the content type (appropriate row, etc.) to cfg.AllowedDataTypes.

Look in DisallowedAbsoluteUris.  Do you see your .docx files there?  If so, does the reason read: "Disallowed by unknown content type"?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
82 Posts

i had added a content type for docx into the table cfg.AllowDataTypes.

and checked the Table: DisallowedAbsoluteUris, I am getting reason "Disallowed by unassigned DataType.". 

I had added below values in cfg.AllowDataTypes table

ID = 1003

ContentTypeTypeID=1 (because for .doc extension, same contenttypetypeid is used)

Name = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"

Is the contenttypetypeid is proper

Let me know how to solve this.

Thanks,

 

Top 10 Contributor
1,905 Posts
Verified by InvestisDev

1261.arachnode.docx

Adding a test .docx...

The .docx in this post is coming back as 'application/octet-stream'...

Set a breakpoint in DataManager.cs at the line shown in the screenshot.  Take the values from the DataType and add them to the cfg.AllowedDataTypes table.

Using this .docx file: {http://www.ecma-international.org/news/TC45_current_work/Ecma%20TC45%20OOXML%20Standard%20-%20Draft%201.3.docx}...

Looks like it's working for me.

Let me know...

Thanks,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC