Indexing HNSC's with SP2013 Enterprise Search

Hello forum,

I am able to crawl my root site collection for my Web App (https://qaserver), however I am not able to crawl the 3 HNSC's that are connected to the web app. I am able to browse them, but not index content.

I get the following error in the crawl log:

"Item not crawled due to one of the following reasons: Preventive crawl rule; Specified content source hops/depth exceeded; URL has query string parameter; Required protocol handler not found; Preventive robots directive. ( This item was deleted because it was excluded by a crawl rule. )"

I don't have any crawl rules, specified hops, query string and robots file. Required protocol handler should be https and is working via browse without issue.

Is there anything specific I need to do to have search index HNSC's?

Farm Info:

  1. SP2013 Standard March2013 CU
  2. SSL (443)
  3. Wildcard SSL cert does NOT include web app url, however, I have enabled the option to ignore SSL cert warnings.
  4. There is also an F5 utilized for load balancing and portal login authentication/authorization.

Any ideas?

January 6th, 2014 3:40pm

When you crawl the root web (https://qaserver), it will crawl the HNSCs -- you do not specify HNSCs for a crawl. Your root web should be valid for the Wildcard Cert in use.

Free Windows Admin Tool Kit Click here and download it now
January 6th, 2014 3:48pm

Trevor,

Yes, I see that I do not need to add the HNCS's to the content source. I am simply pointing to my web app url (https://qaserver)... 

So if I have hnsc's with url's of https://docs.portal.com, https://appcatalog.portalnet.com and www.portal.com on web app https://qaserver. The SSL cert is a wildcard *.portal.com (covering https://docs.portal.com; https://appcatalog.portal.com; https://www.portal.com)... however, it does not cover https://qaserver (web app)...

What are my options?

January 6th, 2014 3:58pm

Hi,

I am not able to find the exact technet article but have you created a root site collection in your webapp https://qaserver path based - that is default '/' ? If not create one with path / and try crawling again.

Free Windows Admin Tool Kit Click here and download it now
January 6th, 2014 4:27pm

Sangeetha,

I am able to crawl the root site collection, just not the host named site collections hanging off the web app. I now think this is due to those hnsc urls being sent to our F5 for authorization/authentication (login page). 

The web app/root sc is 'https://qaserver' - which does not have an external dns record

Host Named Site collections (all of which have external dns entries to an F5 Pool) are:

  • https://www.portal.com
  • https://docs.portal.com
  • https://appcatalog.portal.com

Is there anyway to get around the crawler hitting external dns via alternative access mappings?

January 6th, 2014 7:18pm

You can use SiteDataServers to target a specific crawler at a target server, but remember the crawler doesn't hit the HNSC URLs, rather it only crawls the "root" URL and picks up everything else as well.

https://crawl.codeplex.com/

Free Windows Admin Tool Kit Click here and download it now
January 6th, 2014 10:12pm

Just AAM might not be sufficient, you would have to create the corresponding A records as well. Since I think you mentioned you have DNS in the external zone, you may have to create split dns...or have you got a split environment already for the internal zone? Check out this article - http://blogs.technet.com/b/tothesharepoint/archive/2011/05/25/coordinating-urls-with-aam-and-dns.aspx it might help you to map the AAMs with the DNS

But still your problem is strange because if you are able to browse the HNSCs using the FQDNs from the crawl server using default content access account, there should not be any issue. But you are saying you can do this already right?

I am curious to know how you solve this as I might be designing a SP2013 farm with F5 for SSL offloading as well - please post your solution :)

January 7th, 2014 12:29pm

Trevor... 

  • Web app: https://qaserver
  • HSNC's: https://www.portal.com; https://docs.portal.com; https://appcatalog.portal.com
  • SSL certs: *.portal.com; 
  • Search content source: https://qaserver
  • Farm Search Settings set to ignore SSL warnings

When I crawl, I get:

The crawler could not communicate with the server. Check that the server is available and that the firewall access is configured correctly. If the repository was temporarily unavailable, an incremental crawl will fix this error. ( Error from SharePoint site: HttpStatusCode ServiceUnavailable The request failed with HTTP status 503: Service Unavailable.

the SSL cert I have does not cover the web app url. It only covers the hnsc's. Even if I had another cert for https://qaserver, how would I apply two certs to one web app?

Any

Free Windows Admin Tool Kit Click here and download it now
January 7th, 2014 8:58pm

Change the Web App Uri to https://root.portal.com. And like Sangeetha said, make sure you have a Root Site C
January 7th, 2014 9:00pm

I do have a root site collection on the web app... so, you are saying to change my web app from https://qaserver to https://root.portal.com?

Free Windows Admin Tool Kit Click here and download it now
January 7th, 2014 9:05pm

Can/should the root site collection be a HNSC? (https://root.portal.com) as opposed to a path-based site collection? (https://portal/)

January 7th, 2014 10:14pm

Issue resolved.

I had to rename the web app from https://qaserver to 'root.portal.com', so that it is covered by our Wildcard SSL Cert (*.portal.com), as Trevor mentioned above. In addition, I had to by-pass authentication for my crawl service account for all url's, via F5.

All well now. Thanks Sangeetha and Trevor

Free Windows Admin Tool Kit Click here and download it now
January 8th, 2014 5:25pm

Hello,

I have a problem very similar to this one. But our SharePoint site collections are not all subdomains of a single domains.

The root site collection of web application is like this:

https://vanity.mycompany.com

And our sites are like these:

https://sp.firstcompany.com

https://sp.secondcompany.com

https://sp.thirdcompany.com

 

So as you can see we cannot use a single wild card SSL certificate for all the sites, but rather using SNI in IIS for binding different SSL certificates for every site collection.

The issue is that crawler cannot access HNSCs and these errors are logged for every single HNSCs:          

Access is denied. Verify that either the Default Content Access Account has access to this repository, or add a crawl rule to crawl this repository. If the repository being crawled is a SharePoint repository, verify that the account you are using has "Full Read" permissions on the SharePoint Web Application being crawled. ( Error from SharePoint site: HttpStatusCode Unauthorized The request failed with HTTP status 401

 

I have checked and the default content search account has Full Read on the web application. Just to check things out, I used Check Permissions on one of the HNSC on the default content search account and it is allowed to access the site and its contents.

 

Do you have any idea where I have been wrong?

Thanks

Omid

August 13th, 2015 3:32am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics