What are robots.txt files and how do theyblock and allow the crawlers of different browsers?

The robots.txt file is the first file that a search engine requests when indexing your site. This file lets you tell the search engines which pages on your site not to index. When you first set up your site, it is important to have your robots.txt file in place before you go live. This is especially important if you have faceted navigation. Faceted navigation can result in a large number of URLs to pages that appear to search engines to have the same content. Because duplicate content has a negative impact on your search engine ranking, you should use the robots.txt file to control what is indexed and prevent the search engine from indexing pages that appear to be the same.

Robots.txt Common Commands:

The following sample robots.txt files give you some commonly used methods of disallowing/allowing indexing.

Allow all web crawlers to crawl all content:

User-agent: *
Disallow:

Block all web crawlers from all content:

User-agent: *
Disallow: /

Block a specific web crawler from all content:

User-agent: Googlebot
Disallow: /

Block a specific web crawler from a specific facet and all its values:

User-agent: Googlebot
Disallow: /facet/*

Block all crawlers from a specific facet disregarding the order in which it appears:

User-agent: *
Disallow: */facet/*

Allow all crawlers to crawl a specific facet value within a facet, disregarding the order in which it appears:

User-agent: *
Disallow: */facet/*
Allow: */facet/facet-value-1

Allow all crawlers to crawl a specific facet value within a facet only when this facet appears first:

User-agent: *
Disallow: /facet/*
Allow: /facet/facet-value-1

Block all web crawlers from adding items to cart by following “Add to Cart” links:

User-agent: *
Disallow: /additemtocart.nl

Robots.txt Common Commands:

Leave a comment Cancel reply