What is a robots.txt file and how to use it

What is a robots.txt file and how to use it

Robots.txt – General information

Robots.txt is a text file located in the site’s root directory that specifies for search engines’ crawlers and spiders what website pages and files you want or don’t want them to visit. Usually, site owners strive to be noticed by search engines, but there are cases when it’s not needed: For instance, if you store sensitive data or you want to save bandwidth by not indexing excluding heavy pages with images.

When a crawler accesses a site, it requests a file named ‘/robots.txt’ in the first place. If such a file is found, the crawler checks it for the website indexation instructions.

NOTE: There can be only one robots.txt file for the website. A robots.txt file for an addon domain needs to be placed to the corresponding document root.

Google’s official stance on the robots.txt file

Robots.txt and SEO

Removing exclusion of images

The default robots.txt file in some CMS versions is set up to exclude your images folder. This issue doesn’t occur in the newest CMS versions, but the older versions need to be checked.

This exclusion means your images will not be indexed and included in Google Image Search, which is something you would want, as it increases your SEO rankings.

Should you want to change this, open your robots.txt file and remove the line that says:

   Disallow: /images/ 

Adding reference to your sitemap.xml file 

If you have a sitemap.xml file (and you should have it as it increases your SEO rankings), it will be good to include the following line in your robots.txt file:

  sitemap: http://www.domain.com/sitemap.xml 

(This line needs to be updated with your domain name and sitemap file).

Miscellaneous remarks

  • Don’t block CSS, Javascript and other resource files by default. This prevents Googlebot from properly rendering the page and understanding that your site is mobile-optimized.
  • You can also use the file to prevent specific pages from being indexed, like login- or 404-pages, but this is better done using the robots meta tag.
  • Adding disallow statements to a robots.txt file does not remove content. It simply blocks access to spiders. If there is content that you want to remove, it’s better to use a meta noindex.
  • As a rule, the robots.txt file should never be used to handle duplicate content. There are better ways like a Rel=canonical tag which is a part of the HTML head of a webpage.
  • Always keep in mind that robots.txt is not subtle. There are often other tools at your disposal that can do a better job like the parameter handling tools within Google and Bing Webmaster Tools, the x-robots-tag and the meta robots tag.