What is a robots.txt file and how to use it

What is a robots.txt file and how to use it

Robots.txt – General information

Robots.txt is a text file located in the site’s root directory that specifies for search engines’ crawlers and spiders what website pages and files you want or don’t want them to visit. Usually, site owners strive to be noticed by search engines, but there are cases when it’s not needed: For instance, if you store sensitive data or you want to save bandwidth by not indexing excluding heavy pages with images.

When a crawler accesses a site, it requests a file named ‘/robots.txt’ in the first place. If such a file is found, the crawler checks it for the website indexation instructions.

NOTE: There can be only one robots.txt file for the website. A robots.txt file for an addon domain needs to be placed to the corresponding document root.

Google’s official stance on the robots.txt file

A robots.txt file consists of lines which contain two fields: one line with a user-agent name (search engine crawlers) and one or several lines starting with the directive

 Disallow: 

Robots.txt has to be created in the UNIX text format.

Basics of robots.txt syntax

Usually, a robots.txt file contains something like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~different/

In this example three directories: ‘/cgi-bin/’, ‘/tmp/’ and ‘/~different/’ are excluded from indexation.

NOTE: Every directory is written on a separate line. You can’t write ‘Disallow: /cgi-bin/ /tmp/’ in one line, nor can you break up one directive Disallow or User-agent into several lines – use a new line to separate directives from each other.

‘Star’ (*) in User-agent field means ‘any web crawler’. Consequently, directives of the type ‘Disallow: *.gif’ or ‘User-agent: Mozilla*’ are not supported – please pay attention to such logical mistakes as they are most common ones.

Other common mistakes are typos – misspelled directories, user-agents, missing colons after User-agent and Disallow, etc. When your robots.txt files get more and more complicated, and it’s easy for an error to slip in, there are some validation tools that come in handy: http://tool.motoricerca.info/robots-checker.phtml

Examples of usage

Here are some useful examples of robots.txt usage:

Prevent the whole site from indexation by all web crawlers:

 User-agent: *
Disallow: / 

Allow all web crawlers to index the whole site:

  User-agent: *
Disallow: 


Prevent only several directories from indexation:

User-agent: *
Disallow: /cgi-bin/ 


Prevent the site’s indexation by a specific web crawler:

 User-agent: Bot1
Disallow: / 

Find the list with all user-agents’ names split into categories here.

Allow indexation to a specific web crawler and prevent indexation from others:

User-agent: Opera 9
Disallow:
User-agent: *
Disallow: / 

Prevent all the files from indexation except a single one.

This is quite difficult as the directive ‘Allow’ doesn’t exist. Instead, you can move all the files to a certain subdirectory and prevent its indexation except one file that you allow to be indexed:

 User-agent: *
Disallow: /docs/ 

You can also use an online robots.txt file generator here.

Robots.txt and SEO

Removing exclusion of images

The default robots.txt file in some CMS versions is set up to exclude your images folder.This issue doesn’t occur in the newest CMS versions, but the older versions need to be checked.

This exclusion means your images will not be indexed and included in Google Image Search, which is something you would want, as it increases your SEO rankings.

Should you want to change this, open your robots.txt file and remove the line that says:

   Disallow: /images/ 

Adding reference to your sitemap.xml file 

If you have a sitemap.xml file (and you should have it as it increases your SEO rankings), it will be good to include the following line in your robots.txt file:

  sitemap:http://www.domain.com/sitemap.xml 

(This line needs to be updated with your domain name and sitemap file).

Miscellaneous remarks

  • Don’t block CSS, Javascript and other resource files by default. This prevents Googlebot from properly rendering the page and understanding that your site is mobile-optimized.
  • You can also use the file to prevent specific pages from being indexed, like login- or 404-pages, but this is better done using the robots meta tag.
  • Adding disallow statements to a robots.txt file does not remove content. It simply blocks access to spiders. If there is content that you want to remove, it’s better to use a meta noindex.
  • As a rule, the robots.txt file should never be used to handle duplicate content. There are better ways like a Rel=canonical tag which is a part of the HTML head of a webpage.
  • Always keep in mind that robots.txt is not subtle. There are often other tools at your disposal that can do a better job like the parameter handling tools within Google and Bing Webmaster Tools, the x-robots-tag and the meta robots tag.

Robots.txt for WordPress

WordPress creates a virtual robots.txt file once you publish your first post with WordPress. Though if you already have a real robots.txt file created on your server, WordPress won’t add a virtual one.

A vrtual robots.txt doesn’t exist on the server, and you can only access it via the following link: http://www.yoursite.com/robots.txt

By default, it will have Google’s Mediabot allowed, a bunch of spambots disallowed and some standard WordPress folders and files disallowed.

So in case you didn’t create a real robots.txt yet, create one with any text editor and upload it to the root directory of your server via FTP.

Blocking main WordPress directories

There are 3 standard directories in every WordPress installation – wp-content, wp-admin, wp-includes that don’t need to be indexed.

Don’t choose to disallow the whole wp-content folder though, as it contains an ‘uploads’ subfolder with your site’s media files that you don’t want to be blocked. That’s why you need to proceed as follows:

Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/ 

Blocking on the basis of your site structure

Every blog can be structured in various ways:

a) On the basis of categories
b) On the basis of tags
c) On the basis of both – none of those
d) On the basis of date-based archives

a) If your site is category-structured, you don’t need to have the Tag archives indexed. Find your tag base in the Permalinks options page under the Settings menu. If the field is left blank, the tag base is simply ‘tag’:

   Disallow: /tag/ 

b) If your site is tag-structured, you need to block the category archives. Find your category base and use the following directive:

Disallow: /category/ 

c) If you use both categories and tags, you don’t need to use any directives. In case you use none of them, you need to block both of them:

 Disallow: /tags/
Disallow: /category/ 

d) If your site is structured on the basis of date-based archives, you can block those in the following ways:

 Disallow: /2010/
Disallow: /2011/
Disallow: /2012/
Disallow: /2013/ 

NOTE: You can’t use Disallow: /20*/ here as such a directive will block every single blog post or page that starts with the number ’20’.

Duplicate content issues in WordPress

By default, WordPress has duplicate pages which do no good to your SEO rankings. To repair it, we would advise you not to use robots.txt, but instead go with a subtler way: the ‘rel = canonical’ tag that you use to place the only correct canonical URL in the section of your site. This way, web crawlers will only crawl the canonical version of a page.

About the Author

Leave a Reply