A Guide to the Robots.txt Exclusion Protocol

Learn how to use robots.txt to control how search engine crawlers, or spiders, access and crawl your site on the web and what to index.

By Tim TrottSEO Guide • February 1, 2013
Search Engine Optimisation

This article is part of a series of articles. Please use the links below to navigate between the articles.

  1. SEO Strategy - Search Engine Optimization Tips in 2025
  2. A Guide to the Robots.txt Exclusion Protocol
  3. What Are XML Sitemaps and Why Do You Need One?
  4. How to Use Google Search Central (formerly Google Webmaster Tools)
  5. Google Analytics for Tracking Website Visitor Statistics
  6. How to Start Earning Money with Google Adsense in 2025
  7. Website Loading Times Are Vital - How to Improve Your Performance
  8. How To Improve Website Speed By Enabling GZip Compression
  9. How to Leverage Google Trends for Effective Keyword Research
  10. Top 8 Best Free Search Engine Optimisation Websites & Tools
A Guide to the Robots.txt Exclusion Protocol

A Web crawler is sometimes also called a spider or a bot. They are automated Internet robots that systematically browse the World Wide Web, typically for web indexing, although they can be used to gather data of any kind. Sometimes, these bots can be overzealous in their crawling and generate thousands of hits per hour, a process also known as a denial of service attack.

The robots.txt file is standard for giving instructions about a website to web crawlers. This standard is called The Robots Exclusion Protocol.

When a robot wants to visit a site to crawl, it first checks for the robots.txt to see if it is allowed to crawl the site and if there are any areas it should ignore.

The robots.txt is a plain text file which is placed in the root of the website, for example, http://www.example.com/robots.txt

What is the robots.txt File?

The most basic of robots.txt contents looks like this:

User-agent: *
Disallow:

This straightforward content creates a rule which allows all web crawlers access to the entire site.

User-agent: * indicates that the following rule applies to all spider bots.
Disallow: The empty disallow field indicates nothing is blocked, and every link can be crawled.

The opposite would be to block access for all web crawlers and prevent the site from being indexed.

User-agent: *
Disallow: /

User-agent: * indicates that the following rule applies to all spider bots.
Disallow: / indicates that the homepage, and everything under the homepage, is disallowed or forbidden.

You can specify which URLs are blocked in the Disallow field.

User-agent: *
Disallow: /wp-admin/
Disallow: /private/

Care must be taken when disallowing resources, as malicious users may use the robots.txt to locate hidden areas and target them to attack. For example, the above rules would indicate to a hacker that the site is running WordPress and that there is a URL with private resources. They can then tailor an attack to WordPress or investigate those private resources. Read more about internal information disclosure here.

Robots.txt prevents search engines from listing a web page; it should not be used as a security measure.

Limiting Bot Access with robots.txt

You can also limit access per-bot by specifying them in the User-agent field.

Here are a few of the most popular web crawler user agents to use:

  • googlebot - Googles own web crawler
  • Mediapartners-Google - Google Adsense/Adwords
  • Bingbot - Microsoft Bing
  • MSNBot - Microsoft's old MSN bot
  • Slurp - Yahoo! Search
  • ia_archiver - Internet Archive

You can give access to specific bots while blocking all others:

User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

User-agent: Slurp
Disallow:

# Everyone Else (NOT allowed)
User-agent: *
Disallow: / 

In theory, this should block all "bad bots, " i.e., those who scrape content and hog bandwidth, but bad bots do not honour the robots.txt rules or even access the file. Bad spiders will not follow robots.txt, and search engines and spiders that do follow it are the ones that you want indexing your site.

Validating robots.txt with Google Search Console

Once you have created and uploaded a robots.txt, you can use the Google Search Console (formerly Webmaster Tools) to check for errors and test to see if the rules work against several user agents.

Related ArticlesThese articles may also be of interest to you

CommentsShare your thoughts in the comments below

My website and its content are free to use without the clutter of adverts, popups, marketing messages or anything else like that. If you enjoyed reading this article, or it helped you in some way, all I ask in return is you leave a comment below or share this page with your friends. Thank you.

There are no comments yet. Why not get the discussion started?

New comments for this post are currently closed.