The robots.txt File

Home

Articles

Site Map

 

 

 

 

 

 

 

 

 


   
 
Ask Jeeves
 
Google
 
MSN Search
 
Yahoo Search
 
 
 
 

The purpose of a robots.txt file is to limit the areas of a website that can and cannot be spidered or crawled by search engines and other bots. Most robots or spiders, that are 'well behaved' will usually ask for the robots.txt file when they visits a site. The robots.txt file is an ASCII text file which is located in the root directory of a site like this:

http://www.yourURL.com/robots.txt 

For example: http://www.google.com/robots.txt

The protocol for its use and more information for the robots.txt file is at The Web Robots Pages

A robots.txt file will look something like this if the spider is to crawl everything on the site:

user-agent:*
disallow:

(the * means 'all' spiders/crawlers/robots/bot)

If the search engine spiders are to crawl nothing on the site, it will look like this:

user-agent:*
disallow: /

If a specific file is not to be crawled, then it would like this:

user-agent:*
disallow: /specificfile.htm

You can name specific spiders that you want to keep out and it would look like this (a lot of the one disallowed in the wg below are email address hunters):

User-agent: Black Hole

Disallow: /

User-agent: Titan

Disallow: /

User-agent: WebStripper

Disallow: /

User-agent: NetMechanic

Disallow: /

User-agent: CherryPicker

Disallow: /

User-agent: EmailCollector

Disallow: /

User-agent: EmailSiphon

Disallow: /

A robots.txt file can be chekced at Robots.txt Checker

 
 
 
What do others think:
 
 
The robot.txt primer (WebProWorld)
 
How to keep bad robots, spiders and web crawlers away  
Robots tutorial (SEO Ranke)