The purpose of a robots.txt file is to limit the areas of a website that can and cannot be spidered or crawled by search engines and other bots. Most robots or spiders, that are 'well behaved' will usually ask for the robots.txt file when they visits a site. The robots.txt file is an ASCII text file which is located in the root directory of a site like this:
http://www.yourURL.com/robots.txt
For example: http://www.google.com/robots.txt
The protocol for its use and more information for the robots.txt file is at The Web Robots Pages
A robots.txt file will look something like this if the spider is to crawl everything on the site:
user-agent:*
disallow:
(the * means 'all' spiders/crawlers/robots/bot)
If the search engine spiders are to crawl nothing on the site, it will look like this:
user-agent:*
disallow: /
If a specific file is not to be crawled, then it would like this:
user-agent:*
disallow: /specificfile.htm
You can name specific spiders that you want to keep out and it would look like this (a lot of the one disallowed in the wg below are email address hunters):
User-agent: Black Hole
Disallow: /
User-agent: Titan
Disallow: /
User-agent: WebStripper
Disallow: /
User-agent: NetMechanic
Disallow: /
User-agent: CherryPicker
Disallow: /
User-agent: EmailCollector
Disallow: /
User-agent: EmailSiphon
Disallow: /
A robots.txt file can be chekced at Robots.txt Checker |