Thursday, June 2, 2011

Why We Optimize Robots.txt

What is Robot.txt

The simple way to check presence of robots.txt file in a site is by typing /robots.txt after the URL. This will show the file for robots protocol for that particular site.

Whenever a crawler visit any website it first looks for the robots.txt file in that site. Which tells the robot that which pages it should not visit on the site. Point to be remember is that robots.txt file is just an instruction to the web robots, robots can ignore this.
How to Add robots.txt

For example
User-agent: * // You can also mention name of particular robot like Googlebot .
Disallow: /

In the above example "User-agent: *" means that this section applies to all robots and "Disallow: /" tells the robot that it should not visit any pages on the site. In robots.txt we don't mention "allow" instruction as it works as default.

How to Avoid Sub Page/Directories

In same way we can also exclude directories and particular file which we want that robots should not crawl.

For example
User-agent: *
Disallow: /tmp/

The easy way to disallow different files is to put all these files into a separate directory, and then disallow this special directory or you can explicitly disallow all disallowed pages like:

For example
Disallow: /spam.html
Disallow: /code.html and so on.

The other way to exclude a particular URL is to put meta tag "robots" in coding of that particular page.

For-example
// if you don't want to crawl your page.
OR .

We can also use wild cards in robots.txt. ' * ' is used when we want to give all URLs that contains that specific symbol. And '$' is used to specify matching the end of the URL.

Lets clear this with help of an example. In a dynamic website its very important to avoid crawling of some URLs that contains duplicate content like URLs created in search result.

How To Avoid Search Quarried

Like http://www.vmoptions.com/?search=hertz.co.uk and http://www.vmoptions.com/?search=travel-guard.co.uk are URLs dynamically generated in search result. The main cause of the problem is "?search". Now we can include a wildcard that blocks all urls containing this term in our robots.txt
file.

For example :
Disallow: /*?search= (will not crawl any url containing this term anywhere in the url). In same way to block an URLs that ends with .gif , you could use the following entry:
Disallow: /*.gif$.

While working with the wildcards in generating queries keep in mind that Robots.txt is case sensitive.

No comments:

Post a Comment