What is Robot.txt
Whenever a crawler visit any website it first looks for the robots.txt file in that site. Which tells the robot that which pages it should not visit on the site. Point to be remember is that robots.txt file is just an instruction to the web robots, robots can ignore this.
How to Add robots.txt
For example
User-agent: * // You can also mention name of particular robot like Googlebot .
Disallow: /
In the above example "User-agent: *" means that this section applies to all robots and "Disallow: /" tells the robot that it should not visit any pages on the site. In robots.txt we don't mention "allow" instruction as it works as default.
How to Avoid Sub Page/Directories
For example
User-agent: *
Disallow: /tmp/
The easy way to disallow different files is to put all these files into a separate directory, and then disallow this special directory or you can explicitly disallow all disallowed pages like:
User-agent: *
Disallow: /tmp/
The easy way to disallow different files is to put all these files into a separate directory, and then disallow this special directory or you can explicitly disallow all disallowed pages like:
For example
Disallow: /spam.html
Disallow: /code.html and so on.
The other way to exclude a particular URL is to put meta tag "robots" in coding of that particular page.
For-example
// if you don't want to crawl your page.
OR .
We can also use wild cards in robots.txt. ' * ' is used when we want to give all URLs that contains that specific symbol. And '$' is used to specify matching the end of the URL.
Lets clear this with help of an example. In a dynamic website its very important to avoid crawling of some URLs that contains duplicate content like URLs created in search result.
Disallow: /code.html and so on.
The other way to exclude a particular URL is to put meta tag "robots" in coding of that particular page.
For-example
// if you don't want to crawl your page.
OR .
We can also use wild cards in robots.txt. ' * ' is used when we want to give all URLs that contains that specific symbol. And '$' is used to specify matching the end of the URL.
Lets clear this with help of an example. In a dynamic website its very important to avoid crawling of some URLs that contains duplicate content like URLs created in search result.
How To Avoid Search Quarried
Like http://www.vmoptions.com/?search=hertz.co.uk and http://www.vmoptions.com/?search=travel-guard.co.uk are URLs dynamically generated in search result. The main cause of the problem is "?search". Now we can include a wildcard that blocks all urls containing this term in our robots.txt
file.
For example :
Disallow: /*?search= (will not crawl any url containing this term anywhere in the url). In same way to block an URLs that ends with .gif , you could use the following entry:
Disallow: /*.gif$.
While working with the wildcards in generating queries keep in mind that Robots.txt is case sensitive.
Disallow: /*.gif$.
While working with the wildcards in generating queries keep in mind that Robots.txt is case sensitive.
No comments:
Post a Comment