Use our Robots.txt generator to create a robots.txt file.
Use our Robots.txt analyzer to analyze your robots.txt file today.
Google also offers a similar tool inside of Google Webmaster Central, and shows Google crawling errors for your site.
Allow indexing of everything
Disallow indexing of everything
Disawllow indexing of a specific folder
Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder
Google and Microsoft's Bing allow the use of wildcards in robots.txt files.
To block access to all URLs that include a question mark (?), you could use the following entry:
You can use the $ character to specify matching the end of the URL. For instance, to block an URLs that end with .asp, you could use the following entry:
Part of creating a clean and effective robots.txt file is ensuring that your site structure and filenames are created based on sound strategy. What are some of my favorite tips?
Google has begun entering search phrases into search forms, which may waste PageRank & has caused some duplicate content issues. If you do not have a lot of domain authority you may want to consider blocking Google from indexing your search page URL. If you are unsure of the URL of your search page, you can conduct a search on your site and see what URL appears. For instance,
The catch, as noticed by Sugarrae, is URLs which are already indexed but are then set to noindex in robots.txt will throw errors in Google's Search Console (formerly known as Google Webmaster Tools). Google's John Meuller also recommended against using noindex in robots.txt.
In this guest post by Tony Spencer about 301 redirects and .htaccess he offers tips on how to prevent your SSL https version of your site from getting indexed. In the years since this was originally published, Google has indicated a preference for ranking the HTTPS version of a site over the HTTP version of a site. There are ways to shoot yourself in the foot if it is not redirected or canonicalized properly.
Throughout the years some people have tried to hijack other sites using nefarious techniques with web proxies. Google, Yahoo! Search, Microsoft Live Search, and Ask all allow site owners to authenticate their bots.
Aren't we a tricky one!
Originally robots.txt only supported a disallow directive, but some search engines also support an allow directive. The allow directive is poorly documented and may be handled differently by different search engines. Semetrical shared information about how Google handles the allow directive. Their research showed:
The number of characters you use in the directive path is critical in the evaluation of an Allow against a Disallow. The rule to rule them all is as follows:
A matching Allow directive beats a matching Disallow only if it contains more or equal number of characters in the path
Crawled by Googlebot?
Appears in Index?
|robots.txt||no||If document is linked to, it may appear URL only, or with data from links or trusted third party data sources like the ODP||yes||
People can look at your robots.txt file to see what content you do not want indexed. Many new launches are discovered by people watching for changes in a robots.txt file.
Using wildcards incorrectly can be expensive!
Complex wildcards can also be used.
|robots meta noindex tag||yes||no||yes, but can pass on much of its PageRank by linking to other pages||
Links on a noindex page are still crawled by search spiders even if the page does not appear in the search results (unless they are used in conjunction with nofollow).
Page using robots meta nofollow (1 row below) in conjunction with noindex can accumulate PageRank, but do not pass it on to other pages.
|<meta name="robots" content="noindex">
OR can be used with nofollow likeso
<meta name="robots" content="noindex,nofollow">
|robots meta nofollow tag||destination page only crawled if linked to from other documents||destination page only appears if linked to from other documents||no, PageRank not passed to destination||If you are pushing significant PageRank into a page and do not allow PageRank to flow out from that page you may waste significant link equity.||
<meta name="robots" content="nofollow">
OR can be used with noindex likeso
<meta name="robots" content="noindex,nofollow">
|link rel=nofollow||destination page only crawled if linked to from other documents||destination page only appears if linked to from other documents||Using this may waste some PageRank. It is recommended to use on user generated content areas.||If you are doing something borderline spammy and are using nofollow on internal links to sculpt PageRank then you look more like an SEO and are more likely to be penalized by a Google engineer for "search spam"||<a href="http://destination.com/" rel="nofollow">link text</a>|
|rel=canonical||yes. multiple versions of the page may be crawled and may appear in the index||pages still appear in the index. this is taken as a hint rather than a directive.||PageRank should accumulate on destination target||With tools like 301 redirects and rel=canonical there might be some small amount of PageRank bleed, particularly with rel=canonical since both versions of the page stay in the search index.||
<link rel="canonical" href="http://www.site.com/great-page" />
Want more great SEO insights? Read our SEO blog to keep up with the latest search engine news, and subscribe to our SEO training program to get cutting edge tips we do not share with the general public. Our training program also offers exclusive SEO videos.