The Complete Guide to Robots.txt
Matthew Henry Sep 15 2016
Robots.txt is a small text file that lives in the root directory of a website. It tells well-behaved crawlers whether to crawl certain parts of the site or not. The file uses simple syntax to be easy for crawlers to put in place (which makes it easy for webmasters to put in place, too). Write it well, and you’ll be in indexed heaven. Write it poorly, and you might end up hiding your entire site from search engines.
There is no official standard for the file. Robotstxt.org is often treated as such a resource, but this site only describes the original standard from 1994. It’s a place to start, but you can do more with robots.txt than the site outlines, such as using wildcards, sitemap links, and the “Allow” directive. All major search engines support these extensions.
In a perfect world, no one would need robots.txt. If all pages on a site are intended for public consumption, then, ideally, search engines should be allowed to crawl all of them. But we don’t live in a perfect world. Many sites have spider traps, canonical URL issues, and non-public pages that need to be kept out of search engines. Robots.txt is used to move your site closer to perfect.
How Robots.txt Works
If you’re already familiar with the directives of robots.txt but worried you’re doing it wrong, skip on down to the Common Mistakes section. If you’re new to the whole thing, read on.
Make a robots.txt file using any plain text editor. It must live in the root directory of the site and must be named “robots.txt” (yes, this is obvious). You cannot use the file in a subdirectory.
If the domain is example.com, then the robots.txt URL should be:
The HTTP specification defines ‘user-agent’ as the thing that is sending the request (as opposed to the ‘server’ which is the thing that is receiving the request). Strictly speaking, a user-agent can be anything that requests web pages, including search engine crawlers, web browsers, or obscure command line utilities.
In a robots.txt file, the user-agent directive is used to specify which crawler should obey a given set of rules. This directive can be either a wildcard to specify that rules apply to all crawlers:
Or it can be the name of a specific crawler:
Learn more about giving directives to multiple user-agents in Other user-agent pitfalls.
You should follow the user-agent line by one or more disallow directives:
The above example will block all URLs whose path starts with “/junk-page”:
It will not block any URL whose path does not start with “/junk-page”. The following URL will not be blocked:
The key thing here is that disallow is a simple text match. Whatever comes after the “Disallow:” is treated as a simple string of characters (with the notable exceptions of * and $, which I’ll get to below). This string is compared to the beginning of the path part of the URL (everything from the first slash after the domain to the end of the URL) which is also treated as a simple string. If they match, the URL is blocked. If they don’t, it isn’t.
The Allow directive is not part of the original standard, but it is now supported by all major search engines.
You can use this directive to specify exceptions to a disallow rule, if, for example, you have a subdirectory you want to block but you want one page within that subdirectory crawled:
This example will block the following URLs:
But it will not block any of the following:
Again, this is a simple text match. The text after the “Allow:” is compared to the beginning of the path part of the URL. If they match, the page will be allowed even when there is a disallow somewhere else that would normally block it.
The wildcard operator is also supported by all major search engines. This allows you to block pages when part of the path is unknown or variable. For example:
The * (asterisk) means “match any text.” The above directive will block all the following URLs:
Be careful! The above will also block the following URLs (which might not be what you want):
Another useful extension is the end-of-string operator:
The $ means the URL must end at that point. This directive will block the following URL:
But it will not block any of the following:
But let’s say you’re really shy. You might want to block everything using robots.txt for a staging site (more on this later) or a mirror site. If you have a private site for use by a few people who know how to find it, you’d also want to block the whole site from being crawled.
To block the entire site, use a disallow followed by a slash:
I can think of two reasons you might choose to create a robots.txt file when you plan to allow everything:
- As a placeholder, to make it clear to anyone else who works on the site that you are allowing everything on purpose.
- To prevent failed requests for robots.txt from showing up in the request logs.
To allow the entire site, you can use an empty disallow:
Alternatively, you can just leave the robots.txt file blank, or not have one at all. Crawlers will crawl everything unless you tell them not to.
Though it’s optional, many robots.txt files will include a sitemap directive:
This specifies the location of a sitemap file. A sitemap is a specially formatted file that lists all the URLs you want to be crawled. It’s a good idea to include this directive if your site has an XML sitemap.
Common Mistakes Using Robots.txt
I see many, many incorrect uses of robots.txt. The most serious of those are trying to use the file to keep certain directories secret or trying to use it to block hostile crawlers.
The most serious consequence of misusing robots.txt is accidentally hiding your entire site from crawlers. Pay close attention to these things.
Forgetting to un-hide when you go to production
All staging sites (that are not already hidden behind a password) should have robots.txt files because they’re not intended for public viewing. But when your site goes live, you’ll want everyone to see it. Don’t forget to remove or edit this file.
Otherwise, the entire live site will vanish from search results.
You can check the live robots.txt file when you test, or set things up so you don’t have to remember this extra step. Put the staging server behind a password using a simple protocol like Digest Authentication. Then you can give the staging server the same robots.txt file that you intend to deploy on the live site. When you deploy, you just copy everything. As a bonus, you won’t have members of the public stumbling across your staging site.
Trying to block hostile crawlers
I have seen robots.txt files that try to explicitly block known bad crawlers, like this:
User-agent: EmailWolf 1.00
It’s like leaving a note on the dashboard of your car that says: “Dear thieves: Please do not steal this car. Thanks!”
This is pointless. It’s like leaving a note on the dashboard of your car that says: “Dear thieves: Please do not steal this car. Thanks!”
Robots.txt is strictly voluntary. Polite crawlers like search engines will obey it. Hostile crawlers, like email harvesters, will not. Crawlers are under no obligation to follow the guidelines in robots.txt, but major ones choose to do so.
If you’re trying to block bad crawlers, use user-agent blocking or IP blocking instead.
Trying to keep directories secret
If you have files or directories that you want to keep hidden from the public, do not EVER just list them all in robots.txt like this:
This will do more harm than good, for obvious reasons. It gives hostile crawlers a quick, easy way to find the files that you do not want them to find.
It’s like leaving a note on your car that says: “Dear thieves: Please do not look in the yellow envelope marked ‘emergency cash’ hidden in the glove compartment of this car. Thanks!”
The only reliable way to keep a directory hidden is to put it behind a password. If you absolutely cannot put it behind a password, here are three band-aid solutions.
- Block based on the first few characters of the directory name.
If the directory is “/xyz-secret-stuff/” then block it like this:
- Block with robots meta tag.
Add the following to the HTML code:
- Block with the X-Robots-Tag header.
Add something like this to the directory’s .htaccess file:
<meta name="robots" content="noindex,nofollow">
Header set X-Robots-Tag "noindex,nofollow"
Again, these are band-aid solutions. None of these are substitutes for actual security. If it really needs to be kept secret, then it really needs to be behind a password.
Accidentally blocking unrelated pages
Suppose you need to block the page:
And also everything in the directory:
The obvious way would be to do this:
This will block the things you want, but now you’ve also accidentally blocked an article page about pet care:
This article will disappear from the search results along with the pages you were actually trying to block.
Yes, it’s a contrived example, but I have seen this sort of thing happen in the real world. The worst part is that it usually goes unnoticed for a very long time.
The safest way to block both /admin and /admin/ without blocking anything else is to use two separate lines:
Remember, the dollar sign is an end-of-string operator that says “URL must end here.” The directive will match /admin but not /administer.
Trying to put robots.txt in a subdirectory
Suppose you only have control over one subdirectory of a huge website.
If you need to block some pages, you may be tempted to try to add a robots.txt file like this:
This does not work. The file will be ignored. The only place you can put a robots.txt file is the site root.
If you do not have access to the site root, you can’t use robots.txt. Some alternative options are to block the pages using robots meta tags. Or, if you have control over the .htaccess file (or equivalent), you can also block pages using the X-Robots-Tag header.
Trying to target specific subdomains
Suppose you have a site with many different subdomains:
You may be tempted to create a single robots.txt file and then try to block the subdomains from it, like this:
This does not work. There is no way to specify a subdomain (or a domain) in a robots.txt file. A given robots.txt file applies only to the subdomain it was loaded from.
So is there a way to block certain subdomains? Yes. To block some subdomains and not others, you need to serve different robots.txt files from the different subdomains.
These robots.txt files would block everything:
And these would allow everything:
Using inconsistent type case
Paths are case sensitive.
Will not block “/Acme/” or “/ACME/”.
If you need to block them all, you need a separate disallow line for each:
Forgetting the user-agent line
The user-agent line is critical to using robots.txt. A file must have a user-agent line before any allows or disallows. If the entire file looks like this:
Nothing will actually be blocked, because there is no user-agent line at the top. This file must read:
Other user-agent pitfalls
There are other pitfalls of incorrect user-agent use. Say you have three directories that need to be blocked for all crawlers, and also one page that should be explicitly allowed on Google only. The obvious (but incorrect) approach might be to try something like this:
This file actually allows Google to crawl everything on the site. Googlebot, (and most other crawlers) will only obey the rules under the more specific user-agent line, and will ignore all others. In this example, it will obey the rules under “User-agent: Googlebot” and will ignore the rules under “User-agent: *”.
To accomplish this goal, you need to repeat the same disallow rules for each user-agent block, like this:
Forgetting the leading slash in the path
Suppose you want to block the URL:
And you have the following (incorrect) robots.txt file:
This will not block anything at all. The path must start with a slash. If it does not, it can never match anything. The correct way to block a URL is:
Tips for Using Robots.txt
Now that you know how not to send hostile crawlers right to your secret stuff or disappear your site from search results, here are some tips to help you improve your robots.txt files. Doing it well isn’t going to boost your ranking (that’s what strategic SEO and content are for, silly), but at least you’ll know the crawlers are finding what you want them to find.
Competing allows and disallows
The allow directive is used to specify exceptions to a disallow rule. The disallow rule blocks an entire directory (for example), and the allow rule unblocks some of the URLs within that directory. This raises the question, if a given URL can match either of two rules, how does the crawler decide which one to use?
Not all crawlers handle competing allows and disallows exactly the same way, but Google gives priority to the rule whose path is longer (in terms of character counts). It is really that simple. If both paths are same length, then allow has priority over disallow. For example, suppose the robots.txt file is:
The path “/baddir/goodpage” is 16 characters long, and the path “/baddir/” is only 8 characters long. In this case, the allow wins over the disallow.
The following URLs will be allowed:
And the following will be blocked:
Now consider the following example:
Will these directives block the following URL?
Yes. The path “/some” is 5 characters long, and the path “/*page” is 6 characters long, so the disallow wins. The allow is ignored, and URL will be blocked.
Block a specific query parameter
Suppose you want to block all URLs that include the query parameter “id,” such as:
You might be tempted to do something like this:
This will block the URLs you want, but will also block any other query parameters that end with “id”:
So how do you block “id” without blocking “userid” or “bid”?
If you know “id” will always be the first parameter, use a question mark, like this:
This directive will block:
But it will not block:
If you know “id” will never be the first parameter, use an ampersand, like this:
This directive will block:
But it will not block:
The safest approach is to do both:
There is no reliable way to match both with a single line.
Blocking URLs that contain unsafe characters
Suppose you need to block a URL that contains characters that are not URL safe. One common scenario where this can happen is when server-side template code is accidentally exposed to the web. For example:
If you try to block that URL like this, it won’t work:
Disallow: /search?q=<% var_name %>
If you test this directive in Google’s robots.txt testing tool (available in Search Console), you will find that it does not block the URL. Why? Because the directive is actually checked against the URL:
All web user-agents, including crawlers, will automatically URL-encode any characters that are not URL-safe. Those characters include: spaces, less-than or greater-than signs, single-quotes, double-quotes, and non-ASCII characters.
The correct way to block a URL containing unsafe characters is to block the escaped version:
The easiest way to get the escaped version of the URL is to click on the link in a browser and then copy & paste the URL from the address field.
How to match a dollar sign
Suppose you want to block all URLs that contain a dollar sign, such as:
The following will not work:
This directive will actually block everything on the site. A dollar sign, when used at the end of a directive, means “URL ends here.” So the above will block every URL whose path starts with a slash, followed by zero or more characters, followed by the end of the URL. This rule applies to any valid URL. To get around it, the trick is to put an extra asterisk after the dollar sign, like this:
Here, the dollar sign is no longer at the end of the path, so it loses its special meaning. This directive will match any URL that contains a literal dollar sign. Note that the sole purpose of the final asterisk is to prevent the dollar sign from being the last character.
Fun fact: Google, in its journey toward semantic search, will often correctly interpret misspelled or malformed directives. For example, Google will accept any of the following without complaint:
This does NOT mean you should neglect the formatting and spelling of directives, but if you do make a mistake, Google will often let you get away with it. However, other crawlers probably won’t.
Pet peeve: People often use trailing wildcards in robots.txt files. This is harmless, but it’s also useless; I consider it bad form.
Does exactly the same thing as:
When I see this, I think, “This person does not understand how robots.txt works.” I see it a LOT.
Remember, robots.txt has to be in the root directory, has to start with a user-agent line, cannot block hostile crawlers, and should not be used to keep directories secret. Much of the confusion around using this file stems from the fact that people expect it to be more complex than it is. It’s really, really simple.
Now, go forth and block your pages with confidence. Just not your live site, your secret stuff, or from hostile crawlers. I hope this guide prepared you to use robots.txt without screwing something up, but if you need more guidance, check out Robots.txt.org or Google’s Robots.txt Specifications.
Matthew Henry is Portent’s resident SEO tools developer and math wizard.Read More