The Complete Guide to Robots.txt

Matthew Henry, SEO Fellow

Robots.txt is a small text file that lives in the root directory of a website. It tells well-behaved crawlers whether to crawl certain parts of the site or not. The file uses simple syntax to be easy for crawlers to put in place (which makes it easy for webmasters to put in place, too). Write it well, and you’ll be in indexed heaven. Write it poorly, and you might end up hiding your entire site from search engines.

There is no official standard for the file. Robotstxt.org is often treated as such a resource, but this site only describes the original standard from 1994. It’s a place to start, but you can do more with robots.txt than the site outlines, such as using wildcards, sitemap links, and the “Allow” directive. All major search engines support these extensions.

In a perfect world, no one would need robots.txt. If all pages on a site are intended for public consumption, then, ideally, search engines should be allowed to crawl all of them. But we don’t live in a perfect world. Many sites have spider traps, canonical URL issues, and non-public pages that need to be kept out of search engines. Robots.txt is used to move your site closer to perfect.

How Robots.txt Works

If you’re already familiar with the directives of robots.txt but worried you’re doing it wrong, skip on down to the Common Mistakes section. If you’re new to the whole thing, read on.

The file

Make a robots.txt file using any plain text editor. It must live in the root directory of the site and must be named “robots.txt” (yes, this is obvious). You cannot use the file in a subdirectory.

If the domain is example.com, then the robots.txt URL should be:

http://example.com/robots.txt

The HTTP specification defines ‘user-agent’ as the thing that is sending the request (as opposed to the ‘server’ which is the thing that is receiving the request). Strictly speaking, a user-agent can be anything that requests web pages, including search engine crawlers, web browsers, or obscure command line utilities.

User-agent directive

In a robots.txt file, the user-agent directive is used to specify which crawler should obey a given set of rules. This directive can be either a wildcard to specify that rules apply to all crawlers:

User-agent: *

Or it can be the name of a specific crawler:

User-agent: Googlebot

Learn more about giving directives to multiple user-agents in Other user-agent pitfalls.

Disallow directive

You should follow the user-agent line by one or more disallow directives:

User-agent: *
Disallow: /junk-page

The above example will block all URLs whose path starts with “/junk-page”:

http://example.com/junk-page
http://example.com/junk-page?usefulness=0
http://example.com/junk-page/whatever
http://example.com/junk-pages-and-how-to-keep-them-out-of-search-results

It will not block any URL whose path does not start with “/junk-page”. The following URL will not be blocked:

http://example.com/subdir/junk-page

The key thing here is that disallow is a simple text match. Whatever comes after the “Disallow:” is treated as a simple string of characters (with the notable exceptions of * and $, which I’ll get to below). This string is compared to the beginning of the path part of the URL (everything from the first slash after the domain to the end of the URL) which is also treated as a simple string. If they match, the URL is blocked. If they don’t, it isn’t.

Allow directive

The Allow directive is not part of the original standard, but it is now supported by all major search engines.

You can use this directive to specify exceptions to a disallow rule, if, for example, you have a subdirectory you want to block but you want one page within that subdirectory crawled:

User-agent: *
Allow: /nothing-good-in-here/except-this-one-page
Disallow: /nothing-good-in-here/

This example will block the following URLs:

http://example.com/nothing-good-in-here/
http://example.com/nothing-good-in-here/somepage
http://example.com/nothing-good-in-here/otherpage
http://example.com/nothing-good-in-here/?x=y

But it will not block any of the following:

http://example.com/nothing-good-in-here/except-this-one-page
http://example.com/nothing-good-in-here/except-this-one-page-because-i-said-so
http://example.com/nothing-good-in-here/except-this-one-page/that-is-really-a-directory
http://example.com/nothing-good-in-here/except-this-one-page?a=b&c=d

Again, this is a simple text match. The text after the “Allow:” is compared to the beginning of the path part of the URL. If they match, the page will be allowed even when there is a disallow somewhere else that would normally block it.

Wildcards

The wildcard operator is also supported by all major search engines. This allows you to block pages when part of the path is unknown or variable. For example:

Disallow: /users/*/settings

The * (asterisk) means “match any text.” The above directive will block all the following URLs:

http://example.com/users/alice/settings
http://example.com/users/bob/settings
http://example.com/users/tinkerbell/settings
http://example.com/users/chthulu/settings

Be careful! The above will also block the following URLs (which might not be what you want):

http://example.com/users/alice/extra/directory/levels/settings
http://example.com/users/alice/search?q=/settings
http://example.com/users/alice/settings-for-your-table

End-of-string operator

Another useful extension is the end-of-string operator:

Disallow: /useless-page$

The $ means the URL must end at that point. This directive will block the following URL:

http://example.com/useless-page

But it will not block any of the following:

http://example.com/useless-pages-and-how-to-avoid-creating-them
http://example.com/useless-page/
http://example.com/useless-page?a=b

Blocking everything

But let’s say you’re really shy. You might want to block everything using robots.txt for a staging site (more on this later) or a mirror site. If you have a private site for use by a few people who know how to find it, you’d also want to block the whole site from being crawled.

To block the entire site, use a disallow followed by a slash:

User-agent: *
Disallow: /

Allowing everything

I can think of two reasons you might choose to create a robots.txt file when you plan to allow everything:

  • As a placeholder, to make it clear to anyone else who works on the site that you are allowing everything on purpose.
  • To prevent failed requests for robots.txt from showing up in the request logs.

To allow the entire site, you can use an empty disallow:

User-agent: *
Disallow:

Alternatively, you can just leave the robots.txt file blank, or not have one at all. Crawlers will crawl everything unless you tell them not to.

Sitemap directive

Though it’s optional, many robots.txt files will include a sitemap directive:

Sitemap: http://example.com/sitemap.xml

This specifies the location of a sitemap file. A sitemap is a specially formatted file that lists all the URLs you want to be crawled. It’s a good idea to include this directive if your site has an XML sitemap.

Common Mistakes Using Robots.txt

I see many, many incorrect uses of robots.txt. The most serious of those are trying to use the file to keep certain directories secret or trying to use it to block hostile crawlers.

The most serious consequence of misusing robots.txt is accidentally hiding your entire site from crawlers. Pay close attention to these things.

Forgetting to un-hide when you go to production

All staging sites (that are not already hidden behind a password) should have robots.txt files because they’re not intended for public viewing. But when your site goes live, you’ll want everyone to see it. Don’t forget to remove or edit this file.

Otherwise, the entire live site will vanish from search results.

User-agent: *
Disallow: /

You can check the live robots.txt file when you test, or set things up so you don’t have to remember this extra step. Put the staging server behind a password using a simple protocol like Digest Authentication. Then you can give the staging server the same robots.txt file that you intend to deploy on the live site. When you deploy, you just copy everything. As a bonus, you won’t have members of the public stumbling across your staging site.

Trying to block hostile crawlers

I have seen robots.txt files that try to explicitly block known bad crawlers, like this:

User-agent: DataCha0s/2.0
Disallow: /
User-agent: ExtractorPro
Disallow: /
User-agent: EmailSiphon
Disallow: /
User-agent: EmailWolf 1.00
Disallow: /

It’s like leaving a note on the dashboard of your car that says: “Dear thieves: Please do not steal this car. Thanks!”

This is pointless. It’s like leaving a note on the dashboard of your car that says: “Dear thieves: Please do not steal this car. Thanks!”

Robots.txt is strictly voluntary. Polite crawlers like search engines will obey it. Hostile crawlers, like email harvesters, will not. Crawlers are under no obligation to follow the guidelines in robots.txt, but major ones choose to do so.

If you’re trying to block bad crawlers, use user-agent blocking or IP blocking instead.

Trying to keep directories secret

If you have files or directories that you want to keep hidden from the public, do not EVER just list them all in robots.txt like this:

User-agent: *
Disallow: /secret-stuff/
Disallow: /compromising-photo.jpg
Disallow: /big-list-of-plaintext-passwords.csv

This will do more harm than good, for obvious reasons. It gives hostile crawlers a quick, easy way to find the files that you do not want them to find.

It’s like leaving a note on your car that says: “Dear thieves: Please do not look in the yellow envelope marked ‘emergency cash’ hidden in the glove compartment of this car. Thanks!”

The only reliable way to keep a directory hidden is to put it behind a password. If you absolutely cannot put it behind a password, here are three band-aid solutions.

  1. Block based on the first few characters of the directory name.
    If the directory is “/xyz-secret-stuff/” then block it like this:
  2. Disallow: /xyz-

  3. Block with robots meta tag.
    Add the following to the HTML code:
  4. <meta name="robots" content="noindex,nofollow">

  5. Block with the X-Robots-Tag header.
    Add something like this to the directory’s .htaccess file:
  6. Header set X-Robots-Tag "noindex,nofollow"

Again, these are band-aid solutions. None of these are substitutes for actual security. If it really needs to be kept secret, then it really needs to be behind a password.

Accidentally blocking unrelated pages

Suppose you need to block the page:

http://example.com/admin

And also everything in the directory:

http://example.com/admin/

The obvious way would be to do this:

Disallow: /admin

This will block the things you want, but now you’ve also accidentally blocked an article page about pet care:

http://example.com/administer-medication-to-your-cat-the-easy-way.html

This article will disappear from the search results along with the pages you were actually trying to block.

Yes, it’s a contrived example, but I have seen this sort of thing happen in the real world. The worst part is that it usually goes unnoticed for a very long time.

The safest way to block both /admin and /admin/ without blocking anything else is to use two separate lines:

Disallow: /admin$
Disallow: /admin/

Remember, the dollar sign is an end-of-string operator that says “URL must end here.” The directive will match /admin but not /administer.

Trying to put robots.txt in a subdirectory

Suppose you only have control over one subdirectory of a huge website.

http://example.com/userpages/yourname/

If you need to block some pages, you may be tempted to try to add a robots.txt file like this:

http://example.com/userpages/yourname/robots.txt

This does not work. The file will be ignored. The only place you can put a robots.txt file is the site root.

If you do not have access to the site root, you can’t use robots.txt. Some alternative options are to block the pages using robots meta tags. Or, if you have control over the .htaccess file (or equivalent), you can also block pages using the X-Robots-Tag header.

Trying to target specific subdomains

Suppose you have a site with many different subdomains:

http://example.com/
http://admin.example.com/
http://members.example.com/
http://blog.example.com/
http://store.example.com/

You may be tempted to create a single robots.txt file and then try to block the subdomains from it, like this:

http://example.com/robots.txt
 
User-agent: *
Disallow: admin.example.com
Disallow: members.example.com

This does not work. There is no way to specify a subdomain (or a domain) in a robots.txt file. A given robots.txt file applies only to the subdomain it was loaded from.

So is there a way to block certain subdomains? Yes. To block some subdomains and not others, you need to serve different robots.txt files from the different subdomains.

These robots.txt files would block everything:

http://admin.example.com/robots.txt
http://members.example.com/robots.txt
 
User-agent: *
Disallow: /

And these would allow everything:

http://example.com/
http://blog.example.com/
http://store.example.com/
 
User-agent: *
Disallow:

Using inconsistent type case

Paths are case sensitive.

Disallow: /acme/

Will not block “/Acme/” or “/ACME/”.

If you need to block them all, you need a separate disallow line for each:

Disallow: /acme/
Disallow: /Acme/
Disallow: /ACME/

Forgetting the user-agent line

The user-agent line is critical to using robots.txt. A file must have a user-agent line before any allows or disallows. If the entire file looks like this:

Disallow: /this
Disallow: /that
Disallow: /whatever

Nothing will actually be blocked, because there is no user-agent line at the top. This file must read:

User-agent: *
Disallow: /this
Disallow: /that
Disallow: /whatever

Other user-agent pitfalls

There are other pitfalls of incorrect user-agent use. Say you have three directories that need to be blocked for all crawlers, and also one page that should be explicitly allowed on Google only. The obvious (but incorrect) approach might be to try something like this:

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /dontcrawl/
User-agent: Googlebot
Allow: /dontcrawl/exception

This file actually allows Google to crawl everything on the site. Googlebot, (and most other crawlers) will only obey the rules under the more specific user-agent line, and will ignore all others. In this example, it will obey the rules under “User-agent: Googlebot” and will ignore the rules under “User-agent: *”.

To accomplish this goal, you need to repeat the same disallow rules for each user-agent block, like this:

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /dontcrawl/
User-agent: Googlebot
Disallow: /admin/
Disallow: /private/
Disallow: /dontcrawl/
Allow: /dontcrawl/exception

Forgetting the leading slash in the path

Suppose you want to block the URL:

http://example.com/badpage

And you have the following (incorrect) robots.txt file:

User-agent: *
Disallow: badpage

This will not block anything at all. The path must start with a slash. If it does not, it can never match anything. The correct way to block a URL is:

User-agent: *
Disallow: /badpage

Tips for Using Robots.txt

Now that you know how not to send hostile crawlers right to your secret stuff or disappear your site from search results, here are some tips to help you improve your robots.txt files. Doing it well isn’t going to boost your ranking (that’s what strategic SEO and content are for, silly), but at least you’ll know the crawlers are finding what you want them to find.

Competing allows and disallows

The allow directive is used to specify exceptions to a disallow rule. The disallow rule blocks an entire directory (for example), and the allow rule unblocks some of the URLs within that directory. This raises the question, if a given URL can match either of two rules, how does the crawler decide which one to use?

Not all crawlers handle competing allows and disallows exactly the same way, but Google gives priority to the rule whose path is longer (in terms of character counts). It is really that simple. If both paths are same length, then allow has priority over disallow. For example, suppose the robots.txt file is:

User-agent: *
Allow: /baddir/goodpage
Disallow: /baddir/

The path “/baddir/goodpage” is 16 characters long, and the path “/baddir/” is only 8 characters long. In this case, the allow wins over the disallow.

The following URLs will be allowed:

http://example.com/baddir/goodpage
http://example.com/baddir/goodpagesarehardtofind
http://example.com/baddir/goodpage?x=y

And the following will be blocked:

http://example.com/baddir/
http://example.com/baddir/otherpage

Now consider the following example:

User-agent: *
Allow: /some
Disallow: /*page

Will these directives block the following URL?

http://example.com/somepage

Yes. The path “/some” is 5 characters long, and the path “/*page” is 6 characters long, so the disallow wins. The allow is ignored, and URL will be blocked.

Block a specific query parameter

Suppose you want to block all URLs that include the query parameter “id,” such as:

http://example.com/somepage?id=123
http://example.com/somepage?a=b&id=123

You might be tempted to do something like this:

Disallow: /*id=

This will block the URLs you want, but will also block any other query parameters that end with “id”:

http://example.com/users?userid=a0f3e8201b
http://example.com/auction?num=9172&bid=1935.00

So how do you block “id” without blocking “userid” or “bid”?

If you know “id” will always be the first parameter, use a question mark, like this:

Disallow: /*?id=

This directive will block:

http://example.com/somepage?id=123

But it will not block:

http://example.com/somepage?a=b&id=123

If you know “id” will never be the first parameter, use an ampersand, like this:

Disallow: /*&id=

This directive will block:

http://example.com/somepage?a=b&id=123

But it will not block:

http://example.com/somepage?id=123

The safest approach is to do both:

Disallow: /*?id=
Disallow: /*&id=

There is no reliable way to match both with a single line.

Blocking URLs that contain unsafe characters

Suppose you need to block a URL that contains characters that are not URL safe. One common scenario where this can happen is when server-side template code is accidentally exposed to the web. For example:

http://example.com/search?q=<% var_name %>

If you try to block that URL like this, it won’t work:

User-agent: *
Disallow: /search?q=<% var_name %>

If you test this directive in Google’s robots.txt testing tool (available in Search Console), you will find that it does not block the URL. Why? Because the directive is actually checked against the URL:

http://example.com/search?q=%3C%%20var_name%20%%3E

All web user-agents, including crawlers, will automatically URL-encode any characters that are not URL-safe. Those characters include: spaces, less-than or greater-than signs, single-quotes, double-quotes, and non-ASCII characters.

The correct way to block a URL containing unsafe characters is to block the escaped version:

User-agent: *
Disallow: /search?q=%3C%%20var_name%20%%3E

The easiest way to get the escaped version of the URL is to click on the link in a browser and then copy & paste the URL from the address field.

How to match a dollar sign

Suppose you want to block all URLs that contain a dollar sign, such as:

http://example.com/store?price=$10

The following will not work:

Disallow: /*$

This directive will actually block everything on the site. A dollar sign, when used at the end of a directive, means “URL ends here.” So the above will block every URL whose path starts with a slash, followed by zero or more characters, followed by the end of the URL. This rule applies to any valid URL. To get around it, the trick is to put an extra asterisk after the dollar sign, like this:

Disallow: /*$*

Here, the dollar sign is no longer at the end of the path, so it loses its special meaning. This directive will match any URL that contains a literal dollar sign. Note that the sole purpose of the final asterisk is to prevent the dollar sign from being the last character.

An Addendum

Fun fact: Google, in its journey toward semantic search, will often correctly interpret misspelled or malformed directives. For example, Google will accept any of the following without complaint:

UserAgent: *
Disallow /this
Dissalow: /that

This does NOT mean you should neglect the formatting and spelling of directives, but if you do make a mistake, Google will often let you get away with it. However, other crawlers probably won’t.

Pet peeve: People often use trailing wildcards in robots.txt files. This is harmless, but it’s also useless; I consider it bad form.

For instance:

Disallow: /somedir/*

Does exactly the same thing as:

Disallow: /somedir/

When I see this, I think, “This person does not understand how robots.txt works.” I see it a LOT.

Summary

Remember, robots.txt has to be in the root directory, has to start with a user-agent line, cannot block hostile crawlers, and should not be used to keep directories secret. Much of the confusion around using this file stems from the fact that people expect it to be more complex than it is. It’s really, really simple.

Now, go forth and block your pages with confidence. Just not your live site, your secret stuff, or from hostile crawlers. I hope this guide prepared you to use robots.txt without screwing something up, but if you need more guidance, check out Robots.txt.org or Google’s Robots.txt Specifications.

Matthew Henry, SEO Fellow
SEO Fellow

As an SEO fellow, Matthew is Portent's resident SEO tools developer and math wizard.

Start call to action

See how Portent can help you own your piece of the web.

End call to action
0

Comments

    1. In your example, Google would be blocked, and all other crawlers would be allowed. In general, Google will obey whichever rules have the most specific user-agent, and will ignore all others. In this case, everything under “User-Agent: GoogleBot” will be obeyed (by Google), and everything under “User-Agent: *” will be ignored.

    1. I’m pretty sure blocking spiders in robots.txt only serves to make your PBN (or any site for that matter) more suspicious. If you’re going to use a PBN you should make it look as natural as possible – and a “natural” blog would never block spiders.

  1. This is great, thanks!
    I most definitely have never, ever pushed a site live with robots.txt blocking all crawlers. Nope, never. It’s such a non-issue, I definitely don’t compulsively check newly launched sites.

  2. Hey Matthew,
    What’s your take on the Crawl-Delay directive in robots.txt? I know some bots ignore this directive, but some do follow it, right? Is it useful or a waste of time? Would love to hear your take on it!

    1. I tend to think of crawl-delay as a non-standard extension. It is not supported by Google at all. It is supported by Bing and a few others however. The basic idea is that you can specify “Crawl-delay: X” (in which X is a number) and then the crawler will only request a maximum of one page per X seconds. In practice, this isn’t as useful as it sounds. It’s generally to your advantage to get as much of your content indexed as possible, as quickly as possible. Deliberately slowing the crawler down is likely to lead to incomplete crawling, and is usually not necessary. A properly configured server should be able to handle search engine crawler traffic just fine. If your server is so flaky that it can’t handle being crawled, then you really, really need to replace your server.

  3. In the robots.txt syntax, I still can’t figure out whether the * wildcard can match slashes within URLs. For example, let’s say I don’t want any directory named “foo” to be crawled, no matter how deep in the site hierarchy. So I’d want to block …mysite/foo/ but also block …mysite/blobby/foo/whatever/ Would this work?:

    User-agent: *
    Disallow: *foo*

    … or does it need to be:
    Disallow: */foo/

    …or something else?

    Another question, does it matter whether Disallow: and the path is separated by a space, or a tab?

    Thanks much!

    1. The wildcard matches ANY characters, including slashes.
      So the following:

      User-agent: *
      Disallow: *foo

      would block any path that contains “foo”, possibly including some things you might not want to block, like:

      https://example.com/store/petfood/

      If you specifically want to block a directory named “foo”, but do not want to block other directories or files that merely contain the word “foo”, then you need to include the slashes, as in you second example:

      User-agent: *
      Disallow: */foo/

      This will block:

      https://example.com/foo/
      https://example.com/foo/somefile
      https://example.com/somedir/foo/somefile

      But it will not block:

      https://example.com/store/petfood/

      To answer your other question, if you use a tab instead of a space, it will (at the time I am writing this) work exactly the same way as a space, at least on Google. (I just tested this.) However, this is undocumented behavior and therefore not something you should rely upon. I would just use a space, because that’s what the specification says you should do.

  4. Can a Disallow directive include MULTIPLE wildcards in a single line?

    E.g. Disallow: /base/*/subone/*/subtwo/

    Which, if it works/is meaningful, would block URLs like:
    /base/abc/subone/def/ghi/subtwo
    /base/subone/123/456/789/subtwo/anything

    1. Yes, you can indeed do this, and it works as you describe. This can come in handy if you have a really convoluted directory structure. If you do find yourself needing to do this often though, it might be worth it to simplify your directory structure.

  5. hi, is it possible to limit the depth of folder/subfolders. I want to disallow to crawl deeper than 3 subfolder. Is it possible or would disallow:/*/*/*/ solve the issue.
    Thanks
    Michael

  6. Hi Matt. I have a question regarding crawl priority within the robots.txt file. If I am looking to ensure that a group of urls are prioritized by crawlers within a specific search engine. For Instance Googlebot.
    Is it best practice to then include an allow crawler for a content grouping?

    such as:
    User-agent: Googlebot
    Allow: /acme-anvils

    1. Robots.txt does not affect crawl prioritization, at least not for any search engine that I’m aware of. You can either allow crawling a page, or you can disallow it. That’s it. There is no way to allow some pages “more” and allow other pages “less”. Priority will be chosen by the search engines, using whatever arcane rules they choose to follow. Your best bet is to make sure your new content has prominent links from one or more high-profile, frequently updated pages on the site.

Leave a Reply

Your email address will not be published. Required fields are marked *

Close search overlay