Field Guide to Spider Traps: An SEO’s Companion
Matthew Henry Feb 2 2016
If search engines can’t crawl your site, SEO efforts do not amount to much. One of the problems I see most often are ‘spider traps’. Traps kill crawls and hurt indexation. Here’s how you find and fix them:
What is a spider trap?
A spider trap is a structural issue that causes a web crawler to get stuck in a loop, loading meaningless ‘junk’ pages forever. The junk pages might be any combination of the following:
- Exact duplicates (endless different URLs that point to the same page)
- Near-duplicates (pages that differ only by some small detail, e.g. the crumb trail)
- The same information presented in endless different ways (e.g. millions of different ways to sort and filter a list of 1000 products)
- Pages that are technically unique, but provide no useful information. (e.g. an event calendar that goes thousands of years into the future)
The special (worst) case: E-commerce sites
E-commerce sites are particularly good at creating spider traps. They often have product category pages you can sort and filter using multiple criteria such as price, color, style and product type. These pages often have URLs like “www.site.com/category?pricerange=1020&color=blue,red&style=long&type=pencils.”
If you have, say, ten product types, seven brands, six colors and ten price ranges, and four ways to sort it all, then you’ll have (I had to take my socks off for this one) 34,359,738,368 possible permutations. This number will vary depending on the number of options. The point is, it’s a big number.
That’s not ‘infinite,’ but it’s ‘an awful lot.’
Different causes, same result
This can make it impossible for a search engine to index all of the content on the site, and can prevent the pages that do get indexed from ranking well. There are a few reasons why this is bad for SEO:
- It forces the search engines to waste most of their crawl budget loading useless near-duplicate pages. As a result, the search engines are often so busy with this, they never get around to loading all of the real pages that might otherwise rank well.
- If the trap-generated pages are duplicates of a ‘real’ page (e.g. a product page, blog post etc.) then this may prevent the original page from ranking well by diluting link equity.
- Quality-ranking algorithms like Google Panda may give the site a bad score because the site appears to consist mostly of low-quality or duplicate pages.
The result is the same: Lousy rankings. Lost revenue. Fewer leads. Unhappy bosses.
How to identify a spider trap
- Start a crawl of the site and let it run for a while.
- If the crawl eventually finishes by itself, then there is no spider trap.
- If the crawl keeps running for a very long time, then there might be a spider trap (or the site might just be very large).
- Stop the crawl.
- Export a list of URLs.
- If you find a pattern where all of the new URLs look suspiciously similar to each other, then a spider trap is likely.
- Spot-check a few of these suspiciously similar URLs in a browser.
- If the URLs all return exactly the same page, then the site definitely has a spider trap.
- If the URLs return pages that are technically slightly different, but contain the same basic information, then a spider trap is very likely.
There are a lot of ways to create spider traps. Every time I think I have seen them all, our crawler finds another. These are the most common:
Expanding URL Trap
An expanding URL trap can be especially difficult to see in a browser, because it is usually caused by one or more malformed links buried deeply in the site. As with any spider trap, the easiest way to spot it is to crawl the site with a crawler-based tool. If the site has this issue, the crawl will reveal the following things:
- At first the crawl will run normally. The spider trap will be invisible until the crawler finishes crawling most of the normal (non-trap) pages on the site. If the site is very large, this may take a while.
- At some point in the crawl, the list of crawled URLs will get stuck in an unnatural-looking pattern in which each new URL is a slightly longer near-copy of the previous one. For example:
- Each new URL will contain an area of repeating characters that gets longer with each new step.
- As the crawl continues, the URLs will get longer and longer until they are hundreds or thousands of characters long.
In most expanding URL spider traps, the file path is the part of the URL that gets longer. There are three ingredients that must all be present in order for this to happen:
The site uses URL rewrite rules to convert path components into query parameters. For example, if the public URL is:
then, on the server side, the rewrite rules might convert this to:
In this example the “/extra-large-blue-widget” part is discarded, because it is ‘decorative’ text, added solely to get keywords into the URL.
The rewrite rules are configured to ignore anything beyond the part of the URL they care about. For example, in:
the rewrite rules would silently discard everything after “/products/12345/xl/”. You could change the URL to:
and the server would return exactly the same page.
The final ingredient is a malformed relative link that accidentally adds new directory levels to the current URL. There are many different ways this can happen. For example:
If the page:http://example.com/products/12345/xl/extra-large-blue-widgetcontains a link that is supposed to look like this:<a href="/about/why-you-should-care-about-blue-widgets">which would point to the URL:http://example.com/about/why-you-should-care-about-blue-widgetsbut the author accidentally leaves out the leading slash:<a href="about/why-you-should-care-about-blue-widgets">This link actually points to the URL:http://example.com/products/12345/xl/about/why-you-should-care-about-blue-widgetsIf you repeatedly click on this link, you will be taken through the following URLs:http://example.com/products/12345/xl/about/about/why-you-should-care-about-blue-widget
If the page:http://example.com/products/12345/xl/extra-large-blue-widgetcontains a link that is supposed to look like this:<a href="http://www.othersite.com/">which would point to the URL:http://www.othersite.com/but the author leaves out the “http://”:<a href="www.othersite.com/">This link actually points to the URL:http://example.com/products/12345/xl/www.othersite.com/If you repeatedly click on this link, you will be taken through the following URLs:http://example.com/products/12345/xl/www.othersite.com/
If the page:http://example.com/products/12345/xl/extra-large-blue-widgetcontains a link that is supposed to look like this:<a href="http://www.othersite.com/">which would point to the URL:http://www.othersite.com/but the HTML was pasted from a word processor, which silently converted the quote marks into curly quotes:<a href=“http://www.othersite.com/”>This looks the same to us human beings, but to a browser or a search engine, those curly quotes are not valid quotation marks. From the browser/crawler’s point of view, this tag will look like:<a href=(SOME UNICODE CHARACTER)http://www.othersite.com/(SOME UNICODE CHARACTER)>As a result, the final URL will become the following mangled mess:http://example.com/products/12345/xl/%E2%80%9Chttp://www.othersite.com/%E2%80%9DIf you repeatedly click on this link, you will be taken through the following URLs:http://example.com/products/12345/xl/%E2%80%9Chttp://www.othersite.com/%E2%80%9Chttp://www.othersite.com/%E2%80%9D
This issue can be challenging to fix. Here are some things that may help:
- Track down and fix the malformed link(s) that are creating the extra directory levels. This will fix the problem for now. Be aware that the problem is likely to return in the future if/when another bad link is added.
- If you have the technical skill, you can add rules to the server config to limit rewrites to URLs with a specific number of slashes in them. Any URL with the wrong number of slashes should not be rewritten. This will cause malformed relative links to return a 404 error (as they should).
- If all else has failed, you may be able to block the trap URLs using robots.txt.
Mix & Match Trap
This issue happens when a site has a finite number of items that can be sorted and filtered in a virtually unlimited number of different ways. This is most common on large online stores that offer multiple ways to filter and sort lists of products.
The key things that define a mix & match trap are:
- The site offers many, many different ways to sort/filter the same list of products. For example: by category, manufacturer, color, size, price range, special offers etc.
- It is possible to mix filter types. For example, you could view a list of all products of a specific brand that are also a specified color. If it is only possible to filter by brand or filter by color, but not by both, then this is probably not a spider trap.
- Often, it is also possible to arbitrarily combine multiple choices from the same filter type. For example, by viewing a single list of all products that are any of red, blue, or mauve. A site can still have a mix & match trap without this, but it makes the issue much, much worse. It can easily increase the number of possible pages a trillion times or more.
This type of filter creates a trap because each option multiples the number of possibilities by two or more. If there are many filters, the number of combinations can get extremely large very quickly. For example, if there are 40 different on/off filtering options, then there will be 240, or over a trillion different ways to sort the same list of products.
A more concrete example:
- Suppose an online store has a few thousand products. The list can be sorted by any of: price ascending, price descending, name ascending, or name descending. That’s four possible views.
- There is also an option to limit the results to just items that are on sale. This doubles the number of possible views, so the total number is now 8.
- Results can also be limited to any combination of four price ranges: $0–$10, $10–$50, $50–$200, and $200–$1000. The user may select any combination of these. (e.g. they can select both $0–$10 and $200–$1000 at the same time) This increases the number of possibilities by 24, or 16 times, so the total number of views is now 128.
- Results can also be limited to any combination of five sizes: XS, S, M, L, or XL. This increases the number of possibilities by 25, or 32 times, so the total number of views is now 4096.
- Results can be limited to any combination of 17 available colors. This increases the number of possibilities by 217, or 131,072 times, so the total number of views is now 536,870,912.
- Last but not least, the results can also be limited to any combination of 26 possible brand names. This increases the number of possibilities by 226, or 67,108,864 times, so the total number of views is now 36,028,797,018,963,968. (!!!)
That’s 36 quadrillion different ways to view the same list of a few thousand products. For all practical purposes, this can be considered infinite. To make matters worse, the vast majority of these will contain zero or one items.
This issue can be extremely difficult to fix. The best way to deal with it is to not create the issue in the first place.
- Consider offering fewer options. Seriously. More choices is not always better.
- Depending on the URL scheme, it may be possible to limit the extent of the trap by using robots.txt to block any page with more than a minimum number of filters. This must be done very carefully. Block too much, and the crawler will no longer be able to find all of the products. Block too little, and the site will still be effectively infinite. (mere billions of pages instead of quadrillions)
This issue is easy to identify. If the site has a calendar page, go to it. Try clicking the ‘next year’ (or ‘next month’) button repeatedly. If you can eventually go centuries into the future, then the site has a calendar trap.
This issue happens when an event calendar is capable of showing any future date, far beyond the point where there could plausibly be any events to display. From the search engine’s point of view, this creates an infinite supply of ‘junk’ pages—pages that contain no useful information. As spider traps go, this is a comparatively minor one, but still worth fixing.
- Add code to selectively insert a robots noindex tag into the calendar page when it is more than X years into the future.
- Block a distant future time range using robots.txt. (For example “Disallow: /calendar?year=203” to block the years 2030 through 2039—just make sure you change this before the year 2030 actually happens.)
This trap can be identified by crawling the site, and looking at the list of crawled URLs for something like this:
The key things to look for:
- After the crawl has been running a while, all of the new URLs will include a query parameter with a name like ‘jsessionid’, ‘sessid’, ‘sid’, or the like.
- The value of this parameter will always be a long random-looking code string.
- These code strings will all be the same length.
- The code strings might all be unique, or some of them might be duplicates.
- If you open two URLs that differ only by the value of the code string they will return exactly the same page.
How this works:
- If a URL that does not have a session ID is requested, the site will redirect the request to a version of the URL that has a session ID appended.
- If a request URL does have a session ID, then the server returns the requested page, but it appends the session ID to each of the internal links on the page.
- In theory, if #1 and #2 above were implemented perfectly without missing any links or redirects, then all of the URLs in the crawl would wind up with the same session ID and the crawl would end normally. In actual practice, this almost never happens. The implementation inevitably misses at least a few links.
- If the site contains even one internal link that is missing the session ID (because #2 above was implemented incompletely), then the site will generate a brand new session ID each time this link is followed. From the crawler’s point of view, each time it follows the link it will be taken to a whole new copy of the site, with a new session ID.
- If there are any URLs that are not properly redirected when requested without a session ID (because item #1 above was implemented incompletely), and the URL can also be reached through a link that does not have a session ID (because item #2 was also implemented incompletely), then every link on this new page will effectively point to a brand new copy of the site.
To further complicate things, on some sites all of the above is implemented conditionally—the site first attempts to store session info in a cookie, and if this fails, then it redirects to a URL with a session ID. This really just makes the problem harder to find, because it hides the issue when the site is viewed by a human being with a browser.
To deal with this issue, you will need to remove the session IDs from all redirects and all links. Details of how to do this depend on implementation. It is critical to remove all of them. If you overlook even one source of session IDs, the crawl will still have infinite URLs.
Spider traps can have a variety of causes, and they can vary in severity from “less than optimal” to “biblical disaster”. The one thing they all have in common is they all unnecessarily throw obstacles in the search engines’ path. This will inevitably lead to incomplete crawling and a lower rank than the site deserves. The search engines are your friends. They have the potential to bring in huge amounts of highly qualified traffic. It is in your interests to make their job as easy as possible.
Spider traps are also one of the most difficult SEO problems to find, diagnose and fix. Try the techniques I have outlined above. If you have questions, please leave a comment below.
Matthew Henry is Portent's resident SEO tools developer and math wizard. Read More