Faceted Navigation and SEO: A Deeper Look

Matthew Henry Aug 26 2018

The complex web of factors that determine page counts for a site with faceted navigation. It's about the SEO, folks

tl;dr: Skip to each “Takeaways” section if you want a few ideas for handling faceted navigation and SEO. But do so at your own risk. The “why” is as important as the “what.”

Helpfullee brought up an excellent point, asking whether we know that search engines actually generate and crawl these faceted pages. The answer is “yes.” But don’t take our word for it: Googlebot is fickle, and every site’s different. Review your log files, filtering for Googlebot. That’ll show you whether Googlebot is hitting all the thin or content-free pages generated by faceted navigation.

If you have ever shopped for anything online, you’ve seen faceted navigation. This is the list of clickable options, usually in the left panel, that can be used to filter results by brand, price, color, etc. Faceted navigation makes it possible to mix & match options in any combination the user wishes. It’s popular on large online stores because it allows the user to precisely drill down to only the things they are interested in.

An example of faceted navigation

But this can cause huge problems for search engines because it generates billions of useless near-duplicate pages. This wastes crawl budget, lowers the chances that all of the real content will get indexed, and it gives the search engines the message that the site is mostly low-quality junk pages (because, at this point, it is).

Many articles talk about faceted navigation and how to mitigate the SEO problems that it causes. Those are reactive strategies: How to prevent the search engines from crawling and indexing the billions of pages your faceted navigation created.

This is not one of those how-to articles.

Instead, it’s about the decisions that create massive duplication and how to avoid them from the start. It’s about the seemingly innocuous UX choices and their unintended consequences. My goal is to give you a deeper understanding of how each decision affects crawlability and final page counts. I’m hoping this will give you knowledge you can use, both to avoid problems before they start and to mitigate problems that can’t be avoided.

Match Types and Grouping

Faceted navigation is typically divided into groups, with a list of clickable options in each group. There might be one group for brand names, another for sizes, another for colors, etc. The options in a group can be combined in any of a few different ways:

  • “AND” matching — With this match type, the store only shows an item if it matches all of the selected options. “AND” matching is most often used for product features where it is assumed the shopper is looking for a specific combination of features, and is only interested in a product if it has all of them. (e.g., headphones that are both wireless and noise-canceling)
  • “OR” matching — With this match type, the store shows items that match any of the selected options. This can be used for lists of brand names, sizes, colors, price ranges, and many other things. The assumption here is that the user is interested in a few different things, and wants to see a combined list that includes all of them. (e.g., all ski hats available in red, pink or yellow).
  • “Radio button” matching — With this match type, only one option may be selected at a time. Selecting one option deselects all others. The assumption here is that the options are 100% mutually exclusive, and nobody would be interested in seeing more than one of them at a time. Radio buttons are often used to set sort order. It is also sometimes used to choose between mutually exclusive categories. (e.g., specifying the smartphone brand/model when shopping for phone cases) Some radio button implementations require at least one selected option (e.g., for sort order), and others don’t (e.g., for categories).

The options within a given group can be combined using any one of these match types, but the groups themselves are almost always combined with each other using “AND” matching. For example, if you select red and green from the “colors” group, and you select XL and XXL from the “sizes” group, then you will get a list of every item that is both one of those two colors and one of those two sizes.

A typical real-world website will have several groups using different match types, with many options between them. The total number of combinations can get quite large:

The above example has just over 17 billion possible combinations. Note that the total number of actual pages will be much larger than this because the results from some combinations will be split across many pages.

For faceted navigation, page counts are ultimately determined by three main things:

  1. The total number of possible combinations of options — In the simplest case (with only “AND” & “OR” matching, and no blocking) the number of combinations will be 2n, where n is the number of options. For example, if you have 12 options, then there will be 212, or 4,096 possible combinations. This gets a bit more complicated when some of the groups are radio buttons, and it gets a lot more complicated when you start blocking things.
  2. The number of matching items found for a given combination — The number of matching items is determined by many factors, including match type, the total number of products, the fraction of products matched by each filter option, and the amount of overlap between options.
  3. The maximum number of items to be displayed per page — This is an arbitrary choice set by the site designer. You can set this to any number you want. A bigger number means fewer pages but more clutter on each of them.

 

Test: How Does Match Type Affect Page Counts?

The choice of match type affects the page count by influencing both the number of combinations of options and also the number of matching items per combination.

How were these results calculated?
All of the numeric results in this article were generated by a simulation script written for this purpose. This script works by modeling the site as a multi-dimensional histogram, which is then repeatedly scaled and re-combined with itself each time a new faceted nav option is added to the simulated site. The script simulates gigantic sites with many groups of different option types relatively quickly. (For previous articles, I have always generated crawl data using an actual crawler, running on a test website made up of real HTML pages. That works fine when there are a few tens of thousands of pages, but some of the tests for this article have trillions of pages. That would take my crawler longer than all of recorded human history to crawl. Civilizations rise and fall over centuries. I decided not to wait that long.)

Test #1 — Simple “AND” Matching

Suppose we have a site with the following properties:

  • The faceted nav consists of one big group, with 32 filtering options that can be selected in any combination.
  • There are 10,000 products.
  • On average, each filtering option matches 20% of the products.
  • The site displays (up to) 10 products per page.
  • Options are combined using “AND” matching.

The above assumptions give you a site with:

  • 4,294,967,296 different combinations of options
  • 4,295,064,687 pages.
  • 4,294,724,471 empty results.

The obvious: The number of pages is enormous, and the vast majority of them are empty results. For every 12,625 pages on this site, one shows actual products. The rest show the aggravating “Zero items found” message. This is a terrible user experience and a colossal waste of crawl budget. But it’s also an opportunity.

So what can we do about all those empty results? If you are in control of the server side code, you can remove them. Any option that would lead to a page that says “Zero items found” should either be grayed out (and no longer coded as a link) or, better yet, removed entirely. This needs to be evaluated on the server side each time a new page is requested. If this is done correctly, then each time the user clicks on another option, all of the remaining options that would have led to an empty result will disappear. This reduces the number of pages, and it also dramatically improves the user experience. The user no longer has to stumble through a maze of mostly dead ends to find the rare combinations that show products.

So let’s try this.

Test #2 — “AND” Matching, With Empty Results Removed

This test is identical to Test #1, except now all links that lead to empty results are silently removed.

This time, we get:

  • 1,149,017 (reachable) combinations of options.
  • 1,246,408 pages.
  • 0 empty results. (obviously, because we’ve removed them)

This may still seem like a lot, but it’s a significant improvement over the previous test. The page count has gone from billions down to just over one million. This is also a much better experience for the users, as they will no longer see any useless options that return zero results. Any site that has faceted nav should be doing this by default.

Test #3 — “OR” Matching

This test uses the same parameters as Test #1, except it uses “OR” matching:

  • The faceted nav still has 32 filtering options
  • There are still 10,000 products.
  • Each filtering option still matches 20% of the products.
  • The site still displays 10 products per page.
  • Options are now combined using “OR” matching instead of “AND” matching.

This gives us:

  • 4,294,967,296 different combinations of options.
  • 4,148,637,734,396 pages (!)
  • 0 empty results.

The number of combinations is precisely the same, but the number of pages is much higher now (966 times higher), and there are no longer any empty results. Why is the page count so high? Because, with “OR” matching, every time you click on a new option the number of matching items increases. This is the opposite of “AND” matching, where the number decreases. In this test, most combinations now include almost all of the products on the site. In Test #1, most combinations produced empty results.

There are no empty results at all in this new site. The only way there could be an empty result would be if you chose to include a filtering option that never matches anything (which would be kind of pointless). The strategy of blocking empty results does not affect this match type.

Test #4 — Radio Buttons

This test uses radio button matching.

If we repeat Test #1, but with radio button matching, we get:

  • 33 different combinations of options.
  • 7,400 pages.
  • 0 empty results.

This is outrageously more efficient than any of the others. The downside of radio button matching is that it’s much more restrictive in terms of user choice.

The takeaway: Always at least consider using radio button matching when you can get away with it (any time the options are mutually exclusive). It will have a dramatic effect on page counts.

Recap of Tests #1–4:

TestConfigurationPage count
1“AND” matching (without blocking empty results)4,295,064,687
2“AND” matching, with empty results blocked1,246,408
3“OR” matching4,148,637,734,396
4Radio buttons7,400

Takeaways

  • The choice of match type is important and profoundly impacts page counts.
  • “OR” matching can lead to extremely high page counts.
  • “AND” matching isn’t as bad, provided you are blocking empty results.
  • You should always block empty results.
  • Blocking empty results helps with “AND” matching, but doesn’t affect “OR” matching.
  • Always use radio buttons when the options are mutually exclusive.

How Grouping Affects Page Count

So far, we have looked at page counts for sites that have one big group of options with the same match type. That’s unrealistic. On a real website, there will usually be many groups with different match types. The exact way the options are separated into groups is another factor that can affect page counts.

Test #5 — “OR” Matching, Split Into Multiple Groups

Let’s take the original parameters from Test #3:

  • The faceted nav has a total of 32 filtering options.
  • There are 10,000 products.
  • On average, each filtering option matches 20% of the products.
  • The site displays up to 10 products per page.
  • Options are combined using “OR” matching.

But this time, we’ll redo the test several times, and each time, we’ll split the 32 options into a different number of groups.

This gives us:

ConfigurationPagesEmpty Results
1 group with 32 options4,148,637,734,3960
2 groups with 16 options per group2,852,936,777,2690
4 groups with 8 options per group466,469,159,9500
8 groups with 4 options per group5,969,194,867290,250,752
16 groups with 2 options per group4,296,247,7594,275,284,621

The interesting thing here is that the last two tests have some empty results. Yes, all groups used “OR” matching, and yes, I told you “OR” matching does not produce empty results. So what’s going on here? Remember, no matter which match types are used within each group, the groups are combined with each other using “AND” matching. So, if you break an “OR” group into many smaller “OR” groups, you get behavior closer to an “AND” group.

Another way to put it: Suppose there are eight groups with four options each, and the user has selected exactly one option from each group. For any item to show up in those results, the item would have to match all eight of those selected options. This is functionally identical to what you would get if those eight selected options were part of an “AND” group.

If you are blocking empty results (which you should be doing anyway), then the actual page counts for the last two tests will be much smaller than is shown in this table. Before you get all excited, note that you have to have quite a few groups before this starts happening. It’s possible some site might be in a market where it makes sense to have eight groups with four options each, but it isn’t something that will happen often.

The boring but more practical observation is that even breaking the group into two parts reduces the page count noticeably. The difference isn’t huge, but it’s enough to be of some value. If a group of options that uses “OR” matching can be logically separated into two or more smaller groups, then it may be worth doing.

Test #6 — “AND” Matching, Split Into Multiple Groups

(I’m including this test because, if I don’t, people will tell me I forgot to do this one)

This test is the same as Test #5, but with “AND” matching instead of “OR” matching (and empty results are now being blocked).

ConfigurationPages
1 group with 32 options1,246,408
2 groups with 16 options per group1,246,408
4 groups with 8 options per group1,246,408
8 groups with 4 options per group1,246,408
16 groups with 2 options per group1,246,408

Yep. They all have the same number of pages. How can this be? The options within each group use “AND” matching, and groups are combined with each other using “AND” matching, so it doesn’t matter if you have one group or several. They are functionally identical.

Takeaway

If you want to split up an “AND” group because you think it will make sense to the user or will look nicer on the page, then go for it, but it will not affect page counts.

Other Things that Affect Page Counts

Test #7 — Changing “Items per Page”

This test uses the following parameters:

  • The faceted nav consists of five groups, with varying option counts, and a mix of different match types.
  • There are 10,000 products.
  • On average, each filtering option matches 20% of the products.
  • Links to empty results are blocked.

The test was repeated with different values for “Items per Page.”

This gives us:

ConfigurationPage Count
10 items per page18,690,151,025
32 items per page10,808,363,135
100 items per page8,800,911,375
320 items per page8,309,933,890
1,000 items per page8,211,780,310

This makes a difference when the values are small, but the effect tapers off as the values gets larger.

Test #8 — Adding a Pagination Limit

Some sites, especially some very large online stores, try to reduce database load by setting a “pagination limit.” This is an arbitrary upper limit to the number of pages that can be returned for a given set of results.

For example, if a given filter combination matches 512,000 products, and the site is set up to show 10 products per page, this particular combination would normally create 51,200 pages. Some sites set an arbitrary limit of, say, 100. If the user clicks all the way to page 100, there is no link to continue further.

These sites do this because, compared to delivering pages at the start of a pagination structure, delivering pages deeper in a pagination structure create a massive load on the database (for technical reasons beyond the scope of this article). The larger the site, the greater the load, so the largest sites have to set the arbitrary limit.

This test uses the following parameters:

  • The faceted nav consists of five groups, with varying option counts, and a mix of different match types.
  • There are 500,000 products.
  • On average, each filtering option matches 20% of the products.
  • Links to empty results are blocked.

The test was repeated with different values for the pagination limit.

This gives us:

Pagination LimitTotal Page Count
512,079,937,370
1013,883,272,770
2015,312,606,795
4016,723,058,170
8017,680,426,670
16018,252,882,040
(no limit)18,690,151,025

That’s definitely an improvement, but it’s underwhelming. If you cut the pagination limit in half, you don’t wind up with half as many pages. It’s more in the neighborhood of 90% as many. But this improvement is free because this type of limit is usually added for reasons other than SEO.

Pagination Takeaways

Test 7:

  • For lower values, changing “Items per Page” improves page counts by a noticeable amount.
  • When the values get higher, the effect tapers off. This is happening because most of the results now fit on one page. (and the page count can’t get lower than one)

Test 8:

  • If you have a huge site implementing a pagination limit primarily for database performance reasons, you may see a minor SEO benefit as a free bonus.
  • If you’re not also doing this to reduce database load, it’s not worth it.

Selectively Blocking Crawlers

All of the tests so far let the crawler see all of the human-accessible pages. Now let’s look at strategies that work by blocking pages via robots meta, robots.txt, etc.

Before we do that, we need to be clear about what “page count” really means. There are actually three different “page counts” that matter here:

  1. Human-readable page count — Pages that can be viewed by a human being with a browser.
  2. Crawlable page count — Pages that a search engine crawler is allowed to request.
  3. Indexable page count — The number of pages that the search engine is allowed to index, and to potentially show in search results.

The crawlable page count is important because it determines how much crawl budget is wasted. This will affect how thoroughly and how frequently the real content on the site gets crawled. The indexable page count is important because it effectively determines how many thin, near-duplicate pages the search engines will try to index. This is likely to affect the rankings of the real pages on the site.

Test #9 — Selection Limit via Robots Meta with “noindex, nofollow”

In this test, if the number of selected options on the page gets above a pre-specified limit, then <meta name="robots" content="noindex,nofollow"> will be inserted into the HTML. This tells the search engines not to index the page or follow any links from it.

This test uses the following parameters:

  • The faceted nav consists of five groups, with varying option counts, and a mix of different match types.
  • There are 10,000 products.
  • On average, each filtering option matches 20% of the products.
  • Links to empty results are blocked.

For this test, the “selection limit” is varied from 0 to 5. Any page where the number of selected options is larger than this selection limit will be blocked, via robots meta tag with noindex, nofollow.

selection limitcrawlable pagesindexable pages
011,4001,000
179,64011,400
2470,76079,640
32,282,155470,760
49,269,6312,282,155
532,304,4629,269,631
(no limit)18,690,151,02518,690,151,025

In these results, both indexable and crawlable page counts are reduced dramatically, but the number of crawlable pages is reduced by much less. Why? Because a robots meta tag is part of the HTML code of the page it is blocking. That means the crawler has to load the page in order to find out it has been blocked. A robots meta tag can block indexing, but can’t can’t block crawling. It still wastes crawl budget.

You might well ask: If robots meta can’t directly block a page from being crawled, then why is the crawlable page count reduced at all? Because crawlers can no longer reach the deepest pages: The pages that link to those pages are no longer followed or indexed. Robots meta can’t directly block crawling of a particular page, but it can block the page indirectly, by setting “nofollow” for all of the pages that link to it.

Test #10 — Repeat of Test #9, But With “noindex, follow”

This a repeat of test #9, except now the pages are blocked by a robots meta tag with “noindex, follow” instead of “noindex, nofollow.” This tells the crawler that it still shouldn’t index the page, but it is OK to follow the links from it.

(I’m only including this one because, if I don’t, someone is bound to tell me I forgot to include it.)

selection limitcrawlable pagesindexable pages
018,690,151,0251,000
118,690,151,02511,400
218,690,151,02579,640
318,690,151,025470,760
418,690,151,0252,282,155
518,690,151,0259,269,631
(no limit)18,690,151,02518,690,151,025

This scheme reduces the number of indexable pages, but it does nothing whatsoever to prevent wasted crawl budget. Wasted crawl budget is the main problem that needs to be solved here, so this makes this scheme useless. There are some use cases (unrelated to faceted nav) where “noindex, follow” is a good choice, but this isn’t one of them.

Can the selection limit be implemented with robots.txt?

As shown in test #9, using robots meta tags to implement a selection limit is not ideal, because robots meta tags are part of the HTML of the page. The crawler has to load each page before it can find out if the page is blocked. This wastes crawl budget.

So what about using robots.txt instead? Robots.txt seems like a better choice for this, because it blocks pages from being crawled, unlike robots meta, which blocks pages from being indexed and/or followed. But can robots.txt be used to selectively block pages based on how many options they have selected? The answer is: it depends.

This depends on the URL structure. In some cases it’s simple, in others it’s difficult or impossible.

For example, if the URL structure uses some completely impenetrable format like base-64-encoded JSON:

https://example.com/products?p=WzczLCA5NCwgMTkxLCAxOThd

Then you are out of luck. You cannot use robots.txt to filter this, because there’s no way for robots.txt to tell how many selected options there are. You’ll have to use robots meta or X-Robots. (both of which can be generated by the server-side code, which has access to the decoded version of the query data)

On the other hand, if all filter options are specified as a single underscore-separated list of ID numbers in the query string, like this:

https://example.com/products?filters=73_94_191_198

Then you can easily block all pages that have more than (for example) two options selected, by doing this:

User-agent: *
Disallow: /products?*filters=*_*_

So let’s try this.

Test #11 — Selection Limit, via Robots.txt

This is a repeat of test #9, except now the pages are blocked using robots.txt instead of robots meta.

selection limitcrawlable pagesindexable pages
01,0001,000
111,40011,400
279,64079,640
3470,760470,760
42,282,1552,282,155
59,269,6319,269,631
(no limit)18,690,151,02518,690,151,025

Takeaways

  • Blocking pages based on a selection limit is a very effective way to reduce page counts.
  • Implementing this with robots.txt is best.
  • But you can only use robots.txt if the URL structure allows it.
  • Implementing this with robots meta is less effective than robots.txt, but still useful.

Summary

Faceted navigation is one of the thorniest SEO challenges large sites face. Don’t wait to address issues after you’ve built your site. Plan ahead. Use robots.txt, look at selection options, and “think” like a search engine.

A little planning can improve use of crawl budget, boost SEO, and improve the user experience.

6 Comments

  1. whiner

    whiner

    Hi Matthew, at first great article – did you see any examples of pages with faceted nav with proper seo implementation?

    • Matthew Henry

      Matthew Henry

      Amazon.com is a good example of a site that has put a lot of thought and effort into their faceted nav.

  2. Great post, and covers the combinatorial possibilities well. That said, I have some questions about the assumptions and conclusions here.
    1) Assumption that Google or some other crawler can actually activate the selectors. Does it? If yes, there might be an SEO issue. if no then the only thing crawled/indexed is the initial content there by default . I agree that content will be crawled so designers need to mindful what shows by default. I know this because I can check results from some online data studio embeds I have out there.
    2) Assumption that if crawler could actually use the selectors, it would. I think you have the combination possibilities correct, but I think the execution of this would not be practical.
    Case: 4,294,967,296 combinations. I don’t know how many hits could be executed on the server without it crashing , but let’s assume a very liberal 1000 per second. That would mean 4,294,967 seconds would be required to go through the combinations. That would take roughly 50 days of non stop queries! Whose server would allow that before blocking? Would any crawler be willing to tie up it’s resources on a page for that amount of time for a single site? Not to mention the thousands of similar faceted selectors out there.
    I respectfully agree with the UX conclusions you presented here and appreciate the depth of the exploration of possible combinations . I’m know there are testing implications – how do you know your results are valid when developing one of these? Super difficult to validate ALL the possible results.
    I’m not sure about the DB implications – there surely can be some if the queries are not properly structured. Some of the impact there will depend on actual usage – these are usually fairly simple SQL queries compared to some out there (think of an enterprise ERP system) . You definitely hit an optimization opportunity looking at number of results returned.
    If you think I’m off here please hit me up on twitter @helpfullee
    With respect…

    • Matthew Henry

      Matthew Henry

      Thank you for your thoughtful analysis. I’ll do my best to address each of your points:

      On assumption 1: The article is assuming the options are coded as actual links, with each link leading to a new, unique URL. This is the most common way these sites are coded.

      On assumption 2: You raise a good point. If there are (for example) 4,294,967,296 combinations, I agree that a search engine crawler will only crawl a tiny fraction of this. After maybe a few hundreds of thousands (or, at most, millions) of pages, the crawler will decide it is getting nothing but junk, and will give up. My main point is that having a search engine give up because your site is mostly junk is itself a really bad thing that you should go to great lengths to avoid. The reason four billion crawlable pages is better than four trillion is not because the search engines will crawl any more or less of them. It’s better because it is one step closer to getting the count down to something reasonable. You are not done yet, but every journey must start with the first step.

      Testing implications: I can not, of course, guarantee that my testing script is perfectly accurate in all cases. What I can say is that I put a lot of effort into getting it as accurate as I could. The algorithm went though many development iterations, and at each stage, I tested simulated results of sites that had “mere” hundreds of thousands of pages against actual crawls. I rejected some of my earlier algorithms because they were not accurate enough.

      DB implications: The pagination limit exists on some very large sites because, for this particular type of query, very deep pagination tends to be more expensive than shallow pagination. Also, the chances that a real human being would have a legitimate reason to want to see page 6000 are pretty small. This only matters when the site is insanely large and gets a lot of traffic. It comes down to a simple cost/benefit analysis.

  3. Hi Mathew,

    Thank you sharing such a wonderful post.

    I had doubt on limit of robots.txt file but now you have cleared the doubt with a good example.

    Thanks a lot.

  4. Take a look at the PRG Pattern, which is the best way to get faceted search to work for you AND Google ;-)

Comments are closed.