Marianne Sweeny // May 15 2013
When I meet with clients or present at conferences, I am always asked: “How do I rank high on Google for (insert keyword-phrase-du-jour)?” I give the standard answer: “Only the search engineers and Google can tell you and they aren’t talking.”
Inevitably, the questioner looks dejected, mutters a slur on my credentials, and walks away. I scream silently in my head: “Don’t kill the messenger because we are all hapless Wile E. Coyotes chasing the Larry and Sergey Road Runner with zero chance of catching them, no matter what we order from ACME!”
Thirteen years ago, before the Cone of Silence dropped on Google’s method of operation, we got a glimpse of the method behind their madness. This, combined with the common knowledge of the foundational tenets of all search engines, gives us some idea of what’s going on behind that not-so-simple box on the white page.
In this post, I am going to explore the 3 algorithms that we know for sure Google is using to produce search results, and speculate about the 200+ other algorithms that we suspect they are using based on patent filings, reverse engineering, and the Ouija board.
What is an algorithm (you might ask)?
There are many definitions of algorithm. The National Institute of Standards and Technology defines an algorithm as “a computable set of steps to achieve a desired result.” Ask a developer and they will tell you that an algorithm is “a set of instructions (procedures or functions) that is used to accomplish a certain task.” My favorite definition, and the one that I’m going with, comes from MIT’s Kevin Slavin’s TED Talk “How Algorithms Shape Our World”: algorithms are “math that computers use to decide stuff.”
The most famous Google algorithm is PageRank, a pre-query value that has no relationship to the search query. In its infancy, the PageRank algorithm used links pointing to the page as an indication of its importance. Larry Page, after whom the algorithm is named, used the academic citation model where the papers citing another were endorsements of its authority. Strangely enough, they do not have citation rings or citation buying schemes as with web links. Warning, scary, eye-bleeding computational math ahead.
To combat spam, a Random Surfer algorithm was added was added to PageRank. This algorithm “imagined” a Random Surfer that traveled the Web and would follow the links on each page. However, sometimes, the Random Surfer would arbitrarily, much like us thought-processing bipeds, not return to the original page and keep going or would stop following links and “jump” to another page. The algorithm steps are:
That’s the benefit of algorithms, no overtime and they never get tired or bored.
Surf’s up Dude algorithm worked for about 10 minutes before the SEO community found the hole in its wet suit to manipulate rankings. In the early 2000s, processors caught up to computational mathematics and Google was able to deploy the Hilltop Algorithm (around 2001). This algorithm was the first introduction of semantic influence on search results inasmuch as a machine can be trained to understand semantics.
Hilltop is like a linguistic Ponzi scheme that attributes a quality to links based on the authority of the document pointing the link to the page. One of Hilltop’s algorithms segments the web into a corpus of broad topics. If certain documents in a topic area have lots of links from unaffiliated experts within the same topic area, that document must be an authority. Links from authority documents carry more weight. Authority documents tend to link to other authorities on the same subject and to Hubs, pages that have lots of links to documents on the same subject.
The Topic-Sensitive PageRank algorithm is a set of algorithms that take the semantic reasoning a few steps further. Ostensibly the algorithm uses the Open Directory ontology (dmoz.org) to sort documents by topic.
Another algorithm calculates a score for context sensitive relevance rank based on a set of “vectors”. These vectors represent the context of term use in a document, the context of the term used in the history of queries, and the context of previous use by the user as contained in the user profile.
So, I know what you’re thinking. How can they do that for the whole web? They don’t. They use predictive modeling algorithms to perform these operations on a representational subset of the web, collect the vectors, and apply the findings to all of the “nearest neighbors.”
[Added May 16, 2013]
There are a lot of algorithms for indexing, processing and clustering documents that I left out because including them would have many of you face-first-in-your cereal-from-boredom. However, it is NOT OK to leave out the mother of all information retrieval algorithms, TF-IDF, known affectionately to search geeks as Term Frequency-Inverse Document Frequency.
Introduced in the 1970s, this primary ranking algorithm uses the presence, number of occurrences, and locations of occurrence to produce a statistical weight on the importance of a particular term in the document. It includes a normalization feature to prevent long boring documents from taking up residence in search results due to the shear nature of their girth. This is my favorite algorithm because it supports Woody Allen’s maxim that 80% of success is showing up.
All of the search engines closely guard their complete algorithm structure for ranking documents. However, we live in a wonderful country that has patent protection for ideas. These patents provide insight into Google’s thinking and you can usually pinpoint which ones are deployed.
Panda, the most famous update is an evolving set of algorithms that are combined to determine the quality of the content and user experience on a particular website. There are algorithms that apply decision trees to large data sets of user behavior.
These decision trees look at if this/then that:
Complementing the decisions trees could be any one of a number of page layout algorithms that determine the number and placement of images on a page in relation to the amount of content in relation to a searcher’s focus of attention.
Following on the heels of Panda are the Penguin algorithms. These algorithms are specifically targeted at detecting and removing web spam. They use Google’s vast data resources to evaluate the quality of links pointing to a site, measure the rate of link acquisition, the link source relationship to the page subject, shared domain ownership of the linking sites, and relationships between the linking sites.
Once a site passes an established threshold, another algorithm likely flags the site for additional review by a human evaluator or automatically re-ranks the page so that it drops in search results.
As with the formula for Coca-Cola or the recipe for Colonel Sanders’ Kentucky Fried Chicken, specifics on what Google uses to decide who gets where in the search results set are a closely guarded secret. Instead of speculating on what we might know, let’s focus on what we do know:
Are there any major algorithms we missed? Let us know in the comments.
Marianne considers herself fortunate to be able to combine her passions for search and user experience as a Search Strategist at Portent. Springtime finds her teaching Introduction to Information Retrieval at the University of Washington iSchool. Her aspiration is to regain her Google Adwords certification. Read More