Cloaking and doorway pages – the world of Black Hat SEO


Cloaking and doorway pages are two common, though morally dubious, techniques for optimizing one’s rank on popular search engines. However, these techniques are almost always against search engines’ guidelines, thus putting them in the realm of so-called “Black Hat” SEO.

In this article, you’ll get an extensive introduction to how doorways pages and cloaking works and why many site owners use these techniques despite the risk of being completely delisted from search engine results.

  1. SEO – Search Engine Optimization
  2. Doorway Pages
  3. Cloaking
  4. Conclusion

SEO – Search Engine Optimization

SEO, or Search Engine Optimization, is the process of systematically working to raise a site’s ranking on the various major search engines. These techniques include backlinking, internal linking, optimizing your landing page for crawlers, content generation with high-value keywords, etc. 

But if getting to the top of the search results for specific keywords is so important, why wouldn’t they simply pay Google to display them above the results? Google provides just this service, after all.

Why would they spend so much money and jump through so many hoops to optimize search ranking?

It’s because a high search engine rank is simply more valuable than an ad displayed before the ranking. Google’s users often begin their web surfing journey there – often many times per day. That means that they see Google’s search results format over and over. And, with time, they become used to Google’s ad format.

So used to it, in fact, that they don’t even glance at the ads. Habit and experience lead them to subconsciously skip right over them.

Through familiarity, they get used to separating the ads from the top results and skip straight to those results without even registering the ads. 

This is a phenomenon known as “Banner Blindness.” 

Furthermore, Adblock usage continues to grow, and it blocks Google’s search ads just like any other. In the United States, over 25% of consumers now use adblockers, according to Statista. In Germany, eMarketer reports AdBlock usage of over 30%!

By optimizing search engine ranking, you avoid both Banner Blindness and your ad getting caught by an adblocker. So, as you can see, SEO is significantly more valuable than paid search ads.

High search rankings are so valuable, in fact, that some users are willing to stray into the dark world of Black Hat.

Black Hat SEO

“Black Hat” is a term taken from the hacking community that means that something is outside of the law.

In the hacking community, this tends to mean outright illegal. A black hat hacker is likely someone with whom the police would like to have a word. 

In SEO, the meaning is (usually) a little less severe

Black Hat SEO is generally not illegal in the sense that it is breaking laws, but it does violate search engine rules. And breaking these rules can result in a site being entirely delisted, i.e. no longer being displayed in the results at all. 

So in real terms, “Black Hat” SEO is significantly tamer than other activities that bear the same title. 

As you may have already guessed, cloaking and doorway pages are two prevalent forms of SEO that are widely considered Black Hat.

Before we can understand these techniques, we need to look at the main point of interaction between a search engine and a website. 

And that point of interaction is called the web crawler (or spider).

Search Engine Crawlers

For consumers, the World Wide Web consists primarily of various webpages that can be visited. However, the web is also heavily decentralized. This lack of centralization means that there is no one list of every active site on the internet. 

So no one knows how many sites there are on the web at a given time. This is precisely the question that the first web crawler was designed to answer.

Matthew Gray built that web crawler at MIT in 1993. Named the World Wide Web Wanderer, its purpose was to document and measure the expansion of the internet. 

After a while, it also began indexing the sites visited – rather than merely counting them. This index laid the foundation for a searchable database of extant and visitable websites. 

Matthew Gray later went on to work at Google and helped build the world’s leading search engine – a service which today is responsible for a considerable percentage of Google’s revenue. 

Since 1993 the base technology on which search engines depend has not fundamentally changed. It still depends on “crawlers” similar to the Perl-based World Wide Web Wanderer that Matthew Gray built at MIT. These crawlers are smarter and more efficient than ever before, but they are still fundamentally doing the same thing

How does a crawler work? 

A web crawler, such as Google’s Googlebot, will begin with a list of URLs called “seeds” that it needs to visit. These could be new sites or they could be sites it’s revisiting for new content. Different search engines have different policies regarding crawler behavior. 

The crawler will then access these sites, find hyperlinks (to visit later), and copy the information that it finds.

Next, this information is indexed, and the search engine then tries to determine what the site is about, what kind of content is hosted on it, etc. The crawler itself does not do this. Instead, it’s done later on the search engine’s computers after the website has already been crawled

The information that the crawler found is analyzed and stored in the search engine’s enormous index. And this index is large. According to Google, their index contains over 100 million gigabytes of data.

This is the index that is then searched when someone makes a search request on, say, Google or Bing. 

N.B. Crawling and Indexing is not the same thing

A crawler can discover a site, but that doesn’t mean that it is necessarily indexed. If a webmaster doesn’t want a site or a page of a site to appear in search results, he might use a “noindex” META tag

This would tell any crawlers that discover the site that it doesn’t want to be stored in the search engine’s index. And if it’s not indexed, it isn’t served in their results. 

That said, most site owners want their sites to be crawled, indexed, and served in search results. They want their websites not only to be served but to rank highly in those results for pertinent keywords. 

To achieve this, some site owners are willing to enter the world of Black Hat SEO. 

And one of the most common Black Hat techniques is the doorway page.

Doorway pages 

Doorway pages are sites that are designed specifically and exclusively for search engine crawlers. They use a level of search engine optimization and keyword density that would have severely adverse effects on the actual user experience. 

When you optimize a page for machines, it’s usually far from aesthetically pleasing. Sometimes it’s not even human-readable

But the goal is for the visitor tonever even see the doorway page. It’s a doorway; you’re supposed to pass through it. Not stop and stare at it.

The exact definition of a doorway page, though, can be a subject of debate. And it’s an important one because if Google decides that you are using a doorway page, then your site could be delisted.

So here is Google’s definition:


Doorway pages

Doorways are sites or pages created to rank highly for specific search queries. They are bad for users because they can lead to multiple similar pages in user search results, where each result ends up taking the user to essentially the same destination. They can also lead users to intermediate pages that are not as useful as the final destination.

Here are some examples of doorways:

– Having multiple domain names or pages targeted at specific regions or cities that funnel users to one page

– Pages generated to funnel visitors into the actual usable or relevant portion of your site(s)

– Substantially similar pages that are closer to search results than a clearly defined, browseable hierarchy


Some doorway pages encourage a user to very quickly move onto the actual content and away from the doorway itself (using an aggressive call to action, for example). 

However, most site owners who use doorway pages do not want their users to know the doorway page was even there, much less see it. 

For this reason, a lot of site owners who use doorway pages also use META refresh, Javascript redirection, or Server-Side redirection. 

Again, doorway pages are designed and optimized for machine-reading. That is to say that they are mostly full of keyword-dense text (that often might not make sense) with very little design, javascript enhancements, images, or anything else that a crawler can’t use.

A doorway page is just that; it’s a doorway that you walk through on your way to something else. 

META Refresh, Javascript, and other forms of redirection

META refresh is a refresh command that can be written directly into the HTML of a page.

In short, it causes the browser to refresh the page as soon as it is done loading.

It’s called “META” refresh because it’s in an HTML META tag. However, it’s not limited to refreshing the page, META refresh can send the user to a totally new URL.

This code is in the header of a page. That means that it is loaded first and the first thing it does is ask the site to refresh immediately, but with a different URL

    <meta http-equiv="refresh" content="0;URL='http://example.com/'" />    

You can read more about META refresh here on the World Wide Web Consortium’s website.

This technique can be used to create a client-side redirect on a user’s browser. The search engine’s crawler would still read the content of the page then it would be indexed and ranked. However, the actual human user wouldn’t have the time to read it.

Javascript redirects are also a popular way to quickly and quietly redirect a user from a doorway page to the real destination that the site owner wants the user to find. 

Google considers Javascript redirects to be against its webmaster guidelines if they are designed to show the crawler one thing, but actual human users something else.

Some site owners, instead of using client-side redirection techniques (META refresh, Javascript), will instead use server-side redirection techniques

Multiple doors, same destination

Frequently there is not just one doorway page redirecting to one site. Rather, there will be one real site and many, many doorway pages, each designed to rank well with specific keywords or locations. 

These sites, in turn, all redirect to the exact same destination.

Google and other search engines penalize those who use doorway pages for this reason (among others). By optimizing various doorway pages, but redirecting users to the same site, Google’s algorithm thinks it’s giving users something very closely related to what they’re looking for, while in reality that is not the case.

This has a strong negative effect on user experience. 

Doorway Pages are NOT Landing Pages

It’s important to note that landing pages are not the same thing as doorway pages. A landing page is generally used in online marketing campaigns as an intermediary between the advertisement itself and the final destination

The landing page serves to educate and encourage the user to move to the next step by either explaining or presenting the product or service itself. 

In short, it is an intermediary page that is directed at users/consumers

A doorway page, on the other hand, is specifically designed for search engine crawlers and would, ideally, never actually be seen by the user.

Cloaking 

Cloaking is a much more aggressive kind of doorway page and, for search engines, a much graver sin. 

So what is it, exactly? Cloaking is a technique whereby the server actually identifies a search engine’s web crawler requesting the site URL and returns a doorway page. Not just any doorway page, but a very aggressive one at that. 

When a human user, or at least when an IP that the server doesn’t identify as a crawler, tries to access the site, it returns the real content

In this way, the crawler never has the chance to discover the real content.

The content is fully “cloaked” behind a façade that the server presents the crawler. 

Google defines cloaking as:


Cloaking

Cloaking refers to the practice of presenting different content or URLs to human users and search engines. Cloaking is considered a violation of Google’s Webmaster Guidelines because it provides our users with different results than they expected.

Some examples of cloaking include:

– Serving a page of HTML text to search engines, while showing a page of images or Flash to users

– Inserting text or keywords into a page only when the User-agent requesting the page is a search engine, not a human visitor

If your site uses technologies that search engines have difficulty accessing, like JavaScript, images, or Flash, see our recommendations for making that content accessible to search engines and users without cloaking.

If a site gets hacked, it’s not uncommon for the hacker to use cloaking to make the hack harder for the site owner to detect. Read more about hacked sites.


As you can see, Google is not a fan of cloaking. This is because, with cloaking, Google doesn’t actually know what the real content is that the user will see. So Google might send him to a page that is full of spam or malware, despite telling the user in the search results that it’s what they were seeking.

Again, this is different from doorway pages with sneaky redirects. With cloaking the site identifies the crawler’s request and directly serves it its own page to read and index. There isn’t any redirect; human users are directly served the real page. 

How it works 

Site owners who employ cloaking techniques have to recognize crawlers among all the different sources making HTTP requests.

In order to do this, the server looks at the information available to it when it receives an HTTP request.

That is, it looks at the IP address of the requesting party as well as the HTTP User Agent Header information that arrives with the request. 

HTTP User Agent

In order to identify Googlebot and other crawlers (though Googlebot is by far the most important), cloakers will look at two things. 

The first of these is HTTP User Agent Header

When a user requests a web page, the User Agent Header gives the server information regarding the client. This is generally information about the user’s operating system, browser, browser version number, etc. 

Googlebot and other crawlers also have to provide a User Agent Header when making an HTTP request. And Googlebot’s User Agent looks significantly different from a real user’s.

For example here is my User Agent: 

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:69.0) Gecko/20100101 Firefox/69.0

While Googlebot’s looks like this: 

Googlebot/2.1 (+http://www.google.com/bot.html)

As you can see, there is a considerable difference between the two. Mine displays all kinds of information regarding my computer, operating system version, browser, browser engine, etc. 

Why does the browser send this information? Because it tells the server what kind of site it should be returning. If the user is using an old browser that doesn’t support a feature, the user will be redirected to an older website or one encouraging him to upgrade. If the user is on mobile, a mobile site will be presented, etc. 

Googlebot just returns the fact that… it’s Googlebot.

IP Address

The second way of identifying Googlebot is by comparing the IP address requesting the page with a list of known Googlebot IP addresses.

Now, Google doesn’t provide a list of IP addresses available to the public that they can use to verify Googlebot. That would make abuse too easy for bad actors. In order to verify Googlebot, Google recommends that you do a reverse DNS lookup on the requesting party if it is feeding your server a Googlebot HTTP User Agent. 

Google wants you to be able to verify Googlebot because many sites, such as those with paywalls, give it access to their premium content so that it can be indexed. And these sites don’t want to provide this content to anyone with a Googlebot HTTP User Agent, as this can be faked. On the other hand, Google doesn’t want to make cloaking to easy and therefore keeps its IP addresses private and changes them on occasion. 

However, despite Google’s best efforts, collecting and maintaining a list of Googlebot IP addresses isn’t particularly tricky, and many sites provide lists. 

Between HTTP User Agent Headers and Googlebot IP address lists, an unscrupulous site owner can positively identify Googlebot (most of the time, anyway).

And when Googlebot is discovered, the server can feed it a crawler-optimized page. 

Why would one cloak?

So why would a non-spammer use something like cloaking? Well, as Google mentioned in their section on cloaking (in their webmaster guidelines), some site owners want to use it to inform the crawler of content that would otherwise be invisible to it. This is sometimes the case for sites that have a significant amount of Javascript or Flash content

That said, there are other ways to inform search engines of such content – ways that don’t involve sneaky redirects, doorway pages, or cloaking.

These techniques are simply too often used in bad faith for search engines to allow them.

Conclusion

Search engines are designed to find, index, sort, and rank websites. As the majority of web browsing sessions begin with a search, optimizing your site to be displayed as high as possible has become the norm

As the crawler is a machine, creating content optimized for the crawler often means that the material is less optimized for humans. Doorway pages and cloaking provide two solutions to this. With these methods, crawler optimization is maximized without sacrificing actual page quality.

This, however, leaves Google in the dark regarding the actual content to which they’re sending users. And Google does not like this since it means that they could be sending users to spam or malware-filled sites that they are presenting to users exactly what they’re seeking. 

As such, they strongly oppose these practices, even if they have the occasional legitimate uses.

And, as they are against Google and most other search engine’s policies, they are decidedly Black Hat SEO techniques. 


Mobinner is a High-Performance Demand-Side Platform. Since 2017, we’ve been helping customers drive conversions, build brands, and acquire users. See how our platform can help your business meet its goals.



Leave a Reply

Your email address will not be published. Required fields are marked *