Use Robots Exclusion to inoculate yourself against potential domain spam penalty.

by Detlev Johnson

Domains abound about the Web. There are big ones and small ones, but most of all, there are a lot of spammy ones! Spam topics in discussions typically center on hidden text, bad cloaking and keyword stuffing. Spammy domain networks optimized for search engines are getting negative attention these days. Search engines have learned to cope with "search engine optimizers" and their domains. Today, you run the risk of getting banned for spam if your site(s) appears as duplicate listings.

What about near duplicates? Whatever the hosting situation, if your optimized pages are machine generated, chances are that the process is creating a spammy situation. Anytime a page is optimized for a single keyphrase, secondary (and sometimes primary) keyphrases tend to flood results with listings that look mighty spammy. Machine-generated pages are just a bad idea. Write your content by hand and each page will end up being totally unique and you can rest assured that the results you obtain will be seen as legitimate.

Let's take a look at how spider-based search engines deal with duplicate domains (which indicate their level of automatic spam detection). There are bona fide sites whose situation can be instructive. The celebrated Wine.com case (link no longer available) besides, let's look at a site where the host-master mapped four domains to the same content folder.

This narrated example, in effect, provides for four identical sites and we can find duplicate listings.

All the spider-based search engines have encountered the four domains. In such a situation, the thing to do would be to use indivdual IPs and server message 301, (or at least provide unique folders and the Robots Exclusion protocol to disallow three of the four domains) so that the search engines don't index duplicates. In the very least, Robots Exclusion should be utilized. Since that was not done in this case we will look at duplicate listings to see where the risk for being kicked out for spam is strongest.

I will describe (and list) the indexing of the sites as site A (the primary domain), B, C and D. Five major spider-based engines were tested (Google, Inktomi, FAST, Altavista and Teoma). All four domains resolve to the same IP, which made it easy to use FAST's IP filter to learn that there really aren't more than four domains. But these four domains are mapped to the same IP and content folder (making them truly and exact duplicates). All four sites share the same Robots.txt, although the syntax and hosting does nothing to resolve the duplicates issue.

Search engine indexing (number of pages in the index)
Dupes Google Inktomi FAST Altavista Teoma
A 159 82 34 1 58
B none none none 23 4
C 1 none 72 none 1
D none none 1 16 4

Inktomi has it completely correct. Google nearly does. FAST and Altavista have the wrong domain pronounced and Teoma has dupes from all 4 domains but pronounces the correct one.

Search engines which rely heavily on hypertext to calculate importance (and they all do to a large degree these days) have it mostly right. However, you run a tremendous risk having dupes in these engines because their technology makes it a simple task to discover, analyze and reject duplicate domains. If a spam engineer finds duplicate listings, chances are they will take action against you, especially when someone complains about "flooding."

Each duplicate page can improperly "flood" a query and unfairly dominate search results. If you have duplicate sites or operate PPC landing sites, do not neglect to use the Robots Exclusion Protocol to manage your domains. Use the proper syntax and disallow robots from crawling duplicates. Their index count may drop, but you are doing the search engines a favor; and you can rest at night knowing you've done the right thing!

What if you find yourself in a situation where you change business models or created a new site that you want to promote? Well, let me tell you, SuccessWorks finds itself in that position today. Our company focus shifted towards servicing XML Trusted Feed clients and publishing original material on SEO Writing (like Heather's free book chapters).

We have a domain hosted at PositionTech where security is rightfully so terribly tight, I have trouble getting remote FTP privileges. If you are a customer of theirs, rest assured your CC data is safe. Also, SuccessWorks operates SuccessWks.com, which reflects Heather's services prior to my joining the company. Therefore, we have more than one domain that could potentially appear in results. Purposefully, all the content has been written from scratch and no true duplicate pages are indexed.

The plan we are implementing now is to use JavaScript or Meta-Refresh to direct users to the most recent versions of documents while I get the Robots Exclusion protocol established. The JavaScript will indicate to the host-master where we intend to redirect so that he (or she) can put the final piece in place: 301 server message (Object permanently moved.)