The Page Space Wars, a cyber war for territory in Googles index

This post on HN:

http://www.skorks.com/2010/02/how-to-become-a-spammer-as-a-programmer-regardless-of-people-following-you-on-twitter/

And the one that it refers to have me thinking about what SEO, spam and other tactics really are about and how you can draw the line between what really qualifies as ‘spam’ and what is still ‘clean’.

It’s a surprisingly hard problem, it seems that everybody has their own definition of what qualifies as spam and what does not.

Let me outline a bunch of commonly used (and imo very spammy) SEO techniques to illustrate what this is all about:

Tactic 1, the link farm:

A bunch of ‘feeder’ websites is created that rank high for certain keywords and that basically are just gateways in to the ‘real’ site, the payload.

Tactic 2, the recyclers:

People scan the web for content that ranks high for certain keywords, they then copy this content shamelessly and start promoting it, eventually getting parity with the original, or even displacing it.

Imagine the number of pages in google would be an even million, to make the calculations a bit simpler. If you have a website that depends solely on traffic and your ‘product’ is on a one page website, all other things being equal there is a ¹⁄₁,000,000th chance that your page will be the one visited by some random visitor typing in words in to google.

Words that are ‘special’, closely related to your product will turn up your page more frequently than others, but there still is only the one ‘page slot’ used in google.

If on the other extreme all of the million pages that google would link to would be yours the chance would be ‘1’. There would be no room for your competitors. And that’s what all these spam sites are all about, to artificially increase the footprint of a site in google with more ‘slots’ to catch users in, and to artificially increase the ranking of those pages beyond what their natural position in terms of authority should be.

So, this is just another runaway arms race. If your competitor does it, he’ll have you for lunch. Theoretically, of course the person with 1 million pages will have one million lowly ranked pages, and yours will be ‘special’ and ranked higher. But in practice googles algorithms are far from perfect and the lack of perfection is large enough to make these page space wars and very lucrative way to make money.

Some keywords are so valuable that the spammers have completely skewed the frequency for those words vs other words in normal use. For instance, viagra occurs a whopping 42 million times, and diabetes clocks in at 60 million, and that’s not counting all the variations on the name.

We could probably lose 40 of the 42 million pages without a problem.

Another analogy. If the web were a street with stores on both sides, then the fraction of the total storefront length that your store occupies determines the likelihood of a customer entering your store versus someone else’s store. And in a real life mall, that’s exactly how it works.

If search engines worked perfectly and were able to detect these artificially inflated storefronts 100% reliably then the spam would die out quickly. As long as search engines are imperfect in this respect you can expect the problem to be here to stay.

So, back to Max’ project and the posters response to it. What Max advocates is not to artificially increase your storefront, that would be spam. What he advocates is to execute many websites in parallel, with a minimum in effort and a maximum of automation in order to limit the amount of time you have to spend on each project. And that’s fine, even if none of those are going to be the next google or ebay. As long as someone finds a site useful and you’re able to make more money on it than you invested you’re doing good.

As soon as you start gaming the search engines in order to give that little project a larger footprint than it deserves you’re crossing over in to spam territory.

Jacques Mattheij

Technology, Coding and Business

The Page Space Wars, a cyber war for territory in Googles index