Google is like a giant vacuum, sucking in and indexing content as it finds it on the web. When it finds fresh content it hasn’t seen before, it indexes the page it’s on and all the content it finds there. It also records the page/site that it found the content on. If it comes across the exact or very similar content in the future on a different page or website, it mostly ignores it as duplicate content. This behavior is exploitable in what I’m calling an EarlyBird attack to prevent competitor websites from ever seeing the light of day, or the web browser of a prospective client.
Search engines(I’m using Google in this article) return search engine results when a visitor searches for keywords. Search engines decide which results to return based on a bunch of different criteria. Depth of content on a page, uniqueness of content and relevancy to the searched words are just a few of them. To a website owner, getting your website returned on the first page of these search results is a coveted status. It means traffic, and depending on the niche(narrow area of interest), it means a LOT of traffic.
Websites can generate income for their owners through advertising and affiliate marketing. Getting to the first page of search engines results for their target words and phrases could mean the difference between making $5 a day or $5000 a day. Yes, it can be that dramatic. For this reason, some less than honest website owners will resort to dishonest methods of getting to that first page.
Here’s an example of how this EarlyBird attack works.
RunningWebsiteA is a great site about the latest running shoes. Every day, they post new articles and information about the world of running, jogging, racing and the shoes used by the competitors. The Google search engine gets accustomed to new content appearing on WebsiteA several times a day. For this reason, it checks WebsiteA for new content twice a day(this varies, just an example). WebsiteA is raking in hundreds of dollars a day from advertisers and affiliate programs.
Joe Smith has an interest in running shoes, sees the niche to be profitable and wants to start RunningWebsiteB in the same niche. He prepares dozens of articles for his new website and registers RunningWebsiteB.com then publishes his content
The owner of RunningWebsiteA is shady and knows about the EarlyBird attack. Either programmatically or manually, he is daily grabbing a copy of new web domains registered in the last twenty-four hours. Again, either in Excel or using a script, he searches the list of domains for a list of words relating to his niche. (running, shoes, etc..) He notices RunningWebsiteB.com because of the word “running” in the domain name and adds it to his list of possible competitors. After he’s finished looking through the list, he immediately visits the new websites that look like they might compete with his dominance in the search engines.
He lands on RunningWebsiteB.com and sees stiff competition that might end up beating him in search engine results. Again, being less than honest, the owner of RunningWebsiteA copies all the content from RunningWebsiteB and publishes the content as articles on his website. Since Google is already checking his website several times a day for new content, it sucks up the content and records that it found it on RunningWebsiteA.
Since RunningWebsiteB is brand new, Google won’t index it as frequently. When it does get around to indexing it, a few days after the site’s started, it finds content that appears to be duplicate content of RunningWebsiteA and ignores it. Meanwhile, in the next few weeks, the owner of RunningWebsiteA continues to daily scrape any more new content off RunningWebsiteB and adding it as their own content, which Google gobbles up and attributes it to them.
The result is that the articles from RunningWebsiteB never get found by anyone that’s searching for information. RunningWebsiteA has stolen all of their content and, unless the owner of RunningWebsiteB is aware of what’s going on, he’ll never know why his site didn’t take off.
To exploit this behavior, there are a few things required.
- An existing website with frequently updated content that Google checks frequently
- The daily list of newly registered domains.
- A script to search the list and check results from it, and scrape the content
No, I’m not writing anyone a script, but for someone with a financial incentive and inclined to do it, it really isn’t a difficult task. I wouldn’t be surprised if the EarlyBird attack is already being exploited by less than honest niche website owners.
The #1 takeaway from this is to check search engines for exact unique phrases used in your website content, even and especially if your site is new.