How Exactly Does Google Handle Duplicate Content?

The internet of today is filled with content. More content is added every day than can be possibly read. But to be frankly honest, not all the content that is published is original - far from it. Duplicate content effects on the search rankings on a web page - but how much effect does it actually have? A lot of bloggers and website owners ask this question every day, concerned by thecontent they've (necessarily or otherwise) copied, or has been copied from them. Today, we'll try to answer this question in light of advice from Google itself.

Duplicate content happens a lot!

Alas, it's inevitable. Content continues to be duplicated today. In fact, about 30% of the content on the web today is duplicate. And this just counts exact duplication. If you consider article spinning and other such practices, then this number will be high, pretty high. Maybe even up to 50% (or more, nobody can be sure).

What you can be sure about is that, your content always runs the risk of being duplicated. The more popular you are, the likelier you are to find your content duplicated.

But...

...not all duplicate content is copyrighted or plagiarised! Sometimes, people will quote a paragraph or a few lines from another blog or website, say from a news/press release. Often times, websites have canonical versions of their pages for different regions on different domains, such as .com and .co.uk. You can also find duplicate versions of important pages, such as Terms of Services (ToS) pages.

So how exactly does Google treat duplicate content?

So this means that Google does not - nay, can not penalize every website with duplicate content, because if it did, that would adversely effect its search quality. So the solution? The solution lies in grouping all duplicate content together, and then showing the best contender in search results. For example, if BBC published a breaking news, and 10 other news sites copied that news release, Google will detect duplicate content, and group all of those 10 releases (plus the 11th BBC's original release) together, and then a best candidate will be chosen from the group, and get the top spot on Google. The other pages will be pushed back quite low in the search results.

But how does Google know which contender was the original/best source? I'm glad you asked - it is an extremely important question. This problem is usually solved keeping many factors in mind. These include the authority of each source (PageRank etc.), the date and time the content was published on each page, the structure of each copy (i.e. what is the nature of the keywords and links in the content - are they relevant to the host page? Elements such as internal links will be most relevant on the original source rather than the copies) etc.

So if you are the original source, you need not worry. If you've copied from somewhere, it won't result in a penalty, but you won't really benefit from it.

But what about people who simply copy RSS feeds, or posts from other sites to publish on their own? Well, in that case, Google has to step in. If you've a blog hosed on Blogger, blatant plagiarism will get your blog banned. Someone might file a copyrights violation charge against you, which can only mean trouble. And even if you manage to avert such cases, you'll eventually see a ranking drop in search results. The effect won't be apparent, but given time, it'll become profound.

Did that clear the questions in your mind? If you still have more, you know where to ask them :) Cheers.