By Jill Whalen© 2012, Used by Permission
As part of my SEO for 2013 and beyond series, I promised to provide more in-depth information about the “SEO killers” I mentioned last time.
Today I’m delving into duplicate content as it relates to SEO. My SEO audits of sites that lost traffic over the past year and a half showed that duplicate content was present on most of the sites. While it’s an issue that generally goes hand in hand with other SEO problems, duplicate content comes in so many forms that I found it to be the single most prevalent problem that can affect a website’s success with Google. Before 2011, duplicate content used to be filtered out of the search results and that was that. However, post-Panda/Penguin, dupe content on websites can often have major repercussions.
How to Check for Duplicate Content
While there are numerous duplicate content checkers available, the simplest method is to copy a random snippet of content from various pages of a website, then wrap them in quotation marks and do a Google search. If a lot of dupes show up in the search results, you may have a Google duplicate content issue.
Another good duplicate content checker is to go to your Google Webmaster Tools account. Under “Optimization” you’ll see “HTML Improvements.” If you click “Duplicate Title Tags” you may learn about some duplicate content you had no idea existed on your website.
Causes of Duplicate Content
There are many reasons why a website can end up with dupe content. Sometimes it’s just laziness on the part of the site owner. Other times it’s an attempt to gain more keyword traffic. In many cases, however, duplicate content is simply a mistake caused by technical issues.
For instance, one website I reviewed had their entire glossary and FAQ sections duplicated because they existed in both the root directory and a directory specifically for English-language pages. This caused lots of the same content to be indexed by Google on different (but similar) URLs, such as these:
http://www.example.com/medical-glossary.html
http://www.example.com/en/medical-glossary.html
The fix for this is of course to choose only one place to house the content.
Another site I reviewed had inadvertent duplicate content because they had too many similar but slightly different categories for their products, such as these:
http://www.example.com/bear-cave-photos/2428/dept
http://www.example.com/bear-cave-photo-gifts/1290/dept
They had more than 20 different URLs that all had pretty much the same products on them. Interestingly enough, many of them were bringing direct Google traffic. While that sounds like a good thing on the surface, I believe that if all the URLs had been consolidated into one, they’d have acquired even more weight and overall Google PageRank. This in turn would have provided the one main URL with an even better chance of showing up for even more targeted keyword phrases.
Another fix for this might be to use the canonical link element — i.e., rel=canonical — pointing to one main page. But my first choice would always be to fix the URLs by consolidating them. (Don’t forget to 301-redirect the others.)
I also found websites that had duplicate content issues simply because they used both initial capital letters and all lowercase, like these:
/people/JaneAusten
/people/janeausten
And then there was the old dupe content because the site appeared with both HTTP and HTTPS.
The commonest cause for inadvertent duplicate content such as these examples is a bad or misconfigured content management system (CMS).
I’ve seen CMS’s that output both parameter-laden URLs along with clean URLs for the same products, such as this:
http://www.example.com/index.php?manufacturers_id=5555
http://www.example.com/brand-m-5555.html
None of this would be a problem if Google (and all search engines) did a better job of understanding that all of the above are simply technical issues. And while they can (or should) be able to, they often don’t. In fact, it seems that they’re allowing that sort of thing to potentially hurt a website’s visibility in the search results more often today than they did many years ago. I’ve always said that there was no such thing as a duplicate content penalty, but today there is in fact one — or more. My guess is that they want to encourage webmasters to do a better job of cleaning up the messes that their CMS’s leave behind because that makes Google’s job of crawling and indexing much easier.
Categorization Gone Crazy
Beyond technical issues, another common reason for duplicate content problems is that some products fit into multiple categories. For instance, an “Outdoor Gear” type of site may have multiple target audiences such as hikers, runners, cyclists, snowmobile riders, motorcyclists, ATV riders, etc. And some of their accessories — backpacks, jackets, gloves, etc. — may be of interest to several audiences. If the website is categorized by target market rather than by product type, and the products are found in each of those categories (under different URLs), that can lead to major duplicate content issues.
To fix category problems, re-categorize the site by product type (which may or may not be ideal) or ensure that no matter which category users enter from, they always end up at the same product page URL. (The page itself could explain exactly who might need the product shown.)
A Rose by Any Other Color
A similar duplicate content problem can occur when products come in various colors or sizes. If each of those different sized or colored products has its own page with the same basic description, it’s certainly a duplicate content issue. But that’s the way many CMS’s seem to work. It would be much better for both usability and search engines if those types of products simply had one page with the option to choose sizes and colors, etc., right there on that page.
Tags That Go to Infinity and Beyond
Using WordPress as your CMS isn’t always the answer to duplicate content issues either. Many sites that use it go crazy with their tagging of blog posts. New tags are made up for every new post, and each post gets tagged with a handful or more of them. What happens next is that Google indexes the zillions of tag landing pages, which either has just 1 post tagged or they have the same posts as a bunch of other tag landing pages. It’s especially bad when the tag pages provide the complete blog post rather than just the first paragraph or so.
My recommendation for sorting out that kind of mess is to create only a limited number of tags that the bloggers can use — perhaps 20 at most. If for whatever reason that’s not possible and you want to use your tags for keyword stuffing (a la The Huffington Post), then be sure to add nofollow to the tag links and noindex to the tag landing pages to avoid mega duplicate content Google problems.
Once, Twice, Three Times You’re Out!
Of course, some duplicate content issues are there out of plain old laziness. I’ve run across many sites that put some content on the home page and then repeat the same exact thing on nearly every other page. What’s even worse is when there are other sites using that same content as well!
And then there are the duplicate content issues that some companies create for themselves because they develop additional mini-sites (aka doorway domains) to try to gain even more search engine listings. In other words, if you are Cruella D’ville Inc. and sell Dalmation blankets on your main website at CruellaD.com/dalmation-blankets, it’s no longer a good idea to also sell the same ones at DalmatianBlankets.com — especially if you’re using the same basic content.
Don’t Let Greed and Laziness Bring You Down
Another form of lazy duplicate content is what’s been known for years as “Madlib Spam.” One common example of this is the site that offers the same service in multiple cities. They create individual city landing pages with the same basic content, while switching out the city name. I’ve even seen forms of auto-generated madlib spam on product sites where they want to try to capture all sorts of long-tail keyword traffic that relates to their products. While some of the content may make sense to the reader, more often than not it is gibberish. The sad thing is that even really good sites sometimes use this technique. But in 2013 and beyond this is exactly the kind of thing that could end up bringing a good site to a grinding Google halt. Google isn’t just stopping at punishing the auto-generated areas anymore — even the good parts of a site could take a hit.
As you can see, duplicate content comes in many forms. While every website is likely to have a bit of it here and there, if your site has any of the issues mentioned here, you should block away some time and money to clear it up. Especially if you’ve noticed a significant loss in Google traffic at some point.
Great thoughts here on duplicate contents, thanks Karon. More things to clear out now. Let’s go and do some wipes. 🙂
Jill, you advise that bloggers use only a limited number of tags to avoid loads of tags with only one post associated with them. Makes all good sense to me.
One thing I’m curious about though and that is, with even a limited number of tags, search engines can in theory still potentially see those tag pages as duplicate content.
I presume therefore that search engines also look at other things, such as directory structure (e.g. http://www.mysite.com/blog/tag/tag-name), to establish whether to treat content as unreasonably duplicate or not.
Do you happen to know if this is the case?
@Kevin, I don’t think the structure has anything to do with it. If you have more than one tag landing page that each have exactly the same content on them, then of course it’s duplicate content. Doesn’t mean it will necessarily be a problem, but it could be. Mostly if you have tons of those on your site.
I try to ensure that each tag (or category) page has a slightly different list of posts. There will of course be overlap, but that shouldn’t be a problem, IMO.
Thanks Jill
Sounds like a good idea to review your tags from time to time, just to ensure that they don’t get out of hand.