A Powerful Robots.txt Technique
If you use a content management system like WordPress, Joomla, Drupal or OsCommerce you may be suffering from duplicate content issues. These can be really difficult problems to solve. Sometimes it requires complicated .htaccess rewrite solutions and endless hours finding links to nofollow. But now there is an easier solution for most of these problems.
The trick here is to use your Google Webmaster Tools account and your robots.txt file.
1. Identify Duplicate Content Issues
One way to see if your site is having duplicate content issues is by going to your Google Webmaster Tools account. Look under Diagnostics ->HTML Suggestions ->and check to see if you have duplicate titles and meta descriptions. If you do, then you may be repeating the same title tag and description content on several pages of your site. However, this may also indicate that your website is creating duplicate webpage content. If your site is replicating the same page over and over again, then obviously your titles and meta descriptions will be the same. In either instance you are going to want to correct this duplication issue.
Another thing to look for are sections of your website like feeds, comments pages, and review page. These pages generally create duplicate content. You do not want the search engines crawling these pages.
2. How To Block Search Engines From Indexing Duplicate Content
Once you have identified the sections of your website that should be blocked from the search engines, create a robots.txt file. These are extremely simple to create, but also incredibly damaging if you make one small mistake. I’m not going to go into how to make a robots.txt file, because http://www.robotstxt.org does a great job of explaining it.
3. How To Attack Difficult Sections Of Your Website
A lot of content management systems like to use URLs that have questions marks and equal signs in them. Luckily, Google allows you to block strings within URLs. Here is an example:
This statement will block any URL that has the string ?page= in it. The reason why this is so handy is that most robots.txt examples only show us how to block webpages and directories AND not URLs that have long complicated strings. We just had a case where this little technique came in very handy!
Here are some good resources on robots.txt:
Robots.txt info for WordPress Websites.
Incredibly insightful post about robots.txt on SEOBook.com