5 Ways to Avoid Duplicate Content and Indexing Problems

5 Ways to Avoid Duplicate Content and Indexing Problems Mars664

5 Ways to Avoid Duplicate Content and Indexing Problems


Before a page can rank well, it must crawled and indexed.

More than any other type of site, eCommerce sites are known for developing URL structures that create crawling and indexing problems with search engines. It's important to keep this under control to avoid duplicate content and crawl budget complications.

Here are 5 ways to keep your eCommerce site's indexing optimal.


1.- Know what is in the Google index

To start, it's important to regularly check how many of your pages Google reports as indexed. To do this, run a “site: example.com” search on Google to see how many pages Google knows about on the Web.


While Google's webmaster trends analyst, Gary Illyes, has mentioned that this number is just a rough estimate, it's the easiest way to identify if something is seriously related to your site's indexing.

Regarding the number of pages in your index, Bing's Stefan Weitz also admitted that Bing

…guess the number, which is usually wrong… I think Google has had it for so long that people expect to see it there

The numbers between your content management system (CMS) and the eCommerce platform, sitemap, and server files should match almost perfectly, or at least with any discrepancies addressed and explained. Those numbers, in turn, should be roughly aligned with what a Google site operator's search returns.

Smart SEO aid; A site developed with SEO in mind helps considerably by avoiding duplicate content and structural issues that can create indexing problems.

Although too few results in an index can be a problem, too many results are also a problem as this can mean that you have duplicate content in search results. Although Ilyes has confirmed that There is no “duplicate content penalty”, Duplicate content still hurts your crawl budget and can also dilute the authority of your pages in duplicates.


If Google returns too few results:

  • Identify which pages in your sitemap are not showing up in your traffic. Google Analytics organic search. (Use a long date range.)
  • Search a representative sample of these pages on Google to identify which ones are actually missing from the index. (You don't need to do this for every page.)
  • Identify patterns on pages that are not indexed and address them systematically on your site to increase those pages' chances of indexing. Patterns to look for include duplicate content issues, lack of inbound internal links, non-inclusion in the XML sitemap, unintentional non-indexing or canonicalization, and HTML with serious validation errors.

If Google is returning too many results:

  • Run a site crawl with ScreamingFrog, DeepCrawl, SiteBulb or a similar tool and identify pages with duplicate titles, as they typically have duplicate content.
  • Determine what is causing the duplicates and remove them. There are several causes and solutions that will make up most of the rest of this post.

2.- Optimize sitemaps, robots.txt and navigation links

These three elements are critical to strong indexing and have been covered in depth elsewhere, but I would be remiss if I didn't mention them here.

I cannot stress the importance of a comprehensive sitemap. In fact, it seems like we've reached the point where it's even more important than your internal links. Gary Ilyes recently confirmed that Even search results for “core” keywords (as opposed to long tail keywords) can include pages without inbound links, even without internal links. The only way Google could have known about these pages is through the sitemap.

It's important to note that Google and Bing guidelines still say that pages should be accessible from at least one link, and sitemaps in no way downplay the importance of this.

It's equally important to make sure your robots.txt file is functional, doesn't block Google from any part of your site you want to index, and declares the location of your sitemap. Functional robots.txt files are very important because if they don't work, you can cause Google to stop indexing your site entirely according to Ilyes.

Finally, an intuitive and logical navigation link structure is a must for good indexing. In addition to the fact that every page waiting to be indexed must be accessible from at least one link on your site, good UX practices are essential. Categorization is essential for this.

For example, research by George Miller of the Interaction Design Foundation suggests that The human mind can only hold about seven chunks of information in short-term memory at a time.

I recommend that your navigation structure be designed around this limitation, and in fact, maybe even limit your menu to no more than five categories to make it even easier to use. Five categories per menu section and five subcategories per dropdown menu may be easier to navigate.

Here are some important points that Google representatives have made about navigation and indexing:

  • Accordions and tabs that hide navigation elements can be included if they improve the user experience. In a world of early mobility, hiding items this way doesn't hurt indexing.
  • Use breadcrumb navigation, they are included in the PageRank calculation.
  • Google Webmaster Trends analyst John Mueller has said that any standard menu style, such as a mega menu or drop-down menu, is fine, but poor URL structures that produce too many URLs for a single page are a problem.
  • Gary Illyes has also said that you should avoid using the nofollow attribute on your own content or internal links.
  • Googlers have stated many times that internal link link text is a factor, so make sure your navigation links are descriptive and useful, and avoid keyword stuffing.
  • Avoid infinite spaces or spider traps. They are typically created when interactive site functions are achieved using links.
  • Run a crawler on your site to determine if you end up crawling more pages than you expect to find, as this can help you identify navigation links that create duplicates, infinite spaces, and other problems.
  • Keep your URLs as close to the root as possible from a user experience (UX) perspective. Gary Illyes has said that pages further from the root will be crawled and discovered less frequently.
  • Make sure your site's full navigation is accessible from mobile devices, as mobile-first indexing means this is the version Google is using to index your site.

Bing recommends the following:

  • Keyword-rich URLs that avoid session variables and docIDs.
  • A highly functional site structure that encourages internal linking.
  • A hierarchy of organized content.

3.- Get control over URL parameters

URL parameters are a very common cause of “infinite spaces” and duplicate content, which severely limits crawl budget and can dilute signals. They are variables added to your URL structure that carry server instructions used to do things like:

  • Sort elements
  • Store user session information.
  • Filter items
  • Customize the appearance of the page.
  • Return search results on the site.
  • Track advertising campaigns or signal information to Google Analytics

If you use Screaming Frog, you can identify URL parameters in the URI tab by selecting “Parameters” from the “Filter” drop-down menu.

Examine the different types of URL parameters at play. Any URL parameters that do not have a significant impact on the content, such as ad campaign tags, ranking, filtering, and personalization, should be treated with a noindex or canonicalization directive (and never both). More on this later.

Bing also offers a handy tool to override selected URL parameters in the Configure My Site section of Bing Webmaster Tools.

If parameters have a significant impact on content in a way that creates pages that are not duplicates, here are some of Google's recommendations on proper implementation:

  • Use standard URL encoding, in the format “? Key = value &”. Do not use non-standard encodings such as square brackets or commas.
  • You should use parameters, never file paths, to enumerate values ​​that do not have a significant impact on the content of the page.
  • User-generated values ​​that do not have a significant impact on the content should be placed in a filtering directory that can be hidden with robots.txt, or some form of non-indexing or canonicalization can be used.
  • Use cookies instead of extraneous parameters if a large number of them are necessary for user sessions to eliminate content duplication that burdens web crawlers.
  • Don't generate parameters for user filters that produce no results, so empty pages are not indexed or taxed by web crawlers.
  • Only allow pages to be crawled if they produce new content for search engines.
  • Don't allow links to be clicked for categories or filters that don't have products.


4.- Good and bad filters

When should a filter be crawlable by search engines, and when should it not be indexed or canonicalized? My rule of thumb, influenced by Google recommendations above, is that “good” filters:

  • They should act as a meaningful extension of your product categories, producing different but strong pages.
  • They should help specify a product.

I feel these are or should be indexed. “Bad” filters, in my opinion:

  • Rearrange content without changing it in other ways, such as sorting by price or popularity.
  • Maintain user preferences that change the layout or design but do not affect the content.

These types of filters should not be indexed, and instead they should be addressed with AJAX, noindex or canonicalization directives.

Bing warns webmasters to use the AJAX pushState function to create URLs with duplicate content, or this defeats the purpose.


5.- Appropriate use of noindex and canonicalization

Noindexing tells search engines not to index a page, while canonicalization tells search engines that two or more URLs are actually the same page, but one is the “official” canonical page.

For duplicates or near-duplicates, canonicalization is preferred in most cases, since preserve SEO authority, But it is not always possible. In some circumstances, you do not want any version of the page to be indexed, in which case noindex should be used.

Don't use noindex and canonicalization at the same time. John Mueller warned against this because you could tell search engines not to index the canonical page as well as duplicates, although he said Google would probably treat the canonical tag as an error.

Here, things that should be canonicalized:

  • Duplicates created using navigation parameters and faceted URLs should be canonicalized to the standard version of the page.
  • Canonicalizes paginated content into a consolidated “view all” page.
  • Canonicalize any A/B split or multivariate tests in the official URL.

Here, things that I recommend that are not indexed:

  • Any membership areas or staff login pages.
  • Any shopping cart and thank you pages.
  • Internal search results pages Illyes has said “Generally, they are not that useful to users and we have some algorithms that try to get rid of them…”
  • Any duplicate page that cannot be canonicalized.
  • Narrow product categories that are not unique enough to their parent categories.
  • As an alternative to canonicalization, Bing recommends using its URL normalization feature, found in Bing Webmaster Tools. This limits the amount of crawling needed and allows newer content to be easily indexed.