Beginner's Guide to SEO, Chapter 2: Crawling, Indexing and Ranking
It's been a while since the first part of the Beginner's Guide to SEO – Chapter 1: SEO 101, but after a brief parenthesis, we return to share a chapter, Chapter 2.
Chapter 2: How Search Engines Work: Crawling, Indexing and Ranking
First, show yourself.
As we mentioned in Chapter 1, search engines are response machines. They exist to discover, understand, and organize Internet content to deliver the most relevant results to the questions users ask.
To appear in search results, your content must first be visible to search engines. Arguably the most important piece of the SEO puzzle: if your site can't be found, there's no way for it to appear in the SERPs (Search Engine Results Page).
How do search engines work?
Search engines have three main functions:
Track: Search the Internet for content, review the code/content of each URL you find.
Index: stores and organizes the content found during the crawling process. Once a page is in the index, it is running to be displayed as a result of relevant queries.
Rating: provides the content elements that will best answer a user's query. Sort search results by the most useful for a particular query.
What is search engine crawling?
Crawling is the discovery process in which search engines send a team of robots (known as crawlers or spiders) to find new and updated content. The content can vary, it can be a web page, an image, a video, a PDF, etc., but regardless of the format, the content is discovered through links.
The bot starts by searching for some web pages and then follows the links on those web pages to find new URLs. By jumping along this link path, crawlers can find new content and add it to their index, a massive database of discovered URLs, to be retrieved later when a search engine is searching for information that the content of that URL is relevant to. compatible. .
What is a search engine index?
Search engines process and store the information they find in an index, a huge database of all the content they have discovered and consider good enough to serve search engines.
Search Engine Ranking
When someone performs a search, search engines search their index for highly relevant content and then sort that content in hopes of resolving the user's query. This order of search results by relevance is known as ranking. In general, you can assume that the higher a website ranks, the more relevant the search engine is to the query.
It is possible to block search engine crawlers from part or all of your site, or tell search engines to avoid storing certain pages in their index. While there may be reasons for doing so, if you want users to find your content, you must first ensure that it is accessible to crawlers and can be indexed. Otherwise, it's as good as invisible.
By the end of this chapter, you'll have the context you need to work with the search engine, and not against it!
Note: In SEO, not all search engines are the same
Many beginners wonder about the relative importance of search engines in particular. Most people know that Google has the largest market share, but how important is it to optimize for Bing, Yahoo, and others? The truth is that despite the existence of more than Top 30 Web Search Engines, the SEO community really only pays attention to Google. Because? The short answer is that Google is where the vast majority of people search the web. If we include Google Images, Google Maps and YouTube (a property of Google), More than 90% of web searches happen on Google, that is, almost 20 times combined with Bing and Yahoo.
Crawling: Can search engines find your site?
As you just learned, making sure your site is crawled and indexed is a prerequisite to appearing in the SERPs. First things first, you can check how many pages of your website have been indexed by Google using “site:yourdomain.com”, an advanced search operator.
Head to Google and type “site: yourdomain.com” in the search bar. This will return the results that Google has in its index for the specified site:
The number of results Google displays (see “About __ results” above) is not exact, but it gives you a solid idea of which pages are indexed on your site and how they are currently displayed in search results.
For more accurate results, monitor and use the Index Coverage report and Google Search Console. You can sign up for a free Google Search Console account if you don't currently have one.. With this tool, you can submit sitemaps for your site and monitor how many submitted pages have been added to Google's index, among other things.
If you don't appear anywhere in the search results, there are a few possible reasons:
- Your site is brand new and has not been crawled yet.
- Your site is not linked from any external website.
- Your site navigation makes it difficult for a bot to crawl it effectively.
- Your site contains some basic code called crawler directives that block search engines.
- Your site has been penalized by Google for fraudulent tactics.
If your site has no other sites linking to it, you can still index it by submitting your XML sitemap to Google Search Console or by submitting individual URLs to Google. There's no guarantee they'll include a submitted URL in their index, but it's worth a try!
Can search engines see your entire site?
Sometimes a search engine can find parts of your site by crawling, but other pages or sections may be obscured for one reason or another. It's important to make sure search engines can discover all the content you want to index, and not just your home page.
Ask yourself this: can the bot crawl within the website, and not just it?
Is your content hidden behind login forms?
If you require users to sign in, fill out forms, or answer surveys before accessing certain content, search engines won't see those protected pages. A tracker is definitely not going to log you.
Do you trust search forms?
Robots cannot use search forms. Some people believe that if they put a search box on their site, search engines will be able to find everything their visitors are looking for.
Is the text hidden within the non-textual content?
Non-text media forms (images, video, GIF, etc.) should not be used to display text that you want to index. While search engines are getting better at image recognition, there's no guarantee they can read and understand it just yet. It is always better to add text within the markup of your website.
Can search engines track your site navigation?
Just as a crawler needs to discover your site through links from other sites, it needs a link path on your own site to guide it from page to page. If you have a page that you want search engines to find, but it's not linked from any other page, it's as invisible as the word invisible makes it sound. Many sites make the serious mistake of structuring their navigation in ways that are inaccessible to search engines, making it difficult to include them in search results.
Common navigation errors that can prevent crawlers from seeing your entire site:
- Have a mobile navigation that shows different results than your desktop navigation
- Any type of navigation where the menu items are not in the HTML, such as JavaScript-enabled navigations. Google has gotten a lot better at crawling and understanding Javascript, but it's still not a perfect process. The surest way to ensure that someone finds, understands, and indexes something Google finds by putting it in the HTML.
- Personalization, or showing unique navigation to a specific type of visitor compared to others, could appear to be masking a search engine crawler.
- Forgetting to link to a home page on your website through your navigation. Remember, links are the paths crawlers follow to new pages.
This is why it is essential that your website has a clear and structured navigation of useful URL folders.
informational architecture
Information architecture is the practice of organizing and labeling content on a website to improve efficiency and user capacity. The best information architecture is intuitive, meaning users shouldn't have to think too much to navigate your website or find something.
Your site should also have a helpful 404 page (page for content not found) for when a visitor clicks on a dead link or misspells a URL. The best 404 pages allow users to click back to your site so they don't bounce just because they tried to access a non-existent link.
Tell search engines how to crawl your site
In addition to making sure crawlers can reach your most important pages, It is also pertinent to keep in mind that you will have pages on your site that you do not want them to find. These can include things like old URLs that have thin content, duplicate URLs (such as sort and filter parameters for e-commerce), special promotional code pages, staging or test pages, and more.
Blocking pages from search engines too can help crawlers prioritize your most important pages and maximize your crawl budget (the average number of pages a search engine crawler will crawl on your site).
Crawl directives allow you to control what you want Googlebot to crawl and index using a robots.txt file, a meta tag, a sitemap.xml file.
Robots.txt
Robots.txt files are located in the root directory of websites (for example, yourdomain.com/robots.txt) and suggest which parts of your site search engines should and should not be crawled through specific search engine directives. robots.txt. This is a great solution when you are trying to block search engines from non-private pages on your site.
You don't want to block crawling of private/sensitive pages because users and bots can easily access the file.
Pro tip:
- If Googlebot cannot find a robots.txt file for a site (HTTP status code 40X), it proceeds to crawl the site.
- If Googlebot finds a robots.txt file for a site (HTTP status code 20X), it will generally comply with the suggestions and proceed to crawl the site.
- If Googlebot doesn't find a 20X or 40X HTTP status code (for example, a 501 server error), it can't determine whether it has a robots.txt file or not and won't crawl your site.
Meta directives
The two types of meta-directives are the meta robots tag (most commonly used) and the x-robots-tag. Each provides crawlers with more robust instructions on how to crawl and index the content of a URL.
The x-robots tag provides more flexibility and functionality if you want to block search engines at scale because it can use regular expressions, block non-HTML files, and apply noindex tags throughout the site.
These are the best options for blocking the most sensitive */private URLs from search engines.
* For very sensitive URLs, it is recommended to remove them or require a secure login to view the pages.
WordPress Tip– In Control Panel > Settings > Reading, make sure the “Search Engine Visibility” box is unchecked. This blocks search engines from reaching your site through your robots.txt file!
Avoid these common pitfalls, and you'll have clean, trackable content that allows bots to easily access your pages.
Once you've ensured that your site has been crawled, the next order of business is to make sure it can be indexed.
Sitemaps
A sitemap is exactly what it sounds like: a list of URLs on your site that crawlers can use to discover and index your content. One of the easiest ways to ensure that Google finds your highest priority pages is to create a file that meets Google standards and submit it through Google Search Console. While presenting a sitemap doesn't replace the need for good site navigation, it can certainly help crawlers follow a path to all of your important pages.
Google Search Console
Some sites (most common with e-commerce) make the same content available on multiple different URLs by adding certain parameters to the URLs. If you've ever shopped online, chances are you've narrowed your search through filters. For example, you can search for “shoes” on Amazon and then refine your search by size, color, and style. Every time you refine, the URL changes slightly. How does Google know which version of the URL will serve search engines? Google does a good job of calculating the representative URL itself, but you can use the URL Parameters feature in Google Search Console to tell Google exactly how you want your pages to be treated.