How Crawlers Work

0 views

Skip to first unread message

Daisy Hughlett

unread,

Aug 5, 2024, 5:25:29 AM8/5/24

to contahucir

Whenyou define a crawler, you choose one or more classifiers that evaluate the format of your data to infer a schema. When the crawler runs, the first classifier in your list to successfully recognize your data store is used to create a schema for your table. You can use built-in classifiers or define your own. You define your custom classifiers in a separate operation, before you define the crawlers. AWS Glue provides built-in classifiers to infer schemas from common files with formats that include JSON, CSV, and Apache Avro. For the current list of built-in classifiers in AWS Glue, see Built-in classifiers in AWS Glue.

The metadata tables that a crawler creates are contained in a database when you define a crawler. If your crawler does not specify a database, your tables are placed in the default database. In addition, each table has a classification column that is filled in by the classifier that first successfully recognized the data store.

If the file that is crawled is compressed, the crawler must download it to process it. When a crawler runs, it interrogates files to determine their format and compression type and writes these properties into the Data Catalog. Some file formats (for example, Apache Parquet) enable you to compress parts of the file as it is written. For these files, the compressed data is an internal component of the file, and AWS Glue does not populate the compressionType property when it writes tables into the Data Catalog. In contrast, if an entire file is compressed by a compression algorithm (for example, gzip), then the compressionType property is populated when tables are written into the Data Catalog.

If your crawler runs more than once, perhaps on a schedule, it looks for new or changed files or tables in your data store. The output of the crawler includes new tables and partitions found since a previous run.

Google Search is a fully-automated search engine that uses software known as web crawlers that explore the web regularly to find pages to add to our index. In fact, the vast majority of pages listed in our results aren't manually submitted for inclusion, but are found and added automatically when our web crawlers explore the web. This document explains the stages of how Search works in the context of your website. Having this base knowledge can help you fix crawling issues, get your pages indexed, and learn how to optimize how your site appears in Google Search.

Before we get into the details of how Search works, it's important to note that Google doesn't accept payment to crawl a site more frequently, or rank it higher. If anyone tells you otherwise, they're wrong.

The first stage is finding out what pages exist on the web. There isn't a central registry of all web pages, so Google must constantly look for new and updated pages and add them to its list of known pages. This process is called "URL discovery". Some pages are known because Google has already visited them. Other pages are discovered when Google follows a link from a known page to a new page: for example, a hub page, such as a category page, links to a new blog post. Still other pages are discovered when you submit a list of pages (a sitemap) for Google to crawl.

Once Google discovers a page's URL, it may visit (or "crawl") the page to find out what's on it. We use a huge set of computers to crawl billions of pages on the web. The program that does the fetching is called Googlebot (also known as a crawler, robot, bot, or spider). Googlebot uses an algorithmic process to determine which sites to crawl, how often, and how many pages to fetch from each site. Google's crawlers are also programmed such that they try not to crawl the site too fast to avoid overloading it. This mechanism is based on the responses of the site (for example, HTTP 500 errors mean "slow down").

During the crawl, Google renders the page and runs any JavaScript it finds using a recent version of Chrome, similar to how your browser renders pages you visit. Rendering is important because websites often rely on JavaScript to bring content to the page, and without rendering Google might not see that content.

After a page is crawled, Google tries to understand what the page is about. This stage is called indexing and it includes processing and analyzing the textual content and key content tags and attributes, such as elements and alt attributes, images, videos, and more.

During the indexing process, Google determines if a page is a duplicate of another page on the internet or canonical. The canonical is the page that may be shown in search results. To select the canonical, we first group together (also known as clustering) the pages that we found on the internet that have similar content, and then we select the one that's most representative of the group. The other pages in the group are alternate versions that may be served in different contexts, like if the user is searching from a mobile device or they're looking for a very specific page from that cluster.

Google also collects signals about the canonical page and its contents, which may be used in the next stage, where we serve the page in search results. Some signals include the language of the page, the country the content is local to, and the usability of the page.

The collected information about the canonical page and its cluster may be stored in the Google index, a large database hosted on thousands of computers. Indexing isn't guaranteed; not every page that Google processes will be indexed.

When a user enters a query, our machines search the index for matching pages and return the results we believe are the highest quality and most relevant to the user's query. Relevancy is determined by hundreds of factors, which could include information such as the user's location, language, and device (desktop or phone). For example, searching for "bicycle repair shops" would show different results to a user in Paris than it would to a user in Hong Kong.

Based on the user's query the search features that appear on the search results page also change. For example, searching for "bicycle repair shops" will likely show local results and no image results, however searching for "modern bicycle" is more likely to show image results, but not local results. You can explore the most common UI elements of Google web search in our Visual Element gallery.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

All commercial search engine crawlers begin crawling a website by downloading its robots.txt file, which contains rules about what pages search engines should or should not crawl on the website. The robots.txt file may also contain information about sitemaps; this contains lists of URLs that the site wants a search engine crawler to crawl.

Search engine crawlers use a number of algorithms and rules to determine how frequently a page should be re-crawled and how many pages on a site should be indexed. For example, a page that changes on a regular basis may be crawled more frequently than one that is rarely modified.

However, if the URL is a non-text file type such as an image, video or audio file, search engines will typically not be able to read the content of the file other than the associated filename and metadata.

Crawlers discover new pages by re-crawling existing pages they already know about, then extracting the links to other pages to find new URLs. These new URLs are added to the crawl queue so that they can be downloaded at a later date.

Sitemaps contain sets of URLs, and can be created by a website to provide search engines with a list of pages to be crawled. These can help search engines find content hidden deep within a website and can provide webmasters with the ability to better control and understand the areas of site indexing and frequency.

Alternatively, individual page submissions can often be made directly to search engines via their respective interfaces. This manual method of page discovery can be used when new content is published on site, or if changes have taken place and you want to minimize the time that it takes for search engines to see the changed content.

Google states that for large URL volumes you should use XML sitemaps, but sometimes the manual submission method is convenient when submitting a handful of pages. It is also important to note that Google limits webmasters to 10 URL submissions per day.

Sam Marsden is Lumar's former SEO & Content Manager and currently Head of SEO at Busuu. Sam speaks regularly at marketing conferences, like SMX and BrightonSEO, and is a contributor to industry publications such as Search Engine Journal and State of Digital.

As we mentioned in Chapter 1, search engines are answer machines. They exist to discover, understand, and organize the internet's content in order to offer the most relevant results to the questions searchers are asking.

In order to show up in search results, your content needs to first be visible to search engines. It's arguably the most important piece of the SEO puzzle: If your site can't be found, there's no way you'll ever show up in the SERPs (Search Engine Results Page).

When someone performs a search, search engines scour their index for highly relevant content and then orders that content in the hopes of solving the searcher's query. This ordering of search results by relevance is known as ranking. In general, you can assume that the higher a website is ranked, the more relevant the search engine believes that site is to the query.

One way to check your indexed pages is "site:yourdomain.com", an advanced search operator. Head to Google and type "site:yourdomain.com" into the search bar. This will return results Google has in its index for the site specified:

For more accurate results, monitor and use the Index Coverage report in Google Search Console. You can sign up for a free Google Search Console account if you don't currently have one. With this tool, you can submit sitemaps for your site and monitor how many submitted pages have actually been added to Google's index, among other things.