Connect with us

News

Martin Splitt explains how Google selects a canonical page

Published

on

canonical

Users of Google had a question ‘how Google detects the duplicate pages from the billions of pages creating every day?” Martin Splitt, the developer at Google, has shared some notes regarding the process of detection of canonical pages. He has explained how Google eliminates duplicate or fraud pages from search engines.

He has also shared how Google weighs at least twenty different signals to identify a canonical page. Google also uses machine learning to perform this process.

Martin explains that Google first collects the signals of all the newly created pages. In the next step, the developers detect the duplicate pages.

First, they detect the duplicates and cluster them together. Now they know that these pages are the duplicate of each other. Now the developers have to identify the leader page of all of these.

Martin has also described this process. He said that they reduce the content into a hash or checksum first. Then they compare the checksums. Checksums are the extracts of the content. Martin has explained it like a fingerprint. It is easier to compare the extracts than the whole content.

This process of scanning can catch both exact-duplicate and near-duplicate sites. Developers and analysts compare those checksums to eliminate similar content.

The elimination of clusters is not so easy. Sometimes it is hard for humans to choose the eligible page in a search engine. Here the developers employ the signals. These are an https URL, sitemap approval, presence of redirection, etc.

Machine learning is quite important in this step. The correct application of all these signals to the clusters and analyzing signal weights is very hard for humans. Manually adjustment of signal weight is a nightmare for the developers. Also, it takes a long time to do it with human effort.

Martin has also said that users don’t like to see the same thing every time in search engines. Also, the storage space is not indefinite. That’s why the developers have to do this canonicalization.

Advertisement
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending