Saturday, March 3, 2012

How Search Engines Work?


How Search Engines Work?
It is assumed that search engines start their crawl from seed sites. These are those websites which are manually identified to be most authoritative and trusted like yahoo.com, Microsoft.com, adobe.com etc.  Search engines follow the links on these sites to find other web documents (web pages, images, videos, word documents, PDF files etc) on the web. When a web document is found, search engines crawl it (i.e. parse the code) and then index it (i.e. store certain parts of the web document on hard drives located at different data centers around the world) if it is worth indexing. There are also different levels of indexations. Your document can be stored either permanently or temporarily in the main index.  It can also be stored in the supplemental (or secondary) index or some specialized index (like blog search, image search, product search etc)

Here is one thing to keep in mind.  Crawling and indexing are different processes. It is not necessary that if a search engine crawls a web document, then it will also index it. Search Engines don’t index spam documents or documents which are duplicate or very similar to other document. They use a link analysis technique known as ‘Trust Rank’ to separate useful web pages from spam. Search engines store semantically connected documents together in an index (database) known as ‘Latent Semantic Index’ (LSI) for faster retrieval later. I will talk about semantic connectivity in detail, later in the post.

Now when a user makes a search query, search engines first retrieve all those documents which are relevant to the search query from LSI and then sort them in decreasing order of their importance.  Both relevance and importance of a web document is determined through a method known as ‘Document Analysis‘ which consist of:
1. Semantic Analysis
2. Link  Analysis (or Citation Analysis)

Semantic Analysis
It is done to determine the semantic connectivity between words or phrases i.e. how words/phrases are generally associated with each other. For e.g. ‘Statue of liberty’ is commonly associated with ‘New York’. Similarly, ‘Agra’ is commonly associated with ‘Taj Mahal’. Search engines use different methods to determine semantic connectivity:
i. They use their own dictionaries and thesaurus.
ii. They use Fuzzy Set Theory i.e. search engines measure how words/phrases are used together or how they are used in close proximity and in what context they are used to together.
iii. Topic Modeling – Through this method search engines mathematically try to resolve relationships between words or phrases and if set of contents are relevant to a search query. LDA (Latent Dirichlet Allocation), LSI (Latent Semantic Index), LSA (Latent Semantic Analysis), pLSA (probabilistic Latent Semantic Analysis) etc are all different ways to implement topic modeling.

Link (or Citation) Analysis
Search Engines do link analysis to measure the quantity and quality of inbound links (both internal and external link) and citations to a web document. It is also done to separate useful web pages from spam (Trust Rank).

No comments:

Post a Comment

Twitter Bird Gadget