? Say simply, is to cut a sentence into N words, word segmentation is divided into Chinese English segmentation and word segmentation, search engine has an own database dictionary, there are a lot of words, and then a control word dictionary; there is a point in time is the word that will put some useless words removed, for example, and so on.
"structure, there are still some search engines do not need the content, such as the navigation bar at the bottom of the menu text, copyright information, these are, search engines do not need the search engine only needs content, so at this time will be on the structure of the web content after denoising simply speaking, denoising, beyond the content of text is to delete all the words on the menu, for example, the bottom of the copyright text and so on.
Then the search engine
what is the structure of the web? We want to know ", is composed of HTML, the search engine spiders crawl back, also is the HTML code page, simply say," structured, is to delete the HTML code, and then left the contents below, figure 1 is structured "before, figure 2 the page is structured.
data analysis system, the search engine is the whole process of second systems, namely, search engine system after a system data analysis system of search engine is mainly used to handle the spider crawling back ", today, will give you a detailed explanation of Jack Bauer, working process and several data analysis system the search engine of the important knowledge points. We have just said, the data analysis system is mainly the analysis of spiders to crawl back, then how to analyze it mainly covers the following points?.
on page check is actually very good understanding, is the search engine spider crawling your site all pages, to compare with you crawl the page this page to see if there are duplicate content, if any, then delete.
on page check
data analysis system of how to determine which is the menu text which is the copyright information of
is actually very simple, is compared, such as a content page, but the content is not the same, other content is almost the same, such as navigation, each page has a navigation, and the text is the same, the copyright is, also according to HTML source analysis.
the content of the web page denoising
web page structured
URL should faceThis >
web page structured