Recognition of common areas in a web page using visual information: A possible application in a page classification
Samo za registrovane korisnike
2002
Konferencijski prilog (Objavljena verzija)
Metapodaci
Prikaz svih podataka o dokumentuApstrakt
Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. In this paper we propose, a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.
Izvor:
2002 Ieee International Conference On Data Mining, Proceedings, 2002, 250-257Institucija/grupa
GraFarTY - CONF AU - Kovačević, Miloš AU - Dilligenti, M AU - Gori, M AU - Milutinović, V PY - 2002 UR - https://grafar.grf.bg.ac.rs/handle/123456789/33 AB - Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. In this paper we propose, a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents. C3 - 2002 Ieee International Conference On Data Mining, Proceedings T1 - Recognition of common areas in a web page using visual information: A possible application in a page classification EP - 257 SP - 250 DO - 10.1109/ICDM.2002.1183910 ER -
@conference{ author = "Kovačević, Miloš and Dilligenti, M and Gori, M and Milutinović, V", year = "2002", abstract = "Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. In this paper we propose, a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.", journal = "2002 Ieee International Conference On Data Mining, Proceedings", title = "Recognition of common areas in a web page using visual information: A possible application in a page classification", pages = "257-250", doi = "10.1109/ICDM.2002.1183910" }
Kovačević, M., Dilligenti, M., Gori, M.,& Milutinović, V.. (2002). Recognition of common areas in a web page using visual information: A possible application in a page classification. in 2002 Ieee International Conference On Data Mining, Proceedings, 250-257. https://doi.org/10.1109/ICDM.2002.1183910
Kovačević M, Dilligenti M, Gori M, Milutinović V. Recognition of common areas in a web page using visual information: A possible application in a page classification. in 2002 Ieee International Conference On Data Mining, Proceedings. 2002;:250-257. doi:10.1109/ICDM.2002.1183910 .
Kovačević, Miloš, Dilligenti, M, Gori, M, Milutinović, V, "Recognition of common areas in a web page using visual information: A possible application in a page classification" in 2002 Ieee International Conference On Data Mining, Proceedings (2002):250-257, https://doi.org/10.1109/ICDM.2002.1183910 . .