Recognition of common areas in a web page using a visualization approach
Apstrakt
Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform an additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes the browser screen coordinates for every HTML object in a page. Using a spatial information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the center of a page. We show in initial experiments that using our heuristics, defined objects are recognized properly in 73% of cases.
Izvor:
Artificial Intelligence: Methodology, Systems and Applications, Proceedings, 2002, 2443, 203-212Institucija/grupa
GraFarTY - JOUR AU - Kovačević, Miloš AU - Dilligenti, M AU - Gori, M AU - Milutinović, V PY - 2002 UR - https://grafar.grf.bg.ac.rs/handle/123456789/37 AB - Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform an additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes the browser screen coordinates for every HTML object in a page. Using a spatial information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the center of a page. We show in initial experiments that using our heuristics, defined objects are recognized properly in 73% of cases. T2 - Artificial Intelligence: Methodology, Systems and Applications, Proceedings T1 - Recognition of common areas in a web page using a visualization approach EP - 212 SP - 203 VL - 2443 UR - https://hdl.handle.net/21.15107/rcub_grafar_37 ER -
@article{ author = "Kovačević, Miloš and Dilligenti, M and Gori, M and Milutinović, V", year = "2002", abstract = "Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform an additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes the browser screen coordinates for every HTML object in a page. Using a spatial information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the center of a page. We show in initial experiments that using our heuristics, defined objects are recognized properly in 73% of cases.", journal = "Artificial Intelligence: Methodology, Systems and Applications, Proceedings", title = "Recognition of common areas in a web page using a visualization approach", pages = "212-203", volume = "2443", url = "https://hdl.handle.net/21.15107/rcub_grafar_37" }
Kovačević, M., Dilligenti, M., Gori, M.,& Milutinović, V.. (2002). Recognition of common areas in a web page using a visualization approach. in Artificial Intelligence: Methodology, Systems and Applications, Proceedings, 2443, 203-212. https://hdl.handle.net/21.15107/rcub_grafar_37
Kovačević M, Dilligenti M, Gori M, Milutinović V. Recognition of common areas in a web page using a visualization approach. in Artificial Intelligence: Methodology, Systems and Applications, Proceedings. 2002;2443:203-212. https://hdl.handle.net/21.15107/rcub_grafar_37 .
Kovačević, Miloš, Dilligenti, M, Gori, M, Milutinović, V, "Recognition of common areas in a web page using a visualization approach" in Artificial Intelligence: Methodology, Systems and Applications, Proceedings, 2443 (2002):203-212, https://hdl.handle.net/21.15107/rcub_grafar_37 .