Indexed by:
Abstract:
The existing multirecord webpage extraction methods usually make overall longitudinal analyses of the document object model (DOM) tree. The computional structural similarity is always low, and therefore record regions can not be identified correctly. Different from the previous work, a method named data record extraction based on DOM tree hierarchical feature (DEBHF) is proposed to make transverse analyses of the DOM tree by distinguishing different roles of nodes at different levels. Thus, the problem of searching similar sub-trees is converted into the problem of searching similar sub-blocks in data blocks. Finally, the two-way search for non-overlapped and repeated sub-blocks is adopted to segment the record regions. Experimental results show that the proposed approach can deal with webpages which can not be obtained by the existing methods and the extraction results of different data sources demonstrate its effectiveness. ©, 2015, Journal of Pattern Recognition and Artificial Intelligence. All right reserved.
Keyword:
Reprint 's Address:
Email:
Source :
Pattern Recognition and Artificial Intelligence
ISSN: 1003-6059
CN: 34-1089/TP
Year: 2015
Issue: 2
Volume: 28
Page: 125-131
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 0
Affiliated Colleges: