Multirecord webpage extraction based on DOM tree hierarchical feature - Details

author：

Chen, Q.-L. (Chen, Q.-L..) ^[1] | Liao, X.-W. (Liao, X.-W..) ^[2] (Scholars：廖祥文) | Wei, J.-J. (Wei, J.-J..) ^[3] | Chen, G.-L. (Chen, G.-L..) ^[4] (Scholars：陈国龙)

Indexed by：

Scopus PKU CSCD

Abstract：

The　existing　multirecord　webpage　extraction　methods　usually　make　overall　longitudinal　analyses　of　the　document　object　model　(DOM)　tree.　The　computional　structural　similarity　is　always　low,　and　therefore　record　regions　can　not　be　identified　correctly.　Different　from　the　previous　work,　a　method　named　data　record　extraction　based　on　DOM　tree　hierarchical　feature　(DEBHF)　is　proposed　to　make　transverse　analyses　of　the　DOM　tree　by　distinguishing　different　roles　of　nodes　at　different　levels.　Thus,　the　problem　of　searching　similar　sub-trees　is　converted　into　the　problem　of　searching　similar　sub-blocks　in　data　blocks.　Finally,　the　two-way　search　for　non-overlapped　and　repeated　sub-blocks　is　adopted　to　segment　the　record　regions.　Experimental　results　show　that　the　proposed　approach　can　deal　with　webpages　which　can　not　be　obtained　by　the　existing　methods　and　the　extraction　results　of　different　data　sources　demonstrate　its　effectiveness.　©,　2015,　Journal　of　Pattern　Recognition　and　Artificial　Intelligence.　All　right　reserved.

Keyword：

Extraction algorithm; Information extraction; Multirecord webpage

Community：

[ 1 ] [Chen, Q.-L.]College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350108, China
[ 2 ] [Liao, X.-W.]College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350108, China
[ 3 ] [Wei, J.-J.]College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350108, China
[ 4 ] [Chen, G.-L.]College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350108, China

Reprint 's Address：

廖祥文
[Liao, X.-W.]College of Mathematics and Computer Science, Fuzhou UniversityChina

Email：

Show more details

Related Keywords：

Source ：

Pattern Recognition and Artificial Intelligence

ISSN： 1003-6059

CN： 34-1089/TP

Year： 2015

Issue： 2

Volume： 28

Page： 125-131

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 2

Affiliated Colleges：

数学与统计学院本学院/部未明确归属的数据

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to