TY - GEN
T1 - EPCI
T2 - 16th International World Wide Web Conference, WWW2007
AU - Tashiro, Takashi
AU - Ueda, Takanori
AU - Hori, Taisuke
AU - Hirate, Yu
AU - Yamana, Hayato
PY - 2007
Y1 - 2007
N2 - In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.
AB - In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.
KW - Copy detection
KW - Information retrieval
UR - http://www.scopus.com/inward/record.url?scp=35348850182&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35348850182&partnerID=8YFLogxK
U2 - 10.1145/1242572.1242740
DO - 10.1145/1242572.1242740
M3 - Conference contribution
AN - SCOPUS:35348850182
SN - 1595936548
SN - 9781595936547
T3 - 16th International World Wide Web Conference, WWW2007
SP - 1151
EP - 1152
BT - 16th International World Wide Web Conference, WWW2007
Y2 - 8 May 2007 through 12 May 2007
ER -