EPCI: Extracting potentially copyright infringement texts from the web

Takashi Tashiro*, Takanori Ueda, Taisuke Hori, Yu Hirate, Hayato Yamana

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.

Original languageEnglish
Title of host publication16th International World Wide Web Conference, WWW2007
Pages1151-1152
Number of pages2
DOIs
Publication statusPublished - 2007
Event16th International World Wide Web Conference, WWW2007 - Banff, AB, Canada
Duration: 2007 May 82007 May 12

Publication series

Name16th International World Wide Web Conference, WWW2007

Conference

Conference16th International World Wide Web Conference, WWW2007
Country/TerritoryCanada
CityBanff, AB
Period07/5/807/5/12

Keywords

  • Copy detection
  • Information retrieval

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'EPCI: Extracting potentially copyright infringement texts from the web'. Together they form a unique fingerprint.

Cite this