IISc Logo    Title

etd AT Indian Institute of Science >
Division of Electrical Sciences >
Computer Science and Automation (csa) >

Please use this identifier to cite or link to this item: http://hdl.handle.net/2005/1346

Title: Near-Duplicate Detection Using Instance Level Constraints
Authors: Patel, Vishal
Advisors: Bhattacharyya, Chiranjib
Keywords: Document Clustering - Artificial Intelligence
Latent Dirichlet Allocation
Information Retrieval
Near-Duplicate Detection
Constrained Clustering
Group LDA
Duplicate Bug Report Detection
Near-Duplicate Document Detection
Submitted Date: Aug-2009
Series/Report no.: G23536
Abstract: For the task of near-duplicate document detection, comparison approaches based on bag-of-words used in information retrieval community are not sufficiently accurate. This work presents novel approach when instance-level constraints are given for documents and it is needed to retrieve them, given new query document for near-duplicate detection. The framework incorporates instance-level constraints and clusters documents into groups using novel clustering approach Grouped Latent Dirichlet Allocation (gLDA). Then distance metric is learned for each cluster using large margin nearest neighbor algorithm and finally ranked documents for given new unknown document using learnt distance metrics. The variety of experimental results on various datasets demonstrate that our clustering method (gLDA with side constraints) performs better than other clustering methods and the overall approach outperforms other near-duplicate detection algorithms.
Abstract file URL: http://etd.ncsi.iisc.ernet.in/abstracts/1740/G23536-Abs.pdf
URI: http://hdl.handle.net/2005/1346
Appears in Collections:Computer Science and Automation (csa)

Files in This Item:

File Description SizeFormat
G23536.pdf756.87 kBAdobe PDFView/Open

Items in etd@IISc are protected by copyright, with all rights reserved, unless otherwise indicated.

 

etd@IISc is a joint service of NCSI & IISc Library ||
Feedback
|| Powered by DSpace || Compliant to OAI-PMH V 2.0 and ETD-MS V 1.01