CERN Accelerating science

Download Embed Viewed by 2931 users
Internal Note
Report number CERN-STUDENTS-Note-2014-128
Title Author Clustering on Large Bibliographies
Author(s) Sterz, Christoph (CERN)
Corporate author(s) CERN. Geneva. IT Department
Imprint 28 Aug 2014
Subject category Computing and Computers ; Information Transfer and Management
Keywords invenio ; bibauthorid ; author disambiguation ; library management
Abstract We analyze and design an algorithm for clustering large sets of authors in Bibliographies. Not considering a distance function for a mutual comparison, but transforming the data into a multidimensional metric space, the algorithm described is similar to locally sensitive hashing. The task lies in the field of Record-Linkage. The algorithm was designed and performed based on the data of the CERN Document Server, consisting out of more than 1.7 million metadata entries and is part of the digital assets-managing-software invenio. Meant as a prototype, the algorithm performs efficiently, clustering all authors on CDS in under 30 minutes. We will discuss extensions improving the recall rate, wich still remains inferior to the currently used clustering-approach.
Submitted by christoph.sterz@cern.ch

 


 Запись создана 2014-08-28, последняя модификация 2015-01-15


Access to fulltext:
Загрузка полного текста
PDF