Related product I-Share

Historical Duplicate Detection and Quality Hierarchy Settings for the I-Share Universal Catalog

This document provides historical context for the Duplicate Detection and Quality Hierarchy settings of the I-Share Universal Catalog from 2002 to 2011. The current settings can be found at Duplicate Detection and Quality Hierarchy Settings for the I-Share Universal Catalog.

Duplicate Detection and Quality Hierarchy Recommendations For the Universal Catalog

Original version created by the VISCom Cataloging/Universal Catalog Task Force
May 6, 2002

Fifth revised version incorporating new ILCSO member libraries (2003 and 2004) and changes recommended by IUAG CCAC
July 12, 2004

The goal of the Universal Catalog is to present to users a deduplicated union catalog of the holdings of all the contributing libraries using the bibliographic record of highest quality. To do this, we must define when bibliographic records are duplicates, and when duplicates are detected, we must define how the better record will be chosen.

Duplicate Detection

Four indexes are available for duplicate detection: Original (OCLC) control number (035O), LCCN (010A), ISSN (022A), and ISBN (020A). The duplicate detection process is accomplished by assigning weights to these indexes when they match and by establishing a sum of weights that will cause actions. Below is the recommendation from Endeavor for these values and the recommendation of the VISCom Cataloging/Universal Catalog Task Force (“the Task Force”).

Endeavor Recommendation Task Force Recommendation
Duplicate handling=Replace Duplicate handling=Replace
Cancellation=None Cancellation=None
Duplicate Replace=100 Duplicate Replace=100
Duplicate Warn=100 Duplicate Warn=100
Selected Indexes and Weights:
035O=100
010A=50
022A=100
020A=100
Selected Indexes and Weights:
035O=100
010A=60
022A=70
020A=40

The Task Force has decided to modify Endeavor’s recommendation for duplicate detection based on research done by the ILCSO Maintenance Committee several years ago. In the fall of 1999 a report was generated that identified all the bibs in DRA that matched on ISBN or ISSN. A sampling of matching groups of ISBN numbers found that 38% of them represented bibliographically different items. A sampling of matching groups of ISSN numbers indicated 48% represented different formats, 28% were for preceding/succeeding titles, and 6% were for different publishers. The LCCN is also problematic as a matching tool because this field was not present in ILCSO data prior to 1998. Using these three indexes as 100% indicators of duplicate bibliographic records would result in significant misrepresentations over time.

Further investigation by the Task Force suggests that the reliability of using ISBN or ISSN for duplicate detection is greatly improved when these numbers are used in conjunction with the LCCN. Use of these indexes is highly desirable for the handling of loads of bibliographic records from sources other than OCLC. Because of this, the Task Force recommends that these three indexes each be weighted as above, so that a match on any two of them is required to identify records as duplicates.

Quality Hierarchy

After determining that two bibliographic records represent the same item, Voyager offers the ability to evaluate the two records to determine which should be used in the Universal Catalog and which should not. There are four criteria that can be used alone or in combination:

Cataloging agency (040$a) – Any library in the world
Encoding Level
Contributing agency (040$d) – An ILCSO library’s database name
Record type/bib level

The Quality Hierarchy table consists of four columns containing the appropriate codes for the above, with the highest line being the highest quality. An asterisk can be used in any column to indicate that any value can match that position.

The Task Force recommends that we have two Quality Hierarchy tables—one for the initial build of the Universal Catalog, and a second one for ongoing additions to the Universal Catalog after the initial build.

ILCSO added 12 new member libraries to the consortium in 2003 and 9 new libraries in 2004. After each round of new library conversions, ILCSO has elected to rebuild the entire Universal Catalog from scratch. The concept of using two quality hierarchies, one for the initial rebuild, and a second for the ongoing (daily) additions to the UC, remains as a viable option for our environment.

2004 Rebuild Initial Quality Hierarchy

td>*

  040$a Elvl 040$d Type/Blvltd>
Aurora * * ARUdb *
Benedictine * * BENdb *
Bradley * * BRAdb *
Catholic Theo * * CTUdb *
Chicago State * * CSUdb *
Columbia * * COLdb *
Concordia * * CONdb *
DePaul * * DPUdb *
Dominican * * DOMdb *
Eastern * * EIUdb *
Elmhurst * * ELMdb *
Governors State * * GSUdb *
Greenville * * GRNdb *
IMSA * * IMSdb *
Ill. State Lib * * ISLdb *
ISU * * ISUdb *
Ill. Valley CC * * IVCdb *
Ill. Wesleyan * * IWUdb *
Joliet Jr Coll. * * JOLdb *
Judson * * JUDdb *
Kankakee CC * * KCCdb *
Lake Forest * * LFCdb *
Lewis * * LEWdb *
Lincoln Chr * * LCCdb *
McKendree * * MCKdb *
Millikin * * MILdb *
Natl Louis * * NLUdb *
North Central * * NCCdb *
Northeastern * * NEIdb *
Northern * * NIUdb *
Oakton * * OAKdb *
Roosevelt * * ROUdb *
Saint Xavier * * SXUdb *
Sch Art Inst * * SAIdb *
SIU-C * * SICdb *
SIU-E * * SIEdb *
Trinity * * TRNdb *
Triton * * TRTdb *
UIS * * UISdb *
UIUC * * UIUdb *
UIC * * UICdb *
Western * * WIUdb *
IIT * * IITdb *
SIUM * * SIMdb *
Lewis and Clark CC * * LACdb *
Danville Area CC * * DACdb *
Lincoln Land CC * * LLCdb *
Parkland Coll * * PRKdb *
North Park * * NPUdb *
Sauk Valley CC * * SVCdb *
Northern Baptist Theo Sem * * NBTdb *
Heartland CC * * HRTdb *
John Wood CC * * JWCdb *
Illinois Central Coll * * ICCdb *
Eureka Coll * * ERKdb *
Quincy Univ * * QCYdb *
Harper Coll * * WRHdb *
Olivet Nazarene * * ONUdb *
Augustana Coll * * AUGdb *
Wheaton Coll * * WHEdb *
Illinois Coll * * ILCdb *
Newberry Lib * * NBYdb *
Univ of St. Francis * * USFdb *
Kendall Coll * * KENdb *
Robert Morris Coll * * RMCdb *

In the initial 2004 rebuild, the majority of bibs from the former DRA libraries should be identical due to that system being a single, shared bibliographic database, so the alphabetical order of these libraries is of little consequence. Near the end of the list, and out of alphabetic order, is Illinois Institute of Technology, which has indicated a willingness for other bibs to take precedence over theirs. Following IIT is the SIU Medical school, whose cataloging policy in their former local system was to use only MeSH headings. UIC, higher in the hierarchy than SIUM, supports a large medical library and has used both LCSH and MeSH headings on their bib records.

The new ILCSO member libraries from 2003 and 2004 were added as the final entries to the initial quality hierarchy for each year's rebuild. Records from the new member libraries will go into the Universal Catalog when they are unique to ILCSO. The order of the new libraries was determined by the ILCSO Office after reviewing each libraries' converted bib data, looking at a variety of scenarios including title filing indicator values, presence of obsolete subject headings, uniform title application, and the presence of multiple call number schemes in bib records.

Following the completion of the initial rebuild, the following Quality Hierarchy replaces the one above, for processing the routine, daily updates to the UC:

Ongoing Load Quality Hierarchy

040$a Elvl 040$d Type/Blvl
* (blank) * *
* 1 * *
* I * *
* L * *
* 4 * *
* 7 * *
* 5 * *
* K * *
* M * *
* * * *

Of the four possible data elements to determine quality, we have decided that the cataloging agency (040$a) is not useful, and so we have filled in its column with asterisks.

We have established a hierarchy of encoding level codes as indicated in the Elvl column. Not included are codes 2 (less-than-full level, material not examined), 3 (Abbreviated level), 8 (Prepublication level), E (System-identified error in tapeloaded record), and J (Deleted record). These represent low levels of quality and will all match the last line of the table, making them easily replaced.

The final line in the table will force a replace in cases where neither the record in the UC nor the incoming record match at a higher line in the table. We believe that in general this will contribute newer versions of bib records to the UC. This is beneficial for serial records that are often updated as frequency or publication status changes and for other records as they are enhanced or corrected. However, it is possible that an incoming record may actually be of lesser quality than the one currently in the UC. In these cases an unfortunate result seems unavoidable.

At the recommendation of IUAG’s Cooperative Cataloging and Authority Control Committee and approved in principle by IUAG. a change was made to the table above with the 2004 rebuild project. In past builds of the UC, special preference was given to three ILCSO libraries with OCLC “Enhance” status for particular types of materials. With the 2004 rebuild, this preference was removed from the ongoing quality hierarchy. CCAC made this recommendation based on various factors, including real-life examples where bib records with OCLC numbers were replaced by bibs without OCLC numbers due to the special preference, potential problems with the new “What other libraries own this item” feature in WebVoyage when the bib does not contain an OCLC number, as well as a relatively small number of actual Enhance transactions by these libraries in recent years.

How We Think the Quality Hierarchy Works

Bib in UC Incoming Duplicate Result
Matches higher line in table Matches lower line in table Bib in UC retained, Incoming discarded
Matches lower line in table Matches higher line in table Bib in UC is replaced by Incoming bib
Matches same line in table Matches same line in table Bib in UC is replaced by Incoming bib