Matching records in SUNCAT
SUNCAT is a union catalogue with a deduplicated view of serials records held in research libraries throughout the United Kingdom, plus the ISSN and CONSER databases.
A union catalogue shows all records from varying sources combined into a single database. The deduplicated view shows a single, “preferred” record for one title, with all associated holdings displaying, rather than showing every record that has been contributed to SUNCAT individually.
The Preferred record
The “preferred” record is the bibliographic record that is displayed for union view, along with all the holdings from that set. No bibliographic fields from the other records within the set display. They can be searched, because all the records (not just the preferred record) are indexed. The preferred record is chosen by means of being the fullest, most complete bibliographic record, shown by the presence of certain fields (for example, multiple subject headings, title, main / added entries). Each occurrence of certain fields is awarded a point; the record with the most points is the preferred record. This means that the preferred record in a set may change, as new libraries are added, and records upgraded.
The SUNCAT match / merge algorithm
In order for the deduplicated view to occur, the bibliographic records are matched and merged at the point of display by a complex algorithm. This is based on the algorithm developed for Melvyl, the Union Catalogue for the California Digital Library. Changes to the original CDL algorithm have been implemented to reflect UK data variations, such as, inclusion of the BNB, added entries.
The algorithm works on a points scoring system, which compares records and weighs them up against each other. (This is different from determining the preferred record, which takes the best record from a set which has already been matched and merged.) The points assigned can be either positive or negative; for example, if two records both have an ISSN, but they do not match, then negative points are awarded. Each field used in the matching algorithm has an allotted fixed score. The number of points awarded is dependent on the fields present in the records being compared, and the validity of the match between those fields. If one record has a field which is not present in the record to which it is being matched, the second record is not penalised in any way for not having that field. Only fields which are essential in determining a specific title are included in the match/merge algorithm. Optional fields, such as 5XX notes, and 6XX subject headings, are not. If the records compared reach or cross a pre-determined threshold of points, they are deemed a match, and are merged together.
The matching process occurs in three stages.
A pool of potential duplicate records is retrieved by searching on three values: the ISSN, the LCCN and the short title. They are normalized by removing such characters as suffixes, prefixes and hyphens. The pool may be as large as 500 records; it cannot be larger. If the 500 record limit is reached, the search is modified to include place of publication as a fourth search term.
The second step is a “quick match”. Duplicates are identified using a limited algorithm. At this stage, a match is only possible when the records match on the LCCN, ISSN and BNB (whichever is higher), and the title. The points for the LCCN /ISSN / BNB and title are calculated; if they reach the match threshold, then the records are considered a match.
If the “quick match” did not reach the match threshold, then a “full match” must be calculated. The LCCN / ISSN / BNB weight from Stage Two is kept, the title is recalculated (if necessary), and the following elements are checked: first date of publication, country of publication, place of publication, main and added entries. Values for matches on each of these elements vary. When all values are totalled, if they reach the threshold, the records are deemed a match, and are merged together. If the points do not add up to +800 points, they are not a match.
List of Common Titles
There is potential for mismatches caused by titles which occur commonly, such as “journal”, or, “report”. In order to overcome this, a list of common titles has been created, currently consisting of around 400 titles. These titles are entered into a specific table. When matching takes place, the algorithm calls upon this table; if the records being matched have a title on this table, many fewer points are awarded for the title field. This means that the rest of the fields have to be an exact match in order to make the entire record match. It is unlikely that records will match incorrectly if the title is on the list of common titles. This list will be added to as data from new contributing libraries is added to SUNCAT, after thorough checking to ensure that the title should be included.
The SUNCAT ID
This is one of the major developments that EDINA and Ex Libris have developed together, and stands for “SUNCAT Identifier”. At present, it matches existing sets in SUNCAT, with the aim that it will eventually match at a “work” level, there being one SUNCAT ID per journal title. The matching will be improved as Contributing Libraries upgrade their records, and are loaded into SUNCAT, as well as at the SUNCAT end, where database maintenance work on the SUNCAT IDs will be factored into the normal workflow.
One of the known problems with the matching algorithm was that it allowed for “overlapping sets”, whereby a single record may belong to more than one set, resulting in duplicated holdings. As a record may not have more than one SUNCAT ID, the records which were in these overlapping sets have been assigned to a single set, and given the SUNCAT ID for that set. A tag in the bibliographic record alerts the user that this record may match with more than one set. One of the tasks for the SUNCAT team is to check these records, and ensure that they are in the appropriate set; if not, they can be assigned the SUNCAT ID of the correct set. The SUNCAT team will also have the authority to combine or separate sets, should that be necessary.
Development of the SUNCAT ID entails a major change in the basic concept of the Union Catalogue, which had been previously based on an entirely automatic and dynamic procedure. The SUNCAT ID creates a more “fixed” union catalogue. The SUNCAT ID is stored in the 049 tag, in the $a, and is in the form of a 9-digit number, with a 2 digit modulus 11 check on the end, preceded by the letters “SC”. It looks like this:000000005 049 L $$aSC00087765701
The matching algorithm described above is used to assign the SUNCAT ID to the appropriate record, and is done before the file is loaded to the database, after the data conversion has been run and checked. As the record is loaded into the database, matching is on the SUNCAT ID only. If there is a match, the records are considered the same; if the SUNCAT IDs differ, they are not a match. If both or either record does not have a SUNCAT ID, then the regular matching algorithm is used. However, this latter scenario is highly unlikely to occur, as all records are run through a process to assign SUNCAT IDs before load. The results for matching have been improved by the addition of the SUNCAT ID, removing the overlapping sets described above.