Data processing (initial load)

The data supplied by the contributing library has to undergo a reasonable amount of manipulation in order for the bibliographic records and their holdings to display in the same fashion as the rest of the records in SUNCAT.

On the first contact with the library, when SUNCAT approaches them to ask if they want to join as a contributor, a basic questionnaire is sent out. This includes questions relating to the LMS (and any immediate plans to change it), the catalogue and its records, and the rate that those records are changed. A further questionnaire is sent out when arranging a time for depositing that library's data onto the ftp site. This questionnaire is more detailed, asking about specific local practices.

The SUNCAT team then drafts a data specification, using the responses from both questionnaires, in conjunction with looking at the data itself. The data file is pulled across from the ftp site, renamed according to SUNCAT practices (library_name.date_of_file.aa_extension), and, if necessary, a MARC dump created, so that the file is legible to the human eye. Output from Talis and Aleph libraries do not need their data put into a MARC dump, as it is already legible and searchable, Talis outputting their files into a text format, and Aleph into an Aleph sequential format. Other libraries may send their files as text-only; this is particularly true of those libraries who do not use the MARC standard in their bibliographic records.

There is constant communication with the contributing library during the drafting of the data specification, usually asking for clarification about responses in the data questionnaire compared with the data supplied. Occasionally, SUNCAT will highlight inconsistencies between the two – this can be a problem created when data was exported, or can be down to an obsolete local library practice, which has been brought to light when compiling the data specification. In this way, a relationship of mutual benefits is established, between SUNCAT and the contributing library. Sometimes, the problems discussed result in a new extract being FTPed to the SUNCAT site, but this is a rare situation, as normally the SUNCAT team can accommodate most local practices.

The data specification itself is divided into three parts: general information, holdings, and fixes.

  1. The general information is concerned with such details as the MARC format used, and where the local control number is stored.
  2. The holdings section indicates what fields the summary holdings are kept in, but their treatment is covered under "fixes"".
  3. The fixes section is concerned with the fields that will need manipulation to dovetail this data in with that already extant in SUNCAT. This includes local tags to be stripped or retained. If retained, and they reflect a purely local interest, a $5 with the library's MARC organisational code (or equivalent) is added. Some standard manipulation is done to all bibliographic records, to ensure uniformity of display. Such fixes include 245$h[computer file] changed to 245$h[electronic resource] – these bring the data in line with the latest developments in AACR and MARC.

Particular attention is paid to holdings, as this is where the greatest diversity of practices is encountered. The holdings data has to be slotted into an 852 tag ($a: institution code; $b: sublocation; $h: shelfmark; $3: summary holdings information). There are very few libraries which are able to export their holdings in the format that is required by SUNCAT, so much of the data specification is concerned with this.

For those libraries who do not use MARC, the data specification itself tends to be very basic; the majority of data manipulation takes place in a different process, where the data is converted into a form that follows MARC. This is written specifically for each non-MARC library, but there are common similarities. For example, there is the instruction to cover the title:

"Title:" should be converted to 245 $a, first indicator 0; second indicator 0, unless there is either a definite or indefinite article, in whatever language the title has been transcribed as, in which case, the second indicator should be the number of letters (including spaces) before the next word (e.g., "Le" would become second indicator 3; "L'" would be 2). This only applies to articles at the beginning of the 245$a.

If the article finishes with an apostrophe, capitalise the first word of the title, if it is not already capitalised.

If there is an equals sign ("=") after a phrase in the Title, then add the information after the equals sign in a 245$b [this is an example of a parallel title]. Make sure that there is only one space after the equals sign and the start of the new word.

For example:

Title: Zbornik Radoba = Proceedings

would become:

24500 $aZbornik Radoba =$$bProceedings

If there is a colon (":") after a phrase in the Title, then separate the colon from the word by one space, and add the information after the colon in a 245$b [this is an example of a remainder of title].

For example:

Title: Accountancy: the Journal of the Institute of Chartered Accountants of England and Wales

would become:

245 00 $aAccountancy :$bthe Journal of the Institute of Chartered Accountants of England and Wales

245$b is a non-repeatable field; there can only be one instance per 245.

If there is both an equals sign and a colon in the same title, then treat whichever one comes first as the one to be acted on.

For example:

Title: Les Prix Nobel = The Nobel Prizes: Nobel Prizes, Presentations, Biographies and Lectures

would become:

24504 $aLes Prix Nobel =$b The Nobel Prizes: Nobel Prizes, Presentations, Biographies and Lectures

If there is a full stop after a phrase in the Title, followed by a space, then add a space before the full stop, and replace the full stop with a forward slash ("/") before adding the rest of the field as a 245$c.

As can be seen, every tag, subfield, punctuation mark has to be covered in this document, to ensure that the data is correctly in MARC. Unfortunately, the SUNCAT team cannot be so diligent regarding the adherence to AACR2 without re-cataloguing the entire record!

Once the data specification has been drafted, the SUNCAT team send it to the contributing library for their approval. This may result in revisions to the data specification, if the library thinks that its data may be misrepresented. Once the data specification has been agreed upon, the data is passed through the conversion programmes, to bring it into a uniform line with the other data already in SUNCAT. The conversion processes are adapted to fit the manipulation suggested in the data specification.

Once the conversion process is complete, the resulting output file is checked carefully against the data specification. The locations tables, which control the display of the 852$b, are also updated at this time. Reports consisting of the records which have failed to meet the entry standard, and thus have been rejected, and any character conversion problems, are created (Aleph uses Unicode as a character set). These reports are sent to the contributing library for correction. The corrected records are expected to be returned to SUNCAT in the course of supplying future updates.

The entry standard allows the majority of records into SUNCAT; it only rejects those which do not have the basic fields which will preclude any matching taking place. So, records with no 008 tag, no 245$a tag, no 852 tag, records not coded as a serial, or have a LDR field with character position 5 being a "d" are rejected. Records are rejected after the conversion process is complete, so all nomenclature refers to the finished SUNCAT records. Any other library-specific rejection criteria are also included in the report.

Once the rejection reports have been sent to the contributing library, the data is then loaded into SUNCAT, using a process which automatically matches and merges the records into the main database. This is the final task undertaken by the SUNCAT bibliographic team on the initial data load.