MARC Tools & MARC::Record errors

I know next to nothing about MARC,though being a shambrarian I have to fight it sometimes. My knowledge is somewhat binary, absolutely nothing for most fields/subfields/tags but ‘fairly ok’ for the bits I’ve had to wrestle with.

[If you don’t know that MARC21 is an ageing bibliographic metadata standard, move on. This is not the blog post you’re looking for]

Recent encounters with MARC

  • Importing MARC files in to our Library System (Talis Capita Alto), mainly for our e-journals (so users can search our catalogue and find a link to a journal if we subscribe to it online). Many of the MARC records were of poor quality and often did not even state the item was (a) a journal (b) online. Additionally Alto will only import if there is a 001 field, even though the first thing it does is move the 001 field to the 035 field and create its own. To handle these I used a very simple script to run through the MARC file – using MARC::Record – to add an 001/006/007 where required.
  • Setting up sabre – a web catalogue which searches the records of both the University of Sussex and the University of Brighton – we need to pre-process the MARC records to add extra fields, in particular a field to tell the software (vufind) which organisation the record was from.

Record problems

One of the issues was that not all the records from the University of Brighton were present in sabre. Where were they going missing? Were they being exported from the Brighton system? copied to the sabre server ok? Being output through the perl scritp? lost during the vufind import process?
To answer these questions I needed to see what was in the MARC files, the problem is that MARC is a binary format so you can’t just fire up vi to investigate. The first tool of the trade is a quick script using MARC::Record to convert a MARC file to text file. But this wasn’t getting to the bottom of it. This lead me to a few PC tools that were of use.

PC Tools

MarcEdit : Probably the best known PC application. It allows you to convert a MARC file to text, and contains an editor as well as a host of other tools. A good swiss army knife.
MARCView : Originally from Systems Planning and now provided by OCLC, I had not come across MARCView until recently. It allows you to browse and search through a file containing MARC records. Though the browsing element does not work on larger files.
marcview
marcview

 

USEMARCON is the final utility. It comes with a GUI interface, both of which can be downloaded from The National Library of Finland. The British Library also have some information on it. Its main use is to convert MARC files from one type of MARC to another, something I haven’t looked in to, but the GUI provides a way to delve in to a set of MARC records.

Back to the problem…

So we were pre-processing MARC records from two Universities before importing them in to vufind using a Perl script which had been supplied by another University.

It turns out the script was crashing on certain records, all records after the problematic record were not being processed. It wasn’t just that script, any perl script using MARC::Record (and MARC::batch) would crash when it hit a certain point.

By writing a simple script that just printed out each record we could as least see what the record was immediately before the record causing it to crash (i.e. the last in the list of output). This is where the PC applications were useful. Once we know the record before the problematic record, we could find it using the PC viewers and then move to the next record.

The issue was certain characters (here in the 245 and 700 fields). I haven’t got to the bottom of what the exact issue is. There are two kinds of popular encodings: MARC-8 and records in UTF-8, and this can be designated in the Leader (9th character). I think Alto (via it’s marcgrabber tool) exports in MARC-8 but perhaps the characters in the record did not match the specified encoding.

The title (245) on the orignal catalogue looks like this:

One work around was to use a slightly hidden feature of MarcEdit to convert the file to UTF:

I was then able to run the records through the perl script, and import it in to vufind.

But clearly this was not a sustainable solution. Copying files to my PC and running MarcEdit was not something that would be easy to automate.

Back to MARC::Record

The error message produced looked something like this:

utf8 "xC4" does not map to Unicode at /usr/lib/perl/5.10/Encode.pm line 174

I didn’t find much help via Google, though did find a few mentions of this error related to working with MARC Records.

The issue was that the script loops through each record, the moment it tries to start a loop with a record it does not like it crashes, so there is no way to check for certain characters in the record as it will already be too late.

Unless we use something like exceptions. The closest to this perl has out-of-the-box is eval.

By putting the whole loop in to an eval, if it hits a problem the eval simply passed the flow down to the or do part of the code. But we want to continue processing the records, so this simply calls the eval again, until it reaches the end of the record. You can see a basic working example of this here.

So if you’ve having problems processing a file of MARC records using perl MARC::Record / MARC::batch try wrapping it in a eval. You’ll still loose the records it can not process but it wont stop in it’s tracks (and you can output an error log to record the record number of the records with errors).

Post-script

So, after pulling my hair out, I finally found a way to process a filewhich contains records which cause MARC::Record to crash. It had caused me much stress as I needed to get this working, and quickly, in an automated manner. As I said, the script had been passed to us by another University and it already did quite a few things so I was a little unwilling to rewrite using another language (though a good candidate would be php as the vufind script was written in that language and didn’t seem to have these problems).

But in writing this blog post, I was searching using Google to re-find the various sites and pages I had found when I encountered the problem. And in doing so I had found this: http://keeneworks.posterous.com/marcrecord-and-utf 

Yes. I had actually already resolved the issue, and blogged about it, back in early May. I had somehow – worryingly – completely forgotten any of this. Unbelievable! You can find a copy of a script based on that solution (which is a little similar to the one above) here.

So there you are, a few PC applications and a couple of solutions to perl/MARC issue.

Summon @ Huddersfield

I attended an event at Huddersfield looking at their and Northumberland’s experience of Summon. http://library.hud.ac.uk/blogs/summon4hn/?p=22 These are my rough notes. Take all with a pinch of salt.

The day reaffirmed my view of Summon, it is ground breaking in the Library market, and with no major stumbling blocks. They are very aware that coverage is key and seem to be adding items and publishers. It searches items that a organisation has access to (though users can tick a option to search all items in the kb, not just those they can access). They have good metadata, merging records from a number of sources, and making use of subject headings (to refine or exclude from the search).

There was general consensus that it made sense to maintain only one Knowledge-base, and therefore in this case, using 360 Link if implementing Summon. There was also general dissatisfaction for federated search tools.

To me, and I must stress this is a personal view, there are two products that I have seen which are worth future consideration: Summon and Primo. Summon’s strength is in the e-resources realm and as a resource discovery service. Primo’s strength, while offering these features/services, is as a OPAC (with My account features etc) and personalisation (tags, lists). Both products are in a stage of rapid development.

In my view, one decision to implement one of these products – which ever it is –  will have a chain reaction. And I think this is an important point. Using Sussex as an example, it currently has Aquabrowser (as a Library Catalogue), Talis Prism 2 (for Borrower Account, reservations, renewals), SFX (Link Resolver) and Metalib (Federated Search).

Two example scenarios (and I stress there are other products on the market and this is just my personal immediate thoughts):

One: Let’s say Sussex first decide to replace Metalib with Summon. They would probably cancel Metalib (Summon replaces it). Probably move from SFX to 360 Linker (one Knowledge base). May then wish to review our Library Catalogue in a years time: Primo is no longer on the cards (too much cross over with Summon, which we now already have), so they either stick with Aquabrowser (but the new SaaS v3 release) or perhaps move to Prism 3 (Talis’ new-ish SaaS Catalogue). Sussex would end up with no Ex Libris products, but would potentially subscribe to several Serial Solutions products.

Two: Let’s say Sussex decide to replace Aquabrowser with Primo (which acts more like a Catalogue than AB). They cancel Aquabrowser. Primo would (in addition to being the primary OPAC) have Summon-like functionality, allowing users to search a large database of items instantly, with relevance and facets. So Summon would not be an option. Stick with SFX (Metalib would be a side feature of Primo, with a Primo-like interface). With a number of Ex Libris products they would want to keep an eye on the Ex Libris URM (next genration LMS), they would have no Serials Solutions products.

The following are some notes from the day:

Sue White from Huddersfield Library started the day, saying it is probably the best decision they have ever made.

Helle Lauridsen from Serials Solutions Europe started with a basic introduction of what it is and what their key aims were (i.e. be like google).
She emphasised that all content (different types and publishers) is treated equally.

“better metadata for better findability”. merge metadata elements. Use SerialSolutions, urichs, Medline, crossref to create the best record. ‘record becomes incredibly rich’.

She went through all the new features added in the last 12 months, including a notable size in the knowledge base. ‘dials’ to play with relevancy of different fields. Recommender service coming.

Shows a list of example publishers, included many familiar names, have just signed with Jstor. She showed increase in ‘click throughs’ for particular publishers, the biggest were for jstor and ScienceDirect. Newspapers have proved to be very popular.

There is an advanced search. There has been negative feedback ‘please bring back title/author search’.

Eileen Hiller from Huddersfield talked about product selection. She mentioned having people from across the Library and campus on the selection/implementation group, getting student feedback and talking to academics. They used good feedback in their communications (e.g. in the student newspaper and their blog). Student feedback questionnaire has been useful.

Dave Pattern talks about the history of e-resource access at Huddersfield, started with a online word document and then a onenote version. Metalib was slow, and they found more students using Google Scholar than metalib.

They started with a blank sheet of paper and as a group thrashed out their ideal product, without knowing about Summon. First class search engine, ‘one stop shop’, improved systems management, etc. Invited a number of suppliers in, showed them the vision and asked them to present their product against it, Huddersfield rated each one against The Vision. Report to Library Management Group. Summon was the clear fit.

Implementation: Starts off with a US conference call. MARC21 mapping spreadsheet, they went with defaults. they add a unique id to the 999|a field.

Be relistic with early implementation, e.g. lib cat and repository are only two local databases. Be aware of when you LMS deletes things flagged for deletion. Huddersfield had early issues with this.

Do you want your whole catalogue on Summon? ebook/ejournal records etc.

Summon originally screenscraped for holdings/availability (aquabrowser does this for Sussex) could bring the traditional catalogue to its knees.

Moving to 360 Link makes you life much easier if moving to Summon, only one Knowledgebase to maintain.

They asked Elsevier to create a custom file for their sciencedirect holdings to upload to 360.

Huddersfield found activating journals in 360 a quick process.

360 API more open than Summon API. for customers only. You can basically build your own interface. Virginia using it to produce a mobile friendly version of their catalogue. Hud used it to identify problem MARC records.

94% of Huddersfield subscribed journals are on Summon (No agreement with the following: BSP 80%, Sciencedirect, Jstor… Westlaw/LexisNexis 55%). They now have a agreement with LexisNexis and Jstor. In discussion with Elsevier. They manage to have this level of coverage for these reources by using other sources for the data (e.g. publishers for Business source premier and A&I databases for ScienceDirect).

Dummy journal records for journal titles (print and e) so that they are easily found on Summon. See this example.

Can recommend specific resources (‘you might be interested in ACM Digital Library’), can be useful for subjects like Law.

Summon at Hudderfield now has 60 million items (see left hand side for breakdown), indexed. Judging by this Summon seems to have 575 million items indexed in total.

Survey results: Users found screens easy to understand. many (43%) refined their results. Dave thinks that now Google has facets on the left may increase facet usage. 82% for results were relevant to their research topics.

They will go live in July. Currently working on training materials and staff training. Considering adding archives and Special Collections in the future.

Annette Coates, Digital Services Manager, Northumbria Uni.

She gave a history of e-resource provision, 2005 onwards: webfeat (they brand it nora, which they are keeping for Summon). ‘We have the same issues with federated search that everyone else has’. Both Northumbria and Huddersfield are keeping a seperate A-Z list for e-resources (N are using libguides, like Sussex).

User Evaluation: is it improving the user’s search experience? how can we improve it futher? NORA user survey. Timing important, Getting people involved, Incentives, Capturing the session. They will use all the user feedback in a number of ways, ‘triangulate to ensure depth’, use good quotes as a marketing tool (including to lib staff), feedback good/bad to Serials Solutions, use it to improve the way they show it to others…

Northumbria Summon implementation timeline

Q&A

Focus groups, guidance?
very little guidance in focus group, and let them play with it

What is the position regarding authentication?

N use citrix. Will be Shibolising their 360link.

H channeling as much as possible through ezproxy. don’t have shib. promote usage though usage portal, which authenticates them.

No shibboleth integration at the moment.

(discussion about how summon may mean you can stop trying to add journal records, and can raise lots of questions… should summon be the interface on your catalogue kiosks).

You can send list of ISSNs to Serial Solutions to see matches, to find out what your coverage would be.

There was a very vague indication that OPAC integration may be on the cards for Summon. This is an important thing IMO.

Number of comments about Library staff being far more critical than users.

Summon ingesting stuff (MARC) from LMS 4 times a day. Using DLF standard for getting holdings data from LMS. (this is a good thing). Huddersfield wrote the DLF protocol code.

Q: Are SerialSolutions (proquest) struggling to get metadata from their direct competitors?

A: SS: Ebsco the main one, but we go direct to publishers. And for Elsevier, able to index it from elsewhere (and in talks with them).

Q: lexis and westlaw, where only 50% coverage, how do students know to go elsewhere (i.e. direct to the resource)?

A: for law students point them to e-resource pages (wiki) as well as summon to promote direct access to them. also (and perhaps more importantly) will have recommender which can recommend lexis/westlaw for law searches.

Q: can you search the whole summon kb, not just those things we subscribe to?
yes

Q: Are there personalisation options? (saving lists, items, marking records)
May come in the future, summon are thinking about it.

Mystery Solved

We recently (well, last summer) launched Aquabrowser as our main library catalogue. We provided a feedback link for people to comment on the new interface, as we were keen to pick up on functionality it lacked or issues we may not have thought off. You can see the feedback link on the green bar on the right, it asks the user to login, and then provides a feedback form to leave a message. Continue reading

Library search/discovery apps : intro

There’s a lot of talk in the Library world about ‘next generation catalogues’, library search tools and ‘discovery’. There’s good reason for this talk, in this domain the world has been turned on its head.

History in a nutshell:

  • The card catalogue became the online catalogue, the online catalogue let users search for physical items within the Library.
  • Journals became online journals. Libraries needed to let users find the online journals they subscribed to through various large and small publishers and hosts. They built simple in-house databases (containing journal titles and links to their homepages), or added them to the catalogue, or used a third party web based tool. As the number of e-journals grew, most ended up using the last option, a third party tool (which could offer other services, such as link resolving, and do much of the heavy lifting with managing a knowledge base).
  • But users wanted one place to search. Quite understandable. If you are after a journal, why should you look in one place to see if there is a physical copy, and another place if they had access to it online. Same with books/ebooks.
  • So libraries started to try and find ways to add online content to catalogue systems in bulk (which weren’t really designed for this). Aquabrowser : Uni Sussex beta catalogue

The online catalogues (OPAC) were simple web interfaces supplied with the much larger Library management system (ILS or LMS) which ran the backend the public never saw. These were nearly always slow, ugly, unloved and not very useful.

A couple of years a go(ish), we saw the birth of the next generation catalogue, or search and discovery tools. I could list them, but the Disruptive Technology Library Jester does an excellent job here. I strongly suggest you take a look.

Personally, I think I first heard about Aquabrowser. At the time a new OPAC which was miles ahead of those supplied with Library systems and was (I think) unique as a web catalogue interface not associated with a particular system, and shock, not from an established Library Company. The second system I heard about was probably Primo from Ex Libris. At first not understanding what it was: It sounds like Metalib (another product from the same company which cross-searches various e-resource), is Primo replacing it? Or replacing the OPAC? It took a while to appreciate that this was something that sat on top of the rest. From then, VuFind, LibraryFind and more.

While some where traditional commercial products (Primo, Encore, Aquabrowser), many more were open source solutions, a number of which developed at American Libraries. Often built on common (and modern) technology stacks such as Apache solr/Lucene, Drupal, php/java, mysql/postgres etc.Primo : British Library

In the last year or so a number of major Libraries have started to use one of these ‘Discovery Systems’ for example: the BL and Oxford using Primo, National Libraries of Scotland & Wales and Harvard have purchased Aquabrowser and the LSE is trying VuFind. At Sussex (where I work) we have purchased and implemented Aquabrowser. We’ve added data enrichments such as table of contents (searchable and visible on records), book covers and the ability to tag and review items (tag/reviewing has been removed for various reasons) .

It would be a mistake to put all of these in to one basket. Some focus on being a OPAC replacement, others on being a unified search tool, searching both local and online items. Some focus on social tools, tagging & reviewing. Some work out the box others are just a set of components which a Library can sow together, and some are ‘SaaS’.

It’s an area that is fast changing. Just recently an established Library web app Company announced a forthcoming product called ‘Summon’, which takes searching a library’s online content a step further.

So what do libraries go for, it’s not just potentially backing the wrong horse, but backing the wrong horse when everyone one else had moved on to dog racing!

And within all this it is important to remember ‘what do users actually want’. From the conversations and articles I’ve read, they want a Google search box, but one which returns results from trusted sources and academic content. Whether they are looking for a specific book, specific journal, a reference/citation, or one/many keywords. And not just one which searches the metadata, but one which brings back results based on the full text of items as well. There are some that worry that too many results are confusing. As Google proves, an intelligent ranking system makes the number of results irrelevant.

Setting up (and even reviewing) most of these systems take time, and if users start to add data (tags, reviews) to one system, then changing could cause problems (so should we be using third party tag/rating/review systems?).

You may be interested in some other articles I’ve written around this:

There’s a lot talk about discovery tools, but what sort to go for, who to back? And many issues have yet to be resolved. I’m come on to those next…