perl – nostuff.org

I know next to nothing about MARC,though being a shambrarian I have to fight it sometimes. My knowledge is somewhat binary, absolutely nothing for most fields/subfields/tags but ‘fairly ok’ for the bits I’ve had to wrestle with.

[If you don’t know that MARC21 is an ageing bibliographic metadata standard, move on. This is not the blog post you’re looking for]

Recent encounters with MARC

Importing MARC files in to our Library System (~~Talis~~ Capita Alto), mainly for our e-journals (so users can search our catalogue and find a link to a journal if we subscribe to it online). Many of the MARC records were of poor quality and often did not even state the item was (a) a journal (b) online. Additionally Alto will only import if there is a 001 field, even though the first thing it does is move the 001 field to the 035 field and create its own. To handle these I used a very simple script to run through the MARC file – using MARC::Record – to add an 001/006/007 where required.
Setting up sabre – a web catalogue which searches the records of both the University of Sussex and the University of Brighton – we need to pre-process the MARC records to add extra fields, in particular a field to tell the software (vufind) which organisation the record was from.

Record problems

One of the issues was that not all the records from the University of Brighton were present in sabre. Where were they going missing? Were they being exported from the Brighton system? copied to the sabre server ok? Being output through the perl scritp? lost during the vufind import process?

To answer these questions I needed to see what was in the MARC files, the problem is that MARC is a binary format so you can’t just fire up vi to investigate. The first tool of the trade is a quick script using MARC::Record to convert a MARC file to text file. But this wasn’t getting to the bottom of it. This lead me to a few PC tools that were of use.

PC Tools

MarcEdit : Probably the best known PC application. It allows you to convert a MARC file to text, and contains an editor as well as a host of other tools. A good swiss army knife.

MARCView : Originally from Systems Planning and now provided by OCLC, I had not come across MARCView until recently. It allows you to browse and search through a file containing MARC records. Though the browsing element does not work on larger files.

USEMARCON is the final utility. It comes with a GUI interface, both of which can be downloaded from The National Library of Finland. The British Library also have some information on it. Its main use is to convert MARC files from one type of MARC to another, something I haven’t looked in to, but the GUI provides a way to delve in to a set of MARC records.

Back to the problem…

So we were pre-processing MARC records from two Universities before importing them in to vufind using a Perl script which had been supplied by another University.

It turns out the script was crashing on certain records, all records after the problematic record were not being processed. It wasn’t just that script, any perl script using MARC::Record (and MARC::batch) would crash when it hit a certain point.

By writing a simple script that just printed out each record we could as least see what the record was immediately before the record causing it to crash (i.e. the last in the list of output). This is where the PC applications were useful. Once we know the record before the problematic record, we could find it using the PC viewers and then move to the next record.

The issue was certain characters (here in the 245 and 700 fields). I haven’t got to the bottom of what the exact issue is. There are two kinds of popular encodings: MARC-8 and records in UTF-8, and this can be designated in the Leader (9th character). I think Alto (via it’s marcgrabber tool) exports in MARC-8 but perhaps the characters in the record did not match the specified encoding.

The title (245) on the orignal catalogue looks like this:

One work around was to use a slightly hidden feature of MarcEdit to convert the file to UTF:

I was then able to run the records through the perl script, and import it in to vufind.

But clearly this was not a sustainable solution. Copying files to my PC and running MarcEdit was not something that would be easy to automate.

Back to MARC::Record

The error message produced looked something like this:

utf8 "xC4" does not map to Unicode at /usr/lib/perl/5.10/Encode.pm line 174

I didn’t find much help via Google, though did find a few mentions of this error related to working with MARC Records.

The issue was that the script loops through each record, the moment it tries to start a loop with a record it does not like it crashes, so there is no way to check for certain characters in the record as it will already be too late.

Unless we use something like exceptions. The closest to this perl has out-of-the-box is eval.

By putting the whole loop in to an eval, if it hits a problem the eval simply passed the flow down to the or do part of the code. But we want to continue processing the records, so this simply calls the eval again, until it reaches the end of the record. You can see a basic working example of this here.

So if you’ve having problems processing a file of MARC records using perl MARC::Record / MARC::batch try wrapping it in a eval. You’ll still loose the records it can not process but it wont stop in it’s tracks (and you can output an error log to record the record number of the records with errors).

Post-script

So, after pulling my hair out, I finally found a way to process a filewhich contains records which cause MARC::Record to crash. It had caused me much stress as I needed to get this working, and quickly, in an automated manner. As I said, the script had been passed to us by another University and it already did quite a few things so I was a little unwilling to rewrite using another language (though a good candidate would be php as the vufind script was written in that language and didn’t seem to have these problems).

But in writing this blog post, I was searching using Google to re-find the various sites and pages I had found when I encountered the problem. And in doing so I had found this: http://keeneworks.posterous.com/marcrecord-and-utf

Yes. I had actually already resolved the issue, and blogged about it, back in early May. I had somehow – worryingly – completely forgotten any of this. Unbelievable! You can find a copy of a script based on that solution (which is a little similar to the one above) here.

So there you are, a few PC applications and a couple of solutions to perl/MARC issue.

One of my many many many faults is coming up with (in my blinkered eyes – good) ideas, thinking about them non-stop for 24hours, developing every little detail and aspect. Then spending a few hours doing some of the first things required. then getting bored and moving on to something else. Repeat ad nauseum.

Today’s brilliant plan (to take over the world)

Over the weekend it was ‘tinyurl.com’ services and specifically creating my own one.

I had been using is.gd almost non-stop all week, various things at work had meant sending out URLs to other people both formally and on services like twitter. Due to laziness it was nearly always easier to just make another shortURL for the real URL in question than to find the one I made earlier. It seemed a waste. One more short code used up when it was not really needed. The more slap-dash we are in needlessly creating short URLs, the quicker they become not-so-short URLs.

Creating my own one seemed like a fairly easy thing to do. Short domain name, bit of php or perl and a mysql database, create a bookmarklet button etc.

Developing the idea

But why would anyone use mine and not someone elses?

My mind went along the route of doing more with the data collected (compared to tinyurl.com and is.gd). I noticed that when a popular news item / website / viral come out, many people will be creating the same short URL (especially on twitter).

What if the service said how many – and who – had already shortened that URL. What if it made the list of all shortened URLs public (like the twitter homepage). The stats and information that could be produced with data about the urls being shortened, number of click throughs, etc, maybe even tags. Almost by accident I’m creating a bookmarking social networking site.

This would require the user to log in (where as most do not), not so good, but this would give it a slightly different edge to others, and help fight spam, and not so much of a problem if users only have to log in once.

I like getting all wrapped up in an idea as it allows me to bump in to things i would not otherwise. Like? like…

This article runs through some of the current short URL services
The last one it mentions is snurl.com, I had come across the name on Twitter, but had no idea it offers so much more, with click-thru stats and a record of the links you have shortened. It also has the domain name sn.im (.im being the isle of man). Looks excellent (but they stole some of my ideas!)
snurl.com
Even though domains like is.gd clearly exist, it seems – from the domain registrars I tried – that you can not buy two digit .gd domains. though three letter ones seem to start from $25 a year.
the .im domain looked like it could be good. But what to call any potential service??? Hang-on… what about tr.im! what a brilliant idea. fits. genius. Someone had, again, stolen my idea. besides, when I saw it could be several hundred pounds other top level domains started to look more attractive
tr.im mentioned above, is a little like snurl.com. looks good, though mainly designed to work with twitter. Includes lots of stats. Both have a nice UI. Damn these people who steal my ideas and implement them far better than I ever could. :)
Meanwhile…. Shortly is an app you can download yourself to run your own short url service.
Oh and in terms of user authentication php user class seemed worth playing with.
Writing the code seemed fairly easy, but how would I handle creating those short codes (the random digits after the domain name). They seem to increment while keeping as small as possible.
Meanwhile I remember an old friend and colleague from Canterbury had written something like this years a go, and look! he had put the source code up as well.
This was good simple perl, but I discovered that it just used hexadecimal numbers as the short codes, which themselves are just the hex version of the DB auto-increment id. nice and simple but would mean the codes become longer more quickly than other algorithms.
I downloaded the script above and quickly got it working.
I asked on twitter and got lots of help from bencc (who wrote the script above) and lescarr.
Basically the path to go down was base64 (i.e. 64 dgits in a number system, instead of the usual 10), which was explained to me with the help of a awk script in a tweet. I got confused for a while as the only obvious base64 perl lib actually converts text/binary for MIME email, and created longer, not shorter, codes than the original (decimal) id numbers as created by the database.
I did find a cpan perl module to convert decimal numbers to base64 called Math::BaseCnv. Which I was able to get working with ease.
It didn’t take long to edit the script from Ben’s spod.cx site, and add the Base64 code so that it produced short codes using all lower case, upper case and numbers.
you can see it yourself – if I haven’t broken it again – at http://u.nostuff.org/
You can even add a bookmarklet button using this code
Finally, something I should have done years a go, and setup mod_rewrite to make the links look nice, e.g. http://u.nostuff.org/3

So I haven’t built my (ahem, brilliant) idea. Of course the very things that would have made it different (openly showing what URLs have been bookmarked, by who, and how many click throughs, and tags) were the very thing that would make it time consuming. And sites like snurl.com and tr.im had already done such a good job.

So while I’m not ruling out creating my own really simple service (and infact u.nostuff.org already exists) and I learned about mod_rewrite, base64 on cpan, and a bunch of other stuff, the world is spared yet-another short URL service for the time being.

nostuff.org

…living up to its name

nostuff.org

Tag: perl

MARC Tools & MARC::Record errors

Record problems

PC Tools

Back to the problem…

Back to MARC::Record

Post-script

short urls, perl and base64