short urls, perl and base64

One of my many many many faults is coming up with (in my blinkered eyes – good) ideas, thinking about them non-stop for 24hours, developing every little detail and aspect. Then spending a few hours doing some of the first things required. then getting bored and moving on to something else. Repeat ad nauseum.

Today’s brilliant plan (to take over the world)

Over the weekend it was ‘tinyurl.com’ services and specifically creating my own one.

I had been using is.gd almost non-stop all week, various things at work had meant sending out URLs to other people both formally and on services like twitter. Due to laziness it was nearly always easier to just make another shortURL for the real URL in question than to find the one I made earlier. It seemed a waste. One more short code used up when it was not really needed. The more slap-dash we are in needlessly creating short URLs, the quicker they become not-so-short URLs.

Creating my own one seemed like a fairly easy thing to do. Short domain name, bit of php or perl and a mysql database, create a bookmarklet button etc.

Developing the idea

But why would anyone use mine and not someone elses?

My mind went along the route of doing more with the data collected (compared to tinyurl.com and is.gd). I noticed that when a popular news item / website / viral come out, many people will be creating the same short URL (especially on twitter).

What if the service said how many – and who – had already shortened that URL. What if it made the list of all shortened URLs public (like the twitter homepage). The stats and information that could be produced with data about the urls being shortened, number of click throughs, etc, maybe even tags. Almost by accident I’m creating a bookmarking social networking site.

This would require the user to log in (where as most do not), not so good, but this would give it a slightly different edge to others, and help fight spam, and not so much of a problem if users only have to log in once.

I like getting all wrapped up in an idea as it allows me to bump in to things i would not otherwise. Like? like…

  • This article runs through some of the current short URL services
  • The last one it mentions is snurl.com, I had come across the name on Twitter, but had no idea it offers so much more, with click-thru stats and a record of the links you have shortened. It also has the domain name sn.im (.im being the isle of man). Looks excellent (but they stole some of my ideas!)

    snurl.com
    snurl.com
  • Even though domains like is.gd clearly exist, it seems – from the domain registrars I tried – that you can not buy two digit .gd domains. though three letter ones seem to start from $25 a year.
  • the .im domain looked like it could be good. But what to call any potential service??? Hang-on… what about tr.im! what a brilliant idea. fits. genius. Someone had, again, stolen my idea. besides, when I saw it could be several hundred pounds other top level domains started to look more attractive
  • tr.im mentioned above, is a little like snurl.com. looks good, though mainly designed to work with twitter. Includes lots of stats. Both have a nice UI. Damn these people who steal my ideas and implement them far better than I ever could. :)
  • Meanwhile…. Shortly is an app you can download yourself to run your own short url service.
  • Oh and in terms of user authentication php user class seemed worth playing with.
  • Writing the code seemed fairly easy, but how would I handle creating those short codes (the random digits after the domain name). They seem to increment while keeping as small as possible.
  • Meanwhile I remember an old friend and colleague from Canterbury had written something like this years a go, and look! he had put the source code up as well.
  • This was good simple perl, but I discovered that it just used hexadecimal numbers as the short codes, which themselves are just the hex version of the DB auto-increment id. nice and simple but would mean the codes become longer more quickly than other algorithms.
  • I downloaded the script above and quickly got it working.
  • I asked on twitter and got lots of help from bencc (who wrote the script above) and lescarr.
  • Basically the path to go down was base64 (i.e. 64 dgits in a number system, instead of the usual 10), which was explained to me with the help of a awk script in a tweet. I got confused for a while as the only obvious base64 perl lib actually converts text/binary for MIME email, and created longer, not shorter, codes than the original (decimal) id numbers as created by the database.
  • I did find a cpan perl module to convert decimal numbers to base64 called Math::BaseCnv. Which I was able to get working with ease.
  • It didn’t take long to edit the script from Ben’s spod.cx site, and add the Base64 code so that it produced short codes using all lower case, upper case and numbers.
  • you can see it yourself – if I haven’t broken it again – at http://u.nostuff.org/
  • You can even add a bookmarklet button using this code
  • Finally, something I should have done years a go, and setup mod_rewrite to make the links look nice, e.g. http://u.nostuff.org/3

So I haven’t built my (ahem, brilliant) idea. Of course the very things that would have made it different (openly showing what URLs have been bookmarked, by who, and how many click throughs, and tags) were the very thing that would make it time consuming. And sites like snurl.com and tr.im had already done such a good job.

So while I’m not ruling out creating my own really simple service (and infact u.nostuff.org already exists) and I learned about mod_rewrite, base64 on cpan, and a bunch of other stuff, the world is spared yet-another short URL service for the time being.

webpad : a web based text editor

So I have WordPress (and in fact Drupal, Joomla, mediawiki, Moodle, damn those Dreamhost 1-click installs) as a way of running my website.

But there are still many pages which are outside of a content management system. Especially simple web app projects (such as ircount and stalisfield) and old html files.

It can be a pain to constantly ftp in to the server, or use ssh. Editing via ssh can be a pain, especially over a dodgy wireless connection, or when you want to close the lid to your macbook.

But trying to find something to fit this need didn’t come up with any results. Many hits were either tinyMCE clones which are WYSIWYG html editors that convert input in to html, do good for coding.

Webpad screenshot
Webpad screenshot

Until I came across Webpad. It not only suited my needs perfectly, but it is well designed and implemented.

After a quick install (more or less simply copying the files), you simply enter a specified username and password, and once authenticated you are presented with a line of icons at the top. Simple select the ‘open’ icon to browse to the file you wish to edit on your web server and you’re away!

It’s simple, yet well written and serves its purpose well. If there was one thing I would I suggest for future development it would be improved file management functionality. You can create directories and delete files from the file open dialog box. But I can’t see a way to delete directories, or move/copy files. Deleting directories is of use, as many web apps (wikis, blogs, cms) require you to upgrade the software, edit a config file, and then delete the install directory, or similar.

Oh, and it’s free!

Check out webpad by Beau Lebens on dentedreality.com.au

2008 : nostuff.org under review

It takes someone with a grossly over inflated ego, and thinks their website is a trillion times more important than it actually is to try and write a review of the previous year. What sort of idiot does that, as if anyone will read it!

Hello.

Think of it as a school report, annual appraisal, or cheap channel 4 air time filler around the new year.

nostuff.org has grown in the last year, so did the readership. Some of the posts were even read by humans.

nostuff.org/words started out like many a blog, rambling on my oh-so-important thoughts about the latest news story, gadget, or (I confess) the software I installed on my laptop (in my defence, as no one read it, I was using the blog as a personal notepad as I reinstalled said laptop).

In 2008 (well late 2007 if truth be told, but that would ruin the whole thing, is that what you want? do you?) something happened: some original content appeared on nostuff. I didn’t even copy it from somewhere else.

Part of this was due to me getting a little more geeky than I had in a while. The web had become a read-only resource for me, I consumed (the entire) wikipedia, other blogs, news sites, and often becoming an expert on how to apply for a parking permit in some random US state. I was also watching too much crap TV (why was quizcall so addictive). The telly went off, Radio 4 went on, and I decided to actually do something (perhaps, somewhat belatedly taking the advice of why don’t you).

ircount / Repository Statistics

One of these things I had been working on well over a year. I had written a simple script which connected to ROAR each week and collected the count of records for each UK repository. I finally got around to writing a simple web interface to show all of this, which utilised the amazing simple to use Google Charts. I announced this in March 2008, which was in no way set to coincide with the Open Repositories conference I was attending a few days later.

This went down well, and was probably the first time when well known websites had linked to me (not that I’m obsessed with hits and being linked to or anything). Peter Suber’s highly regarded Open Access news linked here, it felt good (I’m embarrassed to admit this, but I actually have a delicious.com ‘ego’ tag for this). In October I  released an updated version of the site, now on my own website, which included stats for all Institutional Repositories, not just those in the UK.

A future development will be to report on ‘fulltext’ items, not just the number of records, though this will be a departure from just using ROAR as a datasource and will involve me connecting to individual repositories myself. In November I carried out a bit of research by playing with Tim Brody’s Perl library for connecting to OAI-PMH repositories.

Book catalogue using Talis Platform (and mashedlibraries)

The Talis Panlibus Blog and ‘Talis Technical Development’ have a lot of posts discussing APIs and functionality, all of it was hard to visualise at the time (probably because a lot of the infrastructure was in active development). The Talis Platform is a place to store RDF data accessible via the web (like Amazon S3 is a place to store files accessible via the web). It has seperate stores for different users/applications. I first explored this in the summer of 2007, and my steps at the time now look somewhat simple and naive!

In February 2008 I released the first version of my (simple) interface (use this link for the current version). This searches the ‘ukbib’ store which is a Talis Platform store holding RDF data of book records. A seperate store has holdings information for many libraries. The two can be linked via the ISBN (used a bit like a primary key in a relational database). The design of the Platform is such that you can merge multiple sources of data (from across the web) and it will bring back a single response with the data from the various sources.

In March I added the just released Google book API, both the static and dynamic versions. However the dynamic version (which should also show a book cover) only seems to work for a small amount of books.

Mashedlibraries: In November I attended the excellent Mashedlibraries day. During the afternoon we had a opportunity to work on various things. I decided to provide a holdings page, with a Google map showing the location of libraies which hold the item. With no experience of the Google Maps API, nor javascript, nor how to get the information our of the Platform (and Talis Silkworm), this was no small task. Luckily Rob Styles worked with me to provides a huge amount of patient help.

I’ve also added ‘Google Friend Connect’ to the site, though it currently doesn’t really work, as comments appear for all items, rather than just for the item they were added to.

UK Universities

I’ve always (well, the last few years, not so much when I was 5) taken an interest in the University league tables published by the national press. I often make the mistake in thinking that older and Russell group Universities are ‘better’ than others, but these league tables often show otherwise. The problem is they often come up with different results. But what if you combined all the scores from these tables to get an over-all average, hopefully ironing out the oddities of particular methodologies.

I spent a bit of time adding the league tables of various papers (and international rankings) to a spreadsheet, and sticking it on the web – along with some comments – as both an excel file and a Google spreadsheet. Before I published it, I also asked readers to suggested their top 20, to compare those we perceive to be ‘top’ with those that come top in these league tables.

This clearly hit a nerve. Or more to a point, hit a popular Google search term!

list of search terms for nostuff.org (most about Universities)
list of search terms for nostuff.org (most about Universities)

Annoyingly I was getting most hits to the post which asked readers to submit their own guesses to the top 20, and not to the post which listed the carefully compiled top 50 based on collecting data from various sources (which is presumably of more use to most people carrying out such searches). This seemed to be due to the titles of the two posts, so, in my first ever attempt in SEO, I changed the title of the latter, and it then increased the number of hits from search engines (I also added some bold links from the former ‘reader guesses’ post’).

These posts have brought in a lot of readers who are searching Google to find out about Universities in the UK, and had a massive affect on the number of hits the blog has received:

nostuff blog hits 2008 by month
nostuff blog hits 2008 by month

A number of people (mainly outside the UK) have asked for recommendations, especially for particular subjects. I’ve avoiding answering this directly, as I know have no knowledge or experience to be able to answer this. Though have suggested the HERO, and UCAS websites as well as the Guardian and Times Higher Education sections.

I hadn’t predicted this interest from potential students, though it seems so obvious now. Though good to see something I did for my own interest may have been of some use to others.

Open Repositories 2008

As mentioned above,  I attended the Open Repositories 2008 conference. In previous years it had been held across the globe, though this time was held just along the south coast in Southampton. My Uni kindly funded my attendance, so long as I went via the cheapest rail ticket, for two days, with no hotel or expenses. Still it means I got to hear the shipping forecast for the first time ever as I raised early both mornings.

I had only just started using Twitter and was the first time I had blogged (and tweeted?) about conference. In fact probably the first time I had really blogged about work stuff. It sounds very cynical (and I feel cheap saying it), but helped to attract attention of repository/library tech people to the blog.

Conferences

I also attended an event about a project – RIOJA – to look at how an ‘overlay’ journal could be implemented, i.e. if repositories provide a way to access content (including that not published in a traditional journal), then a overlay journal could provide a way to peer review and categorise a subset of the items published in repositories, the ‘overlay journal’ linking to the articles already available online, but by doing so, showing an essential peer review (quality) process has taken place. The day also had speakers talking about over novel journal publishing concepts, and the REF.

It was the first time I had used coveritlive, which Andy Powell has used (very successfully) many times. It was also the first time I had used a t-mobile usb 3G network stick, which I, and others, had persuaded our library to purchase, I set it up on the Mac on the train to Cambridge, and I was glad I did, it meant I had a network connection throughout (power, as ever, was a different issue). My live blog is here, and also on the even less popular Sussex repository blog.

In December I attended an event called ‘Sitting on a Goldmine’ (based on the JISC TILE project), based in London, this was a fantastic day with great speakers and attendees looking at how we can make use of usage data and user generated data to create new services. My write up is here.

And as previously mentioned, I also attended the Mashed Libraries event.

Mobile phones

I posted three articles about mobile phones. About why they are badly named, about the phones I have owned (not that you care) and my musings about getting a new smart phone (I’ve now got a iphone in front of me, it may be common, but my god it is good).

Nostuff, web hosting and wordpress

Jisc Library Management System Review

I found the report with the above name via the Talis blog, found some time to read it, and made some notes, which I randomly decided to store on this blog. Turns out this was quite popular and quite a few accessed it via Google (and via Tom Ropers blog).

Other bits

Reviews

The shorts

Templates and look

In November I looked at some new plug-ins and themes for wordpress. The theme I currently use is Greening, I’ve modified it a bit to increase the font size, show tags used for each post, and add ads. WordPress’ excellent widget system also comes in very handy.

Ads: I’ve had ads since around 2005. So far I have made (on paper) around $30 (I only get the money once I reach $100, which at this rate could take many years!). An added element to this is the large fluctuation between the pound and dollar exchange rate: a few months a go $100 was worth £50, now it is almost £100, so when that cheque gets sent makes a difference!

The ads have always been more a bit of an experiment than hard and fast capitalism (but the extra cash is still appealing). I’ve tried to place them with a balance between not being too annoying (a small one at the top right, some at the bottom, and a few on the left), and hope no one objects too much.

The continuous increase in visits over the last year has seen an increase in click-thrus (which generate the revenue), especially in the last few months with the postings about University rankings in the UK.

Stats

I mentioned stats above. Over the last year visits/hits have been going up month on month.

2008 stats for nostuff.org blog
2008 stats for nostuff.org blog

I collect stats via the WordPress Stats Plugin, via Google Analytics, and some rather basic web server reports. Of coruse they all report different numbers but more or less show the same thing, the table above is from the WordPress stats plugin.

So 2008 was, relatively speaking, a quite good year, just don’t expect the same for 2009!

“Sitting on a gold mine” – improving provision and services for learners by aggregating and using ‘learner behaviour data’

I’m at a workshop today called “Sitting on a gold mine” – improving provision and services for learners  by aggregating and using ‘learner behaviour data’ (it rolls off the tongue!), which is part of a wider JISC TILE project looking at, in a nutshell, how we can use data collected from user and user activity to provide useful services, and the issues and challenges involved (and some Library 2.0 concepts as well). As ever, these are just my notes, at some points I took more notes than others, there will be mistakes and I will badly misquote the speakers, please keep this in mind.

There’s quite a bit of ‘workshop’ discussion coming up, which I’m a little tentative about as I can rant on about many things for hours, but not sure I have a lot of views on this other than ‘this is good stuff’!

Pain Points & Vision – David Kay (TILE)

David gave an overview of the TILE project. Really interesting stuff, lots covered and good use of slides, but quite difficult to get everything down here.

TILE has three objectives

  • Capture scope/scale of Library 2.0
  • Identify significant challenges facing library system developments
  • Propose high level ‘library domain model’ positioning these challenges in the context of library ‘business processes’

You can get context from click streams, this is done by the likes of Amazon and e-music providers.

E.g. First year students searching for Napoleon also borrowed… they downloaded… they rated this resource… etc.

David referred to an idea of Lorcan Dempsey : we get too bogged down by the mechanics of journals and provision without looking at the wider business processes in the new ‘web’ environment.

Four ‘systems’ in the TILE architecture: Library systems (LMS, cross search, ERM), VLE, Repositories and associated content services, we looked at a model of how these systems interact with the user in the middle.

Mark Tool (University of Stirling)

Mark (who used to be based down the road at the University of Brighton) talking about the different systems Stirling (and the other Universities he has worked at) use and how we all don’t really know how users use them. Not just now, but historical trends, e.g. are users using e-books more now than in the past?

These questions are important to lecturers as they point students to resources and systems but what do users actually use, and how do we use them. Also a quality issue, are we pointing them to the right resources. Are we getting good value for money? e.g. licence and staff costs for a VLE.

If we were to look at how different students look at different resources, would we see that ‘high achievers’ use different resources to weaker students? Could/should we point the weaker students to the resources that the former use? Obvious privacy implications.

Also could be of use when looking at new courses and programmes and how to resource them. Nationally, might help guide us to which resources we should be negotiated for at a national level.

Danger:

  • small crowd -> small dataset  -> can be misleading (one or two people can look like a trend)
  • HEI’s very different to each other.

Thinks we should run some smallish pilots and then validate the data collected by some other means.

Joy Palmer – MIMAS

Will mainly be talking about COPAC, which has done some really interesting stuff recently in opening up their data and APIs (see the COPAC blog).

What are COPAC working on:

  • Googlisation of records (will be available on Google soon)
  • Links to Digital content
  • Service coherency with zetoc and suncat
  • Personalisation tools / APIs
    • ‘My Bibliography’
    • Tagging facilities
    • Recommend-er functions
    • ummm other stuff I didn’t have time to note
  • Generally moving from a ‘Walled garden’ to something that can be mashed up [good!]

One example of a service from COPAC is the ‘My bibliography’ (or ‘marked list’ ) which can be exported in the ATOM format (which allows it to be used anywhere that takes an ATOM feed). These lists will be private by default but could be made public.

Talked about the general direction and ethos of COPAC development with lots of good examples, and the issues involved. One of the slides was titled:  From ‘service’ to ‘gravitational hub’ which I liked. She then moved on to her (and MIMAS/COPAC’s) perspective on the issue of using user generated data.

Workshop 1.

[Random notes from the group I was in, mainly the stuff that I agreed with(!), there were three groups] Talking about should we do this? the threats (and what groups of people affected by these threats). Good discussion. We talked about how these things could be useful, why some may be adverse/cautious of it (inc, privacy, inflicting on others areas – IT/library telling academics what they are recommending to students are not being used, ie telling them they are doing it wrong, creates friction). Should we do this? Blunt tool, may see wrong trends. But need to give it a go, and see what happens. Is it ‘anti-HE’ to be offering such services (i.e. recommending books), no no no! Should we leave it it to the likes of Google/Amazon? No, this is where the web is going. But real world experience of things to be aware of e.g. a catalogue ranking an edition of a  book high due to  high usage lead to a newer edition being further down the list.[lots more discussion, I forget]

Dave Pattern – Huddersfield.

[Dave is the system librarian at Huddersfield, who has ideas better than me, then implements than better than I ever could, in a fraction of the time. He’s also a great speaker. I hate him. Check out his annoyingly fantastic blog]

Lots of data generated just doing what we and users need to do, we can dig this. Dave starts of talking about Supermarket loyalty cards. Supermarkets were doing ‘people who bought this also bought’ 10 or more years a go. We can learn from them, we could do this.

We’ve been collecting circ data for years, why haven’t we done anything (bar real basic stuff) with it.

Borrowing suggestions (people who borrowed this also borrowed), working at Hud, librarians report it working well and suggesting the same books as they would.

Personalised Suggestions, if you log in, looking at what they borrowed and then what others items those who borrowed the

Lending paths: paths which join books together. potentially to predict what people will borrow and predict when particular books will be in high demand.

Library catalogue shows some book usage stats when used from a library staff PC (brilliant idea!) this can be broken down by different criteria (i.e. the courses borrowers are on).

Other functionality: Keyword suggestions, Common zero results keywords (eg, newspapermen, asbo, disneyfication). Huddersfield have found digging useful.

He’s released XML data of anonymised  circulation data, with approval of the library, for others to play with and hopes other libraries will do the same. (This is a stupidly big announcement, it feels insulting to put it just as one sentence like this, perhaps I should enclose it in the <blink> tag!?) See his blog post.

(note to self, don’t try to download 50mb file via 3g network usb stick – bad things happen to macbook)

Mark van Harmelen

Due to bad things was slightly distracted during part of this talk. Being a man completely failed to multi-task.

This was an excellent talk (at a good level) about how the TILE project is building prototype/real system(s). Some real good models of how this will/could work.  So far have developed harvesting data from institutions (and COPAC/similar services) and adding ‘group use’ to their database, a searcher known to be ‘chemistry student’ and ‘third year’ can then get relevant recommendations based on data from the groups they belong to. [I’m not doing this justice, but some really good models and examples of this working]

David Jennings – Music Recommender systems

First off refers to the Googlezon film (never heard of this before) and the idea of big brother in the private sector, and moves on and talks about (concept of) ipods which predict the music you want to hear next based on your mood and even matchmaking based on how you react to music.

Discovery: We search, we browse, we wait for things come along, we follow others, we avoid things everyone else listens to, etc.

Talking about flickr’s (not published) popularity ranking as a way to bring things to the front based on views, comments, tags etc.

Workshop 2:

Some random comments and notes from the second discussion session (from all groups)

One University’s experience was that just ‘putting it out there’ didn’t work, no one added tags to catalogue, conclusion was the need of community.

Coldstart problem: new content not surfacing with the sort of things being discussed here.

Is a Subject Librarian’s (or researcher) recommendation of the same value as a undergrad’s?

Will Library Director’s agree for library data to be released in the same way as Huddersfield, even though it is anonymised? They may fear the risks and issues that it could result in, even if we/they are not sure what those risks are (will an academic take issue with a certain aspect of the realised data).

At a national level, if academics used these services to create reading lists, may result in homogenisation of teaching across the UK. Also risk of student’s reading focusing on a small group of items/books, we could end up with four books per subject!

Summary

This was an excellent event, and clearly some good and exciting work is taking place. What are my personal thoughts?…

This is one of those things that once you get discussing it you’re never quite sure why it already hasn’t been done before, especially with circulation data. There’s a wide scope, from local library services (book recommendation) to national systems which use data from VLEs, registry systems and library systems. A lot of potential functionality, both in terms of direct user services and informing HE (and others) to help them make decisions and tailor services for users.

Challenges include: privacy, copyright, resourcing (money) and the uncertainty of (and aversion to) change. The last one includes a multitude of issues: will making data available to others lead to a budget reduction for a particular department, will it create friction between different groups (e.g. between academics and central services such as Libraries and IT)?

Perhaps the biggest fear is not knowing what demons this will release. If you are a Library Director, and you authorise your organisation’s data to be made available – or the introduction of a service such as the ones discussed today – how will it come back to haunt you in the future? Will it lead to your institution making (negative) headlines? Will a system/service supplier sue you for giving away ‘their’ data?  Will academics turn on you in Senate for releasing data that puts them in a bad light? ‘Data’ always has more complex issues than ‘services’.

In HE (and I say this more after talking to various people at different institutions over the last few years) we are sometimes to fearful of the 20% instead of thinking about the 80% (or is that more 5/95%). We will always get complaints about new services and especially about changes. No one contacts you when you are doing well (how many people contact Tesco to tell them they have allocated the perfect amount of shelf space to bacon?!) We must not let complaints dictate how we do things or how we allocate time (though of course not ignore them, relevant points can often be found).

Large organisations – both public and private – can be well known for being inflexible. But for initiatives like this (and those in the future) to have a better chance of succeeding we need to look at how we can bring down the barriers to change. This is too big an issue to get in to here it and the reasons are both big and many, from too many stakeholders requiring approval to a ‘wait until the summer vacation’ philosophy, from long term budget planning to knock-on affects across the organisation (change in department A means training/documentation/website of Department B needs to be changed first). Hmmmm, seemed to have moved away from TILE and on to a general rant offending the entire UK HE sector!

Thinking about Dave Pattern’s announcement, what will it take for other libraries to follow? First, techy stuff, he has (I think) created his own XML schema (is that the right term?) and will be working on an API to access the data. The bad thing would be for a committee to take this and spend years to finally ‘approve’ it. The Good thing would be for a few metadata/XML type people to suggest minor changes (if any) and endorse it as quickly as possible (which is no disrespect to Dave). Example: will the use of UCAS codes be a barrier for international adoption (can’t see why, just thinking out loud). There was concern at the event that some Library Directors would be cautious in approving such things. This is perhaps understandable. However, I have to say I don’t even know who the Director of Huddersfield Information Services is, but my respect for the institution and the person in that role goes about as high as it will go when they do things like this. They have taken a risk, taken the initiative and been the first to do something (to the best of my knowledge) worldwide. I will buy them a beer should I ever meet them!

I’ll be watching any developments (and chatter) that result from this announcement, and thinking about how we can support/implement such an initiative here. In theory once (programming) scripts have been written for a library system, it should be fairly trivial to port it to other customers of the same software (work will probably include mapping departments to UCAS codes, and the way user affiliation to departments is stored may vary between Universities). Perhaps Universities could club together to working on creating the code required? I’m writing this a few hours after Dave made his announcement and already his blog article has many trackbacks and comments.

So in final, final conclusion. A good day, with good speakers and a good group of attendees from mixed backgrounds. Will watch developments with interest.

[First blog post using WordPress 2.7, other blogs covering are Phil’s CETIS blog, and Dave Pattern has another blog entry on his talk. If you have written anything on this event then please let me know!]

Mashed Libraries

Exactly a week a go I was coming home from Mashed Libraries in London (Birkbeck).

I wont bore you with details of the day (or more to the point, I’m lazy and others have already done it better than i could (of course, I should have made each one of those words a link to a different blog but I’m laz… or never-mind)).

Thanks to Owen Stephens for organising, UKOLN for sponsoring and Dave Flanders (and Birkbeck) for the room.

During the afternoon we all got to hacking with various sites and services.

I had previously played around with the Talis Platform (see long winded commentary here, got it seems weird that at the time I really didn’t have a clue what I was playing with, and it was only a year a go!).

I built a basic catalogue search based on the ukbib store. I called it Stalisfield (which is a small village in Kent).

But one area I had never got working was the Holdings. So I decided to set to work on that. Progress was slow, but then Rob Styles sat down next to me and things started to move. Rob help create Talis Cenote (which I nicked most of the code from) and generally falls in to that (somewhat large) group of ‘people much smarter than me’.

We (well I) wanted to show which Libraries had the book in question, and plot them on a Google Map. So once we had a list of libraries we needed to connect to another service to get the location for each of these libraries. The service which fitted this need was the Talis Directory (Silkworm). This raised a point with me, it was a good job there was a Talis service which used the same underlying ID codes for the libraries i.e. the holdings service and the directory both used the same ID number. It could have been a problem if we needed to get the geo/location data from something like OCLC or Librarytechnology.org, what would we have searched on? a Libraries name? hardly a reliable term to use (e.g. The University of Sussex Library is called ‘UNIV OF SUSSEX LIBR’ in OCLC!). Do Libraries need a code which can be used to cross reference them between different web services (a little like ISBNs for books)?

Using the Talis Silkworm Directory was a little more challenging than first thought, and the end result was a very long URL which used SPARQL (something which looks a steep learning curve to me!).

In the mean time, I signed up for Google Maps, and gave myself a crash course in setting it up (I’m quite slow to pick these things up). So we had the longitude and latitude co-ordinates for each library, and we had a Google Map on the page, we just needed to connect to the two.

Four people trying to debug the last little bit of code for my little project
Four people at Mashedlibrary trying to debug the last little bit of my code.

Time was running short, so I was glad to take a back seat and watch (and learn) while Rob went to in to speed-javascript mode. This last part proved to be elusive. The PHP code which was generating javascript code was just not quite working. In the end the (final) problem was related to the order I was outputting the code, but we were out of time, and this required more than five minutes.

Back home, I fixed this (though I never would have known I needed to do this without help).

You can see an example here, and here and here (click on the link at the top to go back to the bib record for the item, which, by the way, should show a Google Book cover at the bottom, though this only works for a few books).

You can click on a marker to see the name of library, and the balloon also has a link which should take you straight to item in question on the library’s catalogue.

It is a little slow, partly due to my bad code and partly due to what it is doing:

  1. Connecting to the Talis Platform to get a list of libraries which have the book in question (quick)
  2. For each library, connect to the Talis Silkworm Directory and perform a SPARQL query to get back some XML which includes the geo co-ordinates. (geo details not available for all libraries)
  3. Finally generate some javascript code to plot each library on to a Google map.
  4. As this last point needs to be done in the <head> of the page, it is only at this point that we can push the page out to the browser.

I added one last little feature.

It is all well and good to see which libraries have the item you are after, but you are probably iterested in libraries near you. So I used the Maxmind GeoLite City code-library to get the user’s rough location, and then centering the map on this (which is clearly not good for those trying to use it outside the UK!). This seems to work most of the time, but it depends on your ISP, some seem more friendly in their design towards this sort of thing. Does the map centre on your location?

VAT ‘offset’? No just a tax rise via the backdoor

Everyone in the UK will be aware that the ‘pre-budget report’ was released this week. Slightly oddly named considering it contained more headlines the most full Budgets.

One of the things mentioned by every media outlet I have seen is the ‘offset’ increase in Duty for Alcohol and Tobacco to counter-balance the temporary decrease the in VAT.

The key word there is temporary. It appears next to the VAT cut, but not next to the increase in Duty to offset it.

So I did some poking around and came across this document: http://is.gd/98Et (pdf)

From page 25 (PDF page 30), paragraph 2.49:

“2.49 As set out in more detail in Chapter 5, alcohol and tobacco duties will be increased to offset the effects of the temporary reduction in VAT. Maintaining these increases after December 2009 will further support fiscal consolidation.”

That is no offset! That is a future tax rise. At the point VAT goes back up, this Duty rise will stop being a tempoary offset and start being an increase in the tax/duty we pay.

My point is not the rise in duty, but the way it has been presented by the Government.

Why have none of the papers and media outlets picked this up? Or have they, let me know!

ecto : first impressions

I’ve heard good things about PC/Mac clients for writing blog posts, so thought I would give it a go. I tried out ecto for OS X, to published to my WordPress based blog.

It did what was promised it acted as a WYSIWYG blog composition tool, and in that sense it was easy to use and worked without problems. However I few things:

  • I could only attach (as far as I could see) audio, pictures and movies. I wanted to attached a txt file, (and may want to upload PDF/doc files) but could see no way of doing this.
  • I couldn’t send it to the blog as an unpublished draft, so I couldn’t send it to WordPress and then upload/link-to the text file using the wordpress interface before publishing.
  • Ecto is a generic blog tool, not specific to WordPress. While in many ways a good thing, it does have it’s downside, there are some options on the WordPress composition screen that I rarely use but do find them useful occasionally, and felt it somewhat unsettling for them not to be there should I need them.
  • As a plus: the problem with the WordPress interface is that there is a lot of screen space used with the top of the screen menus and other misc stuff, and the edit space is somewhat small, and it is annoying to need to scroll both the screen and the text box. the ecto UI does not have this issue. But then WordPress 2.7 may address this problem.
  • One of the main plus points is being able to carry on editing offline, but with Google Gears you should be able to edit happily offline (I haven’t tried this yet).

So ecto is certainly worthy trying if you are after a OS X based blog client, and I chose it above others available based on reviews I had read, but for me, I think I will stick with the native web interface for the time being.

(posted using Ecto)

Playing with OAI-PMH with Simple DC

Setting up ircount has got me quite interested in OAI-PMH, so I thought I would have a little play. I was particularly interested in seeing if there was a way to count the number of full text items in a repository, as ROAR does not generally provide this information.

Perl script

I decided to use the http::oai perl module by Tim Brody (who not-so-coincidentally is also responsible for ROAR, which ircount gets its data from).

A couple of hours later I have a very basic script which will roughly report on the number of records and the number of full text items within a repository, you just need to pass it a URL for the OAI-PMH interface.

To show the outcome of my efforts, here is the verbose output of the script when pointed at the University of Sussex repository (Sussex Research Online).

Here is the output for a sample record (see here for the actual oai output for this record, you may want to ‘view source’ to see the XML):

oai:eprints.sussex.ac.uk:67 2006-09-19
Retreat of chalk cliffs in the eastern English Channel during the last century
relation: http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
MATCH http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
relation: http://www.journalofmaps.com/article_depository/europe/Dornbusch_coast_1124460539.pdf
dc.identifier: http://eprints.sussex.ac.uk/67/
full text found for id oai:eprints.sussex.ac.uk:67, current total of items with fulltext 6
id oai:eprints.sussex.ac.uk:67 is the 29 record we have seen

It first lists the identifier and date, the next line shows the title, it then shows a dc.relation field which contains a full text item on the eprints server, because it looks like a full text item and on the same server the next line shows it has found a line that MATCHed the criteria which means we add this item to the count of items with full text items attached.

The next line is another dc.identifier, again pointing to a fulltext URL for this item. However this time it is on a different server (i.e. the publishers), so this line is not treated as a fulltext item, and so it does not show a MATCH (i.e. had the first identifier line not existed, this record would not be considered one with a fulltext item).

Finally another dc.identifier is shown, then a summary generated by the script concluding that this item does have fulltext, is the sixth record seen with fulltext, and is the 29th record we have seen.

The script, as we will now see, has to use various ‘hacky’ methods to try and guess the number of fulltext items within a repository, as different systems populate simple Dublin Core in different ways.

Repositories and OAI-PMH/Simple Dublin Core.

It quickly became clear on experimenting with different repositories that the different repository software populate Simple Dublin Core in a different manner. Here are some examples:

Eprints2: As you can see above in the Sussex example, fulltext items are added as a dc.relation field, but so too are any publisher/official URLs, which we don’t want to count. The only way to differentiate between the two is to check the domain name within the dc.relation url and see if it matches that of the OAI interface we are working with. This is no means solid, quite possible for a system to have more than one hostname and what the user gives as the OAI URL may not match what the system gives as the URLs for fulltext items.

Eprints3: I’ll use the Warwick repository for this, see the HTML and OAI-PMH for the record used in this example.

<dc:format>application/pdf</dc:format>
<dc:identifier>http://wrap.warwick.ac.uk/46/1/WRAP_Slade_jel_paper_may07.pdf</dc:identifier>
<dc:relation>http://dx.doi.org/10.1257/jel.45.3.629</dc:relation>
<dc:identifier>Lafontaine, Francine and Slade, Margaret (2007) Vertical integration and firm boundaries: the evidence. Journal of Economic Literature, Vol.45 (No.3). pp. 631-687. ISSN 0022-0515</dc:identifier>
<dc:relation>http://wrap.warwick.ac.uk/46/</dc:relation>

Unlike Eprints2, the fulltext item is now in a dc.identifier field, the official/publisher URL is still a dc.relation field, which makes it easier to count the former without the latter. EP3 also seems to provide a citation of the item which is also in a dc.identifier as well. (as an aside: EPrints 3.0.3-rc-1, as used by Birkbeck and Royal Holloway, seems to act differently, missing out any reference to the fulltext).

Dspace: I’ll use Leicester’s repository, see the HTML and OAI-PMH for the record used. (I was going to use Bath’s but looks like they have just moved to Eprints!)

<dc:identifier>http://hdl.handle.net/2381/12</dc:identifier>
<dc:format>350229 bytes</dc:format>
<dc:format>application/pdf</dc:format>

This is very different to Eprints. DC.identifier is used for a link to the html page for this item (like eprints2 but unlike eprints3 which uses dc.relation for this). However it does not mention either the fulltext item or the official/publisher url at all (this record has both). The only clue that this has a full text item is the dc.format (‘application/pdf’), and so my hacked up little script looks out for this as well.

I looked at a few other Dspace based repositories (Brunel HTML / OAI ; MIT HTML / OAI) and they seemed to produce the same sort of output, though not being familiar with Dspace I don’t know if this is because they were all the same version or if the OAI-PMH interface has stayed consistent between versions.

I haven’t even checked out Fedora, bepress Digital Commons or DigiTool yet (all this is actually quite time consuming).

Commentary

I’m reluctant to come up with any conclusions because I know the people who developed all this are so damn smart. When I read the articles and posts produced by those (who were) on the OAI-PMH working group, or were in some way involved, it is clear they have a vast understanding of standards, protocols, metadata, and more. Much of what I have read is clear and well written and yet I still struggle to understand it due to my own metal shortcomings!

Yet what I have found above seems to suggest we still have a way to go in getting this right.

Imagine a service which will use data from repositories: ‘Geography papers archive’, ‘UK Working papers online’, ‘Open Academic Books search’ (all fictional web sites/services which could be created which harvest data from repositories, based on a subject/type subset).

Repositories are all about open access to the full text of research, and it seems to me that harvesters need to be able to presume that the fulltext item, and other key elements, will be in a particular field. And perhaps it isn’t too wild to suggest that one field should be used for one purpose, for example, both Dspace and Eprints provide a full citation of the item in the DC metadata, which an external system may find useful in some way, however it is in the dc.identifier field, yet various other bits of information are also in the very same field, so anyone wishing to extract citations would need to run some sort of messy test to try and ascertain which identifier field, if any, contains the citation they wish to use.

To some extent things can be improved by getting repository developers, harvester developers and OAI/DC experts round a table to agree a common way of using the format. Hmm, but ring any bells? I’ve always thought that the existence of the Bath profile was probably a sign of underlying problems with Z39.50 (though am almost totally ignorant on z39.50). even this will only solve some problems, the issue of multiple ‘real world’ elements being put in to the same field (both identifier and relation are used for a multiple of purposes), as mentioned above, is still a problem.

I know nothing about metadata nor web protocols (left with me, we would all revert to tab delimited files!), so am reluctant to suggest or declare what should happen. But there must be a better fit for our needs than Simple DC. Qualified DC being a candidate (I think, again, I know nuffing). see this page highlighting some of the issues with simple dc.

I guess one problem is that it is easy to fall in to the trap of presuming repository item = article/paper. When of course if could be almost anything, the former would be easy to narrowly define, but the latter – which is the reality – is much harder to give a clear schema for. Perhaps we need ‘profiles’ for the common different item types (articles/theses/images). I think this is the point that people will point out that (a) this has been discussed a thousand times already (b) has probably already been done!. So I’ll shut up and move on (here’s one example of what has already been said).

Other notes:

  • I wish OAI-PMH had a machine readable way of telling clients if they can harvest items, reuse the data, or even access it at all (apologies if it does allow this already). The human text of an IR policy may forbid me sucking up the data and making it searchable elsewhere, but how will I know this?
  • Peter Millington of RSP/SHERPA recently floated the idea of a OAI-PMH verb/command to report the total number of items. His point is that it should be simple for OAI servers to report such a number with ease (probably a simple SQL COUNT(*)) but at the moment OAI-PMH clients – like mine – have to manually count each item, parsing thousands of lines of data, which can take minutes, creating processing requirements for both server and client, to answer a simple question of how many items are there? I echo and support Peter’s idea of creating a count verb to resolve this.
  • Would be very handy if OAI-PMH servers could give an application name and version number as part of the response to the ‘Identify’ verb. Would be very useful when trying to work around the differences between applications and software versions.

Back to the script

Finally. I’m trying to judge how good the little script is, does it report an accurate number of full text items. If you run an IR and would be happy for me to run the script against your repository (I don’t think it creates a high load on the server), then please reply to this post. Ideally with your OAI-PMH URL and how many full text items you think you have, though neither are essential. I’ll attach the results to a comment to this post.

Food for thought, I’m pondering the need to check the dc.type of an item, and only count items of certain types, e.g. should we include images? one image of a piece of research sounds fine, 10,000 images suddenly distorts the numbers. Should it include all items, or just those that are of certain types (article, thesis etc)?

Whale hunting terrorists

My Dearest Government,

It is really quite simple. Sometimes you suggest things which will infringe our privacy and our rights.

The average person will not like the idea of such things, but you will try and persuade us (and parliament) that they must happen for very good reason, and these laws will only be used in the extreme circumstances they were intended for, with lots and checks and balances.

Now, and here is the important bit so pay attention, when you use this very legislation for something completly unrelated to what it was intended for. Let’s say, oh just for example, using Anti-terrorism legislation on UK based assets of a small country because you don’t like they way they are handling a money problem, then our trust in you disappears. Completely.

And if you can’t be trusted not to mis-use that, what hope have we that your plans to monitor every phone call and email will not be abused, when you feel there is a great enough need?