“Sitting on a gold mine” – improving provision and services for learners by aggregating and using ‘learner behaviour data’

I’m at a workshop today called “Sitting on a gold mine” – improving provision and services for learners  by aggregating and using ‘learner behaviour data’ (it rolls off the tongue!), which is part of a wider JISC TILE project looking at, in a nutshell, how we can use data collected from user and user activity to provide useful services, and the issues and challenges involved (and some Library 2.0 concepts as well). As ever, these are just my notes, at some points I took more notes than others, there will be mistakes and I will badly misquote the speakers, please keep this in mind.

There’s quite a bit of ‘workshop’ discussion coming up, which I’m a little tentative about as I can rant on about many things for hours, but not sure I have a lot of views on this other than ‘this is good stuff’!

Pain Points & Vision – David Kay (TILE)

David gave an overview of the TILE project. Really interesting stuff, lots covered and good use of slides, but quite difficult to get everything down here.

TILE has three objectives

  • Capture scope/scale of Library 2.0
  • Identify significant challenges facing library system developments
  • Propose high level ‘library domain model’ positioning these challenges in the context of library ‘business processes’

You can get context from click streams, this is done by the likes of Amazon and e-music providers.

E.g. First year students searching for Napoleon also borrowed… they downloaded… they rated this resource… etc.

David referred to an idea of Lorcan Dempsey : we get too bogged down by the mechanics of journals and provision without looking at the wider business processes in the new ‘web’ environment.

Four ‘systems’ in the TILE architecture: Library systems (LMS, cross search, ERM), VLE, Repositories and associated content services, we looked at a model of how these systems interact with the user in the middle.

Mark Tool (University of Stirling)

Mark (who used to be based down the road at the University of Brighton) talking about the different systems Stirling (and the other Universities he has worked at) use and how we all don’t really know how users use them. Not just now, but historical trends, e.g. are users using e-books more now than in the past?

These questions are important to lecturers as they point students to resources and systems but what do users actually use, and how do we use them. Also a quality issue, are we pointing them to the right resources. Are we getting good value for money? e.g. licence and staff costs for a VLE.

If we were to look at how different students look at different resources, would we see that ‘high achievers’ use different resources to weaker students? Could/should we point the weaker students to the resources that the former use? Obvious privacy implications.

Also could be of use when looking at new courses and programmes and how to resource them. Nationally, might help guide us to which resources we should be negotiated for at a national level.

Danger:

  • small crowd -> small dataset  -> can be misleading (one or two people can look like a trend)
  • HEI’s very different to each other.

Thinks we should run some smallish pilots and then validate the data collected by some other means.

Joy Palmer – MIMAS

Will mainly be talking about COPAC, which has done some really interesting stuff recently in opening up their data and APIs (see the COPAC blog).

What are COPAC working on:

  • Googlisation of records (will be available on Google soon)
  • Links to Digital content
  • Service coherency with zetoc and suncat
  • Personalisation tools / APIs
    • ‘My Bibliography’
    • Tagging facilities
    • Recommend-er functions
    • ummm other stuff I didn’t have time to note
  • Generally moving from a ‘Walled garden’ to something that can be mashed up [good!]

One example of a service from COPAC is the ‘My bibliography’ (or ‘marked list’ ) which can be exported in the ATOM format (which allows it to be used anywhere that takes an ATOM feed). These lists will be private by default but could be made public.

Talked about the general direction and ethos of COPAC development with lots of good examples, and the issues involved. One of the slides was titled:  From ‘service’ to ‘gravitational hub’ which I liked. She then moved on to her (and MIMAS/COPAC’s) perspective on the issue of using user generated data.

Workshop 1.

[Random notes from the group I was in, mainly the stuff that I agreed with(!), there were three groups] Talking about should we do this? the threats (and what groups of people affected by these threats). Good discussion. We talked about how these things could be useful, why some may be adverse/cautious of it (inc, privacy, inflicting on others areas – IT/library telling academics what they are recommending to students are not being used, ie telling them they are doing it wrong, creates friction). Should we do this? Blunt tool, may see wrong trends. But need to give it a go, and see what happens. Is it ‘anti-HE’ to be offering such services (i.e. recommending books), no no no! Should we leave it it to the likes of Google/Amazon? No, this is where the web is going. But real world experience of things to be aware of e.g. a catalogue ranking an edition of a  book high due to  high usage lead to a newer edition being further down the list.[lots more discussion, I forget]

Dave Pattern – Huddersfield.

[Dave is the system librarian at Huddersfield, who has ideas better than me, then implements than better than I ever could, in a fraction of the time. He’s also a great speaker. I hate him. Check out his annoyingly fantastic blog]

Lots of data generated just doing what we and users need to do, we can dig this. Dave starts of talking about Supermarket loyalty cards. Supermarkets were doing ‘people who bought this also bought’ 10 or more years a go. We can learn from them, we could do this.

We’ve been collecting circ data for years, why haven’t we done anything (bar real basic stuff) with it.

Borrowing suggestions (people who borrowed this also borrowed), working at Hud, librarians report it working well and suggesting the same books as they would.

Personalised Suggestions, if you log in, looking at what they borrowed and then what others items those who borrowed the

Lending paths: paths which join books together. potentially to predict what people will borrow and predict when particular books will be in high demand.

Library catalogue shows some book usage stats when used from a library staff PC (brilliant idea!) this can be broken down by different criteria (i.e. the courses borrowers are on).

Other functionality: Keyword suggestions, Common zero results keywords (eg, newspapermen, asbo, disneyfication). Huddersfield have found digging useful.

He’s released XML data of anonymised  circulation data, with approval of the library, for others to play with and hopes other libraries will do the same. (This is a stupidly big announcement, it feels insulting to put it just as one sentence like this, perhaps I should enclose it in the <blink> tag!?) See his blog post.

(note to self, don’t try to download 50mb file via 3g network usb stick – bad things happen to macbook)

Mark van Harmelen

Due to bad things was slightly distracted during part of this talk. Being a man completely failed to multi-task.

This was an excellent talk (at a good level) about how the TILE project is building prototype/real system(s). Some real good models of how this will/could work.  So far have developed harvesting data from institutions (and COPAC/similar services) and adding ‘group use’ to their database, a searcher known to be ‘chemistry student’ and ‘third year’ can then get relevant recommendations based on data from the groups they belong to. [I’m not doing this justice, but some really good models and examples of this working]

David Jennings – Music Recommender systems

First off refers to the Googlezon film (never heard of this before) and the idea of big brother in the private sector, and moves on and talks about (concept of) ipods which predict the music you want to hear next based on your mood and even matchmaking based on how you react to music.

Discovery: We search, we browse, we wait for things come along, we follow others, we avoid things everyone else listens to, etc.

Talking about flickr’s (not published) popularity ranking as a way to bring things to the front based on views, comments, tags etc.

Workshop 2:

Some random comments and notes from the second discussion session (from all groups)

One University’s experience was that just ‘putting it out there’ didn’t work, no one added tags to catalogue, conclusion was the need of community.

Coldstart problem: new content not surfacing with the sort of things being discussed here.

Is a Subject Librarian’s (or researcher) recommendation of the same value as a undergrad’s?

Will Library Director’s agree for library data to be released in the same way as Huddersfield, even though it is anonymised? They may fear the risks and issues that it could result in, even if we/they are not sure what those risks are (will an academic take issue with a certain aspect of the realised data).

At a national level, if academics used these services to create reading lists, may result in homogenisation of teaching across the UK. Also risk of student’s reading focusing on a small group of items/books, we could end up with four books per subject!

Summary

This was an excellent event, and clearly some good and exciting work is taking place. What are my personal thoughts?…

This is one of those things that once you get discussing it you’re never quite sure why it already hasn’t been done before, especially with circulation data. There’s a wide scope, from local library services (book recommendation) to national systems which use data from VLEs, registry systems and library systems. A lot of potential functionality, both in terms of direct user services and informing HE (and others) to help them make decisions and tailor services for users.

Challenges include: privacy, copyright, resourcing (money) and the uncertainty of (and aversion to) change. The last one includes a multitude of issues: will making data available to others lead to a budget reduction for a particular department, will it create friction between different groups (e.g. between academics and central services such as Libraries and IT)?

Perhaps the biggest fear is not knowing what demons this will release. If you are a Library Director, and you authorise your organisation’s data to be made available – or the introduction of a service such as the ones discussed today – how will it come back to haunt you in the future? Will it lead to your institution making (negative) headlines? Will a system/service supplier sue you for giving away ‘their’ data?  Will academics turn on you in Senate for releasing data that puts them in a bad light? ‘Data’ always has more complex issues than ‘services’.

In HE (and I say this more after talking to various people at different institutions over the last few years) we are sometimes to fearful of the 20% instead of thinking about the 80% (or is that more 5/95%). We will always get complaints about new services and especially about changes. No one contacts you when you are doing well (how many people contact Tesco to tell them they have allocated the perfect amount of shelf space to bacon?!) We must not let complaints dictate how we do things or how we allocate time (though of course not ignore them, relevant points can often be found).

Large organisations – both public and private – can be well known for being inflexible. But for initiatives like this (and those in the future) to have a better chance of succeeding we need to look at how we can bring down the barriers to change. This is too big an issue to get in to here it and the reasons are both big and many, from too many stakeholders requiring approval to a ‘wait until the summer vacation’ philosophy, from long term budget planning to knock-on affects across the organisation (change in department A means training/documentation/website of Department B needs to be changed first). Hmmmm, seemed to have moved away from TILE and on to a general rant offending the entire UK HE sector!

Thinking about Dave Pattern’s announcement, what will it take for other libraries to follow? First, techy stuff, he has (I think) created his own XML schema (is that the right term?) and will be working on an API to access the data. The bad thing would be for a committee to take this and spend years to finally ‘approve’ it. The Good thing would be for a few metadata/XML type people to suggest minor changes (if any) and endorse it as quickly as possible (which is no disrespect to Dave). Example: will the use of UCAS codes be a barrier for international adoption (can’t see why, just thinking out loud). There was concern at the event that some Library Directors would be cautious in approving such things. This is perhaps understandable. However, I have to say I don’t even know who the Director of Huddersfield Information Services is, but my respect for the institution and the person in that role goes about as high as it will go when they do things like this. They have taken a risk, taken the initiative and been the first to do something (to the best of my knowledge) worldwide. I will buy them a beer should I ever meet them!

I’ll be watching any developments (and chatter) that result from this announcement, and thinking about how we can support/implement such an initiative here. In theory once (programming) scripts have been written for a library system, it should be fairly trivial to port it to other customers of the same software (work will probably include mapping departments to UCAS codes, and the way user affiliation to departments is stored may vary between Universities). Perhaps Universities could club together to working on creating the code required? I’m writing this a few hours after Dave made his announcement and already his blog article has many trackbacks and comments.

So in final, final conclusion. A good day, with good speakers and a good group of attendees from mixed backgrounds. Will watch developments with interest.

[First blog post using WordPress 2.7, other blogs covering are Phil’s CETIS blog, and Dave Pattern has another blog entry on his talk. If you have written anything on this event then please let me know!]

Mashed Libraries

Exactly a week a go I was coming home from Mashed Libraries in London (Birkbeck).

I wont bore you with details of the day (or more to the point, I’m lazy and others have already done it better than i could (of course, I should have made each one of those words a link to a different blog but I’m laz… or never-mind)).

Thanks to Owen Stephens for organising, UKOLN for sponsoring and Dave Flanders (and Birkbeck) for the room.

During the afternoon we all got to hacking with various sites and services.

I had previously played around with the Talis Platform (see long winded commentary here, got it seems weird that at the time I really didn’t have a clue what I was playing with, and it was only a year a go!).

I built a basic catalogue search based on the ukbib store. I called it Stalisfield (which is a small village in Kent).

But one area I had never got working was the Holdings. So I decided to set to work on that. Progress was slow, but then Rob Styles sat down next to me and things started to move. Rob help create Talis Cenote (which I nicked most of the code from) and generally falls in to that (somewhat large) group of ‘people much smarter than me’.

We (well I) wanted to show which Libraries had the book in question, and plot them on a Google Map. So once we had a list of libraries we needed to connect to another service to get the location for each of these libraries. The service which fitted this need was the Talis Directory (Silkworm). This raised a point with me, it was a good job there was a Talis service which used the same underlying ID codes for the libraries i.e. the holdings service and the directory both used the same ID number. It could have been a problem if we needed to get the geo/location data from something like OCLC or Librarytechnology.org, what would we have searched on? a Libraries name? hardly a reliable term to use (e.g. The University of Sussex Library is called ‘UNIV OF SUSSEX LIBR’ in OCLC!). Do Libraries need a code which can be used to cross reference them between different web services (a little like ISBNs for books)?

Using the Talis Silkworm Directory was a little more challenging than first thought, and the end result was a very long URL which used SPARQL (something which looks a steep learning curve to me!).

In the mean time, I signed up for Google Maps, and gave myself a crash course in setting it up (I’m quite slow to pick these things up). So we had the longitude and latitude co-ordinates for each library, and we had a Google Map on the page, we just needed to connect to the two.

Four people trying to debug the last little bit of code for my little project
Four people at Mashedlibrary trying to debug the last little bit of my code.

Time was running short, so I was glad to take a back seat and watch (and learn) while Rob went to in to speed-javascript mode. This last part proved to be elusive. The PHP code which was generating javascript code was just not quite working. In the end the (final) problem was related to the order I was outputting the code, but we were out of time, and this required more than five minutes.

Back home, I fixed this (though I never would have known I needed to do this without help).

You can see an example here, and here and here (click on the link at the top to go back to the bib record for the item, which, by the way, should show a Google Book cover at the bottom, though this only works for a few books).

You can click on a marker to see the name of library, and the balloon also has a link which should take you straight to item in question on the library’s catalogue.

It is a little slow, partly due to my bad code and partly due to what it is doing:

  1. Connecting to the Talis Platform to get a list of libraries which have the book in question (quick)
  2. For each library, connect to the Talis Silkworm Directory and perform a SPARQL query to get back some XML which includes the geo co-ordinates. (geo details not available for all libraries)
  3. Finally generate some javascript code to plot each library on to a Google map.
  4. As this last point needs to be done in the <head> of the page, it is only at this point that we can push the page out to the browser.

I added one last little feature.

It is all well and good to see which libraries have the item you are after, but you are probably iterested in libraries near you. So I used the Maxmind GeoLite City code-library to get the user’s rough location, and then centering the map on this (which is clearly not good for those trying to use it outside the UK!). This seems to work most of the time, but it depends on your ISP, some seem more friendly in their design towards this sort of thing. Does the map centre on your location?

Playing with OAI-PMH with Simple DC

Setting up ircount has got me quite interested in OAI-PMH, so I thought I would have a little play. I was particularly interested in seeing if there was a way to count the number of full text items in a repository, as ROAR does not generally provide this information.

Perl script

I decided to use the http::oai perl module by Tim Brody (who not-so-coincidentally is also responsible for ROAR, which ircount gets its data from).

A couple of hours later I have a very basic script which will roughly report on the number of records and the number of full text items within a repository, you just need to pass it a URL for the OAI-PMH interface.

To show the outcome of my efforts, here is the verbose output of the script when pointed at the University of Sussex repository (Sussex Research Online).

Here is the output for a sample record (see here for the actual oai output for this record, you may want to ‘view source’ to see the XML):

oai:eprints.sussex.ac.uk:67 2006-09-19
Retreat of chalk cliffs in the eastern English Channel during the last century
relation: http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
MATCH http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
relation: http://www.journalofmaps.com/article_depository/europe/Dornbusch_coast_1124460539.pdf
dc.identifier: http://eprints.sussex.ac.uk/67/
full text found for id oai:eprints.sussex.ac.uk:67, current total of items with fulltext 6
id oai:eprints.sussex.ac.uk:67 is the 29 record we have seen

It first lists the identifier and date, the next line shows the title, it then shows a dc.relation field which contains a full text item on the eprints server, because it looks like a full text item and on the same server the next line shows it has found a line that MATCHed the criteria which means we add this item to the count of items with full text items attached.

The next line is another dc.identifier, again pointing to a fulltext URL for this item. However this time it is on a different server (i.e. the publishers), so this line is not treated as a fulltext item, and so it does not show a MATCH (i.e. had the first identifier line not existed, this record would not be considered one with a fulltext item).

Finally another dc.identifier is shown, then a summary generated by the script concluding that this item does have fulltext, is the sixth record seen with fulltext, and is the 29th record we have seen.

The script, as we will now see, has to use various ‘hacky’ methods to try and guess the number of fulltext items within a repository, as different systems populate simple Dublin Core in different ways.

Repositories and OAI-PMH/Simple Dublin Core.

It quickly became clear on experimenting with different repositories that the different repository software populate Simple Dublin Core in a different manner. Here are some examples:

Eprints2: As you can see above in the Sussex example, fulltext items are added as a dc.relation field, but so too are any publisher/official URLs, which we don’t want to count. The only way to differentiate between the two is to check the domain name within the dc.relation url and see if it matches that of the OAI interface we are working with. This is no means solid, quite possible for a system to have more than one hostname and what the user gives as the OAI URL may not match what the system gives as the URLs for fulltext items.

Eprints3: I’ll use the Warwick repository for this, see the HTML and OAI-PMH for the record used in this example.

<dc:format>application/pdf</dc:format>
<dc:identifier>http://wrap.warwick.ac.uk/46/1/WRAP_Slade_jel_paper_may07.pdf</dc:identifier>
<dc:relation>http://dx.doi.org/10.1257/jel.45.3.629</dc:relation>
<dc:identifier>Lafontaine, Francine and Slade, Margaret (2007) Vertical integration and firm boundaries: the evidence. Journal of Economic Literature, Vol.45 (No.3). pp. 631-687. ISSN 0022-0515</dc:identifier>
<dc:relation>http://wrap.warwick.ac.uk/46/</dc:relation>

Unlike Eprints2, the fulltext item is now in a dc.identifier field, the official/publisher URL is still a dc.relation field, which makes it easier to count the former without the latter. EP3 also seems to provide a citation of the item which is also in a dc.identifier as well. (as an aside: EPrints 3.0.3-rc-1, as used by Birkbeck and Royal Holloway, seems to act differently, missing out any reference to the fulltext).

Dspace: I’ll use Leicester’s repository, see the HTML and OAI-PMH for the record used. (I was going to use Bath’s but looks like they have just moved to Eprints!)

<dc:identifier>http://hdl.handle.net/2381/12</dc:identifier>
<dc:format>350229 bytes</dc:format>
<dc:format>application/pdf</dc:format>

This is very different to Eprints. DC.identifier is used for a link to the html page for this item (like eprints2 but unlike eprints3 which uses dc.relation for this). However it does not mention either the fulltext item or the official/publisher url at all (this record has both). The only clue that this has a full text item is the dc.format (‘application/pdf’), and so my hacked up little script looks out for this as well.

I looked at a few other Dspace based repositories (Brunel HTML / OAI ; MIT HTML / OAI) and they seemed to produce the same sort of output, though not being familiar with Dspace I don’t know if this is because they were all the same version or if the OAI-PMH interface has stayed consistent between versions.

I haven’t even checked out Fedora, bepress Digital Commons or DigiTool yet (all this is actually quite time consuming).

Commentary

I’m reluctant to come up with any conclusions because I know the people who developed all this are so damn smart. When I read the articles and posts produced by those (who were) on the OAI-PMH working group, or were in some way involved, it is clear they have a vast understanding of standards, protocols, metadata, and more. Much of what I have read is clear and well written and yet I still struggle to understand it due to my own metal shortcomings!

Yet what I have found above seems to suggest we still have a way to go in getting this right.

Imagine a service which will use data from repositories: ‘Geography papers archive’, ‘UK Working papers online’, ‘Open Academic Books search’ (all fictional web sites/services which could be created which harvest data from repositories, based on a subject/type subset).

Repositories are all about open access to the full text of research, and it seems to me that harvesters need to be able to presume that the fulltext item, and other key elements, will be in a particular field. And perhaps it isn’t too wild to suggest that one field should be used for one purpose, for example, both Dspace and Eprints provide a full citation of the item in the DC metadata, which an external system may find useful in some way, however it is in the dc.identifier field, yet various other bits of information are also in the very same field, so anyone wishing to extract citations would need to run some sort of messy test to try and ascertain which identifier field, if any, contains the citation they wish to use.

To some extent things can be improved by getting repository developers, harvester developers and OAI/DC experts round a table to agree a common way of using the format. Hmm, but ring any bells? I’ve always thought that the existence of the Bath profile was probably a sign of underlying problems with Z39.50 (though am almost totally ignorant on z39.50). even this will only solve some problems, the issue of multiple ‘real world’ elements being put in to the same field (both identifier and relation are used for a multiple of purposes), as mentioned above, is still a problem.

I know nothing about metadata nor web protocols (left with me, we would all revert to tab delimited files!), so am reluctant to suggest or declare what should happen. But there must be a better fit for our needs than Simple DC. Qualified DC being a candidate (I think, again, I know nuffing). see this page highlighting some of the issues with simple dc.

I guess one problem is that it is easy to fall in to the trap of presuming repository item = article/paper. When of course if could be almost anything, the former would be easy to narrowly define, but the latter – which is the reality – is much harder to give a clear schema for. Perhaps we need ‘profiles’ for the common different item types (articles/theses/images). I think this is the point that people will point out that (a) this has been discussed a thousand times already (b) has probably already been done!. So I’ll shut up and move on (here’s one example of what has already been said).

Other notes:

  • I wish OAI-PMH had a machine readable way of telling clients if they can harvest items, reuse the data, or even access it at all (apologies if it does allow this already). The human text of an IR policy may forbid me sucking up the data and making it searchable elsewhere, but how will I know this?
  • Peter Millington of RSP/SHERPA recently floated the idea of a OAI-PMH verb/command to report the total number of items. His point is that it should be simple for OAI servers to report such a number with ease (probably a simple SQL COUNT(*)) but at the moment OAI-PMH clients – like mine – have to manually count each item, parsing thousands of lines of data, which can take minutes, creating processing requirements for both server and client, to answer a simple question of how many items are there? I echo and support Peter’s idea of creating a count verb to resolve this.
  • Would be very handy if OAI-PMH servers could give an application name and version number as part of the response to the ‘Identify’ verb. Would be very useful when trying to work around the differences between applications and software versions.

Back to the script

Finally. I’m trying to judge how good the little script is, does it report an accurate number of full text items. If you run an IR and would be happy for me to run the script against your repository (I don’t think it creates a high load on the server), then please reply to this post. Ideally with your OAI-PMH URL and how many full text items you think you have, though neither are essential. I’ll attach the results to a comment to this post.

Food for thought, I’m pondering the need to check the dc.type of an item, and only count items of certain types, e.g. should we include images? one image of a piece of research sounds fine, 10,000 images suddenly distorts the numbers. Should it include all items, or just those that are of certain types (article, thesis etc)?

ircount : new location, new functionality

A while a go, I released a simple website which reported on the number of items in UK repositories over time. It collected its data from ROAR but by collecting it on a weekly basis could provide a table showing growth week by week.

First it has a new home: http://www.nostuff.org/ircount/

Secondly, it now collects data for every institutional (and departmental) repository registered in ROAR across the world. Not just the UK. It has been collecting the data since July.

The country integration isn’t perfect, you have to select a country, and then you are more or less restricted to that country (though you can hack it, see the ‘info&help’), and there is a lot of potential with improving this. There are also a couple of bugs, for example when comparing four repositories it seems to (a) forget which country you were dealing with, and (b) it stops showing the graph/chart.

I’m currently looking at trying to make an educated guess at how many fulltext items are in a given repository. This is proving to be a steep learning curve in the joys of OAI-PMH, and how the different repository systems (and the different versions on these systems) have allocated information about the fulltext in to different Dublin Core (DC) elements. But this is for another post.

In the mean time, I hope the worldwide coverage is of some use, and feel free to leave any comments.

RIOJA – Repository Interface for Overlaid Journal Archives – live blog

Living Blogging http://www.ucl.ac.uk/ls/rioja/ see http://is.gd/N8T or , in fact, see below…

Update (8 July 2008):

This is my first attempt at using coveritlive.com (as used very sucessfully many times by Andy Powell at Eduserv). On the whole thought it was sucessful, and the software was great.

Few points: The software doesn’t seem to allow you to edit entries, I guess this keeps with the spirit of liveblogging but can be a pain when you notice a mistake just after sending a message. In fact trying to take notes and listen to what is being said lead to quite a few typos etc, the small font on the screen – further away on my lap than it would be on a desk – didn’t help. Secondly, difficult to let people know you are doing this (the other extreme would be to bombared mailing lists announcing your intentions), so some who may have had an interest in keeping an eye on the days event would not have been aware of this.

For me as a reader, I think I find the more traditional blog post of these events more useful as a (slightly rough) written report of the different talks. However, I was keen to try out this software and knew that someone else (who is far better at it than me) was doing this. At the recent Talis Xiphos research day there was a combination of the two (here and here), and I thought this worked pretty well for those watching the day from afar.

JISC Library Management System Review

The JISC and SCONUL have just released a Review (Horizon Scan) of Library Management Systems (LMS or ILS). Intersting stuff. A lot of stuff to be expected, and some interesting findings:

  • Dire need to embrace web 2.0 and to get OPACs out of the 90s
  • The need to intergrate with campus systems such as Financial and registry systems.
  • Open Source is used within current systems, and as systems in their own right, the latter have no real penetration in the UK market and are unlikely to do so in the short term (lack of advantages), and are not currently more ‘open’ than other systems.
  • Most major systems are much of a muchness (they put it far more elegantly, though after glancing through 100 odd pages I’m not much inclined to find each quote), little reason to change system at the moment. Especially as…
  • It is a good time for the role and definition of a LMS to be looked at.
  • We are already moving away from the ‘one big system approach’ i.e. Aquabrowser, ERMs and Metasearch are all separate products. Almost certainly going to move in this direction, and this is a good thing. This requires standards and cooperation for different systems to talk to each other.
  • UK market is small compared to global market, though not that different to the norm, except for BL ILL server and our concept of Short Loans.
  • Libraries have not made good use of looking at how users work and interact with their systems.
  • There is a need and movement to liberate data (silios and all that)
  • LMS may become a back of house system (though this should not be seen as a bad thing)
  • Recommends that libraries increase the value of their investment by implementing additional services around the LMS. Which is fine if these new services are using standard protocols to interact with the LMS, if their are using proprietary api’s or talk direct to the database then it is another thing to lock the library in to one provider (or at least, another reason why changing LMS will have a large impact).
  • Procurement process is overall expensive, especially when LMSs are more or less the same-ish.
  • Encourages libraries to review their contracts (ie not changing system, just getting more – or better value – out of the current system with better contract).

One thing that interested me were the Vendor comments (and something the report reccommends to libraries) on the lack of consortia in the UK, and noted that other countries make good use of this (ie this is best practice). I can certainly see scope for this, especially as (like the report notes) the software has already been designed to handle consortiams.But what would the consortiams be? geographic based (London? South East? M25 group? Scotland) or perhaps using other groupings (Russel group, 94 group, CURL). Or perhaps smaller groupings based on counties or similar institutions.

I’ll quote two bullet points in their reccommendations:

  • The focus on breaking down barriers to resources is endorsed, involving single sign on, unifying workflows and liberating metadata for re-use.
  • SOA-based interoperability across institutional systems is emphasised as the foundation for future services and possibly the de-coupling of LMS components

The report also says “Libraries currently remain unconvinced about the return on their investment in electronic resource management systems.” Not really, we’re just waiting for a good one to come to market :)

I like a comment from one of the reference group “Since around 2000 there has been a growth in the perception of the library collection not as something physical that you hold, but as something you organise access to.” Nothing new, but nicely put.

Report can be found here [pdf], took ages to find where it is on the JISC website (JISC website? confusing? surely not!), but it is here (and the key item – the report – is at the very very bottom of the page (why?). I originally found the report via Talis’ Panlibus blog.

or08: eprints track, session 2

After coffee a little more talk about new features and the future as we ran out of time before. Christopher Gutteridge has now turned up, he may have had a few grown up fizzy drinks last night.

(lost concentration here: salt grain take) Eprints plugin will try and pick when people enter their names wrong (e.g. get first/lastnames mixed up). Report an eprint (or report an issue with an eprint) link on item/record pages?

3.1 beta: should be released in a day or so. Live CD available.

When will the new template (for records/items) including related papers (or ‘people who liked this also liked…’), html designer working on this. Can recreate abstract pages daily for fresh data (e.g. i think for stats/other papers).

People come in via Google for an item and the leave again. Soton ecs put links to postgrad prospectus and more on abstract pages for items, found hits to postgrad prospectus tripled.

Talking about more finely grained controls ans privileges , i.e. who can edit what, and where, and giving people additional power. Includes, for example, this person can edit wording of fields/help, but not edit workflow.

11:42: now moving on to research assessment experience.

Bill Mortimer – Open University.

How Open used eprints to support the RAE experience.

used eprints as a publication database because it was publicly available and helped increase citations. Also because of the reporting tool developed for eprints.

Open use mediated deposit but also imported records and self deposit.

Only peer reviewed items in ORO. Had up to 7 temp ‘editors’ processing the buffer.

Very slow uptake when mediated. Now have just under 7,000 items in ORO.

Simplified the workflow (which of course ep3+ have improved). Researchers responsible for depositing items for RAE submission.

Pro: increased awareness (of IR) increased deposits.

con: overlap of perceptions of ORO and RAE process (some felt RAE took over the IR). Lots of records but only 16% carry full text (% of full text varied by department).

Slide with some future ideas, good, see presentation on (though not currently there) http://pubs.or08.ecs.soton.ac.uk/

12:06am

Susan Miles – Kingston

metdata only repository at the moment but plan to add content and full text this year.

uni departmental structure and hierarchy has been the most controversial thing. Didn’t use RAE tool, wasn’t out the box.

Subject team staff created records, but focused moved to collection of physical items. (some) staff really got in to the IR, but this had the downside that many left with their new skills and experience!

misc bits

  • non existent items
  • people trying to pass off others work
  • items being removed and then re-entered constantly at the last minute for the rae
  • over sees academics caused issues.
  • proof of performances and other ‘arts’ outputs were a challenge (next time get the academics to do it).
  • a barrel moving back and forth in a room was a piece of research to be submitted for the RAE (How. evidence, metadata)

Unexpected, but lots of interest in the IR across the University. But lots of things in the buffer and no staff.

University committee has endorsed the IR as the source of publication data.

Because of using subject team staff for IR RAE, subject support now have good knowledge of the IR, which is good.

12:27

Wendy from soton

higher profile in Uni due to RAE work means people are including her – and the IR – more in discussions across campus such as looking at the REF.

question (from me): were any academics reluctant/against their rae information being put online? Answer: no

[anon comment, etheses mandate being reviewed regarding animal rights issues etc]

William Nixon: also planning to upload rae data. Does not foresee any problems, BUT recommend to not flag items as rae08 as some academics may have issues with this.

Les: HEFCE put metadata for items submitted to rae on web anyway.

q for open: you are currently only published peer reviewed items, do you plan to change this.

a: yes reviewing.

or08: Eprints 3.1

At the sucks-less-than-dspace Eprints track today. First up Eprints 3.1 and Future. Haven’t seen anything about 3.1 before this, v3 was released over a year a go so looking forward to seeing what is new.

9:10am: Les Carr is talking. reviewing v3 released last year. Talking about the large amount of work surrounding a repository (for all), which he experienced first hand running the soton ecs repository, and the work they have put in to help this. He found that when he contacted academics to point out problems he has fixed with their items/records they seemed pleased glad that someone was doing this. Last year they (eprints team) wanted to focus on ‘things on the ground’ to make things easier and not focus too much on rejiging the internals.

9:20: 3.1 more control for users. manage the repository without needing technical time (especially as University IT services often want to just set something up and leave it). showing Citation impact for authors.

Eprints 3 platform is built of two parts: ‘core’ backend, and plugins. Plugins control everything you see (I didn’t know plugins were used to this extent). A lot of the new things are just new plugins ‘slotted in’. Plugins can be updated separately which means upgrading specific parts of functionality is easy and doesn’t affect the whole system.

Lots of things moved from the command line to the web interface.

Administration: user interface for creating new fields and and configuring administrative tasks (sounds good).

Easily extend metadata, what gets stored, in a nice user interface.

9:31: live demo of adding new fields: ‘manage metadate fields’, you can edit them for each dataset e.g. document, eprints, users, imports. First get a screen showing all existing fields, a text field to enter a new field name (and something to show if you have any fields half created, to continue). Interface looks similar to creating an eprint item. select the different types of field e.g. boolean date, name, etc, lots of them, with descriptions of what they are, also one is a set where you can add a list of defined options, another is compound which can have various subfields. This is looking great.

9:38: next screen, loads of options: required? include in export? index? As name was selected on prev screen various options specific to the name field type. lots more. Has help (click on the ‘?’).

9:41: next screen set of questions about how this is displayed in the user interface, i.e. text user would see, help text. Again seems well designed. Editing XML in the past wasn’t rocket science but it was easy to forget steps or get syntax wrong, plus (certainly for v2) you had to do it with no items in the archive (not easy on a live repository!)

By default new fields appear in the MISC step (screen) of the deposit process for users. which can be changed by editing the workflow.

9:53: configuration (via web interface), fairly crude at the moment but looks to be useful (though not turned on for the demo repository), basically can edit things that are in cfg files. plan to turn this in to a full user interface in the future (not sure if for 3.1 or beyond).

9:58: running through some of the thins in the cfg files, such as how to make a field mandatory only for theses.

Quality Assurance. ideas of an ‘issue’ (something amiss) and an ‘audit’.

issue: stale, missing metadata. issues reported by item and also aggregated by depository.notification of issues can be emailed to authors. We cn define all this, i.e. what counts as an issue in the cfg files. can also check for duplicates (good as it will make my god awful script we use at Sussex obsolete).

Can have a nightly audit, and see if anyone has acted on the alerts and issues. reports can be generated for people.

10:07: batched editing. do a search and then batched change any fields for those search results. nice. running short of time so not demo’ing.

manage deposits screen (for users) icons on the right of each item of yours, to see, delete, move, etc. you change what columns you see on this screen by using icons at the bottom of the screen, can also move them around.

Impact Evidence: citation tracking, researchers can track citations counts and rank papers. volatile fields don’t change the history of a record. download counts from irstat.

Better bibliographies. can reorder, choose what to view, better control. this is very much needed as different researchers want their publication list in a different way. uses stylesheets.

Complex objects: all public objects have official URIs. expanded document-level metadata .

Versioning (based on VERSION project). ‘simple and useful’. pubished material ‘pre post or reprints’. unpublished materail, early draft, working paper. looks good.

10:19: Improve Document uploader. can upload a zip file of many files.

10:25 discussion about versions, e.g. how a user may add a draft (with limited metadata) and then go on and re-edit the item later on when they have a published version.

‘Contributers’ field. roles taken from dc relator names (225). large list of roles, may want to cut down.

A new skin, but not for 3.1 – i.e. record/abstract page will show a thumbnail of the item at the top, because the item is the important thing not the metadata (which is what is emphasised in the ui at the moment), i.e. in the same way that flickr shows the photo as the main thing on the page, and metadata at the bottom, good idea. new layout looks good.

Future: no time to talk: cloud computing, amazon eprints services perhaps (you just sign up to a IR on amazon and one is automatically created). On top of Fedora (saw folks on IRC talking about the same for Dspace the other day), or the Microsoft offering just announced. In a box (i.e. comes out the box as a pre-installed server) honeycomb.

UK repositories : growth of records

For a while now I’ve been running a weekly script which connects to ROAR and grabs the number of records for each UK based Institutional Repository. I’ve finally got around to writing a web front end to this, which you can see here. All quite basic at the moment, and I have lots of ideas of what I could do to improve this (one idea based on the compare average number of deposits per repository). Have a look and let me know what you think, and let me know of any bugs.

UK Repository Records Statistics (the name sucks!)