ircount : update

One Sunday morning in January this year I got an email sent automatically from the webhosting company. It contained the output of the script that ran weekly, when all ran fine the script produced no output. When something went wrong the error messages were emailed to me. Judging by the length of the email something big had gone wrong.

The script collected data from http://roar.eprints.org/ – to be used as this weeks ‘number of records’ for each repository.

The reason became clear quickly. A major revamp to ROAR had just been launch, showing off a new interface, which used the Eprints software as a platform (essential a repository or repositories). This was a great leap forward but unfortunately removed the simple text file I used to collect the data, and what was more, the IDs for each IR had changed.

I finally got around to fixing this in May. The most fiddly bit was linking the data I collected now with the data I already had. This involved matching URLs and repository names.

Anyways. Things should be more or less as they were. A few little tweaks have been added. A few bugs still remain.

As ever you can view the code and changes here: http://trac.nostuff.org/ircount/browser/trunk

And checkout the svn here: http://svn.nostuff.org/ircount/

ircount can be found here: http://www.nostuff.org/ircount/

ircount development

I’ve finally got around to spending a bit of time on the ircount code.

This post goes through some of the techy stuff behind it. If you’re just interested in features, I’m afraid there’s none yet, but you can now compare more than 4 repositories, but that’s as far as you’ll want to read. Continue reading

ircount : Repository Record Statistics

I’ve updated Repository Record  Statistics (which I refer to as ircount).

Some key points:

  • Repository names with non-standard characters were not displaying properly. They now should, though there are some parts of the site where I have not updated the code yet. Many Russian IRs are also not showing the correct name, even though it is correct in the database, which I will look in to.
  • If a repository changes its name (in ROAR) then the new name should be shown in ircount
  • When looking at the details for one Repository, you can now compare it with any other repository, not just those from the same country. You’ll see two drop down boxes, one for those from the same country (for convenience) and one listing all repositories.
  • I’ve removed the Full Text numbers, which only appeared for some repositories. They were inaccurate and not very useful.
  • There’s now a link on the homepage to ircount news (which is actually just posts on this blog which have been tagged ‘ircount’, like this one).

The code is now in a subversion (svn) repository, my first time using such a system, which should help me keep track of changes. I can make it available to others if anyone is interested.

The future

There are other changes planned…. At the moment this is all in one big database table. Each week it collects lots of info about repositories from ROAR, including their name etc, and saves one row per repository (for each week). This means lots of infomation (such as a repository’s name) is duplicated each week. It also means that when selecting the name to be displayed you have to be careful to select the latest entry for the repository in question (something which hit me badly when trying to fix the name problem, and something I still haven’t got around to fixing for all SQL queries). I’m working on improving the back-end design.

I’m also thinking about periodically connecting to the OAI-PMH interface for each IR to collect certain details directly. Though this will be quite a change of direction, at the moment, ircount’s philosophy has been simply collecting from ROAR and reporting on what it gets back. Do I want to go down the road and loose this simple model.

I’m also pondering on ways to keep track of the number of full text items in each IR (you can see some initall thoughts on this here), though this will open a big can of worms.

The stats table which shows growth of repositories (based on number of records) over time for a given country (this one), could do with some improvements. RSS feeds for various things are also on the to do list.

Technical details

How did I fix these things, and add the new features. I made these changes a few months a go (on a test area) and the exact details have already slipped my mind.

Funny characters in Repository names

Any name with a non-standard character (a-z, A-Z, 0-9) has always displayed as garbage, this became more of an issue when I expanded ircount to include repositories world-wide. Below you can see an example from a Swedish repository: Växjö University.Example of incorrect name

The script which collects the data each week outputs its collected data in to a text file as well as writing it all to the database (really just as a backup and for debugging). The names displayed in the text file were the same messed up format, just as they were on the website. As the text file had nothing to do with the MySQL database or PHP front-end, I concluded the problem was with the actually grabbing the data from the ROAR website, which uses PERL’s LWP::simple.

This was a red herring. In the end I knocked up a script which just collected the file and output it to a textfile, and all worked fine. I gradually added code from the main script and it all was still working fine. So why did not main script not work.

In the end, I can’t remember the details but I think starting a new log file, or using a different filename (stupid I know) and other random things made the funny character problem go away.  Which meant the problem was now with the DB after all, which made more sense.

In the end I found this excellent page http://www.gyford.com/phil/writing/2008/04/25/utf8_mysql_perl.php

Converting the database tables/fields to utf8_general_ci, and adding a couple of lines of code to the perl script to ensure the connection to the db was in utf (both outlined in the webpage linked to above) sorted this out. The final step was ensuring that the front end user interface selected the most recent repository name for a given IR, as older entries in the db would still have the incorrect name.

Country names

When I started collecting data for all repositories around the world I needed a way for users to be able to select a particular country. ROAR provided two digit codes, but how to display proper country names. I found a soultion using a simple to use PHP library detailed here (note it’s on two pages).

Known Bugs

  • Some Repository names still displayed wrong, e.g Russian names. Anyone know why?!
  • Table may show old names, and sometimes show two seperate rows if a repository has changed it’s name in ROAR
  • When comparing repositories, sometimes the graph does not display, especially when comparing four. This is due to the amount of data being passed to Google Charts is exceeding the maximum size of a URL (can’t remember of the top of my head but it is about 2,000 characters). This should be fixable as there is no need to pass so much data to Google charts, just need to be a little more intelligent in preparing the URL.
  • Some HTML tables are not displaying correctly, some border lines are missing, almost at random.

Playing with OAI-PMH with Simple DC

Setting up ircount has got me quite interested in OAI-PMH, so I thought I would have a little play. I was particularly interested in seeing if there was a way to count the number of full text items in a repository, as ROAR does not generally provide this information.

Perl script

I decided to use the http::oai perl module by Tim Brody (who not-so-coincidentally is also responsible for ROAR, which ircount gets its data from).

A couple of hours later I have a very basic script which will roughly report on the number of records and the number of full text items within a repository, you just need to pass it a URL for the OAI-PMH interface.

To show the outcome of my efforts, here is the verbose output of the script when pointed at the University of Sussex repository (Sussex Research Online).

Here is the output for a sample record (see here for the actual oai output for this record, you may want to ‘view source’ to see the XML):

oai:eprints.sussex.ac.uk:67 2006-09-19
Retreat of chalk cliffs in the eastern English Channel during the last century
relation: http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
MATCH http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
relation: http://www.journalofmaps.com/article_depository/europe/Dornbusch_coast_1124460539.pdf
dc.identifier: http://eprints.sussex.ac.uk/67/
full text found for id oai:eprints.sussex.ac.uk:67, current total of items with fulltext 6
id oai:eprints.sussex.ac.uk:67 is the 29 record we have seen

It first lists the identifier and date, the next line shows the title, it then shows a dc.relation field which contains a full text item on the eprints server, because it looks like a full text item and on the same server the next line shows it has found a line that MATCHed the criteria which means we add this item to the count of items with full text items attached.

The next line is another dc.identifier, again pointing to a fulltext URL for this item. However this time it is on a different server (i.e. the publishers), so this line is not treated as a fulltext item, and so it does not show a MATCH (i.e. had the first identifier line not existed, this record would not be considered one with a fulltext item).

Finally another dc.identifier is shown, then a summary generated by the script concluding that this item does have fulltext, is the sixth record seen with fulltext, and is the 29th record we have seen.

The script, as we will now see, has to use various ‘hacky’ methods to try and guess the number of fulltext items within a repository, as different systems populate simple Dublin Core in different ways.

Repositories and OAI-PMH/Simple Dublin Core.

It quickly became clear on experimenting with different repositories that the different repository software populate Simple Dublin Core in a different manner. Here are some examples:

Eprints2: As you can see above in the Sussex example, fulltext items are added as a dc.relation field, but so too are any publisher/official URLs, which we don’t want to count. The only way to differentiate between the two is to check the domain name within the dc.relation url and see if it matches that of the OAI interface we are working with. This is no means solid, quite possible for a system to have more than one hostname and what the user gives as the OAI URL may not match what the system gives as the URLs for fulltext items.

Eprints3: I’ll use the Warwick repository for this, see the HTML and OAI-PMH for the record used in this example.

<dc:format>application/pdf</dc:format>
<dc:identifier>http://wrap.warwick.ac.uk/46/1/WRAP_Slade_jel_paper_may07.pdf</dc:identifier>
<dc:relation>http://dx.doi.org/10.1257/jel.45.3.629</dc:relation>
<dc:identifier>Lafontaine, Francine and Slade, Margaret (2007) Vertical integration and firm boundaries: the evidence. Journal of Economic Literature, Vol.45 (No.3). pp. 631-687. ISSN 0022-0515</dc:identifier>
<dc:relation>http://wrap.warwick.ac.uk/46/</dc:relation>

Unlike Eprints2, the fulltext item is now in a dc.identifier field, the official/publisher URL is still a dc.relation field, which makes it easier to count the former without the latter. EP3 also seems to provide a citation of the item which is also in a dc.identifier as well. (as an aside: EPrints 3.0.3-rc-1, as used by Birkbeck and Royal Holloway, seems to act differently, missing out any reference to the fulltext).

Dspace: I’ll use Leicester’s repository, see the HTML and OAI-PMH for the record used. (I was going to use Bath’s but looks like they have just moved to Eprints!)

<dc:identifier>http://hdl.handle.net/2381/12</dc:identifier>
<dc:format>350229 bytes</dc:format>
<dc:format>application/pdf</dc:format>

This is very different to Eprints. DC.identifier is used for a link to the html page for this item (like eprints2 but unlike eprints3 which uses dc.relation for this). However it does not mention either the fulltext item or the official/publisher url at all (this record has both). The only clue that this has a full text item is the dc.format (‘application/pdf’), and so my hacked up little script looks out for this as well.

I looked at a few other Dspace based repositories (Brunel HTML / OAI ; MIT HTML / OAI) and they seemed to produce the same sort of output, though not being familiar with Dspace I don’t know if this is because they were all the same version or if the OAI-PMH interface has stayed consistent between versions.

I haven’t even checked out Fedora, bepress Digital Commons or DigiTool yet (all this is actually quite time consuming).

Commentary

I’m reluctant to come up with any conclusions because I know the people who developed all this are so damn smart. When I read the articles and posts produced by those (who were) on the OAI-PMH working group, or were in some way involved, it is clear they have a vast understanding of standards, protocols, metadata, and more. Much of what I have read is clear and well written and yet I still struggle to understand it due to my own metal shortcomings!

Yet what I have found above seems to suggest we still have a way to go in getting this right.

Imagine a service which will use data from repositories: ‘Geography papers archive’, ‘UK Working papers online’, ‘Open Academic Books search’ (all fictional web sites/services which could be created which harvest data from repositories, based on a subject/type subset).

Repositories are all about open access to the full text of research, and it seems to me that harvesters need to be able to presume that the fulltext item, and other key elements, will be in a particular field. And perhaps it isn’t too wild to suggest that one field should be used for one purpose, for example, both Dspace and Eprints provide a full citation of the item in the DC metadata, which an external system may find useful in some way, however it is in the dc.identifier field, yet various other bits of information are also in the very same field, so anyone wishing to extract citations would need to run some sort of messy test to try and ascertain which identifier field, if any, contains the citation they wish to use.

To some extent things can be improved by getting repository developers, harvester developers and OAI/DC experts round a table to agree a common way of using the format. Hmm, but ring any bells? I’ve always thought that the existence of the Bath profile was probably a sign of underlying problems with Z39.50 (though am almost totally ignorant on z39.50). even this will only solve some problems, the issue of multiple ‘real world’ elements being put in to the same field (both identifier and relation are used for a multiple of purposes), as mentioned above, is still a problem.

I know nothing about metadata nor web protocols (left with me, we would all revert to tab delimited files!), so am reluctant to suggest or declare what should happen. But there must be a better fit for our needs than Simple DC. Qualified DC being a candidate (I think, again, I know nuffing). see this page highlighting some of the issues with simple dc.

I guess one problem is that it is easy to fall in to the trap of presuming repository item = article/paper. When of course if could be almost anything, the former would be easy to narrowly define, but the latter – which is the reality – is much harder to give a clear schema for. Perhaps we need ‘profiles’ for the common different item types (articles/theses/images). I think this is the point that people will point out that (a) this has been discussed a thousand times already (b) has probably already been done!. So I’ll shut up and move on (here’s one example of what has already been said).

Other notes:

  • I wish OAI-PMH had a machine readable way of telling clients if they can harvest items, reuse the data, or even access it at all (apologies if it does allow this already). The human text of an IR policy may forbid me sucking up the data and making it searchable elsewhere, but how will I know this?
  • Peter Millington of RSP/SHERPA recently floated the idea of a OAI-PMH verb/command to report the total number of items. His point is that it should be simple for OAI servers to report such a number with ease (probably a simple SQL COUNT(*)) but at the moment OAI-PMH clients – like mine – have to manually count each item, parsing thousands of lines of data, which can take minutes, creating processing requirements for both server and client, to answer a simple question of how many items are there? I echo and support Peter’s idea of creating a count verb to resolve this.
  • Would be very handy if OAI-PMH servers could give an application name and version number as part of the response to the ‘Identify’ verb. Would be very useful when trying to work around the differences between applications and software versions.

Back to the script

Finally. I’m trying to judge how good the little script is, does it report an accurate number of full text items. If you run an IR and would be happy for me to run the script against your repository (I don’t think it creates a high load on the server), then please reply to this post. Ideally with your OAI-PMH URL and how many full text items you think you have, though neither are essential. I’ll attach the results to a comment to this post.

Food for thought, I’m pondering the need to check the dc.type of an item, and only count items of certain types, e.g. should we include images? one image of a piece of research sounds fine, 10,000 images suddenly distorts the numbers. Should it include all items, or just those that are of certain types (article, thesis etc)?

ircount : new location, new functionality

A while a go, I released a simple website which reported on the number of items in UK repositories over time. It collected its data from ROAR but by collecting it on a weekly basis could provide a table showing growth week by week.

First it has a new home: http://www.nostuff.org/ircount/

Secondly, it now collects data for every institutional (and departmental) repository registered in ROAR across the world. Not just the UK. It has been collecting the data since July.

The country integration isn’t perfect, you have to select a country, and then you are more or less restricted to that country (though you can hack it, see the ‘info&help’), and there is a lot of potential with improving this. There are also a couple of bugs, for example when comparing four repositories it seems to (a) forget which country you were dealing with, and (b) it stops showing the graph/chart.

I’m currently looking at trying to make an educated guess at how many fulltext items are in a given repository. This is proving to be a steep learning curve in the joys of OAI-PMH, and how the different repository systems (and the different versions on these systems) have allocated information about the fulltext in to different Dublin Core (DC) elements. But this is for another post.

In the mean time, I hope the worldwide coverage is of some use, and feel free to leave any comments.

UK repositories : growth of records

For a while now I’ve been running a weekly script which connects to ROAR and grabs the number of records for each UK based Institutional Repository. I’ve finally got around to writing a web front end to this, which you can see here. All quite basic at the moment, and I have lots of ideas of what I could do to improve this (one idea based on the compare average number of deposits per repository). Have a look and let me know what you think, and let me know of any bugs.

UK Repository Records Statistics (the name sucks!)