The Data Imperative: Libraries and Research Data : comment

I put this in to a seperate post. It continues on from my previous post, but didn’t want my notes of the day to be taken over by my ill thought views.

Personal Thoughts

Reluctant to give some thoughts as I know so little about the service. However… (!)

There seems to be two clear areas here: Data formatting and Data storing. There is some linkage (Preserving surely covers both, formats can become obsolete, Servers die), yet the two seem to be somewhat seperate.

Both require IT skills, but IT is a broad church, the former is technical metadata (and is very much IT and library) and in the general area that I sees covered in the Eduserv efoundations blog.

The latter in its simplest form is hard core infrastructure. Disks, sans, servers, security, but also has elements at the application level (how do we access it, using what software, repositories? CRIS? Fedora?).

On another issue, while it is easy to say that libraries should take the lead, I think we need to be cautious. With the current climate of frozen or decreasing budgets nationally, and journal subscription pressure, how wise is it to go to the University’s executive and demand funding for resources/staff for data management. We know it’s important and could make the process of research more efficient, but there are other things higher up a Universities list of priorities (NSS/atracting good students, REF, research funding). Even at a library level, journals help researchers do research (which brings funding), and keep students happy because we have the stuff they need (NSS). How many journals should we cancel to focus on Research Data? Why? The recent JISC call will help with providing a business case.

The problem at the moment is that there are not enough clear benefits for most Universities to steam ahead with this. Let’s clarify this: not enough benefits for the institution itself. The benefits are for the UK as a while (actually, the while world). It’s the UK-wide economy and research that will benefit. So maybe it needs UK-wide funding. It’s easier to convince someone (or something) to spend money when the benefits for them are clear. In this case the benefits are for UK so it should the UK which sets aside explicit cash (via HEFCE, JISC, and so on).

And this is happening, with the JISC call (talked about today), amongst other things it will help build examples.

But I’m not sure if the institutional level is the best one. Australia has been successful with a centralised approach. We have a number of small Universities, and those which only have one or two departments which are research active. Yet the resources/knowledge required of them will be similar to that of a large institution. Will this leave them at a disadvantage?

On another note, it seems the range of data is vast. When dicussing this, I always – incorrectly – picture text based data, of vearying size, perhaps using XML. Of course this is blinkered. For auido, images and similar should a data service just provide a method to download, or a method to browse and view/listen? When it comes to storage and delivery, should we just treat all data as ‘blobs’ – things to be downloaded as a file, and we no nothing more with it? This makes it easy and repository softwareapplications (eprints/dspace/fedora) are well placed to cater to this need. But I get the impression that this is somewhat simplistic. Perhaps this means a data service needs a clear scope, otherwise we could end up building front end applications which mimic flickr, youtube and last.fm all in one. A costly path to go down.

[all views are my own. are wrong, badly worded, ill thought, why are you reading this?, just think the opposite and it will be right, etc]

Event: The Data Imperative: Libraries and Research Data

Today I’m at the one day event ‘The Data Imperative: Libraries and Research Data’ at the Oxford e-research Centre. As usual, these are my own rough notes. There are mistakes, gaps and my own interpretation of what was said.

Paul Jeffreys : Director of IT, Oxford University.

Started off giving an overview of where this has come from. e-Research is more than just e-Infrastructure. e-Research is not just about outputs, but outputs (articles/data) are a part of this, and an discreet area to work on.

This is a cross-discipline area, it needs academics, University executive, research office, IT and Library. Libraries have skills that have to be fed in to this.

EIDCSR : ‘Enough talking,  let’s try and do it’, selected two research groups to work with, but not a pilot, a long term commitment. He talks about Oxford’s commitment to a data repository, it stresses cross agencies, mentions business models and feeds in to a senior research committee (the quote is far too long to add here!).

As each HEI is facing the same issue, it makes sense for national activity. but how much is done locally and how much is done nationally.

What is the vision of research management data? To what extent is managing research data the role of the Library/librarians? Is data management and data repositories a new kind of activity? Is it Librarians or Information Professionals who are charged to take this forward? [cjk: i thought they were one and the same]

John K Milner : Project Manager UKRDS

Can’t just use existing subject specific data centres. Need for cross-discipline (eg climate change) and therefore universal standards and methods so one subject can use another subject’s data with ease.

Feasibility study:

Understand what is happening today? where are the gaps. Avoid re-inventing the wheel.

Four Case studies (Bristol, Leeds, Leicester, Oxford), views of ~700 researchers over all disciplines (inc the arts).

What did they learn?

About half of the data has a useful life of 10 years? 26% has ‘indefinite’ value, ie keep for ever’ Nearly all kept locally (memory stick, departmental server, [cjk: not good!]).

21% use a national/international data centre. 18% share with them.

UK has rich landscape of facilities, skills and infrastructure.

The management of data from a research project are now starting to be directly funded, which is important.

What are others doing? Are we in step with other countries? Yes. US spending $100 million on 5 large data centres. Australians are leading in this area, and have a central approach to it. Canada and Germany also have similar developments.

Aim: to set up a framework for research data.

Why Pathfinder: not a pilot but the start of a long term commitment.

[my notes miss a bit here, had to deal with a urgent work issue]

Service must be useful and accessible. Need a framework for stakeholder engagement.

This is non-trivial. Lots of parties involved, a lot of effort needed.

Citation of datasets is of growing interest to some researchers, this may help engage the research community.

Showing a diagram of UKRDS Basic processes. Split between ‘Research Project process’, Research data sharing process and UKRDS Services and Administration

Diagram doesn’t focus on curation but on accessibility (inc discovery, stable storage, identity) as this seems like the most important part. Discovery:Google, Identity(auth):Shibboleth.

Making it happen.

Need clearly defined service elements, will involve DCC, RIN and data centres.

HEIs need a reliable back-office service to handle working with data.

UKRDS is extremely challenging, nothing is easy and it is expensive. Needs support of funders and HEIs, need the right bodies to show leadership and shape policy. It will take time.

Q: Is it limited to HEI or public sector (museums etc). A: a more complicated issue, but they are working with the liked of Connecting for Health and DEFRA.

Q: Copyright. A: HEI often don’t own copyright. Data Management Plan (Wellcome are funding Data planning as part of funding)

Q: Is it retrospective? A: Could be. [he did say more]

Q: Could UKRDS influence ‘reputational kick back’ [nice phase!] e.g. for the REF. A: Yes, in discussion with HEFCE.

Q: Research Councils A: they are in discussion with RCs but Wellcome very much taking the lead (leap of faith) in the area. The whole key is a ‘value proposition’ which makes a case for funding this.

Q/point: Engage government/politicians.

Q: Challenge in explaining what it is, especially for subjects which are already doing something with data. How can we tap in to those already doing it? A: there is sometimes a missing link between researchers and subject national data centres. No real relationship between the two. Which is a problem in cross-subject research.

Research data management at the University of Oxford: a case study for institutional engagement – Luis Martinez, OeRC, Sally Rumsey, Oxford University Library Service

More of a ‘in practice’ talk, rather than high level.

Luis Martinez

Scoping study: ‘DataShare project‘. Talking to researchers they found some couldn’t understand they own old data, some wanted to publish their own data, some found data was lost when academics moved on.

Requirements: Advice/support across research cycle (where to store it, how, etc), Secure Storage for large dataset. Sustainably infrastructure.

Lots of different Oxford units need to be consulted (library, it, research technology, academics, legal, repository etc).

Findings after consultation: there is actually widespread expertise in data management and curation amongst service units, and other findings. DataShare: new models, tools, workflows for academic data sharing.

Data Audit Framework: (DAF) adapted this to Oxford needs and used it to document practices in research groups.

Policy-making for Research Data in Repositories : a guide‘ [pdf]

The EIDCSR challenge: two units that both research around the human heart. The two groups share the data between them and agree to produce 3d models using the combined data. They are helping this groups do this, using a ‘life cycle approach’.

Using the DAF to capture the requirements. Participating in the UKRDS Pathfinder (as above).

They have a blog http://eidcsr.blogspot.com/

Sally Rumsey

Starts of by talking about the roles required regarding the library. They have Repository staff, librarians, curators, but not so sure about ‘data librarians’.

What should of data should they be responsible for? Some stuff can go to a national service. There are vast datasets (eg Oxford Supercomputing centre), who has the expertise to make these specialised datasets available. Some departments already have provision in place, fine, why rock the boat.

Long tail. Every thing else (not above). No other home, lots of it, Academics asking for it, highly individual (ie unique), hums and sciences.

Things to consider: live or changing data Freely available or restricted? Long term post project?

Showing what looks like a list of random words/letters/strings of chars, an example of some data they were asked to look after from the English department.

Showing a diagram showing that Fedora (a repository system which is strong on metadata/structure but lacks an out of the box UI) is key to the setup. many applications can sit on top of it. Institutional Repository is just one application which runs on top of Fedora.

ORA (IR) for DATA: actual data can be held anywhere in University but ORA is a place of discovery. Allows for referencing of data. Might want to link to ‘DataBank‘ (a proof of concept to show what is possible).

Databank: how do you search/discover? First things added were audio files, perhaps then photos, how do you find them?

Showing Databank. Explaining that everything has a uid so we have cool URLs, and hence you can link to it [yes!]. Explaining how you can group an audio object, a related photo object and a related text object (perhaps explaining it).

End of morning discussion (I’ll just note some points I picked up):

This seems to raise such huge resource implications.

DAF is flexible, you can pick elements of it to use.

Non academic repositories, such as flickr, preservation issues, if they go down. [unlike the AHDS then!]

The Research Data Management Workforce – Alma Swan, Key Perspectives

Study commissioned by JISC, looking at the ‘supply of DS [data scientists] skills’.

NSF Roles:

  • Data Authors – produce data
  • Data Managers – more technical people – often work in partnership with data authors
  • Data Users
  • Data Scientists – expert data handlers and managers (perhaps ‘Data Manager’ was a confusing name).

Our Definitions (but in practice the roles and names are fuzzy):

  • Data creators or authors
  • Data Scientists
  • Data Managers
  • Data Librarians

Data Creators

Using DCC Curation lifecycle model, these are the out ring. But not all of it, and do things not on the ring, such as throw data away.

Shows picture of an academics office. Data is stored in random envelops.

Data Scientists – the focus of this study

Work with the researchers, in the same lab. Do most things in the DCC model. Are computer scientists (or can be one), experts in database technologies, ensure systems are in place, format migration. A ‘translation service’ between Researchers and computer experts.

Lots of facts about this, based on the research. Often fallen in to the role by accident, often started out as a researcher. Domain (maths, chemistry) related or Computer training. Informatics Skills: well advanced in biology and chemistry. Majority have a further degree. Need People skills. Rapidly involving area.

Data Librarians

Only a handful in the UK. specific skills in data care, curation. Bottom half (or bottom two thirds) of DCC model.

Library schools have not yet geared up for training. Demand is low, no established career path. Good subject-based first degree is required.

Things are changing, eg library schools are creating courses/modules around this.

Future Roles of the library

train researchers to be more data aware

Pressing issue inform researchers on data principles, eg ownership.

Open Data : datasets

A growing recognition across all disciplines that articles aren’t enough, datasets are what are needed to be in the open.

Datasets are a resource in their own right.

Publishers do not normally claim ownership of datasets. Some are (usual suspects)

Funder may own Data, Employers may own data. No one seems sure. Several entities may own the data.

In some areas of research journals play role in enforcement.

Some journals are just data.

Using PDF for data is very very not good.

Do we leave preservation of data to publishers [cjk: no! they should have nothing to do with this, the actors are Universities, their employees and their funders]

Simon Hodson – JISC Data Management Infrastructure Programme

Something problem, not easy to tackle. Would be a mistake for institutions to wait. The Call is designed to better understand how its data management facility can be taken forward.

Detailed business cases are needed.

Needs everyone (HEI, funders, data centres, RIN, etc) to be on board.

the Call will have an Advisory Group.

‘Exemplar projects and studies designed to help establish partnership between researchers, institutions, research councils.

See DCC as playing a major role in developing capacity and skills in the sector.

Tools and technologies: tools to help managers make business case internally, institutional planning tools (building on DAT, DRAMBORA, and costing tools). Workshop 1oth June DCC to review progress/outcomes of DAT project.

Two calls planned for the early Autumn.

2 June Call: Infrastructure. To build examples within the sector. Requirements analysis -> Implementation plan -> Execution thereof -> business models.

Bids encouraged from consortia.

Briefing day 6 July. DCC will provide support for bids, including a specific helpdesk.

There may be a Digital Curation course in the next few weeks.

Libraries and Research Data Management; conclusions – Martin Lewis, Director of Library Services and University Librarian, University of Sheffield.

Martin had been chairing all day and here he sums up and bring the various threads together.

The library research data pyramid. Things at the bottom need to be in place before things higher up. At the bottom, training in library (confidence), Library schools. Then develop local data curation capacity, teach data literacy. Higher up: research data awareness, research data advice, Lead on local policy. At the very top ‘influence national data agenda’.

Summary

An excellent day and excellent knowledgeable speakers. Nice venue, and most importantly, I found the only plug socket in the room!

This is clearly an emerging area. Many are in the same posistion, they are aware of the (Opene) Research Data developments, but nothing has yet happened at their university, nor academics queuing up to demand such a service. This is a good thing and it needs to happen, and Universities need to start acting now. But there are many preasures on University resources at the moment. How high on the institutional priority list will this come?

[Very finally, I did another audioboo experiment. On the fly, with no pre-planning, I recorded about 2 minutes of talk during the lunch. It’s random, with no thought, many umms, a pointless ‘one more thing’ and basically wrong. laugh at it here]

“Sitting on a gold mine” – improving provision and services for learners by aggregating and using ‘learner behaviour data’

I’m at a workshop today called “Sitting on a gold mine” – improving provision and services for learners  by aggregating and using ‘learner behaviour data’ (it rolls off the tongue!), which is part of a wider JISC TILE project looking at, in a nutshell, how we can use data collected from user and user activity to provide useful services, and the issues and challenges involved (and some Library 2.0 concepts as well). As ever, these are just my notes, at some points I took more notes than others, there will be mistakes and I will badly misquote the speakers, please keep this in mind.

There’s quite a bit of ‘workshop’ discussion coming up, which I’m a little tentative about as I can rant on about many things for hours, but not sure I have a lot of views on this other than ‘this is good stuff’!

Pain Points & Vision – David Kay (TILE)

David gave an overview of the TILE project. Really interesting stuff, lots covered and good use of slides, but quite difficult to get everything down here.

TILE has three objectives

  • Capture scope/scale of Library 2.0
  • Identify significant challenges facing library system developments
  • Propose high level ‘library domain model’ positioning these challenges in the context of library ‘business processes’

You can get context from click streams, this is done by the likes of Amazon and e-music providers.

E.g. First year students searching for Napoleon also borrowed… they downloaded… they rated this resource… etc.

David referred to an idea of Lorcan Dempsey : we get too bogged down by the mechanics of journals and provision without looking at the wider business processes in the new ‘web’ environment.

Four ‘systems’ in the TILE architecture: Library systems (LMS, cross search, ERM), VLE, Repositories and associated content services, we looked at a model of how these systems interact with the user in the middle.

Mark Tool (University of Stirling)

Mark (who used to be based down the road at the University of Brighton) talking about the different systems Stirling (and the other Universities he has worked at) use and how we all don’t really know how users use them. Not just now, but historical trends, e.g. are users using e-books more now than in the past?

These questions are important to lecturers as they point students to resources and systems but what do users actually use, and how do we use them. Also a quality issue, are we pointing them to the right resources. Are we getting good value for money? e.g. licence and staff costs for a VLE.

If we were to look at how different students look at different resources, would we see that ‘high achievers’ use different resources to weaker students? Could/should we point the weaker students to the resources that the former use? Obvious privacy implications.

Also could be of use when looking at new courses and programmes and how to resource them. Nationally, might help guide us to which resources we should be negotiated for at a national level.

Danger:

  • small crowd -> small dataset  -> can be misleading (one or two people can look like a trend)
  • HEI’s very different to each other.

Thinks we should run some smallish pilots and then validate the data collected by some other means.

Joy Palmer – MIMAS

Will mainly be talking about COPAC, which has done some really interesting stuff recently in opening up their data and APIs (see the COPAC blog).

What are COPAC working on:

  • Googlisation of records (will be available on Google soon)
  • Links to Digital content
  • Service coherency with zetoc and suncat
  • Personalisation tools / APIs
    • ‘My Bibliography’
    • Tagging facilities
    • Recommend-er functions
    • ummm other stuff I didn’t have time to note
  • Generally moving from a ‘Walled garden’ to something that can be mashed up [good!]

One example of a service from COPAC is the ‘My bibliography’ (or ‘marked list’ ) which can be exported in the ATOM format (which allows it to be used anywhere that takes an ATOM feed). These lists will be private by default but could be made public.

Talked about the general direction and ethos of COPAC development with lots of good examples, and the issues involved. One of the slides was titled:  From ‘service’ to ‘gravitational hub’ which I liked. She then moved on to her (and MIMAS/COPAC’s) perspective on the issue of using user generated data.

Workshop 1.

[Random notes from the group I was in, mainly the stuff that I agreed with(!), there were three groups] Talking about should we do this? the threats (and what groups of people affected by these threats). Good discussion. We talked about how these things could be useful, why some may be adverse/cautious of it (inc, privacy, inflicting on others areas – IT/library telling academics what they are recommending to students are not being used, ie telling them they are doing it wrong, creates friction). Should we do this? Blunt tool, may see wrong trends. But need to give it a go, and see what happens. Is it ‘anti-HE’ to be offering such services (i.e. recommending books), no no no! Should we leave it it to the likes of Google/Amazon? No, this is where the web is going. But real world experience of things to be aware of e.g. a catalogue ranking an edition of a  book high due to  high usage lead to a newer edition being further down the list.[lots more discussion, I forget]

Dave Pattern – Huddersfield.

[Dave is the system librarian at Huddersfield, who has ideas better than me, then implements than better than I ever could, in a fraction of the time. He’s also a great speaker. I hate him. Check out his annoyingly fantastic blog]

Lots of data generated just doing what we and users need to do, we can dig this. Dave starts of talking about Supermarket loyalty cards. Supermarkets were doing ‘people who bought this also bought’ 10 or more years a go. We can learn from them, we could do this.

We’ve been collecting circ data for years, why haven’t we done anything (bar real basic stuff) with it.

Borrowing suggestions (people who borrowed this also borrowed), working at Hud, librarians report it working well and suggesting the same books as they would.

Personalised Suggestions, if you log in, looking at what they borrowed and then what others items those who borrowed the

Lending paths: paths which join books together. potentially to predict what people will borrow and predict when particular books will be in high demand.

Library catalogue shows some book usage stats when used from a library staff PC (brilliant idea!) this can be broken down by different criteria (i.e. the courses borrowers are on).

Other functionality: Keyword suggestions, Common zero results keywords (eg, newspapermen, asbo, disneyfication). Huddersfield have found digging useful.

He’s released XML data of anonymised  circulation data, with approval of the library, for others to play with and hopes other libraries will do the same. (This is a stupidly big announcement, it feels insulting to put it just as one sentence like this, perhaps I should enclose it in the <blink> tag!?) See his blog post.

(note to self, don’t try to download 50mb file via 3g network usb stick – bad things happen to macbook)

Mark van Harmelen

Due to bad things was slightly distracted during part of this talk. Being a man completely failed to multi-task.

This was an excellent talk (at a good level) about how the TILE project is building prototype/real system(s). Some real good models of how this will/could work.  So far have developed harvesting data from institutions (and COPAC/similar services) and adding ‘group use’ to their database, a searcher known to be ‘chemistry student’ and ‘third year’ can then get relevant recommendations based on data from the groups they belong to. [I’m not doing this justice, but some really good models and examples of this working]

David Jennings – Music Recommender systems

First off refers to the Googlezon film (never heard of this before) and the idea of big brother in the private sector, and moves on and talks about (concept of) ipods which predict the music you want to hear next based on your mood and even matchmaking based on how you react to music.

Discovery: We search, we browse, we wait for things come along, we follow others, we avoid things everyone else listens to, etc.

Talking about flickr’s (not published) popularity ranking as a way to bring things to the front based on views, comments, tags etc.

Workshop 2:

Some random comments and notes from the second discussion session (from all groups)

One University’s experience was that just ‘putting it out there’ didn’t work, no one added tags to catalogue, conclusion was the need of community.

Coldstart problem: new content not surfacing with the sort of things being discussed here.

Is a Subject Librarian’s (or researcher) recommendation of the same value as a undergrad’s?

Will Library Director’s agree for library data to be released in the same way as Huddersfield, even though it is anonymised? They may fear the risks and issues that it could result in, even if we/they are not sure what those risks are (will an academic take issue with a certain aspect of the realised data).

At a national level, if academics used these services to create reading lists, may result in homogenisation of teaching across the UK. Also risk of student’s reading focusing on a small group of items/books, we could end up with four books per subject!

Summary

This was an excellent event, and clearly some good and exciting work is taking place. What are my personal thoughts?…

This is one of those things that once you get discussing it you’re never quite sure why it already hasn’t been done before, especially with circulation data. There’s a wide scope, from local library services (book recommendation) to national systems which use data from VLEs, registry systems and library systems. A lot of potential functionality, both in terms of direct user services and informing HE (and others) to help them make decisions and tailor services for users.

Challenges include: privacy, copyright, resourcing (money) and the uncertainty of (and aversion to) change. The last one includes a multitude of issues: will making data available to others lead to a budget reduction for a particular department, will it create friction between different groups (e.g. between academics and central services such as Libraries and IT)?

Perhaps the biggest fear is not knowing what demons this will release. If you are a Library Director, and you authorise your organisation’s data to be made available – or the introduction of a service such as the ones discussed today – how will it come back to haunt you in the future? Will it lead to your institution making (negative) headlines? Will a system/service supplier sue you for giving away ‘their’ data?  Will academics turn on you in Senate for releasing data that puts them in a bad light? ‘Data’ always has more complex issues than ‘services’.

In HE (and I say this more after talking to various people at different institutions over the last few years) we are sometimes to fearful of the 20% instead of thinking about the 80% (or is that more 5/95%). We will always get complaints about new services and especially about changes. No one contacts you when you are doing well (how many people contact Tesco to tell them they have allocated the perfect amount of shelf space to bacon?!) We must not let complaints dictate how we do things or how we allocate time (though of course not ignore them, relevant points can often be found).

Large organisations – both public and private – can be well known for being inflexible. But for initiatives like this (and those in the future) to have a better chance of succeeding we need to look at how we can bring down the barriers to change. This is too big an issue to get in to here it and the reasons are both big and many, from too many stakeholders requiring approval to a ‘wait until the summer vacation’ philosophy, from long term budget planning to knock-on affects across the organisation (change in department A means training/documentation/website of Department B needs to be changed first). Hmmmm, seemed to have moved away from TILE and on to a general rant offending the entire UK HE sector!

Thinking about Dave Pattern’s announcement, what will it take for other libraries to follow? First, techy stuff, he has (I think) created his own XML schema (is that the right term?) and will be working on an API to access the data. The bad thing would be for a committee to take this and spend years to finally ‘approve’ it. The Good thing would be for a few metadata/XML type people to suggest minor changes (if any) and endorse it as quickly as possible (which is no disrespect to Dave). Example: will the use of UCAS codes be a barrier for international adoption (can’t see why, just thinking out loud). There was concern at the event that some Library Directors would be cautious in approving such things. This is perhaps understandable. However, I have to say I don’t even know who the Director of Huddersfield Information Services is, but my respect for the institution and the person in that role goes about as high as it will go when they do things like this. They have taken a risk, taken the initiative and been the first to do something (to the best of my knowledge) worldwide. I will buy them a beer should I ever meet them!

I’ll be watching any developments (and chatter) that result from this announcement, and thinking about how we can support/implement such an initiative here. In theory once (programming) scripts have been written for a library system, it should be fairly trivial to port it to other customers of the same software (work will probably include mapping departments to UCAS codes, and the way user affiliation to departments is stored may vary between Universities). Perhaps Universities could club together to working on creating the code required? I’m writing this a few hours after Dave made his announcement and already his blog article has many trackbacks and comments.

So in final, final conclusion. A good day, with good speakers and a good group of attendees from mixed backgrounds. Will watch developments with interest.

[First blog post using WordPress 2.7, other blogs covering are Phil’s CETIS blog, and Dave Pattern has another blog entry on his talk. If you have written anything on this event then please let me know!]