Event: The Data Imperative: Libraries and Research Data

Today I’m at the one day event ‘The Data Imperative: Libraries and Research Data’ at the Oxford e-research Centre. As usual, these are my own rough notes. There are mistakes, gaps and my own interpretation of what was said.

Paul Jeffreys : Director of IT, Oxford University.

Started off giving an overview of where this has come from. e-Research is more than just e-Infrastructure. e-Research is not just about outputs, but outputs (articles/data) are a part of this, and an discreet area to work on.

This is a cross-discipline area, it needs academics, University executive, research office, IT and Library. Libraries have skills that have to be fed in to this.

EIDCSR : ‘Enough talking,  let’s try and do it’, selected two research groups to work with, but not a pilot, a long term commitment. He talks about Oxford’s commitment to a data repository, it stresses cross agencies, mentions business models and feeds in to a senior research committee (the quote is far too long to add here!).

As each HEI is facing the same issue, it makes sense for national activity. but how much is done locally and how much is done nationally.

What is the vision of research management data? To what extent is managing research data the role of the Library/librarians? Is data management and data repositories a new kind of activity? Is it Librarians or Information Professionals who are charged to take this forward? [cjk: i thought they were one and the same]

John K Milner : Project Manager UKRDS

Can’t just use existing subject specific data centres. Need for cross-discipline (eg climate change) and therefore universal standards and methods so one subject can use another subject’s data with ease.

Feasibility study:

Understand what is happening today? where are the gaps. Avoid re-inventing the wheel.

Four Case studies (Bristol, Leeds, Leicester, Oxford), views of ~700 researchers over all disciplines (inc the arts).

What did they learn?

About half of the data has a useful life of 10 years? 26% has ‘indefinite’ value, ie keep for ever’ Nearly all kept locally (memory stick, departmental server, [cjk: not good!]).

21% use a national/international data centre. 18% share with them.

UK has rich landscape of facilities, skills and infrastructure.

The management of data from a research project are now starting to be directly funded, which is important.

What are others doing? Are we in step with other countries? Yes. US spending $100 million on 5 large data centres. Australians are leading in this area, and have a central approach to it. Canada and Germany also have similar developments.

Aim: to set up a framework for research data.

Why Pathfinder: not a pilot but the start of a long term commitment.

[my notes miss a bit here, had to deal with a urgent work issue]

Service must be useful and accessible. Need a framework for stakeholder engagement.

This is non-trivial. Lots of parties involved, a lot of effort needed.

Citation of datasets is of growing interest to some researchers, this may help engage the research community.

Showing a diagram of UKRDS Basic processes. Split between ‘Research Project process’, Research data sharing process and UKRDS Services and Administration

Diagram doesn’t focus on curation but on accessibility (inc discovery, stable storage, identity) as this seems like the most important part. Discovery:Google, Identity(auth):Shibboleth.

Making it happen.

Need clearly defined service elements, will involve DCC, RIN and data centres.

HEIs need a reliable back-office service to handle working with data.

UKRDS is extremely challenging, nothing is easy and it is expensive. Needs support of funders and HEIs, need the right bodies to show leadership and shape policy. It will take time.

Q: Is it limited to HEI or public sector (museums etc). A: a more complicated issue, but they are working with the liked of Connecting for Health and DEFRA.

Q: Copyright. A: HEI often don’t own copyright. Data Management Plan (Wellcome are funding Data planning as part of funding)

Q: Is it retrospective? A: Could be. [he did say more]

Q: Could UKRDS influence ‘reputational kick back’ [nice phase!] e.g. for the REF. A: Yes, in discussion with HEFCE.

Q: Research Councils A: they are in discussion with RCs but Wellcome very much taking the lead (leap of faith) in the area. The whole key is a ‘value proposition’ which makes a case for funding this.

Q/point: Engage government/politicians.

Q: Challenge in explaining what it is, especially for subjects which are already doing something with data. How can we tap in to those already doing it? A: there is sometimes a missing link between researchers and subject national data centres. No real relationship between the two. Which is a problem in cross-subject research.

Research data management at the University of Oxford: a case study for institutional engagement – Luis Martinez, OeRC, Sally Rumsey, Oxford University Library Service

More of a ‘in practice’ talk, rather than high level.

Luis Martinez

Scoping study: ‘DataShare project‘. Talking to researchers they found some couldn’t understand they own old data, some wanted to publish their own data, some found data was lost when academics moved on.

Requirements: Advice/support across research cycle (where to store it, how, etc), Secure Storage for large dataset. Sustainably infrastructure.

Lots of different Oxford units need to be consulted (library, it, research technology, academics, legal, repository etc).

Findings after consultation: there is actually widespread expertise in data management and curation amongst service units, and other findings. DataShare: new models, tools, workflows for academic data sharing.

Data Audit Framework: (DAF) adapted this to Oxford needs and used it to document practices in research groups.

Policy-making for Research Data in Repositories : a guide‘ [pdf]

The EIDCSR challenge: two units that both research around the human heart. The two groups share the data between them and agree to produce 3d models using the combined data. They are helping this groups do this, using a ‘life cycle approach’.

Using the DAF to capture the requirements. Participating in the UKRDS Pathfinder (as above).

They have a blog http://eidcsr.blogspot.com/

Sally Rumsey

Starts of by talking about the roles required regarding the library. They have Repository staff, librarians, curators, but not so sure about ‘data librarians’.

What should of data should they be responsible for? Some stuff can go to a national service. There are vast datasets (eg Oxford Supercomputing centre), who has the expertise to make these specialised datasets available. Some departments already have provision in place, fine, why rock the boat.

Long tail. Every thing else (not above). No other home, lots of it, Academics asking for it, highly individual (ie unique), hums and sciences.

Things to consider: live or changing data Freely available or restricted? Long term post project?

Showing what looks like a list of random words/letters/strings of chars, an example of some data they were asked to look after from the English department.

Showing a diagram showing that Fedora (a repository system which is strong on metadata/structure but lacks an out of the box UI) is key to the setup. many applications can sit on top of it. Institutional Repository is just one application which runs on top of Fedora.

ORA (IR) for DATA: actual data can be held anywhere in University but ORA is a place of discovery. Allows for referencing of data. Might want to link to ‘DataBank‘ (a proof of concept to show what is possible).

Databank: how do you search/discover? First things added were audio files, perhaps then photos, how do you find them?

Showing Databank. Explaining that everything has a uid so we have cool URLs, and hence you can link to it [yes!]. Explaining how you can group an audio object, a related photo object and a related text object (perhaps explaining it).

End of morning discussion (I’ll just note some points I picked up):

This seems to raise such huge resource implications.

DAF is flexible, you can pick elements of it to use.

Non academic repositories, such as flickr, preservation issues, if they go down. [unlike the AHDS then!]

The Research Data Management Workforce – Alma Swan, Key Perspectives

Study commissioned by JISC, looking at the ‘supply of DS [data scientists] skills’.

NSF Roles:

  • Data Authors – produce data
  • Data Managers – more technical people – often work in partnership with data authors
  • Data Users
  • Data Scientists – expert data handlers and managers (perhaps ‘Data Manager’ was a confusing name).

Our Definitions (but in practice the roles and names are fuzzy):

  • Data creators or authors
  • Data Scientists
  • Data Managers
  • Data Librarians

Data Creators

Using DCC Curation lifecycle model, these are the out ring. But not all of it, and do things not on the ring, such as throw data away.

Shows picture of an academics office. Data is stored in random envelops.

Data Scientists – the focus of this study

Work with the researchers, in the same lab. Do most things in the DCC model. Are computer scientists (or can be one), experts in database technologies, ensure systems are in place, format migration. A ‘translation service’ between Researchers and computer experts.

Lots of facts about this, based on the research. Often fallen in to the role by accident, often started out as a researcher. Domain (maths, chemistry) related or Computer training. Informatics Skills: well advanced in biology and chemistry. Majority have a further degree. Need People skills. Rapidly involving area.

Data Librarians

Only a handful in the UK. specific skills in data care, curation. Bottom half (or bottom two thirds) of DCC model.

Library schools have not yet geared up for training. Demand is low, no established career path. Good subject-based first degree is required.

Things are changing, eg library schools are creating courses/modules around this.

Future Roles of the library

train researchers to be more data aware

Pressing issue inform researchers on data principles, eg ownership.

Open Data : datasets

A growing recognition across all disciplines that articles aren’t enough, datasets are what are needed to be in the open.

Datasets are a resource in their own right.

Publishers do not normally claim ownership of datasets. Some are (usual suspects)

Funder may own Data, Employers may own data. No one seems sure. Several entities may own the data.

In some areas of research journals play role in enforcement.

Some journals are just data.

Using PDF for data is very very not good.

Do we leave preservation of data to publishers [cjk: no! they should have nothing to do with this, the actors are Universities, their employees and their funders]

Simon Hodson – JISC Data Management Infrastructure Programme

Something problem, not easy to tackle. Would be a mistake for institutions to wait. The Call is designed to better understand how its data management facility can be taken forward.

Detailed business cases are needed.

Needs everyone (HEI, funders, data centres, RIN, etc) to be on board.

the Call will have an Advisory Group.

‘Exemplar projects and studies designed to help establish partnership between researchers, institutions, research councils.

See DCC as playing a major role in developing capacity and skills in the sector.

Tools and technologies: tools to help managers make business case internally, institutional planning tools (building on DAT, DRAMBORA, and costing tools). Workshop 1oth June DCC to review progress/outcomes of DAT project.

Two calls planned for the early Autumn.

2 June Call: Infrastructure. To build examples within the sector. Requirements analysis -> Implementation plan -> Execution thereof -> business models.

Bids encouraged from consortia.

Briefing day 6 July. DCC will provide support for bids, including a specific helpdesk.

There may be a Digital Curation course in the next few weeks.

Libraries and Research Data Management; conclusions – Martin Lewis, Director of Library Services and University Librarian, University of Sheffield.

Martin had been chairing all day and here he sums up and bring the various threads together.

The library research data pyramid. Things at the bottom need to be in place before things higher up. At the bottom, training in library (confidence), Library schools. Then develop local data curation capacity, teach data literacy. Higher up: research data awareness, research data advice, Lead on local policy. At the very top ‘influence national data agenda’.

Summary

An excellent day and excellent knowledgeable speakers. Nice venue, and most importantly, I found the only plug socket in the room!

This is clearly an emerging area. Many are in the same posistion, they are aware of the (Opene) Research Data developments, but nothing has yet happened at their university, nor academics queuing up to demand such a service. This is a good thing and it needs to happen, and Universities need to start acting now. But there are many preasures on University resources at the moment. How high on the institutional priority list will this come?

[Very finally, I did another audioboo experiment. On the fly, with no pre-planning, I recorded about 2 minutes of talk during the lunch. It’s random, with no thought, many umms, a pointless ‘one more thing’ and basically wrong. laugh at it here]