Event: The Data Imperative: Libraries and Research Data

Today I’m at the one day event ‘The Data Imperative: Libraries and Research Data’ at the Oxford e-research Centre. As usual, these are my own rough notes. There are mistakes, gaps and my own interpretation of what was said.

Paul Jeffreys : Director of IT, Oxford University.

Started off giving an overview of where this has come from. e-Research is more than just e-Infrastructure. e-Research is not just about outputs, but outputs (articles/data) are a part of this, and an discreet area to work on.

This is a cross-discipline area, it needs academics, University executive, research office, IT and Library. Libraries have skills that have to be fed in to this.

EIDCSR : ‘Enough talking,  let’s try and do it’, selected two research groups to work with, but not a pilot, a long term commitment. He talks about Oxford’s commitment to a data repository, it stresses cross agencies, mentions business models and feeds in to a senior research committee (the quote is far too long to add here!).

As each HEI is facing the same issue, it makes sense for national activity. but how much is done locally and how much is done nationally.

What is the vision of research management data? To what extent is managing research data the role of the Library/librarians? Is data management and data repositories a new kind of activity? Is it Librarians or Information Professionals who are charged to take this forward? [cjk: i thought they were one and the same]

John K Milner : Project Manager UKRDS

Can’t just use existing subject specific data centres. Need for cross-discipline (eg climate change) and therefore universal standards and methods so one subject can use another subject’s data with ease.

Feasibility study:

Understand what is happening today? where are the gaps. Avoid re-inventing the wheel.

Four Case studies (Bristol, Leeds, Leicester, Oxford), views of ~700 researchers over all disciplines (inc the arts).

What did they learn?

About half of the data has a useful life of 10 years? 26% has ‘indefinite’ value, ie keep for ever’ Nearly all kept locally (memory stick, departmental server, [cjk: not good!]).

21% use a national/international data centre. 18% share with them.

UK has rich landscape of facilities, skills and infrastructure.

The management of data from a research project are now starting to be directly funded, which is important.

What are others doing? Are we in step with other countries? Yes. US spending $100 million on 5 large data centres. Australians are leading in this area, and have a central approach to it. Canada and Germany also have similar developments.

Aim: to set up a framework for research data.

Why Pathfinder: not a pilot but the start of a long term commitment.

[my notes miss a bit here, had to deal with a urgent work issue]

Service must be useful and accessible. Need a framework for stakeholder engagement.

This is non-trivial. Lots of parties involved, a lot of effort needed.

Citation of datasets is of growing interest to some researchers, this may help engage the research community.

Showing a diagram of UKRDS Basic processes. Split between ‘Research Project process’, Research data sharing process and UKRDS Services and Administration

Diagram doesn’t focus on curation but on accessibility (inc discovery, stable storage, identity) as this seems like the most important part. Discovery:Google, Identity(auth):Shibboleth.

Making it happen.

Need clearly defined service elements, will involve DCC, RIN and data centres.

HEIs need a reliable back-office service to handle working with data.

UKRDS is extremely challenging, nothing is easy and it is expensive. Needs support of funders and HEIs, need the right bodies to show leadership and shape policy. It will take time.

Q: Is it limited to HEI or public sector (museums etc). A: a more complicated issue, but they are working with the liked of Connecting for Health and DEFRA.

Q: Copyright. A: HEI often don’t own copyright. Data Management Plan (Wellcome are funding Data planning as part of funding)

Q: Is it retrospective? A: Could be. [he did say more]

Q: Could UKRDS influence ‘reputational kick back’ [nice phase!] e.g. for the REF. A: Yes, in discussion with HEFCE.

Q: Research Councils A: they are in discussion with RCs but Wellcome very much taking the lead (leap of faith) in the area. The whole key is a ‘value proposition’ which makes a case for funding this.

Q/point: Engage government/politicians.

Q: Challenge in explaining what it is, especially for subjects which are already doing something with data. How can we tap in to those already doing it? A: there is sometimes a missing link between researchers and subject national data centres. No real relationship between the two. Which is a problem in cross-subject research.

Research data management at the University of Oxford: a case study for institutional engagement – Luis Martinez, OeRC, Sally Rumsey, Oxford University Library Service

More of a ‘in practice’ talk, rather than high level.

Luis Martinez

Scoping study: ‘DataShare project‘. Talking to researchers they found some couldn’t understand they own old data, some wanted to publish their own data, some found data was lost when academics moved on.

Requirements: Advice/support across research cycle (where to store it, how, etc), Secure Storage for large dataset. Sustainably infrastructure.

Lots of different Oxford units need to be consulted (library, it, research technology, academics, legal, repository etc).

Findings after consultation: there is actually widespread expertise in data management and curation amongst service units, and other findings. DataShare: new models, tools, workflows for academic data sharing.

Data Audit Framework: (DAF) adapted this to Oxford needs and used it to document practices in research groups.

Policy-making for Research Data in Repositories : a guide‘ [pdf]

The EIDCSR challenge: two units that both research around the human heart. The two groups share the data between them and agree to produce 3d models using the combined data. They are helping this groups do this, using a ‘life cycle approach’.

Using the DAF to capture the requirements. Participating in the UKRDS Pathfinder (as above).

They have a blog http://eidcsr.blogspot.com/

Sally Rumsey

Starts of by talking about the roles required regarding the library. They have Repository staff, librarians, curators, but not so sure about ‘data librarians’.

What should of data should they be responsible for? Some stuff can go to a national service. There are vast datasets (eg Oxford Supercomputing centre), who has the expertise to make these specialised datasets available. Some departments already have provision in place, fine, why rock the boat.

Long tail. Every thing else (not above). No other home, lots of it, Academics asking for it, highly individual (ie unique), hums and sciences.

Things to consider: live or changing data Freely available or restricted? Long term post project?

Showing what looks like a list of random words/letters/strings of chars, an example of some data they were asked to look after from the English department.

Showing a diagram showing that Fedora (a repository system which is strong on metadata/structure but lacks an out of the box UI) is key to the setup. many applications can sit on top of it. Institutional Repository is just one application which runs on top of Fedora.

ORA (IR) for DATA: actual data can be held anywhere in University but ORA is a place of discovery. Allows for referencing of data. Might want to link to ‘DataBank‘ (a proof of concept to show what is possible).

Databank: how do you search/discover? First things added were audio files, perhaps then photos, how do you find them?

Showing Databank. Explaining that everything has a uid so we have cool URLs, and hence you can link to it [yes!]. Explaining how you can group an audio object, a related photo object and a related text object (perhaps explaining it).

End of morning discussion (I’ll just note some points I picked up):

This seems to raise such huge resource implications.

DAF is flexible, you can pick elements of it to use.

Non academic repositories, such as flickr, preservation issues, if they go down. [unlike the AHDS then!]

The Research Data Management Workforce – Alma Swan, Key Perspectives

Study commissioned by JISC, looking at the ‘supply of DS [data scientists] skills’.

NSF Roles:

  • Data Authors – produce data
  • Data Managers – more technical people – often work in partnership with data authors
  • Data Users
  • Data Scientists – expert data handlers and managers (perhaps ‘Data Manager’ was a confusing name).

Our Definitions (but in practice the roles and names are fuzzy):

  • Data creators or authors
  • Data Scientists
  • Data Managers
  • Data Librarians

Data Creators

Using DCC Curation lifecycle model, these are the out ring. But not all of it, and do things not on the ring, such as throw data away.

Shows picture of an academics office. Data is stored in random envelops.

Data Scientists – the focus of this study

Work with the researchers, in the same lab. Do most things in the DCC model. Are computer scientists (or can be one), experts in database technologies, ensure systems are in place, format migration. A ‘translation service’ between Researchers and computer experts.

Lots of facts about this, based on the research. Often fallen in to the role by accident, often started out as a researcher. Domain (maths, chemistry) related or Computer training. Informatics Skills: well advanced in biology and chemistry. Majority have a further degree. Need People skills. Rapidly involving area.

Data Librarians

Only a handful in the UK. specific skills in data care, curation. Bottom half (or bottom two thirds) of DCC model.

Library schools have not yet geared up for training. Demand is low, no established career path. Good subject-based first degree is required.

Things are changing, eg library schools are creating courses/modules around this.

Future Roles of the library

train researchers to be more data aware

Pressing issue inform researchers on data principles, eg ownership.

Open Data : datasets

A growing recognition across all disciplines that articles aren’t enough, datasets are what are needed to be in the open.

Datasets are a resource in their own right.

Publishers do not normally claim ownership of datasets. Some are (usual suspects)

Funder may own Data, Employers may own data. No one seems sure. Several entities may own the data.

In some areas of research journals play role in enforcement.

Some journals are just data.

Using PDF for data is very very not good.

Do we leave preservation of data to publishers [cjk: no! they should have nothing to do with this, the actors are Universities, their employees and their funders]

Simon Hodson – JISC Data Management Infrastructure Programme

Something problem, not easy to tackle. Would be a mistake for institutions to wait. The Call is designed to better understand how its data management facility can be taken forward.

Detailed business cases are needed.

Needs everyone (HEI, funders, data centres, RIN, etc) to be on board.

the Call will have an Advisory Group.

‘Exemplar projects and studies designed to help establish partnership between researchers, institutions, research councils.

See DCC as playing a major role in developing capacity and skills in the sector.

Tools and technologies: tools to help managers make business case internally, institutional planning tools (building on DAT, DRAMBORA, and costing tools). Workshop 1oth June DCC to review progress/outcomes of DAT project.

Two calls planned for the early Autumn.

2 June Call: Infrastructure. To build examples within the sector. Requirements analysis -> Implementation plan -> Execution thereof -> business models.

Bids encouraged from consortia.

Briefing day 6 July. DCC will provide support for bids, including a specific helpdesk.

There may be a Digital Curation course in the next few weeks.

Libraries and Research Data Management; conclusions – Martin Lewis, Director of Library Services and University Librarian, University of Sheffield.

Martin had been chairing all day and here he sums up and bring the various threads together.

The library research data pyramid. Things at the bottom need to be in place before things higher up. At the bottom, training in library (confidence), Library schools. Then develop local data curation capacity, teach data literacy. Higher up: research data awareness, research data advice, Lead on local policy. At the very top ‘influence national data agenda’.

Summary

An excellent day and excellent knowledgeable speakers. Nice venue, and most importantly, I found the only plug socket in the room!

This is clearly an emerging area. Many are in the same posistion, they are aware of the (Opene) Research Data developments, but nothing has yet happened at their university, nor academics queuing up to demand such a service. This is a good thing and it needs to happen, and Universities need to start acting now. But there are many preasures on University resources at the moment. How high on the institutional priority list will this come?

[Very finally, I did another audioboo experiment. On the fly, with no pre-planning, I recorded about 2 minutes of talk during the lunch. It’s random, with no thought, many umms, a pointless ‘one more thing’ and basically wrong. laugh at it here]

or08: eprints track, session 2

After coffee a little more talk about new features and the future as we ran out of time before. Christopher Gutteridge has now turned up, he may have had a few grown up fizzy drinks last night.

(lost concentration here: salt grain take) Eprints plugin will try and pick when people enter their names wrong (e.g. get first/lastnames mixed up). Report an eprint (or report an issue with an eprint) link on item/record pages?

3.1 beta: should be released in a day or so. Live CD available.

When will the new template (for records/items) including related papers (or ‘people who liked this also liked…’), html designer working on this. Can recreate abstract pages daily for fresh data (e.g. i think for stats/other papers).

People come in via Google for an item and the leave again. Soton ecs put links to postgrad prospectus and more on abstract pages for items, found hits to postgrad prospectus tripled.

Talking about more finely grained controls ans privileges , i.e. who can edit what, and where, and giving people additional power. Includes, for example, this person can edit wording of fields/help, but not edit workflow.

11:42: now moving on to research assessment experience.

Bill Mortimer – Open University.

How Open used eprints to support the RAE experience.

used eprints as a publication database because it was publicly available and helped increase citations. Also because of the reporting tool developed for eprints.

Open use mediated deposit but also imported records and self deposit.

Only peer reviewed items in ORO. Had up to 7 temp ‘editors’ processing the buffer.

Very slow uptake when mediated. Now have just under 7,000 items in ORO.

Simplified the workflow (which of course ep3+ have improved). Researchers responsible for depositing items for RAE submission.

Pro: increased awareness (of IR) increased deposits.

con: overlap of perceptions of ORO and RAE process (some felt RAE took over the IR). Lots of records but only 16% carry full text (% of full text varied by department).

Slide with some future ideas, good, see presentation on (though not currently there) http://pubs.or08.ecs.soton.ac.uk/

12:06am

Susan Miles – Kingston

metdata only repository at the moment but plan to add content and full text this year.

uni departmental structure and hierarchy has been the most controversial thing. Didn’t use RAE tool, wasn’t out the box.

Subject team staff created records, but focused moved to collection of physical items. (some) staff really got in to the IR, but this had the downside that many left with their new skills and experience!

misc bits

  • non existent items
  • people trying to pass off others work
  • items being removed and then re-entered constantly at the last minute for the rae
  • over sees academics caused issues.
  • proof of performances and other ‘arts’ outputs were a challenge (next time get the academics to do it).
  • a barrel moving back and forth in a room was a piece of research to be submitted for the RAE (How. evidence, metadata)

Unexpected, but lots of interest in the IR across the University. But lots of things in the buffer and no staff.

University committee has endorsed the IR as the source of publication data.

Because of using subject team staff for IR RAE, subject support now have good knowledge of the IR, which is good.

12:27

Wendy from soton

higher profile in Uni due to RAE work means people are including her – and the IR – more in discussions across campus such as looking at the REF.

question (from me): were any academics reluctant/against their rae information being put online? Answer: no

[anon comment, etheses mandate being reviewed regarding animal rights issues etc]

William Nixon: also planning to upload rae data. Does not foresee any problems, BUT recommend to not flag items as rae08 as some academics may have issues with this.

Les: HEFCE put metadata for items submitted to rae on web anyway.

q for open: you are currently only published peer reviewed items, do you plan to change this.

a: yes reviewing.

or08: Eprints 3.1

At the sucks-less-than-dspace Eprints track today. First up Eprints 3.1 and Future. Haven’t seen anything about 3.1 before this, v3 was released over a year a go so looking forward to seeing what is new.

9:10am: Les Carr is talking. reviewing v3 released last year. Talking about the large amount of work surrounding a repository (for all), which he experienced first hand running the soton ecs repository, and the work they have put in to help this. He found that when he contacted academics to point out problems he has fixed with their items/records they seemed pleased glad that someone was doing this. Last year they (eprints team) wanted to focus on ‘things on the ground’ to make things easier and not focus too much on rejiging the internals.

9:20: 3.1 more control for users. manage the repository without needing technical time (especially as University IT services often want to just set something up and leave it). showing Citation impact for authors.

Eprints 3 platform is built of two parts: ‘core’ backend, and plugins. Plugins control everything you see (I didn’t know plugins were used to this extent). A lot of the new things are just new plugins ‘slotted in’. Plugins can be updated separately which means upgrading specific parts of functionality is easy and doesn’t affect the whole system.

Lots of things moved from the command line to the web interface.

Administration: user interface for creating new fields and and configuring administrative tasks (sounds good).

Easily extend metadata, what gets stored, in a nice user interface.

9:31: live demo of adding new fields: ‘manage metadate fields’, you can edit them for each dataset e.g. document, eprints, users, imports. First get a screen showing all existing fields, a text field to enter a new field name (and something to show if you have any fields half created, to continue). Interface looks similar to creating an eprint item. select the different types of field e.g. boolean date, name, etc, lots of them, with descriptions of what they are, also one is a set where you can add a list of defined options, another is compound which can have various subfields. This is looking great.

9:38: next screen, loads of options: required? include in export? index? As name was selected on prev screen various options specific to the name field type. lots more. Has help (click on the ‘?’).

9:41: next screen set of questions about how this is displayed in the user interface, i.e. text user would see, help text. Again seems well designed. Editing XML in the past wasn’t rocket science but it was easy to forget steps or get syntax wrong, plus (certainly for v2) you had to do it with no items in the archive (not easy on a live repository!)

By default new fields appear in the MISC step (screen) of the deposit process for users. which can be changed by editing the workflow.

9:53: configuration (via web interface), fairly crude at the moment but looks to be useful (though not turned on for the demo repository), basically can edit things that are in cfg files. plan to turn this in to a full user interface in the future (not sure if for 3.1 or beyond).

9:58: running through some of the thins in the cfg files, such as how to make a field mandatory only for theses.

Quality Assurance. ideas of an ‘issue’ (something amiss) and an ‘audit’.

issue: stale, missing metadata. issues reported by item and also aggregated by depository.notification of issues can be emailed to authors. We cn define all this, i.e. what counts as an issue in the cfg files. can also check for duplicates (good as it will make my god awful script we use at Sussex obsolete).

Can have a nightly audit, and see if anyone has acted on the alerts and issues. reports can be generated for people.

10:07: batched editing. do a search and then batched change any fields for those search results. nice. running short of time so not demo’ing.

manage deposits screen (for users) icons on the right of each item of yours, to see, delete, move, etc. you change what columns you see on this screen by using icons at the bottom of the screen, can also move them around.

Impact Evidence: citation tracking, researchers can track citations counts and rank papers. volatile fields don’t change the history of a record. download counts from irstat.

Better bibliographies. can reorder, choose what to view, better control. this is very much needed as different researchers want their publication list in a different way. uses stylesheets.

Complex objects: all public objects have official URIs. expanded document-level metadata .

Versioning (based on VERSION project). ‘simple and useful’. pubished material ‘pre post or reprints’. unpublished materail, early draft, working paper. looks good.

10:19: Improve Document uploader. can upload a zip file of many files.

10:25 discussion about versions, e.g. how a user may add a draft (with limited metadata) and then go on and re-edit the item later on when they have a published version.

‘Contributers’ field. roles taken from dc relator names (225). large list of roles, may want to cut down.

A new skin, but not for 3.1 – i.e. record/abstract page will show a thumbnail of the item at the top, because the item is the important thing not the metadata (which is what is emphasised in the ui at the moment), i.e. in the same way that flickr shows the photo as the main thing on the page, and metadata at the bottom, good idea. new layout looks good.

Future: no time to talk: cloud computing, amazon eprints services perhaps (you just sign up to a IR on amazon and one is automatically created). On top of Fedora (saw folks on IRC talking about the same for Dspace the other day), or the Microsoft offering just announced. In a box (i.e. comes out the box as a pre-installed server) honeycomb.

or08: live blogging experiment

Today – as you probably have seen – I posted some badly written notes that meant nothing to no one, and interested even fewer. This was my experiment in live blogging. I’ve seen others do it quite a bit recently and always thought it worked well, so wanted to give it a try.

Some thoughts:

  • Using a different tense is a little weird. Normally we write in the past tense if reviewing an event, when blogging as it happens I found myself switching between current and past tense (the latter out of habit). This wasn’t helped with no internet access before coffee, so i was writing in to a text editor (not that you wanted to know, called Smultron) something I planned to paste in to a blog post in the future, which when posted will be talking about the past, but i wanted it to read as if it was live!
  • I looked up at one point while switching between wordpress and twitter and saw two laptop screens of people in front of me, one had twitter across the screen, the other had the wordpress composition window. Am I boring or with the in-crowd?!
  • Perhaps the biggest point was my difficulty in note taking. I wanted to write stuff that other people not there would find useful. However, my notes were largely rather basic, not meaty enough to say much, someone reading would get a general idea what the talk was about. It would give someone a feel of the outline of a talk, but not what the key points were, something which I think is a crucial difference.
  • As well as taking notes, I had various tabs open, including the excellent crowdvine conference site, twitter, bloglines, google blog search (searching to see what turned up for ‘or08’… oh look! me! god I’m so vain). At times the note taking, twittering (and learning about tags on twitter) and checking out crowdvine, I would occasionally look up and have no idea what the presenter is talking about (I’m a man, I have evolved to be an expert single tasker). Must try and ensure I’m not being distracted from the actual reason I’m there.
  • My notes were rough. Not helped by the fact that the lecture hall was very full (and it wasn’t one of those poncy MBA lecture theaters with big wide seats), so I was being careful of my elbows – which limits typing, and for me, using the shift key. Does the embarrassment of badly typed, ill thought, ungrammatical notes get trumped by their potential interest to others and timeliness?
  • Timeliness is an important point, I could have waited until the end of the day but wanted to get them out straight away.
  • After morning coffee I had the internet. I sat down with next to two people I had met before, while there were quite a few with their laptops open, they were not, and I felt a little self conscious. They were trying to listen to the talk, and here was this guy next to them mucking around on his laptop the whole time. Actually I don’t think they were bothered.
  • While talking about being self-conscious, does posting things as quickly as possible look like attention seeking and ego massaging? Never thought that about anyone else doing it so hopefully the answer is no (but then I love this sort of thing, so I wont).

So will I do it again. Yes, and I like having the web to hand while at these things. I think I need to improve my note taking, and perhaps take more time writing up points (and my thoughts) on the things of interest rather than writing lots of little snippets. I basically need to take notes anyway (whether notepad and pen, MS Word or a blog), and it does make it stick in my head better than just sitting there, so I may as well make my notes open to others. The timeliness (thats time – li – ness, not Time Lines!) is perhaps harder to argue, but I like the idea that things are hitting the web the moment they happen, so think I will continue it.

I remember last year a couple of years a go when the www 2006 conference was taking place (can’t believe it was two years a go), I was sitting at my desk watching flickr, blogs, and just about everything else being updated – a lot – in real time. The ability for me to see photos, watch videos and see notes of things that happened a couple of minutes a go amazing and really help capture the feel for the whole event.

Other bits

Battery was running low (why didn’t campus designers in the 60s think to add plug points in lecture theaters for laptops) so had to revert to pen/paper for session 3. All good talks, but the SWORD talk by Julie Allinson was excellent.

Didn’t stay for the poster session minute madness but of the few posters I did have chance to see, the one for feedforward really got my eye and just looks excellent.

Crowdvine (link to or08 on crowdvine)

This is an excellent tool, and I recommend it to anyone setting up a conference. Though I think web savvy crowds will get more out of it (e.g. integration with twitter and web feeds). It helped to put names to faces, but it also helped to get a feel for who are some of the more prominent people. For example: If Les Carr talks about Gnu Eprints, I know to listen as he manages the thing, and if Bill Hubbard talks about IRs I know to listen to what he says because he Manages Sherpa in the UK. However I couldn’t tell you the same about the Dspace or US equivalents. I still can’t tell you their names (I don’t do names) but I certainly recognised faces of those who seemed to be very active in their area. I know this sounds a little elitist or hierarchical, but it really isn’t meant to be.

Handy hint: if you want your profile page to be at the top of the conference homepage, just make superficial changes to it every few hours!

As someone mention on twitter, this, and every social networking site, needs much more than just ‘friend’. Perhaps: ‘i have seen a few emails from them on mailing lists and I may have even replied to one’, ‘I kinda stood in the same group as them during a coffee break at a conference once’ and ‘I read their blog and see them mentioned here and there so we are a little like friends’. I felt a little unsure when clicking on a few people as friend, but then they all added me back (except Christophe Gutteridge, bastard). Of course this is no different to facebook, the amount of people who have requested me as a friend who I swear I have never spoken to, even if they new a girl who lived in the corridor above me at the first year of University (that’s a real one).

(PS I used too many brackets and exclamation marks in this blog post!)

or08: session 2b: Sustainability Issues

[again unedited, unchecked, sorry for mistakes!]

Warewick Cathro
Assistant Direct General, National Library of Australia

[sorry didn’t take very good notes for Warwick’s good talk]

“towards the australian data commons” paper on the web for reference on Australian policy in this area.

various sites/projects:

arrow: aggregates IRs in Uni repositories, 90,000 records, expects to grow rapidly. not a ‘native search service’ intended to let others use the metadata.
future: evolve, support financially by ‘austrian national data service’ (like everything else in this talk). will use shibb and poss openid.

regstry services
[interesting stuff, another project, but didn’t make any notes]

pilin – identity management
handle mirror/proxy.
tools and define requirements for a national service
national persistant identifier service.

Obsolescence notification
aons project
toolkit on sourceforge
adapters for ir software
compares profile with data from external registeries, for each registry they have built an adapter

Australian METS profile
encoding of preervation metadata
exchaging data format.
three layer model, top, generic profile, middle: content models, bottom: implementation profiles

———-

Libby Bishop (Leeds/Essex)

Timescapes: looking at relationships, family life (young people, fatherood, older people).
But also buildng a data archive in the process, some objects not born digital.

400+ participants
5000+ objects
500+ gb size.

Sustainable = Shareable + desirable.

Share:
IP sorted, resource discovery, harvestability.

Desirability
what makes people want to use this, this issue is at the service
researchers are primarly audience, but also media, policy makers, students.
Longitudinal (new term to me) e.g. track people as they move through time
needs to be multimedia: voice, video, audio.
video helps to engage you people
themematic data
reuse helps make it desirable.

Distinctive features of timescapes
answer: data (primarly)
but also: multimedia, sensitive content, complex access
Longitudinal, dynamic updating.
Intergrating of research, archive and reuse.
researchers are central to the design, they interact with repository.

Timescape Repository (at leeds), Timescapes data preserved at UK Data Archive (essex):
no point recreating a preservation service at leeds. uses digitool at leeds because mandras (?) was. digitool not open. wanted to use an existing tool at leeds rather than setup a new one.

metadata:
lots of challenges, especially in what is needed.
lots of people, expertise, and different institution.
researchers tend to be the experts and know their area,
and IR people know current practice in metadata.
looking in to how to mark up audiovisual, e.g. looking at a METS wrapper.
modifying depositor interface to repository, let people add their own metadata, with some stuff still being added by the IR staff.

showing an example of the sort of data (in a MS Word file) the researchers are collecting. need a fair bit of conversations to encourage researchers to do this. (transcript guidelines/forms)

back to sustainability:
“key strategies for sustainability”
– embedding in multiple institutions (can’t predict the future).
– build trust with researchers in what you are doing (and asking them to do) is essential, esp in long term.
– reuse!
lots of people want to be part of the project: affiliates programme. those who want to work closely have to agree to contribute their own data and reuse current data.

Summary:
researchers agreed to share and reuse data: success
waiting list of affiliates

issues:
quaility of researcher dsubmitted data, some reluctant to share, digitool multimedia support limited.
Collaboration takes time, especially across institutions.

or08 session 1 (part3)

Rich Tags: cross repository browsing
ds
ECS, Southampton

categories can be unhelpful, if you don’t know it (LOC!) hard to see relate articles.

solution, unified categories across repositories. Automated.

aggregated data from oai, then got more from external sources.

eg.
instution name from whois, decade from date.

tf-idf, algorithm for categories (?)

mSpace, ecs project.

richtags.org, example of above, with records from various universities.

categories come from dmoz

or08: session1 (part 2)

again unedited/checked:

I’m tagging with or08:

http://blogsearch.google.com/blogsearch?hl=en&q=or08&ie=UTF-8&scoring=d
http://technorati.com/search/or08?authority=a4&language=en

On the margins of scholarship

Richard Davis, Uni London Computer Centre

flickr, good example of an online document repository

“flickr for eprints”
1 – the data i enter to be used by other applications
2 – rss feeds i can use elsewhere
3 – clickable keywords, leading to similar articles in other repositories.

linnean online

demo’d images in an eprints ir, allowed photos to be previewed on the same page, bookmarks, comments.

may scan in original comments.

sneep. social networking extensions for eprints.
jisc funded, tagging, bookmarking, open source, exploit eprint objects

about giving choice, let users find it, e.g. facebook app, if we make it will it will be useful for someone (in a way we can’t predict).

web2.0 is raising expectations of what websites should do.

or08: session1 part1

again, unedited or checked. draft notes:

ian mulvany -nature (who produce connotea) – speaker.
david kane – Waterford Institute of Technology

openid

semantic web, very useful to get data from, but hard to do and very few do it.

contributing: plan text easy, semantic web hard
data mining: semantic web easy, plain text hard.

talked about how nature are working on intergrating social tools with connotea.

“we want to connect repositories with connotea”.

connitea could act as an interchange with repositories.

showed how Waterford Institute of Technology catalogue uses tags etc, example of how social tools can be intergrated.

openid:
your signon can be a URI, like a URL or email address,
When you try to sign on with oenid, you are redirected to your openid provider (eg yahoo). Yours details
are not shared with the site you are trying to access, just keys.

security risk, one website (your open id provider) is the key to access to your details on many websites, i.e. phishing risk. something to be aware of.

of (?) allows one site to access things on another site (eg doffler access your photos on flickr) without you having to give the first site your login details for the second.

connotea now supports open id. think it is the future.

or08: Open Repositories 2008

I’m at Open Repositories 2008.

Got in to southampton central at 8:15am, for a 9 start, thought i had loads of time but wait for bus, journey, and wondering around campus only just got here for 9. so did just about everyone else, so not username and password allocated until coffee time :( [but now online, hence you seeing this!]

draft notes from the first session, unedited or checked.
peter murray-rust
repositories data

Believes data is the most important thing for scientists, as opposed to open access final full text.

“PDF destroys information” – pdf destroys information, Word contains data which pdf just losses, word files (and latex, xml) are useful for sciencetists as they can reuse the data, formula, metadata etc contained within, which is lost as a pdf file.

academic theses are one of the most important thing for institutions/researchers. electronic theses are going to be very powerful.

technical problems slowed the talk down.

showed pdb repository, protein data range, going since the 70s. showed rsearch he *put* in to the *repository* while working at glaxo.

message is that scientists are already putting stuff in repositories.

crystaleye, built/started by a postgrad. now has over 100,000 crystal structures. harvests from those that release their crystallography (acs, rcs, etc). links to paper via doi.

scientists will not put things in to repositories (presumes he means articles based IRs, as he has just been describing how scientists do!).

OSCAR text extraction. showed example of cutting the text of a PDF doc in to OSCAR, it produced a table of formula that were contained within the text.

Royal Society Chemistry: PROSPECT , semantic markup of papers.
SPRECTRa (cam/imperial), how can we capture data as part of the academic process

“do not try to invent electronic notebooks”, success rate approx zero. i.e. don’t try and make capturing data the integral part of their workflow.

No one knows when their paper gets published.

“get at the authoring process”, that is the key.