ircount : update

One Sunday morning in January this year I got an email sent automatically from the webhosting company. It contained the output of the script that ran weekly, when all ran fine the script produced no output. When something went wrong the error messages were emailed to me. Judging by the length of the email something big had gone wrong.

The script collected data from http://roar.eprints.org/ – to be used as this weeks ‘number of records’ for each repository.

The reason became clear quickly. A major revamp to ROAR had just been launch, showing off a new interface, which used the Eprints software as a platform (essential a repository or repositories). This was a great leap forward but unfortunately removed the simple text file I used to collect the data, and what was more, the IDs for each IR had changed.

I finally got around to fixing this in May. The most fiddly bit was linking the data I collected now with the data I already had. This involved matching URLs and repository names.

Anyways. Things should be more or less as they were. A few little tweaks have been added. A few bugs still remain.

As ever you can view the code and changes here: http://trac.nostuff.org/ircount/browser/trunk

And checkout the svn here: http://svn.nostuff.org/ircount/

ircount can be found here: http://www.nostuff.org/ircount/

Event: The Data Imperative: Libraries and Research Data

Today I’m at the one day event ‘The Data Imperative: Libraries and Research Data’ at the Oxford e-research Centre. As usual, these are my own rough notes. There are mistakes, gaps and my own interpretation of what was said.

Paul Jeffreys : Director of IT, Oxford University.

Started off giving an overview of where this has come from. e-Research is more than just e-Infrastructure. e-Research is not just about outputs, but outputs (articles/data) are a part of this, and an discreet area to work on.

This is a cross-discipline area, it needs academics, University executive, research office, IT and Library. Libraries have skills that have to be fed in to this.

EIDCSR : ‘Enough talking,  let’s try and do it’, selected two research groups to work with, but not a pilot, a long term commitment. He talks about Oxford’s commitment to a data repository, it stresses cross agencies, mentions business models and feeds in to a senior research committee (the quote is far too long to add here!).

As each HEI is facing the same issue, it makes sense for national activity. but how much is done locally and how much is done nationally.

What is the vision of research management data? To what extent is managing research data the role of the Library/librarians? Is data management and data repositories a new kind of activity? Is it Librarians or Information Professionals who are charged to take this forward? [cjk: i thought they were one and the same]

John K Milner : Project Manager UKRDS

Can’t just use existing subject specific data centres. Need for cross-discipline (eg climate change) and therefore universal standards and methods so one subject can use another subject’s data with ease.

Feasibility study:

Understand what is happening today? where are the gaps. Avoid re-inventing the wheel.

Four Case studies (Bristol, Leeds, Leicester, Oxford), views of ~700 researchers over all disciplines (inc the arts).

What did they learn?

About half of the data has a useful life of 10 years? 26% has ‘indefinite’ value, ie keep for ever’ Nearly all kept locally (memory stick, departmental server, [cjk: not good!]).

21% use a national/international data centre. 18% share with them.

UK has rich landscape of facilities, skills and infrastructure.

The management of data from a research project are now starting to be directly funded, which is important.

What are others doing? Are we in step with other countries? Yes. US spending $100 million on 5 large data centres. Australians are leading in this area, and have a central approach to it. Canada and Germany also have similar developments.

Aim: to set up a framework for research data.

Why Pathfinder: not a pilot but the start of a long term commitment.

[my notes miss a bit here, had to deal with a urgent work issue]

Service must be useful and accessible. Need a framework for stakeholder engagement.

This is non-trivial. Lots of parties involved, a lot of effort needed.

Citation of datasets is of growing interest to some researchers, this may help engage the research community.

Showing a diagram of UKRDS Basic processes. Split between ‘Research Project process’, Research data sharing process and UKRDS Services and Administration

Diagram doesn’t focus on curation but on accessibility (inc discovery, stable storage, identity) as this seems like the most important part. Discovery:Google, Identity(auth):Shibboleth.

Making it happen.

Need clearly defined service elements, will involve DCC, RIN and data centres.

HEIs need a reliable back-office service to handle working with data.

UKRDS is extremely challenging, nothing is easy and it is expensive. Needs support of funders and HEIs, need the right bodies to show leadership and shape policy. It will take time.

Q: Is it limited to HEI or public sector (museums etc). A: a more complicated issue, but they are working with the liked of Connecting for Health and DEFRA.

Q: Copyright. A: HEI often don’t own copyright. Data Management Plan (Wellcome are funding Data planning as part of funding)

Q: Is it retrospective? A: Could be. [he did say more]

Q: Could UKRDS influence ‘reputational kick back’ [nice phase!] e.g. for the REF. A: Yes, in discussion with HEFCE.

Q: Research Councils A: they are in discussion with RCs but Wellcome very much taking the lead (leap of faith) in the area. The whole key is a ‘value proposition’ which makes a case for funding this.

Q/point: Engage government/politicians.

Q: Challenge in explaining what it is, especially for subjects which are already doing something with data. How can we tap in to those already doing it? A: there is sometimes a missing link between researchers and subject national data centres. No real relationship between the two. Which is a problem in cross-subject research.

Research data management at the University of Oxford: a case study for institutional engagement – Luis Martinez, OeRC, Sally Rumsey, Oxford University Library Service

More of a ‘in practice’ talk, rather than high level.

Luis Martinez

Scoping study: ‘DataShare project‘. Talking to researchers they found some couldn’t understand they own old data, some wanted to publish their own data, some found data was lost when academics moved on.

Requirements: Advice/support across research cycle (where to store it, how, etc), Secure Storage for large dataset. Sustainably infrastructure.

Lots of different Oxford units need to be consulted (library, it, research technology, academics, legal, repository etc).

Findings after consultation: there is actually widespread expertise in data management and curation amongst service units, and other findings. DataShare: new models, tools, workflows for academic data sharing.

Data Audit Framework: (DAF) adapted this to Oxford needs and used it to document practices in research groups.

Policy-making for Research Data in Repositories : a guide‘ [pdf]

The EIDCSR challenge: two units that both research around the human heart. The two groups share the data between them and agree to produce 3d models using the combined data. They are helping this groups do this, using a ‘life cycle approach’.

Using the DAF to capture the requirements. Participating in the UKRDS Pathfinder (as above).

They have a blog http://eidcsr.blogspot.com/

Sally Rumsey

Starts of by talking about the roles required regarding the library. They have Repository staff, librarians, curators, but not so sure about ‘data librarians’.

What should of data should they be responsible for? Some stuff can go to a national service. There are vast datasets (eg Oxford Supercomputing centre), who has the expertise to make these specialised datasets available. Some departments already have provision in place, fine, why rock the boat.

Long tail. Every thing else (not above). No other home, lots of it, Academics asking for it, highly individual (ie unique), hums and sciences.

Things to consider: live or changing data Freely available or restricted? Long term post project?

Showing what looks like a list of random words/letters/strings of chars, an example of some data they were asked to look after from the English department.

Showing a diagram showing that Fedora (a repository system which is strong on metadata/structure but lacks an out of the box UI) is key to the setup. many applications can sit on top of it. Institutional Repository is just one application which runs on top of Fedora.

ORA (IR) for DATA: actual data can be held anywhere in University but ORA is a place of discovery. Allows for referencing of data. Might want to link to ‘DataBank‘ (a proof of concept to show what is possible).

Databank: how do you search/discover? First things added were audio files, perhaps then photos, how do you find them?

Showing Databank. Explaining that everything has a uid so we have cool URLs, and hence you can link to it [yes!]. Explaining how you can group an audio object, a related photo object and a related text object (perhaps explaining it).

End of morning discussion (I’ll just note some points I picked up):

This seems to raise such huge resource implications.

DAF is flexible, you can pick elements of it to use.

Non academic repositories, such as flickr, preservation issues, if they go down. [unlike the AHDS then!]

The Research Data Management Workforce – Alma Swan, Key Perspectives

Study commissioned by JISC, looking at the ‘supply of DS [data scientists] skills’.

NSF Roles:

  • Data Authors – produce data
  • Data Managers – more technical people – often work in partnership with data authors
  • Data Users
  • Data Scientists – expert data handlers and managers (perhaps ‘Data Manager’ was a confusing name).

Our Definitions (but in practice the roles and names are fuzzy):

  • Data creators or authors
  • Data Scientists
  • Data Managers
  • Data Librarians

Data Creators

Using DCC Curation lifecycle model, these are the out ring. But not all of it, and do things not on the ring, such as throw data away.

Shows picture of an academics office. Data is stored in random envelops.

Data Scientists – the focus of this study

Work with the researchers, in the same lab. Do most things in the DCC model. Are computer scientists (or can be one), experts in database technologies, ensure systems are in place, format migration. A ‘translation service’ between Researchers and computer experts.

Lots of facts about this, based on the research. Often fallen in to the role by accident, often started out as a researcher. Domain (maths, chemistry) related or Computer training. Informatics Skills: well advanced in biology and chemistry. Majority have a further degree. Need People skills. Rapidly involving area.

Data Librarians

Only a handful in the UK. specific skills in data care, curation. Bottom half (or bottom two thirds) of DCC model.

Library schools have not yet geared up for training. Demand is low, no established career path. Good subject-based first degree is required.

Things are changing, eg library schools are creating courses/modules around this.

Future Roles of the library

train researchers to be more data aware

Pressing issue inform researchers on data principles, eg ownership.

Open Data : datasets

A growing recognition across all disciplines that articles aren’t enough, datasets are what are needed to be in the open.

Datasets are a resource in their own right.

Publishers do not normally claim ownership of datasets. Some are (usual suspects)

Funder may own Data, Employers may own data. No one seems sure. Several entities may own the data.

In some areas of research journals play role in enforcement.

Some journals are just data.

Using PDF for data is very very not good.

Do we leave preservation of data to publishers [cjk: no! they should have nothing to do with this, the actors are Universities, their employees and their funders]

Simon Hodson – JISC Data Management Infrastructure Programme

Something problem, not easy to tackle. Would be a mistake for institutions to wait. The Call is designed to better understand how its data management facility can be taken forward.

Detailed business cases are needed.

Needs everyone (HEI, funders, data centres, RIN, etc) to be on board.

the Call will have an Advisory Group.

‘Exemplar projects and studies designed to help establish partnership between researchers, institutions, research councils.

See DCC as playing a major role in developing capacity and skills in the sector.

Tools and technologies: tools to help managers make business case internally, institutional planning tools (building on DAT, DRAMBORA, and costing tools). Workshop 1oth June DCC to review progress/outcomes of DAT project.

Two calls planned for the early Autumn.

2 June Call: Infrastructure. To build examples within the sector. Requirements analysis -> Implementation plan -> Execution thereof -> business models.

Bids encouraged from consortia.

Briefing day 6 July. DCC will provide support for bids, including a specific helpdesk.

There may be a Digital Curation course in the next few weeks.

Libraries and Research Data Management; conclusions – Martin Lewis, Director of Library Services and University Librarian, University of Sheffield.

Martin had been chairing all day and here he sums up and bring the various threads together.

The library research data pyramid. Things at the bottom need to be in place before things higher up. At the bottom, training in library (confidence), Library schools. Then develop local data curation capacity, teach data literacy. Higher up: research data awareness, research data advice, Lead on local policy. At the very top ‘influence national data agenda’.

Summary

An excellent day and excellent knowledgeable speakers. Nice venue, and most importantly, I found the only plug socket in the room!

This is clearly an emerging area. Many are in the same posistion, they are aware of the (Opene) Research Data developments, but nothing has yet happened at their university, nor academics queuing up to demand such a service. This is a good thing and it needs to happen, and Universities need to start acting now. But there are many preasures on University resources at the moment. How high on the institutional priority list will this come?

[Very finally, I did another audioboo experiment. On the fly, with no pre-planning, I recorded about 2 minutes of talk during the lunch. It’s random, with no thought, many umms, a pointless ‘one more thing’ and basically wrong. laugh at it here]

Research in the Open: How Mandates Work in Practice

Today I’m at the RSP/RIN Research in the Open: How Mandates Work in Practice at the impressive RIBA 66 Portland Place.

Slides can be found here (not available when I made this post, as semi excuse as to why my notes miss so much). These are rough notes, which I’m making available in case others are interested, apologies for mistakes and don’t take it as gospel!

After an introduction by Stéphane Goldstein, kicking off with Robert Kiley from the Wellcome Trust.

Wellcome trust mandate since 2006, anyone receiving funding from Wellcome Trust must deposit in to pubmed, now uk pubmed central. SHERPA Juliet lists 48 funder policies/mandates.

Two routes to complying to their mandate: (route 1) publisher in open access / hybrid journal (preferred), Wellcome will normally pay any associated fees. However when paying the publisher, they expect a certain level of service in return (deposited on behalf of author, final version available at time of publication, certain level of re-use. Route 2 Author self-archives author’s final version within 6 months of publication. It was stressed that the first option is very much preferred.

“Publication costs are legitimate research costs”. To fund Open Access fees for ALL research they fund would, they estimate. take up 1-2% of their budget.

Risk of ‘Double payment’ (author fees and subscriptions). OUP have a good model here.

Still to do:

  • Improve compliance (roughly 33%, significant increase after letters to VCs),
  • improve mechanisms (Elsevier introduced OA workflow which resulted in significant increase in deposits, but funders/institutions/publishers all need to play a part here),
  • Clarifying Publishers OA Policies  (and re-use rights, didn’t catch this).

Research Councils UK – Astrid Wissenburg, ESRC

Starts of by talking about drivers for OA in the RC. Value for money, ensuring research is used, infrastructure and more.

Principles: Accessible, Quality (peer review), preservation (she’s moving through the slides fast)

April 2009 study in to OA impact, provides options for RC to consider.Findings

  • Significant shift in favour of OA over last decade
  • Knowledge/awareness still limited. Confusion
  • Engagement with OA varies by subject area.
  • Too early to access impact of RCs policies.
  • Drivers
    • Not speed of dissemination
    • principles of free access
    • co-authors views are a big influence (mandates less so!)
    • some evidence that OA increases citation just after publication
    • limited compliance monitoring by finders
    • concern about impact of learned societies (but no evidence of libraries cancelling journals)
    • little evidence of use by non-researchers (CJK comment: interesting, I would imagine this may grow, wish newspapers would link/cite journal articles)

Both models (oa journals/repositories) supported by RCs, level playing field.

Pay to publish findings: limited use, barriers, costs, awareness, not RAE. would lead to redistribution of costs from non-academic to academic areas.

OA Deposit (repositories): from grant application from 1 Oct 2006, so a three year project starting then will only be finished in Autumn 2009. Acknowledges embargos but ‘at earliest opportunity’.

75% researchers were not aware of the mandate. diversity across subjects. ‘In general, no active deposit’.

A slide showing % of awareness broken down by RC, interesting.

From the highest level RCs are committed to supporting OA (this will increase). But change takes time.

Some issues: what do to with embargo periods, difficult for funders to manage (are there incentives we could use), depends on existence of repositories, multiple deposit options confusing to researchers, awareness/understanding.

UKPubMed Central – Paul Davey, Engagement Manager, UKPubMed Central

Aims to become the information resource of choice for biomedical sector.

Principles: freely available, added to UK pubmed central, freely copied and reused.

Departmental of Health have clear policy to make research freely available.

95% of papers submitted are taken care of (deposited?) by the authors. only 0.5% submitted by academics (PIs/colleagues)

1.6 milion papers in uk pubmed central. 366 thousand downloads last month.

Core benefits: transparency, cutting down duplication, greater visibility.

Text mining, grabbing key terms from an article  (a little like  OpenCalais does)

Mentions EBI’s CiteXplore, encouraging academics to ink to other research.

Pubmed UK includes funding/grant facilities search. Can link articles to funding grants.

In short, backing from key funders, will make researchers more efficient, researcher’s visibility will increase.

Beta out in the Autumn, new site in Jan 2012.

Questions:

Worried about text mining, need for humans to moderate this. response: Limited finding in this area so human intervention also limited. really need specialist to answer this fully.

Question about increasing visibility of UK pubmed central, referring to Google, response: getting indexed by Google very much part of increasing visibility.

Question about Canadian ‘pubmed central’, response confirms this and mentions talk of a European pubmed central. Potential of European funders using UK pubmed central as a place to deposit research (like everything here, not sure if I’ve noted this right).

PEER – Pioneering collaboration between publishers, repositories and researchers – Julia Wallace

Funded by EC, not a ‘publisher project’.

Three key stages of publication: NISO Author’s original, NISO Accepted Manuscript, NISO version of record.

Starts of talking about the project, interesting stuff but failed to take notes.

From the website:

PEER (Publishing and the Ecology of European Research), supported by the EC eContentplus programme, will investigate the effects of the large-scale, systematic depositing of authors’ final peer-reviewed manuscripts (so called Green Open Access or stage-two research output) on reader access, author visibility, and journal viability, as well as on the broader ecology of European research. The project is a collaboration between publishers, repositories and researchers and will last from 2008 to 2011.

Seven members: including a publisher group, university, funders etc. Various publishers involved, big and small and about six European repositories taking part.

Approach / content:

  • Publishers contribute 300 journals, plus control
  • Maximises deposit and access in participating repositories
  • 50% publisher submitted 50% author submitted.
  • Good quality, range of impact factors. Publishers set embargo periods, up to 36 months.

Publishers will deposit articles in to the repositories via a central depot for their 50% of articles submitted (50% fulltext, metadata for the remaining 50%). Publishers will invite authors to deposit for the ‘author’ 50%

Technical: using PDFA-1 (where possible) and SWORD

Three strands: Behaviour, Usage (looking at raw log files), Economic. Also looking at Model Development (the three strands will look in to this).

Question about why they chose PDF (not very good for text mining). A: wide range of subjects and publishers means that PDF the best fit

Economic Implications of Alternative Scholarly Publishing Models, also Loughborough University’s Institutional Mandate – Charles Oppenheim, Loughborough University

‘Houghton report’ looks at costs and benefits of scholarly publishing.

Link to report http://hdl.handle.net/2134/4137

Link to main website and models http://www.cfses.com/EI-ASPM/

  • Massive savings by using OA, UK would benefit from this.
  • Savings include: quicker searching, less negotiations, savings not just in library budgets
  • 2,300 activity items costed.
  • This report currently final word in economics of OA.
  • Charles Talks about the various methods and work involved in producng this report.
  • a 5% incease in accessibility would lead to savings (or extra money to spend) in research/he/RCs
  • Hard to compare UK toll/open access publishing costs as one pays for UK access to content from across the world, the other pays for UK content to be world wide accessible.
  • Keen to role this out to other countires
  • Publishers response to report: furious!

Now for something completly different: Loughborough approve a mandae a few months a go, to come in to affect Oct 09. An intergral part of academic personal research plans (only those research items in the IR will be considered at the review). Now have over 4,000 items

Lunch and audioboo

During lunch I did an experiment using audioboo. Would I be able to summarise the morning, on the fly with no planning, in a brief audio recording. The answer, as you can discover, is ‘no’, but fun to try, and made me think of what I had taken in during the morning. Link to audioboo recording. or try the embedded version below.

Institutional Mandates – Paul Ayris, University College London

Paul starts off by shoing a number of Venn diagrams, for example: 90% of its research is available online, 40% available to an NHS hospital

What do UCL academics want

  • as authors: visbility / impact
  • as readers: access
  • delivery 24×7 anywhere

UCL madate, a case study:

Looking global is an important part of UCL (for PIs rankings etc). Number of systems in their publication system: Symplectic, IRIS, eprints, data mart (and portico, FIS, HR). Symplectic (or similar tool) and IRIS seem central in this model. Plan to automatically extra metadata from other external places (publication repositories.

How did they get the mandate? Paul spoke at UCLs senate (Academic Board), the agreed: all academic staff should record they own publication on a UCL publication system, and, teaching materials should all be deposited in their eprints systems.

UCL are going to set up a publication board to over see the OA rollout; to advise, monitor, oversee presentation and more.

Next steps: market/exploit, set standards for online publication, to advise on ongoing resource issues in this area. Also, establish processes, Statistics and management information, advise on multimedia, copyright issues.

‘Open Access is the natural way for a global university to achieve its objectives’

Question about blurring the line between dissemination and publication, and that some of UCLs aims seem more fitting of ‘publication’. Paul agrees, still trying to figure this out.

HEFCE – Paul Hubbard, Head of Research Policy, HEFCE

Policy: Research is a process which leads to insights for sharing. So Scholarly Communication matters to HEFCE. Prompt and accessible publishing is essential for a world class research system.

Supporting research: JISC, RIN, Programmes to support national research libraries (UKRR), UKRDS. Mentions Boston Spa (BL) document centre as an example of our world class sharing.

Internet opens up new ways of scholarly communication and sharing.

What do HEFCE want to see:

  • Widest and earliest dissemination of public research.
  • IP shared effectively with the people best placed to exploit it (CJK comment, i don’t think it is publishers!)

Committed to: UK maintaining world leading research, funding that fosters autonomy and dynamism, research quality assessment regime that supports rather than inhibits new developments.

As we move forward, things may be unclear those HEIs with repositories will be at an advantage.

Paul finishes up with a personal view of scholarly communications in 2030. He sees to forms of communication: discussion (building up ideas), and writing up a formal firm idea/conclusion based on these. HEFCE supports – through the likes of JISC – a range of tools and systems to enable this. (sorry that was an awful summary, he said much more than that!).

Answered a question as to why IRs, HEIs are the places to administrate/manage. Websites people go to see research for a particular subject need to be overlay systems harvesting from IRs.

[hmm, does ‘university requirement’ sound better than mandate?]

Institutional Policies and Processes for Mandate Compliance – Bill Hubbard, SHERPA, University of Nottingham

99.9% of academics do not object to Open Access, but need to show it will not change how they work. Librarians going to be much more part of the research process. Most people (including most publishers) are in favour of Open Access.

Other pressures on the systems, lack of peer reviewers, rising prices of journals, growing need for different forms of scholarly communications (e-lab books, multimedia), public demand for highest value for money ‘public should get what they pay for’,

Not if we change, but how we change. Research has to change seamlessly. Mandates have a value-added basis with fast delivery of benefits. Need integrated processes, need integrated support (we don’t want researchers to hear different messages from their Uni, funder, publisher, etc).

Authors need to know ‘what do i meed to do’. Need to make it less confusing, need to make it clear when they can get help.

First step compliance: how can funders improve compliance, how can authors be supported?

All 1994 and Russel Group now have IR (Reading, I think, just setting one up now).

Compliance for mandates makes it better for us admin/support staff, and for the University generally.

Institutions need a compliance officer (perhaps repository manager). Funders need to ensure these people have the information they need. Share compliance information

I’ve missed so much of Bill’s talk here, he moves fast (and passionately) and lots of points.

After Bill’s talk there was a panel session.

Twitter

Finally check out some of the useful tweets from the day. (Twitter search only goes back about a month or so, so this link may not work after a certain date). Jim Richardson also created a permanent copy with the (new to me) webcitation website.

Conclusions

With such dodgy note taking I feel some concise summary is in order!

  • Mandates are happening, by Universities and by Funders.
  • HEFCE want research to be accessible to as many as possible as quickly as possible.  Coming from HEFCE, this holds a lot of weight.
  • Funders (Research Councils / Wellcome) put mandates in place several years a go. They have not sat back and said ‘job done’. They are building on this foundation. How can they check? How can they enforce/encourage? How can they assist? How can they automate? How can they work with publishers and HE to share this information? Expect more to come in this area.
  • Wellcome Trust prefers submission to Open Access Journals rather than author depositing in to a repository at a later date.
  • HE Mandates are coming, we alreay have a few in the UK. Making them an intergral part of an academic’s review seems like a good idea. My opinion is that this is reasonable – even if there are those who disagree – surely an employer can (and does in every other sector) ask for a record of what an employee has been working on, and a copy of the end output, i.e. the full text in an IR.
  • The report ‘Economic implications of alternative scholarly publishing models : exploring the costs and benefits. JISC EI-ASPM Project‘ is a thourough comprehensive look at the economic costs of Open Access and new forms of Scholorarly Communications.
  • I think we are starting to see the larger Universities developing sophisticated network of systems to manage research/publications/OA/research-funding. See slide 10 of Paul Ayris presentation, and this article about Imperial’s setup as two examples.
  • It makes sense to share information (between IT systems) between funders, HE and publishers. Examples: Funders sharing (bibliographic) information to a University about publications from its researchers, Universities (or publishers) passing information to funders linking publications to funding (or even the other way round?).
  • This is an area which is still developing, fast, and will of course involve a culture change. Publishers seem unsure how to handle this new world.

Playing with OAI-PMH with Simple DC

Setting up ircount has got me quite interested in OAI-PMH, so I thought I would have a little play. I was particularly interested in seeing if there was a way to count the number of full text items in a repository, as ROAR does not generally provide this information.

Perl script

I decided to use the http::oai perl module by Tim Brody (who not-so-coincidentally is also responsible for ROAR, which ircount gets its data from).

A couple of hours later I have a very basic script which will roughly report on the number of records and the number of full text items within a repository, you just need to pass it a URL for the OAI-PMH interface.

To show the outcome of my efforts, here is the verbose output of the script when pointed at the University of Sussex repository (Sussex Research Online).

Here is the output for a sample record (see here for the actual oai output for this record, you may want to ‘view source’ to see the XML):

oai:eprints.sussex.ac.uk:67 2006-09-19
Retreat of chalk cliffs in the eastern English Channel during the last century
relation: http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
MATCH http://eprints.sussex.ac.uk/67/01/Dornbusch_coast_1124460539.pdf
relation: http://www.journalofmaps.com/article_depository/europe/Dornbusch_coast_1124460539.pdf
dc.identifier: http://eprints.sussex.ac.uk/67/
full text found for id oai:eprints.sussex.ac.uk:67, current total of items with fulltext 6
id oai:eprints.sussex.ac.uk:67 is the 29 record we have seen

It first lists the identifier and date, the next line shows the title, it then shows a dc.relation field which contains a full text item on the eprints server, because it looks like a full text item and on the same server the next line shows it has found a line that MATCHed the criteria which means we add this item to the count of items with full text items attached.

The next line is another dc.identifier, again pointing to a fulltext URL for this item. However this time it is on a different server (i.e. the publishers), so this line is not treated as a fulltext item, and so it does not show a MATCH (i.e. had the first identifier line not existed, this record would not be considered one with a fulltext item).

Finally another dc.identifier is shown, then a summary generated by the script concluding that this item does have fulltext, is the sixth record seen with fulltext, and is the 29th record we have seen.

The script, as we will now see, has to use various ‘hacky’ methods to try and guess the number of fulltext items within a repository, as different systems populate simple Dublin Core in different ways.

Repositories and OAI-PMH/Simple Dublin Core.

It quickly became clear on experimenting with different repositories that the different repository software populate Simple Dublin Core in a different manner. Here are some examples:

Eprints2: As you can see above in the Sussex example, fulltext items are added as a dc.relation field, but so too are any publisher/official URLs, which we don’t want to count. The only way to differentiate between the two is to check the domain name within the dc.relation url and see if it matches that of the OAI interface we are working with. This is no means solid, quite possible for a system to have more than one hostname and what the user gives as the OAI URL may not match what the system gives as the URLs for fulltext items.

Eprints3: I’ll use the Warwick repository for this, see the HTML and OAI-PMH for the record used in this example.

<dc:format>application/pdf</dc:format>
<dc:identifier>http://wrap.warwick.ac.uk/46/1/WRAP_Slade_jel_paper_may07.pdf</dc:identifier>
<dc:relation>http://dx.doi.org/10.1257/jel.45.3.629</dc:relation>
<dc:identifier>Lafontaine, Francine and Slade, Margaret (2007) Vertical integration and firm boundaries: the evidence. Journal of Economic Literature, Vol.45 (No.3). pp. 631-687. ISSN 0022-0515</dc:identifier>
<dc:relation>http://wrap.warwick.ac.uk/46/</dc:relation>

Unlike Eprints2, the fulltext item is now in a dc.identifier field, the official/publisher URL is still a dc.relation field, which makes it easier to count the former without the latter. EP3 also seems to provide a citation of the item which is also in a dc.identifier as well. (as an aside: EPrints 3.0.3-rc-1, as used by Birkbeck and Royal Holloway, seems to act differently, missing out any reference to the fulltext).

Dspace: I’ll use Leicester’s repository, see the HTML and OAI-PMH for the record used. (I was going to use Bath’s but looks like they have just moved to Eprints!)

<dc:identifier>http://hdl.handle.net/2381/12</dc:identifier>
<dc:format>350229 bytes</dc:format>
<dc:format>application/pdf</dc:format>

This is very different to Eprints. DC.identifier is used for a link to the html page for this item (like eprints2 but unlike eprints3 which uses dc.relation for this). However it does not mention either the fulltext item or the official/publisher url at all (this record has both). The only clue that this has a full text item is the dc.format (‘application/pdf’), and so my hacked up little script looks out for this as well.

I looked at a few other Dspace based repositories (Brunel HTML / OAI ; MIT HTML / OAI) and they seemed to produce the same sort of output, though not being familiar with Dspace I don’t know if this is because they were all the same version or if the OAI-PMH interface has stayed consistent between versions.

I haven’t even checked out Fedora, bepress Digital Commons or DigiTool yet (all this is actually quite time consuming).

Commentary

I’m reluctant to come up with any conclusions because I know the people who developed all this are so damn smart. When I read the articles and posts produced by those (who were) on the OAI-PMH working group, or were in some way involved, it is clear they have a vast understanding of standards, protocols, metadata, and more. Much of what I have read is clear and well written and yet I still struggle to understand it due to my own metal shortcomings!

Yet what I have found above seems to suggest we still have a way to go in getting this right.

Imagine a service which will use data from repositories: ‘Geography papers archive’, ‘UK Working papers online’, ‘Open Academic Books search’ (all fictional web sites/services which could be created which harvest data from repositories, based on a subject/type subset).

Repositories are all about open access to the full text of research, and it seems to me that harvesters need to be able to presume that the fulltext item, and other key elements, will be in a particular field. And perhaps it isn’t too wild to suggest that one field should be used for one purpose, for example, both Dspace and Eprints provide a full citation of the item in the DC metadata, which an external system may find useful in some way, however it is in the dc.identifier field, yet various other bits of information are also in the very same field, so anyone wishing to extract citations would need to run some sort of messy test to try and ascertain which identifier field, if any, contains the citation they wish to use.

To some extent things can be improved by getting repository developers, harvester developers and OAI/DC experts round a table to agree a common way of using the format. Hmm, but ring any bells? I’ve always thought that the existence of the Bath profile was probably a sign of underlying problems with Z39.50 (though am almost totally ignorant on z39.50). even this will only solve some problems, the issue of multiple ‘real world’ elements being put in to the same field (both identifier and relation are used for a multiple of purposes), as mentioned above, is still a problem.

I know nothing about metadata nor web protocols (left with me, we would all revert to tab delimited files!), so am reluctant to suggest or declare what should happen. But there must be a better fit for our needs than Simple DC. Qualified DC being a candidate (I think, again, I know nuffing). see this page highlighting some of the issues with simple dc.

I guess one problem is that it is easy to fall in to the trap of presuming repository item = article/paper. When of course if could be almost anything, the former would be easy to narrowly define, but the latter – which is the reality – is much harder to give a clear schema for. Perhaps we need ‘profiles’ for the common different item types (articles/theses/images). I think this is the point that people will point out that (a) this has been discussed a thousand times already (b) has probably already been done!. So I’ll shut up and move on (here’s one example of what has already been said).

Other notes:

  • I wish OAI-PMH had a machine readable way of telling clients if they can harvest items, reuse the data, or even access it at all (apologies if it does allow this already). The human text of an IR policy may forbid me sucking up the data and making it searchable elsewhere, but how will I know this?
  • Peter Millington of RSP/SHERPA recently floated the idea of a OAI-PMH verb/command to report the total number of items. His point is that it should be simple for OAI servers to report such a number with ease (probably a simple SQL COUNT(*)) but at the moment OAI-PMH clients – like mine – have to manually count each item, parsing thousands of lines of data, which can take minutes, creating processing requirements for both server and client, to answer a simple question of how many items are there? I echo and support Peter’s idea of creating a count verb to resolve this.
  • Would be very handy if OAI-PMH servers could give an application name and version number as part of the response to the ‘Identify’ verb. Would be very useful when trying to work around the differences between applications and software versions.

Back to the script

Finally. I’m trying to judge how good the little script is, does it report an accurate number of full text items. If you run an IR and would be happy for me to run the script against your repository (I don’t think it creates a high load on the server), then please reply to this post. Ideally with your OAI-PMH URL and how many full text items you think you have, though neither are essential. I’ll attach the results to a comment to this post.

Food for thought, I’m pondering the need to check the dc.type of an item, and only count items of certain types, e.g. should we include images? one image of a piece of research sounds fine, 10,000 images suddenly distorts the numbers. Should it include all items, or just those that are of certain types (article, thesis etc)?

ircount : new location, new functionality

A while a go, I released a simple website which reported on the number of items in UK repositories over time. It collected its data from ROAR but by collecting it on a weekly basis could provide a table showing growth week by week.

First it has a new home: http://www.nostuff.org/ircount/

Secondly, it now collects data for every institutional (and departmental) repository registered in ROAR across the world. Not just the UK. It has been collecting the data since July.

The country integration isn’t perfect, you have to select a country, and then you are more or less restricted to that country (though you can hack it, see the ‘info&help’), and there is a lot of potential with improving this. There are also a couple of bugs, for example when comparing four repositories it seems to (a) forget which country you were dealing with, and (b) it stops showing the graph/chart.

I’m currently looking at trying to make an educated guess at how many fulltext items are in a given repository. This is proving to be a steep learning curve in the joys of OAI-PMH, and how the different repository systems (and the different versions on these systems) have allocated information about the fulltext in to different Dublin Core (DC) elements. But this is for another post.

In the mean time, I hope the worldwide coverage is of some use, and feel free to leave any comments.

or08: eprints track, session 2

After coffee a little more talk about new features and the future as we ran out of time before. Christopher Gutteridge has now turned up, he may have had a few grown up fizzy drinks last night.

(lost concentration here: salt grain take) Eprints plugin will try and pick when people enter their names wrong (e.g. get first/lastnames mixed up). Report an eprint (or report an issue with an eprint) link on item/record pages?

3.1 beta: should be released in a day or so. Live CD available.

When will the new template (for records/items) including related papers (or ‘people who liked this also liked…’), html designer working on this. Can recreate abstract pages daily for fresh data (e.g. i think for stats/other papers).

People come in via Google for an item and the leave again. Soton ecs put links to postgrad prospectus and more on abstract pages for items, found hits to postgrad prospectus tripled.

Talking about more finely grained controls ans privileges , i.e. who can edit what, and where, and giving people additional power. Includes, for example, this person can edit wording of fields/help, but not edit workflow.

11:42: now moving on to research assessment experience.

Bill Mortimer – Open University.

How Open used eprints to support the RAE experience.

used eprints as a publication database because it was publicly available and helped increase citations. Also because of the reporting tool developed for eprints.

Open use mediated deposit but also imported records and self deposit.

Only peer reviewed items in ORO. Had up to 7 temp ‘editors’ processing the buffer.

Very slow uptake when mediated. Now have just under 7,000 items in ORO.

Simplified the workflow (which of course ep3+ have improved). Researchers responsible for depositing items for RAE submission.

Pro: increased awareness (of IR) increased deposits.

con: overlap of perceptions of ORO and RAE process (some felt RAE took over the IR). Lots of records but only 16% carry full text (% of full text varied by department).

Slide with some future ideas, good, see presentation on (though not currently there) http://pubs.or08.ecs.soton.ac.uk/

12:06am

Susan Miles – Kingston

metdata only repository at the moment but plan to add content and full text this year.

uni departmental structure and hierarchy has been the most controversial thing. Didn’t use RAE tool, wasn’t out the box.

Subject team staff created records, but focused moved to collection of physical items. (some) staff really got in to the IR, but this had the downside that many left with their new skills and experience!

misc bits

  • non existent items
  • people trying to pass off others work
  • items being removed and then re-entered constantly at the last minute for the rae
  • over sees academics caused issues.
  • proof of performances and other ‘arts’ outputs were a challenge (next time get the academics to do it).
  • a barrel moving back and forth in a room was a piece of research to be submitted for the RAE (How. evidence, metadata)

Unexpected, but lots of interest in the IR across the University. But lots of things in the buffer and no staff.

University committee has endorsed the IR as the source of publication data.

Because of using subject team staff for IR RAE, subject support now have good knowledge of the IR, which is good.

12:27

Wendy from soton

higher profile in Uni due to RAE work means people are including her – and the IR – more in discussions across campus such as looking at the REF.

question (from me): were any academics reluctant/against their rae information being put online? Answer: no

[anon comment, etheses mandate being reviewed regarding animal rights issues etc]

William Nixon: also planning to upload rae data. Does not foresee any problems, BUT recommend to not flag items as rae08 as some academics may have issues with this.

Les: HEFCE put metadata for items submitted to rae on web anyway.

q for open: you are currently only published peer reviewed items, do you plan to change this.

a: yes reviewing.

or08: Eprints 3.1

At the sucks-less-than-dspace Eprints track today. First up Eprints 3.1 and Future. Haven’t seen anything about 3.1 before this, v3 was released over a year a go so looking forward to seeing what is new.

9:10am: Les Carr is talking. reviewing v3 released last year. Talking about the large amount of work surrounding a repository (for all), which he experienced first hand running the soton ecs repository, and the work they have put in to help this. He found that when he contacted academics to point out problems he has fixed with their items/records they seemed pleased glad that someone was doing this. Last year they (eprints team) wanted to focus on ‘things on the ground’ to make things easier and not focus too much on rejiging the internals.

9:20: 3.1 more control for users. manage the repository without needing technical time (especially as University IT services often want to just set something up and leave it). showing Citation impact for authors.

Eprints 3 platform is built of two parts: ‘core’ backend, and plugins. Plugins control everything you see (I didn’t know plugins were used to this extent). A lot of the new things are just new plugins ‘slotted in’. Plugins can be updated separately which means upgrading specific parts of functionality is easy and doesn’t affect the whole system.

Lots of things moved from the command line to the web interface.

Administration: user interface for creating new fields and and configuring administrative tasks (sounds good).

Easily extend metadata, what gets stored, in a nice user interface.

9:31: live demo of adding new fields: ‘manage metadate fields’, you can edit them for each dataset e.g. document, eprints, users, imports. First get a screen showing all existing fields, a text field to enter a new field name (and something to show if you have any fields half created, to continue). Interface looks similar to creating an eprint item. select the different types of field e.g. boolean date, name, etc, lots of them, with descriptions of what they are, also one is a set where you can add a list of defined options, another is compound which can have various subfields. This is looking great.

9:38: next screen, loads of options: required? include in export? index? As name was selected on prev screen various options specific to the name field type. lots more. Has help (click on the ‘?’).

9:41: next screen set of questions about how this is displayed in the user interface, i.e. text user would see, help text. Again seems well designed. Editing XML in the past wasn’t rocket science but it was easy to forget steps or get syntax wrong, plus (certainly for v2) you had to do it with no items in the archive (not easy on a live repository!)

By default new fields appear in the MISC step (screen) of the deposit process for users. which can be changed by editing the workflow.

9:53: configuration (via web interface), fairly crude at the moment but looks to be useful (though not turned on for the demo repository), basically can edit things that are in cfg files. plan to turn this in to a full user interface in the future (not sure if for 3.1 or beyond).

9:58: running through some of the thins in the cfg files, such as how to make a field mandatory only for theses.

Quality Assurance. ideas of an ‘issue’ (something amiss) and an ‘audit’.

issue: stale, missing metadata. issues reported by item and also aggregated by depository.notification of issues can be emailed to authors. We cn define all this, i.e. what counts as an issue in the cfg files. can also check for duplicates (good as it will make my god awful script we use at Sussex obsolete).

Can have a nightly audit, and see if anyone has acted on the alerts and issues. reports can be generated for people.

10:07: batched editing. do a search and then batched change any fields for those search results. nice. running short of time so not demo’ing.

manage deposits screen (for users) icons on the right of each item of yours, to see, delete, move, etc. you change what columns you see on this screen by using icons at the bottom of the screen, can also move them around.

Impact Evidence: citation tracking, researchers can track citations counts and rank papers. volatile fields don’t change the history of a record. download counts from irstat.

Better bibliographies. can reorder, choose what to view, better control. this is very much needed as different researchers want their publication list in a different way. uses stylesheets.

Complex objects: all public objects have official URIs. expanded document-level metadata .

Versioning (based on VERSION project). ‘simple and useful’. pubished material ‘pre post or reprints’. unpublished materail, early draft, working paper. looks good.

10:19: Improve Document uploader. can upload a zip file of many files.

10:25 discussion about versions, e.g. how a user may add a draft (with limited metadata) and then go on and re-edit the item later on when they have a published version.

‘Contributers’ field. roles taken from dc relator names (225). large list of roles, may want to cut down.

A new skin, but not for 3.1 – i.e. record/abstract page will show a thumbnail of the item at the top, because the item is the important thing not the metadata (which is what is emphasised in the ui at the moment), i.e. in the same way that flickr shows the photo as the main thing on the page, and metadata at the bottom, good idea. new layout looks good.

Future: no time to talk: cloud computing, amazon eprints services perhaps (you just sign up to a IR on amazon and one is automatically created). On top of Fedora (saw folks on IRC talking about the same for Dspace the other day), or the Microsoft offering just announced. In a box (i.e. comes out the box as a pre-installed server) honeycomb.

or08: live blogging experiment

Today – as you probably have seen – I posted some badly written notes that meant nothing to no one, and interested even fewer. This was my experiment in live blogging. I’ve seen others do it quite a bit recently and always thought it worked well, so wanted to give it a try.

Some thoughts:

  • Using a different tense is a little weird. Normally we write in the past tense if reviewing an event, when blogging as it happens I found myself switching between current and past tense (the latter out of habit). This wasn’t helped with no internet access before coffee, so i was writing in to a text editor (not that you wanted to know, called Smultron) something I planned to paste in to a blog post in the future, which when posted will be talking about the past, but i wanted it to read as if it was live!
  • I looked up at one point while switching between wordpress and twitter and saw two laptop screens of people in front of me, one had twitter across the screen, the other had the wordpress composition window. Am I boring or with the in-crowd?!
  • Perhaps the biggest point was my difficulty in note taking. I wanted to write stuff that other people not there would find useful. However, my notes were largely rather basic, not meaty enough to say much, someone reading would get a general idea what the talk was about. It would give someone a feel of the outline of a talk, but not what the key points were, something which I think is a crucial difference.
  • As well as taking notes, I had various tabs open, including the excellent crowdvine conference site, twitter, bloglines, google blog search (searching to see what turned up for ‘or08’… oh look! me! god I’m so vain). At times the note taking, twittering (and learning about tags on twitter) and checking out crowdvine, I would occasionally look up and have no idea what the presenter is talking about (I’m a man, I have evolved to be an expert single tasker). Must try and ensure I’m not being distracted from the actual reason I’m there.
  • My notes were rough. Not helped by the fact that the lecture hall was very full (and it wasn’t one of those poncy MBA lecture theaters with big wide seats), so I was being careful of my elbows – which limits typing, and for me, using the shift key. Does the embarrassment of badly typed, ill thought, ungrammatical notes get trumped by their potential interest to others and timeliness?
  • Timeliness is an important point, I could have waited until the end of the day but wanted to get them out straight away.
  • After morning coffee I had the internet. I sat down with next to two people I had met before, while there were quite a few with their laptops open, they were not, and I felt a little self conscious. They were trying to listen to the talk, and here was this guy next to them mucking around on his laptop the whole time. Actually I don’t think they were bothered.
  • While talking about being self-conscious, does posting things as quickly as possible look like attention seeking and ego massaging? Never thought that about anyone else doing it so hopefully the answer is no (but then I love this sort of thing, so I wont).

So will I do it again. Yes, and I like having the web to hand while at these things. I think I need to improve my note taking, and perhaps take more time writing up points (and my thoughts) on the things of interest rather than writing lots of little snippets. I basically need to take notes anyway (whether notepad and pen, MS Word or a blog), and it does make it stick in my head better than just sitting there, so I may as well make my notes open to others. The timeliness (thats time – li – ness, not Time Lines!) is perhaps harder to argue, but I like the idea that things are hitting the web the moment they happen, so think I will continue it.

I remember last year a couple of years a go when the www 2006 conference was taking place (can’t believe it was two years a go), I was sitting at my desk watching flickr, blogs, and just about everything else being updated – a lot – in real time. The ability for me to see photos, watch videos and see notes of things that happened a couple of minutes a go amazing and really help capture the feel for the whole event.

Other bits

Battery was running low (why didn’t campus designers in the 60s think to add plug points in lecture theaters for laptops) so had to revert to pen/paper for session 3. All good talks, but the SWORD talk by Julie Allinson was excellent.

Didn’t stay for the poster session minute madness but of the few posters I did have chance to see, the one for feedforward really got my eye and just looks excellent.

Crowdvine (link to or08 on crowdvine)

This is an excellent tool, and I recommend it to anyone setting up a conference. Though I think web savvy crowds will get more out of it (e.g. integration with twitter and web feeds). It helped to put names to faces, but it also helped to get a feel for who are some of the more prominent people. For example: If Les Carr talks about Gnu Eprints, I know to listen as he manages the thing, and if Bill Hubbard talks about IRs I know to listen to what he says because he Manages Sherpa in the UK. However I couldn’t tell you the same about the Dspace or US equivalents. I still can’t tell you their names (I don’t do names) but I certainly recognised faces of those who seemed to be very active in their area. I know this sounds a little elitist or hierarchical, but it really isn’t meant to be.

Handy hint: if you want your profile page to be at the top of the conference homepage, just make superficial changes to it every few hours!

As someone mention on twitter, this, and every social networking site, needs much more than just ‘friend’. Perhaps: ‘i have seen a few emails from them on mailing lists and I may have even replied to one’, ‘I kinda stood in the same group as them during a coffee break at a conference once’ and ‘I read their blog and see them mentioned here and there so we are a little like friends’. I felt a little unsure when clicking on a few people as friend, but then they all added me back (except Christophe Gutteridge, bastard). Of course this is no different to facebook, the amount of people who have requested me as a friend who I swear I have never spoken to, even if they new a girl who lived in the corridor above me at the first year of University (that’s a real one).

(PS I used too many brackets and exclamation marks in this blog post!)

or08: session 2b: Sustainability Issues

[again unedited, unchecked, sorry for mistakes!]

Warewick Cathro
Assistant Direct General, National Library of Australia

[sorry didn’t take very good notes for Warwick’s good talk]

“towards the australian data commons” paper on the web for reference on Australian policy in this area.

various sites/projects:

arrow: aggregates IRs in Uni repositories, 90,000 records, expects to grow rapidly. not a ‘native search service’ intended to let others use the metadata.
future: evolve, support financially by ‘austrian national data service’ (like everything else in this talk). will use shibb and poss openid.

regstry services
[interesting stuff, another project, but didn’t make any notes]

pilin – identity management
handle mirror/proxy.
tools and define requirements for a national service
national persistant identifier service.

Obsolescence notification
aons project
toolkit on sourceforge
adapters for ir software
compares profile with data from external registeries, for each registry they have built an adapter

Australian METS profile
encoding of preervation metadata
exchaging data format.
three layer model, top, generic profile, middle: content models, bottom: implementation profiles

———-

Libby Bishop (Leeds/Essex)

Timescapes: looking at relationships, family life (young people, fatherood, older people).
But also buildng a data archive in the process, some objects not born digital.

400+ participants
5000+ objects
500+ gb size.

Sustainable = Shareable + desirable.

Share:
IP sorted, resource discovery, harvestability.

Desirability
what makes people want to use this, this issue is at the service
researchers are primarly audience, but also media, policy makers, students.
Longitudinal (new term to me) e.g. track people as they move through time
needs to be multimedia: voice, video, audio.
video helps to engage you people
themematic data
reuse helps make it desirable.

Distinctive features of timescapes
answer: data (primarly)
but also: multimedia, sensitive content, complex access
Longitudinal, dynamic updating.
Intergrating of research, archive and reuse.
researchers are central to the design, they interact with repository.

Timescape Repository (at leeds), Timescapes data preserved at UK Data Archive (essex):
no point recreating a preservation service at leeds. uses digitool at leeds because mandras (?) was. digitool not open. wanted to use an existing tool at leeds rather than setup a new one.

metadata:
lots of challenges, especially in what is needed.
lots of people, expertise, and different institution.
researchers tend to be the experts and know their area,
and IR people know current practice in metadata.
looking in to how to mark up audiovisual, e.g. looking at a METS wrapper.
modifying depositor interface to repository, let people add their own metadata, with some stuff still being added by the IR staff.

showing an example of the sort of data (in a MS Word file) the researchers are collecting. need a fair bit of conversations to encourage researchers to do this. (transcript guidelines/forms)

back to sustainability:
“key strategies for sustainability”
– embedding in multiple institutions (can’t predict the future).
– build trust with researchers in what you are doing (and asking them to do) is essential, esp in long term.
– reuse!
lots of people want to be part of the project: affiliates programme. those who want to work closely have to agree to contribute their own data and reuse current data.

Summary:
researchers agreed to share and reuse data: success
waiting list of affiliates

issues:
quaility of researcher dsubmitted data, some reluctant to share, digitool multimedia support limited.
Collaboration takes time, especially across institutions.

or08 session 1 (part3)

Rich Tags: cross repository browsing
ds
ECS, Southampton

categories can be unhelpful, if you don’t know it (LOC!) hard to see relate articles.

solution, unified categories across repositories. Automated.

aggregated data from oai, then got more from external sources.

eg.
instution name from whois, decade from date.

tf-idf, algorithm for categories (?)

mSpace, ecs project.

richtags.org, example of above, with records from various universities.

categories come from dmoz