SQL grouping mulitple values in to one SELECT field

SQL is one of those things I always need to understand more than I do. Sometimes odd little problems can seem impossible with my limited understanding of the language.

One such hit this me this morning. We wanted a report from our Institutional Repository (a online archive of the University’s research publications). The report was to include title, journal title, type (journal article, book, etc), author(s) and departments for each item in the report.

Anyone with a basic understanding of relational databases will be able to picture the main table of publications, with another table for publication authors – one record for each author of each publication, with the publication’s id number to link them to the item in question. The same set up for the departments associated with a particular item.

With basic SQL, it would be easy to return a list which included multiple entries for items with multi authors, i.e. an item with three authors would be repeated three times in the list, each showing a different author. An item with three authors and two departments would be repeated six times to cover all combinations of authors and departments. Not very desirable. We could also provide a list which only shows each item once (as required) but only show one author and department (ie by using GROUP BY and perhaps a function to select the first author). Again would be less than ideal.

A bit of googling showed creating an agregate function in postgress (we are using mysql), and this solution which seems to use a tempory table.

The solution in the end was much more simpler, and involves using a function called group_concat.

The final SQL looked like this:

SELECT a.eprintid, a.type, a.title, a.publication,
GROUP_CONCAT(c.creators_given, " ", c.creators_family) as Authors,
GROUP_CONCAT(d.depts) as Departments
FROM `archive` a, `archive_creators` c, `archive_depts` d
WHERE a.eprintid = c.eprintid
AND a.eprintid = d.eprintid
GROUP BY c.eprintid, d.eprintid

The two group_concats will produce a list of authors (and a list of departments) both comma separated. The author’s is a little more complex as the name is within two separate fields, though this isn’t a problem, we just need to specify both with a space in between to make them readable.

or08: eprints track, session 2

After coffee a little more talk about new features and the future as we ran out of time before. Christopher Gutteridge has now turned up, he may have had a few grown up fizzy drinks last night.

(lost concentration here: salt grain take) Eprints plugin will try and pick when people enter their names wrong (e.g. get first/lastnames mixed up). Report an eprint (or report an issue with an eprint) link on item/record pages?

3.1 beta: should be released in a day or so. Live CD available.

When will the new template (for records/items) including related papers (or ‘people who liked this also liked…’), html designer working on this. Can recreate abstract pages daily for fresh data (e.g. i think for stats/other papers).

People come in via Google for an item and the leave again. Soton ecs put links to postgrad prospectus and more on abstract pages for items, found hits to postgrad prospectus tripled.

Talking about more finely grained controls ans privileges , i.e. who can edit what, and where, and giving people additional power. Includes, for example, this person can edit wording of fields/help, but not edit workflow.

11:42: now moving on to research assessment experience.

Bill Mortimer – Open University.

How Open used eprints to support the RAE experience.

used eprints as a publication database because it was publicly available and helped increase citations. Also because of the reporting tool developed for eprints.

Open use mediated deposit but also imported records and self deposit.

Only peer reviewed items in ORO. Had up to 7 temp ‘editors’ processing the buffer.

Very slow uptake when mediated. Now have just under 7,000 items in ORO.

Simplified the workflow (which of course ep3+ have improved). Researchers responsible for depositing items for RAE submission.

Pro: increased awareness (of IR) increased deposits.

con: overlap of perceptions of ORO and RAE process (some felt RAE took over the IR). Lots of records but only 16% carry full text (% of full text varied by department).

Slide with some future ideas, good, see presentation on (though not currently there) http://pubs.or08.ecs.soton.ac.uk/

12:06am

Susan Miles – Kingston

metdata only repository at the moment but plan to add content and full text this year.

uni departmental structure and hierarchy has been the most controversial thing. Didn’t use RAE tool, wasn’t out the box.

Subject team staff created records, but focused moved to collection of physical items. (some) staff really got in to the IR, but this had the downside that many left with their new skills and experience!

misc bits

  • non existent items
  • people trying to pass off others work
  • items being removed and then re-entered constantly at the last minute for the rae
  • over sees academics caused issues.
  • proof of performances and other ‘arts’ outputs were a challenge (next time get the academics to do it).
  • a barrel moving back and forth in a room was a piece of research to be submitted for the RAE (How. evidence, metadata)

Unexpected, but lots of interest in the IR across the University. But lots of things in the buffer and no staff.

University committee has endorsed the IR as the source of publication data.

Because of using subject team staff for IR RAE, subject support now have good knowledge of the IR, which is good.

12:27

Wendy from soton

higher profile in Uni due to RAE work means people are including her – and the IR – more in discussions across campus such as looking at the REF.

question (from me): were any academics reluctant/against their rae information being put online? Answer: no

[anon comment, etheses mandate being reviewed regarding animal rights issues etc]

William Nixon: also planning to upload rae data. Does not foresee any problems, BUT recommend to not flag items as rae08 as some academics may have issues with this.

Les: HEFCE put metadata for items submitted to rae on web anyway.

q for open: you are currently only published peer reviewed items, do you plan to change this.

a: yes reviewing.

or08: Eprints 3.1

At the sucks-less-than-dspace Eprints track today. First up Eprints 3.1 and Future. Haven’t seen anything about 3.1 before this, v3 was released over a year a go so looking forward to seeing what is new.

9:10am: Les Carr is talking. reviewing v3 released last year. Talking about the large amount of work surrounding a repository (for all), which he experienced first hand running the soton ecs repository, and the work they have put in to help this. He found that when he contacted academics to point out problems he has fixed with their items/records they seemed pleased glad that someone was doing this. Last year they (eprints team) wanted to focus on ‘things on the ground’ to make things easier and not focus too much on rejiging the internals.

9:20: 3.1 more control for users. manage the repository without needing technical time (especially as University IT services often want to just set something up and leave it). showing Citation impact for authors.

Eprints 3 platform is built of two parts: ‘core’ backend, and plugins. Plugins control everything you see (I didn’t know plugins were used to this extent). A lot of the new things are just new plugins ‘slotted in’. Plugins can be updated separately which means upgrading specific parts of functionality is easy and doesn’t affect the whole system.

Lots of things moved from the command line to the web interface.

Administration: user interface for creating new fields and and configuring administrative tasks (sounds good).

Easily extend metadata, what gets stored, in a nice user interface.

9:31: live demo of adding new fields: ‘manage metadate fields’, you can edit them for each dataset e.g. document, eprints, users, imports. First get a screen showing all existing fields, a text field to enter a new field name (and something to show if you have any fields half created, to continue). Interface looks similar to creating an eprint item. select the different types of field e.g. boolean date, name, etc, lots of them, with descriptions of what they are, also one is a set where you can add a list of defined options, another is compound which can have various subfields. This is looking great.

9:38: next screen, loads of options: required? include in export? index? As name was selected on prev screen various options specific to the name field type. lots more. Has help (click on the ‘?’).

9:41: next screen set of questions about how this is displayed in the user interface, i.e. text user would see, help text. Again seems well designed. Editing XML in the past wasn’t rocket science but it was easy to forget steps or get syntax wrong, plus (certainly for v2) you had to do it with no items in the archive (not easy on a live repository!)

By default new fields appear in the MISC step (screen) of the deposit process for users. which can be changed by editing the workflow.

9:53: configuration (via web interface), fairly crude at the moment but looks to be useful (though not turned on for the demo repository), basically can edit things that are in cfg files. plan to turn this in to a full user interface in the future (not sure if for 3.1 or beyond).

9:58: running through some of the thins in the cfg files, such as how to make a field mandatory only for theses.

Quality Assurance. ideas of an ‘issue’ (something amiss) and an ‘audit’.

issue: stale, missing metadata. issues reported by item and also aggregated by depository.notification of issues can be emailed to authors. We cn define all this, i.e. what counts as an issue in the cfg files. can also check for duplicates (good as it will make my god awful script we use at Sussex obsolete).

Can have a nightly audit, and see if anyone has acted on the alerts and issues. reports can be generated for people.

10:07: batched editing. do a search and then batched change any fields for those search results. nice. running short of time so not demo’ing.

manage deposits screen (for users) icons on the right of each item of yours, to see, delete, move, etc. you change what columns you see on this screen by using icons at the bottom of the screen, can also move them around.

Impact Evidence: citation tracking, researchers can track citations counts and rank papers. volatile fields don’t change the history of a record. download counts from irstat.

Better bibliographies. can reorder, choose what to view, better control. this is very much needed as different researchers want their publication list in a different way. uses stylesheets.

Complex objects: all public objects have official URIs. expanded document-level metadata .

Versioning (based on VERSION project). ‘simple and useful’. pubished material ‘pre post or reprints’. unpublished materail, early draft, working paper. looks good.

10:19: Improve Document uploader. can upload a zip file of many files.

10:25 discussion about versions, e.g. how a user may add a draft (with limited metadata) and then go on and re-edit the item later on when they have a published version.

‘Contributers’ field. roles taken from dc relator names (225). large list of roles, may want to cut down.

A new skin, but not for 3.1 – i.e. record/abstract page will show a thumbnail of the item at the top, because the item is the important thing not the metadata (which is what is emphasised in the ui at the moment), i.e. in the same way that flickr shows the photo as the main thing on the page, and metadata at the bottom, good idea. new layout looks good.

Future: no time to talk: cloud computing, amazon eprints services perhaps (you just sign up to a IR on amazon and one is automatically created). On top of Fedora (saw folks on IRC talking about the same for Dspace the other day), or the Microsoft offering just announced. In a box (i.e. comes out the box as a pre-installed server) honeycomb.