writing » One Librarian's Take on the Repository Landscape

25 Oct 2023

I have been researching institutional repositories for the past four years, ever since I was part of a SCELC (a California consortium of non-UC schools) committee that assembled a repository landscape report. I am not an expert on repositories, I am not even primarily a repository librarian (I run one alongside many other applications). But I also have many observations locked in my head that it would be constructive, and perhaps helpful for others, to formulate.

CCA’s Situation

My thoughts on IRs are, of course, colored by my college’s experience with them. We have had an IR for over a decade now and it has served a broad array of use cases: syllabi archive, student works, awards juries, level (junior/senior) review, school-wide brainstorming project, archives, digitized artists’ books using Internet Archive’s Bookreader, commencement videos shared with parents, signed legal documents, media hosted so our catalog can link to them. And others that did not come to mind just now. The point being, the traditional open access faculty research use case, which our IR also serves, is only one among many functions for us and a relatively small one at that.

Our software is openEQUELLA, formerly EQUELLA. It was an independent Australian company when we selected it, was purchased by Pearson, dropped by Pearson, and subsequently open sourced. Over the lifespan of our repository, it has been supported by at least three different companies. I have my share of struggles with openEQUELLA, but it is enormously flexible software. It has a powerful contribution form builder, bulk metadata editor with scripting, custom display templates, custom item views (called hierarchies), robust APIs, and granular permissions. We are able to support such a broad range of functions due to its flexibility. Unfortunately, the open source community around it has disappeared and it is now essentially back to being the product of one company, Edalex, which has its own builds for its clients. We are looking for a replacement, which motivates much of my current IR research.

Not in Consideration

There are many repositories I only needed to briefly review before eliminating them from consideration. Bepress Digital Commons is owned by Elsevier, fully featured, but not open source. EPrints has a dated UI and is focused around journals, not our primary use case. CONTENTdm is more content showcase than full IR, and also not open source.

I will be honest—I didn’t do my full due diligence on DSpace. There are a lot of repositories to consider, I have seen clunky looking DSpace sites and heard grumbling from colleagues, that was enough. The new version (7) has a separate API layer and Angular JS frontend, which is a promising approach. DSpace is Java, though, and I am wary from maintaing our current Java IR. Being open source and written in an interpreted language I can understand and contribute to is not vital, but it is very nice.

ArchivesSpace looked great at first, the documentation and APIs, but I thought I was missing something—how do I upload a file? You have to upload files elsewhere, then catalog in AS, which doesn’t fit our workflow. We need simple upload forms that do everything for our end users.

Archipelago Commons is unique; I regret that we cannot consider it more seriously. It has a lovely metadata story that sets it apart, positive community vibes, and, while Drupal and PHP isn’t my preference, it uses a modern suite of components. Unfortunately, proven longevity is one of our “must haves”; after our current experience, it is too difficult to switch to something too new, with too small a user base.

Major Players

While some of the options above took time to consider, I fairly quickly identified the top contenders for us: Samvera (Hyku/Hyrax), Islandora, and InvenioRDM. There is an obvious commonality amongst these three: they are all open source, written in an interpreted language, and have large user communities of GLAM institutions. There are conferences, Slack/Discord spaces, listservs, vendors, user groups, etc. centered around these systems which is important for a small institution with limited resources. We would probably be well off running any one of them, but below I present my thoughts on each and why we ultimately chose InvenioRDM. Another observation that confirms these are strong choices: all three of them came out with multiple releases, including features we would use, while I was researching. In fact, my estimations were constantly shifting based on recent developments and my perception of the direction of each project.

Samvera

Samvera was my initial favorite but also the first to be ruled out of the top three. It’s popular and backed by numerous large organizations and major grants. As a framework for building a repository, Hyrax seems great, but it also looks to require more devoted development time and do less out of the box. Samvera has a strong community with many adopters, a foundation, a code of conduct, fairly clear communication, an annual conference, all the signs of a mature and stable platform. The structure is a little confusing to me; there’s not only the name churn the software has undergone, which may appear minor but is the first obstacle anyone trying to learn about the system encounters, but also a number of GitHub issues from prior (grant-funded?) projects that lack comments and are left in ambiguous states on some Samvera repositories. Some of these features have been completed, and certainly required discussion, but that wasn’t evident. Maybe I was looking in the wrong place, but also part of a platform is how legible its resources are. By comparison, I know as little about Islandora as about Samvera, but found navigating their documentation and GitHub presence more intuitive.

The reason why I ruled out Samvera was one specific feature: a write REST API. Our current repository functions as a syllabus archive thanks to an integration with an external system—plus I really like the power of having a REST API in my toolbox—and a requirement of our new system is fulfilling this function. Hyku/Hyrax has a module for a read API, but I didn’t see visible work on writing capabilities. I found that surprising, and I’m sure with some work one could be added, but I want a system that was closer to meeting our needs out of the box. I was also worried about work types, metadata flexibility, and metadata bulk editing. The multitenant features of Hyku, which are seeing a lot of active development with Hyku for Consortia, are unique in the repository ecosystem. It makes for a powerful way to spin up usable, turnkey repositories for several organizations at once (whether it’s one site per each member of a consortium or separate departmental sites under a university). Samvera’s focus, after initially being a repostory-builder framework meant for organizations with a development team, now points toward expanding their market share with better migration tools and more power in its turnkey (Hyku) option. It has gone from being the favored option of large institutions to one of the best choices for smaller libraries.

Islandora

Islandora has a great community, based on my limited interaction with it. I had conversations on GitHub pull requests, on Slack, over email, and at two-day Islandora Camp event that was absolutely wonderful, full of useful content and hosted by very knowledgeable community members. I learned about as much during the two days of Camp, which was affordably priced and included a gift coffee mug with pictures of our pets on it (!), as I did in months of research. Islandora has a host non-profit foundation and an annual conference. It’s my impression that the software will be well stewarded and viable for a long time.

I have some structural and user experience concerns with Islandora, from the perspective of wanting to allow inexpert users to contribute content with varying degrees of privacy. Islandora has a bifurcated content creation workflow—first create metadata, then attach “child” media files—which would confuse our users and introduce a point of failure into the contribution process. To make items private, we need to use a private file system not built into the demos I tried, and a “hierarchical access” contributed module from an Islandora vendor so that media children of private items would inherit their parent’s privacy restrictions. I believe these goals are achievable, but the more we fight the default configuration, the more worried I become. We cannot afford to run a wildly unique instance that diverges from the community and its support.

For us, some Islandora components add more maintenance cost than they provide value. Blazegraph, a triplestore with a query engine that hasn’t seen much development since 2016 when Amazon bought out its developers, sits alongside Islandora like a sidecar needing everything to be indexed in it without contributing core functionality back. Again, making some items private is a special consideration here. Fedora was a common pain point during my demos, both simply in running the app and as a common source of difficult-to-diagnose errors. A small library would probably benefit from dropping these and maybe some other services, and I admire Islandora 2’s microservice design which allows for this, but then…why use Islandora at all? We move closer and closer to building a bespoke Drupal site, not an Islandora site. We would not benefit from some of the main vectors of community development, which sadly echoes our current repository experience.

Drupal is not necessarily a negative for me. I’ve maintained Drupal sites. Islandora’s unique appeal over other systems is that most customization is accomplished through configuration in a GUI. Drupal provides a powerful suite of conceptual tools for building websites. I think Drupal’s success in Libraries is largely explained by how these familiar structuring concepts can be used by non-coders: taxonomies, views, blocks, content types, contexts. It’s an immense architecture, but not all that hard to understand for a librarian, and means that I might be able to train other library staff to make major contributions to site design without needing to code. PHP is a bit of a minus as programming languages go, but we actually already run PHP apps and have developers on staff with PHP experience, whereas Samvera would be our first Ruby/Rails project.

Islandora’s focus appears to be on smoothing out the features, template, and deployment of their “2.0” major structural change which is tied to its (modern, version 8+) Drupal base. Running local demos over and over probably influences my perspective, but there seemed to be abundant activity on their Docker deployment project (isle-dc) and the starter site template, whereas their microservices framework (which performs things like derivative generation, integration with Fedora/Blazegraph) is mature and mostly stable. Islandora had more robust resources and documentation on deploying and running it in production than other options.

InvenioRDM

A big motivator in selecting Invenio was a meeting with the CalTech repository team, who graciously offered me their time to discuss IRs and ask questions. They also run EPrints and Islandora, so they had some perspective on the relative merits of different systems. They were strong advocates for InvenioRDM, believed any new repository should by built with it, and thought that using Islandora as an IR would be a headache. I was surprised, but our discussion came before Islandora Camp, when I learned about some nuances mentioned above that we would be fighting against if we went with Islandora.

We are a Python shop, running multiple large Django applications, and I would have far more support for a Python project than one in another language. The Python ecosystem is familiar not only to our developers, but also our small sys admin / dev ops team. I was able to follow Invenio documentation on using a special “site code” module to make a trivial extension of the application in about an hour, whereas I would need to learn more framework idioms for other choices. Though Flask is not as familiar as Django for me and Invenio in particular uses some of the more complex Flask patterns (e.g. Blueprints) to facilitate being a modular application, its language is still a major appeal.

Invenio was also perhaps the platform that matured the most as I was researching. Multiple times, a critical feature that really benefits us suddenly appeared in the release notes. Though to be fair, sometimes new features seem to go unmentioned, and Invenio’s documentation can be spotty. A question asked in Discord was sometimes answered with a link to documentation which I’d look for but either wasn’t linked from an index or was added without any fanfare. The biggest issue with Invenio development is that the app carries a STEM focus which does not align with our community. Their roadmap includes items like GitHub integration and data citation support which is of relatively little use to my college. In general, Invenio’s development speed seems faster than other platforms at this time, though that may also be because it is less mature in general.

Conclusion

These observations come firmly from my institutional context. I also didn’t talk much about our SCELC project, where we are looking for a consortial IR with particular attention to Hyku and DSpace. I would not necessarily write off any repository platform for another institution, and my main takeaway is that libraries are lucky to have many viable options in this area. In particular, the major open source platforms have strong communities, active development, and mature codebases.

Thank you to everyone who took the time to talk to me or dealt with my nuisome requests for help running repository demos. I hope this post is helpful to others, and I welcome any feedback or corrections.