What we don't know about open access: research questions in need of researchers

From the SPARC Open Access Newsletter, May 2008 edition

If a mature movement is one with a literature so vast that only specialists can master it, then the OA movement has been mature for several years.  But that only means that much is known about OA.  Much is still unknown, and much is changing so that much of our old knowledge doesn't apply to new circumstances. 

Here's an informal list of research questions whose answers would usefully fill out our current knowledge.  It's a personal list in the sense that it represents what I'd like to know myself to help me see the big picture, understand the devils in the details, and give better recommendations to researchers and institutions.  If I were advising Ph.D. students looking for research topics, I'd hand them this list and hope they'd find something they could sink their teeth into. 

Some of the questions must already be the subjects of research in progress.  Some already have partial answers in the literature.  Some are too small for an article and some are too large for a dissertation.  Some require the passage of time before we'll have enough evidence to make a serious attack on them. 

This list would be larger and better if it weren't just my own.  Therefore, I'll soon deposit it in the Open Access Directory (OAD), the wiki that Robin Peek and I launched just this week.  The basic idea behind the OAD is that a community can maintain a list far better than a single individual could maintain it alone.  It may take a few days to set up the new list, but watch my blog for news that it's ready for community editing.  Once it goes up, it will be as much yours as mine.  Please add new entries, revise existing entries, and send the URL to grad students and other researchers looking for ways to deepen our understanding of OA.

Open Access Directory

Alongside the new list on Research Questions, OAD will also include a second, related list on Research in Progress.  The second list will present ongoing research projects and contact info for their participants.  The idea here is simply to support collaboration and prevent unwanted duplication of effort.  When a question on the first list is already under investigation, someone can simply move the relevant details to the second list. 

I found it very difficult to stop writing this list once I started, for two somewhat conflicting reasons.  On the one hand, the catalog of what I don't know and would like to know is large.  Whenever I sit down to transcribe it, nothing is easier than to pose a new question.  On the other, some questions have to sneak up on me, and only strike when I'm doing something else, genuinely surprising me that I didn't think of them earlier.  For both reasons, I'm glad the list is moving to a wiki and not just to the SOAN archive.  I expect to be enlarging it myself.

(1) Access

* Publishers often assert that all or most of those who need access to peer-reviewed journal literature already have access.  Who doesn't have access?  What kinds of people don't have access and how well can we measure their numbers?

It's important to separate lay readers without access from professional researchers (in the academy, industry, and the professions) without access.  Among professional researchers without access, it would help to classify by country and field. 

It's also important to distinguish demand for access from people without access.  Some of those without access may not care to have it.  How well can we measure the demand for access among those who don't currently have it?

Can we redo the estimates annually in order to have a moving measurement of our progress in closing the access gap and meeting the unmet demand?

* What is the current rate of self-archiving in different fields and countries?  Can we graph the change in these rates over time?  Can we disentangle spontaneous self-archiving from self-archiving encouraged or required by funders and universities?  Can we calculate both the percentage of self-archiving authors and the percentage of self-archived papers?

How accurately can we rank the disciplines by their levels of OA archiving?  Even if the widely-held assumption is correct that physics is first, what's the rest of the picture?

* What percentage of published articles from a given year or a given journal have OA copies somewhere online?  Can we break this down by permitted copies and unpermitted ones?  Can we break it down by OA preprints and OA postprints?  Can we break it down by field?  Can we collect these numbers easily enough to recompute them annually and chart future progress?

Bo-Christer Bork, Annikki Roos, and Mari Lauri just released a paper doing a significant part of this calculation.

* There is an emerging consensus to allow (or even require) self-archiving for the final version of the author's peer-reviewed manuscript, and to give the publisher exclusivity for the published edition.  Can we sample widely in different fields and come up with any useful generalizations on how widely these two versions differ from one another?

* How accurately can a university measure the "ullage" of its library? 

In 2004, I introduced the term "ullage" --after the word for the empty space at the top of a wine bottle-- for the gap between what a university directly offers through its library and the totality of literature to which faculty and students might need access.

Can a university measure its ullage in absolute numbers (of journals or journal articles) and percentages (of the journal literature in certain fields)?  Can we agree on the total number of journal articles (for a given year and perhaps for a given field) so that different universities could use the same denominators when making their calculations and comparing their results?

Can researchers outside a university compute its ullage using public information?  Can we do this for a wide range of institutions and map where ullage is higher and where it is lower?

* How accurately can a subscription journal measure the number of professional researchers in the relevant fields who don't have paid access to its contents? 

Can researchers, universities, libraries, and governments make their own measurements for a given journal, or verify journal's measurement, without access to its list of subscribers?  How accurately, and how easily, can we reconstruct a journal's list of subscribers from OPACs? 

* What is the average number of peer-reviewed scholarly journals to which *public libraries* subscribe?  Can we break this down by field of journal and nationality of library? 

* Why does the OA impact advantage (even if it's mere correlation without causation) differ by field?  What are the key variables? 

* How much economic value is produced by OA to research literature and data?  That is, if the basic peer-reviewed literature and all associated data were OA, then what kinds of economic activity would that trigger and what is its total net value? 

This is the overarching question of the EASI-OA (Economic and Social Impacts of Open Access) research project.  There are many sub-questions awaiting exploration.

(2) Journal business models

* What is the breakdown of green journals by field?  Why do the fields differ in this respect?  What are the key variables?

* Identify the publishers who do not yet allow author-initiated self-archiving (with no fees and no delays).  Sort them from largest to smallest (by numbers of journals or articles published), and break them down by field.  If we redo the numbers every year, would we find that they are rising or falling?

If we identified the largest holdouts, how effectively would that information alone change behavior of researchers (as authors, referees, and editors)? 

* What percentage of peer-reviewed, free online journals go beyond removing price barriers to the removal of at least some permission barriers?  Of those removing permission barriers, how many use a CC-BY license (or equivalent), a CC-BY-NC license (or equivalent), and so on?

* Only a minority of OA journals charge author-side publication fees.  What are the other OA journal business models?  This may require a lot of emails and phone calls, since many journals don't give business details on their web sites.  The first phase of this research is simply to document the range of models actually in use.  The second phase is to study which models work best, and worst, in which niches.

* For journals that charge publication fees, what is the range of fees and the average?  Can we break this down by field and country?  Can we break it down by what the author (and readers) get for the money?  For example, some publish the articles under open licenses and some take the fee and leave users with the limitations and uncertainties of fair use.

* For journals that charge publication fees, how many waive the fee in cases of economic hardship?  What tests do they use, if any, to decide whether to give a waiver? 

* Can conventional subscription-based journals survive if they provide OA after an embargo period?  If so, what is the shortest embargo period compatible with their survival?  This will probably differ by field.

Since the rise of green OA hasn't yet triggered cancellations even in the field with the longest history and highest levels of self-archiving (physics), it may be too soon to make these measurements.  But how well can we estimate them?  How far can we base the estimates on actual renewal and cancellation decisions rather than on abstract preferences, current predictions, or hypothetical decisions?

One difficult variable is the effect of rising levels of green OA on subscriptions.  On that one, we may just have to wait.  But one variable we may be able to measure today is the rate at which usage and citations of articles (in a given field or from a given journal) decline after the date of publication.

* If not all disciplines will be like physics (in which high-level OA archiving coexists with TA journals, and publishers cannot identify any cancellations attributable to OA archiving), then which disciplines will and will not be like physics?  What are the key variables?  How can we know?

Why have subscription-based journals survived in physics, with no cancellations attributable to OA archiving?  Is this temporary or a sign of sustainable compatibility?  Will the advent of SCOAP3 change their business models before we have a chance to answer this question?

* If the rise of OA archiving starts to harm TA journals, will the journals tend to change their archiving policies (retreating from green), convert to gold OA, fold up, something else?  Can we estimate how many journals would take each of these options?  Can we break down the estimates by field and country?  Can we identify the key variables in their decisions?  Can we do better than merely asking editors and publishers for predictions or hypothetical decisions?

* How much does good journal management software reduce the cost of facilitating peer review and running a journal? 

* What percentage of journals pay editors or referees?  What are the percentages by discipline and nation?  When journals pay editors or referees, what are the average salaries, stipends, or fees?

* There is a very wide range of claims about the cost of publishing journals (per page or per article) and a very wide range of claims about the prices charged for journals (per page or per article).  What are the claims (who said what, when, and with respect to what journal or kind of journal)?  Can we explain the differences among them and produce estimates (for certain kinds of journal) that all stakeholders could accept?

* How many journals automatically deposit all their articles (or, all their OA articles) in a separate and independent OA repository?   How many have their own OA repositories? 

* What kinds of subsidies do TA journals get from public funds?  Can we quantify these subsidies?   How many countries pay these subsidies?  How many publishers and journals benefit from them? 

How large are the subsidies relative to publication costs?  That is, how much do journals depend on these subsidies?

How large are the subsidies relative to the same country's support for green OA?  For example, the NIH will spend about $2-4 million/year to implement its OA mandate, but spends about 10 times that amount ($30 million) in page charges and other subsidies for TA journals.

* In the world of newspapers, OA allows the publisher to raise advertising rates and revenue (something realized in 2007 by the New York Times and Wall Street Journal).   To what extent is the same true for scholarly journals?  If there is an analogous increase for scholarly journals, how large is it?  Does it vary by field?

* How often do authors ask to retain more rights than a journal's standard contract allows?  How often do journals accede to these author requests?  Can we classify these attempts and successes by field, country, and terms requested by authors?

How many journals which don't give blanket permission in advance for author initiated self-archiving routinely give permission on request?

How many journals have accepted the terms of a given author addendum?  Can we compare the track records of different author addenda, and break down the results by field and country?

* Some journals report increased submissions after shortening their embargo period or converting to OA.  How general is this phenomenon?  Is it more likely to arise in some fields than in others?  How large are the increases and how can a given TA journal predict the result in its own case? 

* In my article on flipping a business model (SOAN for October 2007), I said:  "It's easy for a journal to measure the extent of the match between its reading and writing institutions.  Simply calculate the percentage of authors [published in the journal] who are affiliated with subscribing institutions.  Even journals that are quite sure they would never flip their business model should do the calculation.  The door may be open [to flipping the business model to OA], and that's a fact worth knowing." 

Can this calculation be done for a given journal by an outside researcher without access to the journal's list of subscribers, for example, by reconstructing an approximate subscriber list through OPACs?

* How many TA journals have OA backruns?  Can we break this down by field and country?

How many TA journals without an OA backrun would agree to make their backrun OA if only someone (like Google or the OCA) would pay to digitize it?

For TA journals with OA backruns, what is the average embargo or "moving wall" after which older issues become OA?  Can we break down the average durations by field and country?

(3) Book business models

* For a dual edition book (with OA and non-OA editions), how can we measure the sales the non-OA would have had in the absence of the OA edition?  If we think the OA edition increased (or decreased) net sales, how can we measure that increase (or decrease)?

The effect of an OA edition on sales will probably vary by type of book (e.g. monograph v. encyclopedia), but exactly how? 

If the OA edition comes out after the non-OA edition, how does the delay affect the impact on sales?  If the OA edition comes out simultaneously with the non-OA edition, but is discontinued after a time, how does that affect the impact on sales?

* How will improvements in ebook readers affect the economics of dual-edition books?  If OA editions currently increase the sales of print editions, how much of that effect is due to the fact that few people want to read a whole book on a screen?

* Does Google's opt-out Library Project increase the sales of scanned copyrighted books (as Google expects) or decrease their sales (as suing publisher and author groups fear)? 

(4) Software

* Compare the free and open-source (FOSS) packages for creating and maintaining OA repositories.  Which features are present and absent in each?  Which are best suited for different kinds of users?  Perhaps include some non-FOSS packages.  

This job was done Raym Crow in August 2004, and by several others since then for subsets of the available packages.  But it needs to be done again for the full range and latest versions of FOSS packages.

I was going to add the same question here about journal management software.  But just last month Johns Hopkins released its survey and evaluation of the packages, and even put it on a wiki for community updating.

(5) Researcher attitudes and practices

* How many authors are already allowed to self-archive (by the journal in which they published) and have not self-archived?  How many papers are already covered by these permissions but not yet self-archived?  Can we break down the numbers by field, country, and year?

* How many more faculty would routinely self-archive *if they knew* that it was lawful?  If they knew that it took an average of 6-10 minutes/paper?  If they knew that self-archiving increased citations 40-250% (on average, in different fields)? 

How much of the failure to self-archive is due to ignorance and misunderstanding?

* Do junior faculty deposit their work in archives, or submit it to OA journals, more often than senior faculty, perhaps because they grew up with the internet and more readily see the benefits of OA?  Or do senior faculty do so more often than junior faculty, perhaps because they already have tenure and can afford to disregard the criteria of conservative promotion and tenure committees? 

How will self-archiving rates change as today's junior faculty become tenured?  What about rates of submission to OA journals?

* Let's say that 20% of researchers publish 80% of the peer-reviewed articles.  (First, get the actual percentage of researchers who publish 80% of the articles.  But here assume it's 20%.)  Now study that 20%.  What do *those researchers* know about OA journals and OA archiving?  How often do they submit their work to OA journals, and how often do they deposit copies of their postprints in OA repositories?  If these researchers are clustered in certain fields, countries, or institutions, where are the largest clusters?

* When researchers learn about a TA article of interest to them, how often do they look online for an OA copy?  When they do so, where do they look?

* What are the "best practices" for universities to increase the rate of self-archiving?  What do they cost?  How much do they boost the rate of deposits?

If some practices work best in some niches or circumstances, then in which niches or circumstances?

* When faculty have a choice between equally prestigious and equally suitable OA and TA journals, will they submit their work to the OA journal?  When the answer is no, what other variables come into play?

* When a new journal is excellent but little-known, what steps will tend to earn it prestige in proportion to its quality? 

(6) Universities

* Arthus Sale has collected evidence that university-level OA mandates succeed in driving up the rate of repository deposits toward 100%.

As more universities adopt mandates, and as existing mandates have time to work, there are opportunities to build on Sale's data with data from other universities. What does the newer and larger picture reveal?  If some mandates work better than others, what are the key variables? 

If more than one kind of policy effectively drives the deposit rate toward 100%, then what is the range?  This will matter if a given institution would find it politically easier to adopt one kind of policy than another.

* What lessons can be learned from institutions which have adopted OA mandates about how to formulate the policy, propose it, educate faculty and administrators about the issues, and build consensus for its adoption?

* If universities require faculty to retain key rights when publishing journal articles, with no opt-outs, would that (1) help authors in their negotiations with journals, or (2) hurt authors by causing journals to reject their papers?

If the answer is sometimes the former and sometimes the latter, then what are the key variables (for example, the size and prestige of the university, the number of universities with similar policies, the prestige of the journal)?  What kinds of policies are more likely to help authors?

* When universities require faculty to retain key rights when publishing journal articles, but allow opt-outs (like the Harvard policy of February 2008), then how often do faculty request opt-outs?  How often do journals demand that authors request opt-outs, and how often do authors refuse these demands?  How many journals would actually reject a paper (as opposed to threatening to reject a paper) from an author who refused to seek an opt-out?  Would this rate decline as more universities adopted similar policies?

* How much time does it take for a university to create and maintain an OAI-compliant OA repository?  How much does it cost the university, in hardware, software, and human resources?  If it depends on how much the university wants to do with the repository, and how much to educate users, then can we break down the costs for each layer of use and service?

* When universities recommend or encourage use of an author addendum, how many faculty actually try it?  How do universities stand behind the authors whose addendum was rejected?  (What practices are actually in use?)  How effective are these university actions in getting journals to accept rather than reject an addendum?

* How many universities mandate OA for electronic theses and dissertations?  Can we break this down by university size and country? 

Can we put together some "best practices" for policy exemptions and embargoes?

* For a given university and its actual serials budget and research output:  If we assume that all journals convert to OA and charge a given publication fee per paper (say, $2000), then will the total budget needed to cover faculty publication fees be larger or smaller than the budget now devoted to TA journals?  This calculation has been done before, at least three times, and in each case the authors assumed that all OA journals would charge publication fees and that all fees would be paid by universities.  The purpose of redoing the calculation is to make the assumptions more realistic.  First do the baseline calculation, above.  Then:  how does the number change when 10% (then 20%, then 30% etc.) of OA journals charge no publication fee at all?  How does the number change when 10% (then 20%, then 30% etc.) of those fees are paid by funding agencies rather than universities? 

For more detail on the previous calculations and my recommended refinements, see my article from SOAN for June 2006.

* What percentage of university libraries have negotiated full access privileges for walk-in patrons?  Is the trajectory up or down?

* How successful have libraries been in negotiating licensing terms that benefit authors (e.g. providing permission for self-archiving), not just terms that benefit readers?  What tactics are likely to increase that success rate? 

I know a handful of universities trying to negotiate better terms for their authors (not just their readers), but most of them are not yet ready to discuss their experience in public.  How many universities are already trying these negotiations?  How would their prospects improve if more universities joined the effort?

(7) Funding agencies

* What is the amount of the average research grant?  How does that compare with the average cost of publishing a peer-reviewed journal article?  (This is to compare the value added by funders with the value added by publishers.)

* We know roughly how much it costs the NIH to implement its OA policy ($2-4 million/year).  How much would it cost to expand the policy to cover all federally funded research?   Can we do the same calculation for other countries?

* If NIH agreed to pay processing fees charged by OA journals for all the journal articles produced by its research grants, assuming (say) $2000 per paper, then how much would that be?  How much would it be if it imposed a cap of 3 (then 5, 7, 9 etc.) on the number of fees paid per research project?  For a given cap (say, 3 fees per research project), what percentage of the average grant budget would be devoted to fees rather than new research?

* What percentage of non-classified research in a given field is funded by government funding agencies?  by private funding agencies?  by universities?  by other sources?  is not funded at all?

Start with a given country and go through all the fields of the natural sciences, social sciences, and humanities.  Then do the same for some other countries.

What dollar amounts are associated with these percentage figures? 

* Take any set of important papers, for example, the top 10% by citation impact in 10 different fields.  What percentage of them are based on publicly-funded research?  That is, what percentage of them would be OA if the relevant government had adopted an OA mandate for publicly-funded research?  What percentage are OA today?

Do this exercise for many different sets of important papers.  For example, all the papers by a given year's Nobel prize winners; the consensus set of the 100 papers most important for understanding HIV/AIDS or climate change; etc.