Some limitations of DOAJ metadata for research purposes

by: Xuan Zhao, Luan Borges, & Heather Morrison

Abstract

The Directory of Open Access Journals http://doaj.org is an excellent service that fulfills many important functions, in particular facilitating access to a vetted collection of over 15,000 freely available peer-reviewed journals. The DOAJ search services and metadata download are very useful for researchers as well. The purpose of this post is to alert researchers to some of the limitations of the DOAJ metadata that researchers need to take into account to avoid drawing erroneous conclusions. First, when downloading DOAJ metadata, it is necessary to open the .csv file in Unicode in order to retain non-English characters. We open in Open Office for this reason, then save as an excel file. The nature of the metadata means that some data is inserted in the wrong column; clean-up, as discussed below, is necessary before data analysis. When journal editors or others working on their behalf enter metadata into DOAJ, research is not the primary purpose of this exercise; for this reason, in-depth assessment and corrections may be necessary before analysis. Below, we present publisher size analysis as an example of what researchers may encounter. Finally, because the main purpose of DOAJ is connecting readers with content, the metadata of interest to a particular research project may not be up to date. As demonstrated below, as of Jan. 5, 2021, only 30% of DOAJ journals have a “last update” date within the previous year (2020). We do not know whether the “last update” date reflects a full or partial metadata review. We illustrate the potential impact on research results with the example of the SKC longitudinal APC study. Of the 4,292 DOAJ journals that responded “yes” to the APC question, only 30% have a last update date of 2020 or 2021. Even with this 30% of journals, we have no way of knowing whether the APC status and/or amount per se was updated, or only other unrelated metadata. This means that if we compare 2019 prices obtained from publisher websites in 2019 with 2021 DOAJ APC metadata, we will almost certainly get incorrect results, for example falsely assuming that matching APC amounts means no change in the prices. DOAJ provides rich and useful metadata for the researcher and the research question “is this journal listed in DOAJ?” is of value in and of itself. For this reason, we intend to continue using DOAJ metadata in addition to data derived from other sources, particularly data derived directly from publisher websites. See below to a link to an open data version of the DOAJ metadata reflecting the corrections explained in this post.

Details

Correcting for displaced observations

As previously mentioned, the first step to confidently use the DOAJ metadata for analysis and research is identifying and correcting data inserted in the wrong column, herein also called displaced observations. 

Below we can see an example of a displaced observation from the DOAJ metadata. Column BB has no assigned variable while containing some observations, apparently displaced one column to the right. 

Table 1 – An example of misplaced data from 2021 DOAJ metadata

Users may follow different steps to correct for displaced data. Here we explain in more detail how we have identified these displacements and corrected them.  

Before proceeding with any analysis, it is important to get familiarized with the DOAJ metadata first. We recommend users to read the DOAJ Guide to applying, available online, because the metadata reflects responses to questions asked in the application process. The DOAJ metadata, as of 5 Jan. 2021, possesses 53 variables ranging from Journal Title to Country to Most recent article added. It may be helpful to start correcting observations from variables with easily identifiable responses, such as « Country » or « Country of Publisher », or variables that allow only two types of answers (i.e Yes or No), such as Author holds copyright without restrictions and APC. It is recommended to create a pivot table to identify displaced observations, repeating this process until no observations are identified in a wrong column. 

When cleaning-up the DOAJ metadata, users will notice that in some cases only one observation was displaced; in other cases, an entire row was displaced beginning on a specific variable. In the example highlighted in yellow below, all observations beginning at variable Publisher were displaced one column to the right. 

Table 2 – Line 36 illustrates an example of an entire row with displaced observations

Data entry inconsistencies

When correcting for displaced observations, we have also identified some inconsistencies in the way observations are registered in the DOAJ metadata. The table below lists the main visible inconsistencies found for some variables. In the majority of instances, the inconsistencies will not impact DOAJ users looking up information for a particular journal. However, it is important to take into account these inconsistencies before proceeding to any automated statistical analysis. For example, DOAJ metadata as is can be used to identify the number of journals with persistent article identifiers, but automated counting of DOI v. ARK or other approaches would require some advance data manipulation.

VariableExample
Alternative titleSome journals alternative titles may be registered as a number. Some examples are  “2300-6633” and “0”. 
KeywordsSome observations have some special characters as follows: 
6.         rheology, tribology, hydrodynamics, thermodynamics, mechanics of structures, mechatronics. 
           water cycles, water environment, water treatment and reuse, water resource, water quality, hydrology
 •          natural sciences, •      environmental sciences, •      social sciences, agricultural sciences, veterinary medicine, medical sciences
Copyright information URLSome URLs lack a letter « h » at the beginning or the end. The example below illustrates this small error. There should be an “h” at the beginning and an  “l” at the end of the link. ttp://www.emeraldgrouppublishing.com/services/publishing/jiuc/authors.htm
Plagiarism information URLSome URLs lack a letter « h » at the beginning or the end. The example below illustrates this small error. There should be an « h » at the beginning and an  « l » at the end of the link.
ttp://www.emeraldgrouppublishing.com/services/publishing/jiuc/authors.htm
URL for journal’s instructions for authorsSome URLs lack a letter « h » at the beginning or the end. The example below illustrates this small error. There should be an « h » at the beginning of the URL
ttps://revistas.unasp.edu.br/LifestyleJournal/about/submissions
Other submission fees information URLSome URLs have extra letters. The example below, for instance, has a letter « i » at the beginning of the URL
ihttps://journals.univie.ac.at/index.php/voebm/m/index
Some URLs lack a letter « h » at the beginning or the end. The example below illustrates this small error. There should be an « h » at the beginning of the URL
ttp://psr.ui.ac.id/index.php/journal/about/submissions#authorGuidelines ttps://www.karger.com/Journal/Guidelines/261897#sec62
Preservation ServicesPreservation services can be registered as a name or a website
Preservation Service: national libraryPreservation services – national library can be registered as a name or a website
Preservation information URLSome URLs lack a letter « h » at the beginning or the end. The example below, for instance, has a small error. There should be an « h » at the beginning of the URL
tps://periodicos.uff.br/revistagenero/about/editorialPolicies#focusAndScope ttp://ejournal.stkip-pgri-sumbar.ac.id/index.php/economica
Deposit policy directoryDeposit policy directory can be registered as a name or a website
Persistent article identifiersPersistent article identifiers can be registered as an acronym (UDC, DOI, ARK), but also as a website, such as dc.identifier.uri (DSpaceUnipr) or NBN http://www.depositolegale.it/national-bibliography-number/
Another example is the occurrences UDC and UDC (Universal decimal Classification), which are equivalents but were registered differently
URL for journal’s Open Access statementSome URLs lack a letter « h » at the beginning or at the end, or they have an extra h at the beginning of the URL. The example below has an extra letter « h » at the beginning of the URL. 
hhttp://www.revistas.usp.br/gestaodeprojetos/about
Table 3 – Visible inconsistencies identified in the DOAJ metadata

Publisher’s names duplicates investigation and clean-up

The purpose of this project is preparation to develop a rough picture of publisher size to compare with Solomon & Björk’s findings (2012). In order to better perform publisher size analysis, we have specifically investigated the publisher duplicates and corrected most of the obvious errors, such as small differences in punctuation and/or characters, extra spaces at the beginning and/or at the end, and minor differences in entering the publisher name when it is the same, etc. (Please see examples in Table 4 – Investigative Strategies – Publisher Names Duplicates).

The process of clean-up was divided into three stages. Firstly, we created a pivot table for the publisher column to identify the entries in rows which were slightly different but weren’t gathered. Secondly, when potential duplicates were found, we conducted an investigation to confirm duplicates and/or to decide which name to keep (in priority order: use the name with the most journal entries; correct name with obvious typo; use the first name listed). Please see the investigative strategies below:

Table 4 – Investigative Strategies – Publisher Names Duplicates

Thirdly, after identifying inconsistencies in publisher names, we created a table (please see Table 5 – Corrections GatheringPublisher Names Duplicates) to register all the corrections on the variable Publisher. About 500 inconsistencies were corrected. Thus, the number of publishers in the pivot table has decreased from 7218 entries (data resource: pivot table based on DOAJ metadata) to 6804 entries (data resource: pivot table based on the cleaned-up version of database).

Table 5 – Corrections GatheringPublisher Names Duplicates

As illustrated in the two tables above, there were different types of data inconsistencies. In order to respect metadata to the greatest extent, we acted prudently when making decisions. In some minor variation cases, we tried to click on the URLs to check publisher websites and to collect convincing evidence. However, we met some intricate complex challenges.

One of the challenges was the language. Due to the massiveness and the wide-range of publishers (124 countries, 80 languages, DOAJ, 7 Feb. 2021) [https://doaj.org/], we were unable to identify all of the sources of information. Besides, when there were invalid URLs or unmatched information, it was difficult to seek out any precision. What’s more, among 7218 entries of publisher names, some of the potential duplicates weren’t gathered because of their different beginning words. For example, “Editora da Universidade Estadual de Maringá (Eduem)” vs. “Eduem – Editora da Universidade Estadual de Maringá” and “Academica Brâncuşi” vs. “Editura Academica Brâncuşi”. They were usually far apart and hard to be detected. More details can be found in the Table 6 below:

Different beginning words (examples)“Academica Brâncuşi” vs. “Editura Academica Brâncuşi”;
“Alexandru Ioan Cuza University of Iaşi” vs. “Editura Universităţii ‘Alexandru Ioan Cuza’ Iaşi”;
“Editora da Universidade Estadual de Maringá (Eduem)” vs. “Eduem – Editora da Universidade Estadual de Maringá”
Table 6 – (1)

Unmatched publisher names (examples):

Original publisher namesPossible correct namesURLs
Canadian Society for the Study of Education.The Canadian Association for Curriculum Studieshttps://jcacs.journals.yorku.ca/index.php/jcacs/index
Badan Penelitian dan Pengembangan KesehatanURL directs to a new web link:
https://ejournal2.litbang.kemkes.go.id/index.php/jki/index
whose publisher name is:
Pusat Penelitian dan Pengembangan Biomedis dan Teknologi Dasar Kesehatan
http://ejournal.litbang.kemkes.go.id/index.php/jki
Shaheed Beheshti University of Medical Sciences and Health ServicesKowsarmedicalhttp://journals.sbmu.ac.ir/jme
Table 6 – (2)

Invalid URLs (examples):

Original publisher namesOriginal URLs (invalid)
Alborz University of Medical Sciences
(URLs wrongly directs to a website whose contents are meaningless; when we searched the journal title, we were directed to this website : https://enterpathog.abzums.ac.ir/)
http://enterpathog.com/?page=home ; https://jehe.abzums.ac.ir/index.php?slc_lang=en&sid=1
Instituto Nacional de Salud (INS)http://revistas.ins.gov.py/index.php/rspp/
Instituto Superior de Ciências de Educação do Huambohttp://revista.isced-hbo.ed.ao/rop/index.php/ROP/index
Table 6 – (3)

Given the barriers and challenges mentioned above, we can draw a conclusion to the limitations of publisher names clean-up project. Precision is not possible in this project because the question “who is the publisher” is complex. Instead of making any definitive claims about publisher size, we are primarily interested in whether the long tail effect (a few big publishers, a few more middle-sized, most very small) reported by Solomon & Björk (2012) can still be observed in DOAJ in 2021.

DOAJ metadata update analysis

The following analysis was conducted to determine whether DOAJ metadata on article processing charges (APCs) – charging status and amount – would be sufficient for SKC’s longitudinal study on APC trends over time. The answer is clearly no. The metadata for the vast majority of journals in DOAJ (overall and APC charging) has not been updated for more than a year, and it is unknown whether the most recent update would have included an update to APC or other metadata. We will continue to use DOAJ metadata as it is rich and the question “is this journal listed in DOAJ” is of value in and of itself, however for price comparisons we cannot rely on this data as it would likely result in erroneous conclusions.

DOAJ journals by year of last update.

This chart illustrates the percentage of DOAJ journals last update by year. Detailed figures are in the table below. Note that just under half the journals were last updated 2 or more years ago (2018 or earlier).

DOAJ last update as of Jan. 5, 2021
Year# journals last updated % journals last updated
20152942%
20161,4699%
20172,86418%
20182,95119%
20193,41222%
20204,66230%
2021390%
Total15,691100%
Table 7

DOAJ APC charging journals by year of last update

The chart above illustrates the percentage of journals that answered “yes” to the DOAJ question about charging APCs by year of last update. The table below provides the detailed figures. Note that only 30% of DOAJ journals that charge APCs were updated in the past year (2020 or 2021). It is also unknown whether in these cases the last update was a thorough review of the metadata, or might have been an update of non-APC data.

DOAJ last update APC journals only Jan. 5, 2021
Year of last udpate# of journals last updated% journals last updated
2015471%
20162386%
201749912%
201893022%
20191,28630%
20201,27630%
2021160%
Total4,292100%
Table 8

A version of the Jan. 5, 2021 DOAJ metadata file reflecting the corrections explained below is available as open data here:

Directory of Open Access Journals; Zhao, Xuan; Borges, Luan; Morrison, Heather, 2021, “DOAJ_metadata_2021_01_05_with_SKC_clean_up”, https://doi.org/10.5683/SP2/G5LEXG, Scholars Portal Dataverse, V1

References

The Directory of Open Access Journals (DOAJ) online: https://doaj.org/

Solomon, D. J., & Björk, B. (2012). A study of open access journals using article processing charges. Journal of the American Society for Information Science and Technology63(8), 1485–1495. https://doi.org/10.1002/asi.22673

Cite as: Zhao, X., Borges, L., & Morrison, H. (2021). Some limitations of DOAJ metadata for research purposes. Sustaining the Knowledge Commons. https://sustainingknowledgecommons.org/2021/02/10/some-limitations-of-doaj-metadata-for-research-purposes/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.