Explore Edoc Server with R and OAI-PMH

Everyone can access the metadata of Humboldt University’s edoc publication server via it’s OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) interface. With Kurt Hornik’s OAIHarvester package for R you can effortlessly download and clean the dataset and explore it with only a few lines of R code. The same techniques apply to all repositories with an OAI-PMH interface.

Import libraries:

library("OAIHarvester")
library("tidyverse")

Import data:

edoc_url <- "http://edoc.hu-berlin.de/oai/request"
all_records <- oaih_list_records(edoc_url)

Metadata is available in different formats. The OAIHarvester defaults to the most basic, the oai_dc format. This will give us a small subset of the available metadata:

 [1] "title"       "creator"     "subject"     "description" "publisher"   "contributor"
 [7] "date"        "type"        "format"      "identifier"  "source"      "language"   
[13] "relation"    "coverage"    "rights"

Clean & transform data:

all_metadata <- all_records[, "metadata"] # extract metadata column
all_metadata <- oaih_transform(all_metadata[length(all_metadata)> 0L]) # parse xml
all_metadata <- as_tibble(all_metadata) # transform to tibble
all_metadata <- all_metadata %>% mutate(year = strtoi(str_sub(date, 1,4))) # extract year to own column
all_metadata <- all_metadata %>% mutate(language = unlist(language)) # unlist language

Plot publications by date issued:

Note that “date issued” is the date of the first publication of an item, not the date it first appeared on edoc.

all_metadata %>% 
  ggplot(aes(x = year)) +
  geom_bar(width = 1.1) +
  labs( title = "Edoc items by date of first publication")

Plot usage of CC-licenses:

The result of this plot came as a surprise. I did not expect that the overwhelming majority uses the most restrictive CC-license. Perhaps we can adjust our recommendations.

all_rights <- unlist(all_metadata[, "rights"])
all_rights <- as_tibble(all_rights)
all_rights <- filter(all_rights, str_detect(value, "http")) # only look at rights urls
cc_rights <- unlist(str_extract_all(all_rights, "(by([:alpha:]|-)*)|(zero)")) # extract cc licenses
cc_rights <- as_tibble(cc_rights)
licenses_by_openness <- tibble(
  value = c("zero", "by", "by-sa", "by-nd", "by-nc", "by-nc-sa", "by-nc-nd"), 
  openness = 1:7
  )
cc_rights <- cc_rights %>% 
  count(value) %>%
  inner_join(licenses_by_openness, by = "value")
  
ggplot(cc_rights, aes(x=reorder(value, openness), y=n)) +
  geom_col(aes(fill = openness)) +
  scale_fill_gradient(low="green", high="red", guide=FALSE) +
  labs( x = "CC-Licenses ordered from most open to least open", y = "Count", title = "CC-Licenses of edoc publications")

Plot number of publications in english and german language:

all_metadata %>%
  filter(language == "eng" | language =="ger") %>%
  filter(year >= 1997 & year <= 2017) %>%
  ggplot() + 
  geom_bar(mapping = aes(x = year, fill = language)) +
  labs( title = "Edoc items in english and german language")

all_metadata %>%
  filter(language == "eng" | language =="ger") %>%
  filter(year >= 1997 & year <= 2017) %>%
  ggplot() + 
  geom_bar(mapping = aes(x = year, fill = language), position="fill") +
  labs( title = "Edoc items in english and german language")

References:

Hornik, Kurt. (2017). Metadata Harvesting with R and OAI-PMH. URL https://cran.r-project.org/web/packages/OAIHarvester/vignettes/oaih.pdf