Everyone can access the metadata of Humboldt University’s edoc publication server via it’s OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) interface. With Kurt Hornik’s OAIHarvester package for R you can effortlessly download and clean the dataset and explore it with only a few lines of R code. The same techniques apply to all repositories with an OAI-PMH interface.
library("OAIHarvester")
library("tidyverse")
edoc_url <- "http://edoc.hu-berlin.de/oai/request"
all_records <- oaih_list_records(edoc_url)
Metadata is available in different formats. The OAIHarvester defaults to the most basic, the oai_dc format. This will give us a small subset of the available metadata:
[1] "title" "creator" "subject" "description" "publisher" "contributor"
[7] "date" "type" "format" "identifier" "source" "language"
[13] "relation" "coverage" "rights"
all_metadata <- all_records[, "metadata"] # extract metadata column
all_metadata <- oaih_transform(all_metadata[length(all_metadata)> 0L]) # parse xml
all_metadata <- as_tibble(all_metadata) # transform to tibble
all_metadata <- all_metadata %>% mutate(year = strtoi(str_sub(date, 1,4))) # extract year to own column
all_metadata <- all_metadata %>% mutate(language = unlist(language)) # unlist language
Note that “date issued” is the date of the first publication of an item, not the date it first appeared on edoc.
all_metadata %>%
ggplot(aes(x = year)) +
geom_bar(width = 1.1) +
labs( title = "Edoc items by date of first publication")
The result of this plot came as a surprise. I did not expect that the overwhelming majority uses the most restrictive CC-license. Perhaps we can adjust our recommendations.
all_rights <- unlist(all_metadata[, "rights"])
all_rights <- as_tibble(all_rights)
all_rights <- filter(all_rights, str_detect(value, "http")) # only look at rights urls
cc_rights <- unlist(str_extract_all(all_rights, "(by([:alpha:]|-)*)|(zero)")) # extract cc licenses
cc_rights <- as_tibble(cc_rights)
licenses_by_openness <- tibble(
value = c("zero", "by", "by-sa", "by-nd", "by-nc", "by-nc-sa", "by-nc-nd"),
openness = 1:7
)
cc_rights <- cc_rights %>%
count(value) %>%
inner_join(licenses_by_openness, by = "value")
ggplot(cc_rights, aes(x=reorder(value, openness), y=n)) +
geom_col(aes(fill = openness)) +
scale_fill_gradient(low="green", high="red", guide=FALSE) +
labs( x = "CC-Licenses ordered from most open to least open", y = "Count", title = "CC-Licenses of edoc publications")
all_metadata %>%
filter(language == "eng" | language =="ger") %>%
filter(year >= 1997 & year <= 2017) %>%
ggplot() +
geom_bar(mapping = aes(x = year, fill = language)) +
labs( title = "Edoc items in english and german language")
all_metadata %>%
filter(language == "eng" | language =="ger") %>%
filter(year >= 1997 & year <= 2017) %>%
ggplot() +
geom_bar(mapping = aes(x = year, fill = language), position="fill") +
labs( title = "Edoc items in english and german language")
Hornik, Kurt. (2017). Metadata Harvesting with R and OAI-PMH. URL https://cran.r-project.org/web/packages/OAIHarvester/vignettes/oaih.pdf