disambr
is an R 📦 that provides a flexible framework for disambiguation of named entities. Currently this package implements the AEV algorithm (van den Akker et al., 2020) for Web of Science author disambiguation.
The idea of basic usage is simply piping the disambiguation procedures. Each procedure takes list of sets as input and returns list of sets either by adding a new set or modifying input list of sets.
data |>
disambr_set_on_same_paper() |>
disambr_set_similar_initials() |>
disambr_set_similar_last_names()
Sequence (piped functions) of disambiguation procedures defines a disambiguation algorithm.
disambr_eva <- function(data) {
data |>
disambr_set_on_same_paper() |>
disambr_set_similar_initials() |>
disambr_set_similar_last_names()
}
The EVA-algorithm: An open-source solution for the disambiguation of author names in Web of Science data Olmo R. van den Akker Sacha Epskamp Stanislav Vlasov
The creation of co-authorship networks is a valuable way to depict the social structure of scientific fields. However, these co-authorship networks often get distorted because of the problems of author name synonymy (the same author is split into two nodes because his name is spelled differently in different publications) and author name homonymy (different authors are compounded into one node because they share the same name). The practice of author name disambiguation (AND) tries to solve these problems by correctly identifying the authors of scientific articles.
Several algorithms have been put forward in the context of AND, but none of them are suitable for large datasets of the Web of Science database. Therefore, in an earlier part of this project we proposed a new unsupervised learning algorithm based on the most recent AND literature. This so-called AEV-algorithm involves two phases: a blocking phase, in which pairs of authors are selected that are sufficiently similar, and a disambiguation phase, in which similar author names are either split or combined into one node based on information retrieved from the Web of Science database. In the disambiguation phase, the algorithm uses information about co-authorship, e-mail addresses, institutional affiliations, cited references, and article keywords (van den Akker et al., 2020).
All disambiguation procedures used in disambr
package work with sets. A set is basically any R object that can represent mathematical sets (e.g., set of authors, set of companies) with special attributes that are used by disabmr
functions to identify the kind of set it is working with or produced (e.g., a set of authors that are likely to be the same person or a set of companies that are definitely different companies, etc.).
The attributes that are currently used to define/describe sets as well as their values are listed below:
disambr_set_name
disambr_entity
person
, organization
, publication
disambr_set_type
similar_entities
, different_entities
disambr_set_coefficient
disambr_set_collection
single_set_table
(first column assumed to store entity id or entity id is just row number if entity_id_reference
attribute is set to self
, see below),list_of_sets_as_lists
(each set is a list of entity ids),dyads_table
(first and second columns assumed to be ids for the pair of entities)disambr_entity_id_reference
self
, name of other set as in its set_name
attributedisambr_entity_id_reference_md5_sum
disambr_recipe
disambr
procedures that were applied to produce given set(s)procedure
file_name
file_md5sum
(to check file identity later)file_header
(to check for consistency between read files)This research was supported (in part) by the Fetzer Franklin Fund of the John E. Fetzer Memorial Trust.
van den Akker, O. R., Epskamp, Sacha, & Vlasov, S. A. (2020). The AEV Algorithm—Author name disambiguation for large Web of Science datasets.
FN | File Name |
VR | Version Number |
PT | Publication Type (J=Journal; B=Book; S=Series; P=Patent) |
AU | Authors |
AF | Author Full Name |
BA | Book Authors |
BF | Book Authors Full Name |
CA | Group Authors |
GP | Book Group Authors |
BE | Editors |
TI | Document Title |
SO | Publication Name |
SE | Book Series Title |
BS | Book Series Subtitle |
LA | Language |
DT | Document Type |
CT | Conference Title |
CY | Conference Date |
CL | Conference Location |
SP | Conference Sponsors |
HO | Conference Host |
DE | Author Keywords |
ID | Keywords Plus® |
AB | Abstract |
C1 | Author Address |
RP | Reprint Address |
EM | E-mail Address |
RI | ResearcherID Number |
OI | ORCID Identifier (Open Researcher and Contributor ID) |
FU | Funding Agency and Grant Number |
FX | Funding Text |
CR | Cited References |
NR | Cited Reference Count |
TC | Web of Science Core Collection Times Cited Count |
Z9 | Total Times Cited Count* |
U1 | Usage Count (Last 180 Days) |
U2 | Usage Count (Since 2013) |
PU | Publisher |
PI | Publisher City |
PA | Publisher Address |
SN | International Standard Serial Number (ISSN) |
EI | Electronic International Standard Serial Number (eISSN) |
BN | International Standard Book Number (ISBN) |
J9 | 29-Character Source Abbreviation |
JI | ISO Source Abbreviation |
PD | Publication Date |
PY | Year Published |
VL | Volume |
IS | Issue |
SI | Special Issue |
PN | Part Number |
SU | Supplement |
MA | Meeting Abstract |
BP | Beginning Page |
EP | Ending Page |
AR | Article Number |
DI | Digital Object Identifier (DOI) |
D2 | Book Digital Object Identifier (DOI) |
PG | Page Count |
P2 | Chapter Count (Book Citation Index) |
WC | Web of Science Categories |
SC | Research Areas |
GA | Document Delivery Number |
UT | Accession Number |
PM | PubMed ID |
ER | End of Record |
EF | End of File |
name | comments |
---|---|
bibliometrix | reads only plaintext format into bibliometrixDB object |
wosr | Requires WoS API subscription |
refsplitr | package ‘refsplitr’ is not available (for R version 4.0.1) |
read.wos.R | Does not work… |
metagear | scrapebibliography by DOI |
hindexcalculator | ? |
https://github.com/massimoaria/bibliometrix
Site: https://bibliometrix.org/index.html
git clone https://github.com/massimoaria/bibliometrix
install.packages("bibliometrix")
library("bibliometrix")
library("magrittr")
bmdata <- convert2df(file = 'https://www.bibliometrix.org/datasets/wos_plaintext.txt', dbsource = 'wos', format = "plaintext")
bmdata %>% class
## [1] "data.frame" "bibliometrixDB"
bmdata %>% names
bmdata <- convert2df(file = 'https://www.bibliometrix.org/datasets/wos_plaintext.txt', dbsource = 'wos', format = "csv")
CRAN docs: https://cran.r-project.org/web/packages/metagear/metagear.pdf
GitHub: https://github.com/cran/metagear/
git clone https://github.com/cran/metagear/
scrape_bibliography
CRAN docs: https://cran.r-project.org/web/packages/BibPlots/BibPlots.pdf
CRAN docs: https://cran.r-project.org/web/packages/hindexcalculator/hindexcalculator.pdf
git clone https://github.com/ropensci/refsplitr
install.packages("refsplitr")
library("refsplitr")
Requires premium WoS API - https://clarivate.com/webofsciencegroup/solutions/xml-and-apis
CRAN doc: https://cran.r-project.org/web/packages/wosr/wosr.pdf
Site: https://github.com/vt-arc/wosr
GitHub: https://github.com/vt-arc/wosr
git clone https://github.com/vt-arc/wosr
install.packages("wosr")
library(wosr)
## Get session ID
sid <- auth("s.vlasov@tilburguniversity.edu", password = "")
## Error: No matches returned for Username s.vlasov@tilburguniversity.edu
## Query WoS to see how many results match your query
query <- 'TS = ("animal welfare") AND PY = (2002-2003)'
query_wos(query, sid = sid)
## Download data
pull_wos(query, sid = sid)
GitHub: https://github.com/alberto-martin/read.wos.R
git clone https://github.com/alberto-martin/read.wos.R
## load functions
## --------------------------------------------------------------------------------
source("../lib/read.wos.R/read.wos.functions.R")
## --------------------------------------------------------------------------------
## test
wos.data.mp <- read.wos(dir("../data/Journals in Mathematical Psychology", no.. = TRUE, full.names = TRUE))
## Error in substring(fields, 4) : invalid multibyte string at '<ff><fe>P'
## In addition: Warning message:
## In readLines(files[1], n = 1) : line 1 appears to contain an embedded nul
wos.data.mp <- read.wos("../data/Journals in Mathematical Psychology/Psychonomic Bulletin & Review 2.txt")
## Error in substring(fields, 4) : invalid multibyte string at '<ff><fe>P'
## In addition: Warning message:
## In readLines(files[1], n = 1) : line 1 appears to contain an embedded nul
wos.data <- read.wos("/mnt/md5/data/wos/wos-sci-expanded.firm-names-query.analytical-instruments/LN Public NAICS records from 10001 to 10500.txt")
## Error in substring(fields, 4) : invalid multibyte string at '<ff><fe>P'
## In addition: Warning message:
## In readLines(files[1], n = 1) : line 1 appears to contain an embedded nul