Title: | Document-Level Matching Between Bibliographic Datasets |
---|---|
Description: | Identifies and visualizes document overlap in any number of bibliographic datasets. This package implements the identification of overlapping documents through the exact match of a unique identifier (e.g. Digital Object Identifier - DOI) and, for records where the identifier is absent, through a score calculated from a set of fields commonly found in bibliographic datasets (Title, Source, Authors and Publication Year). Additionally, it provides functions to visualize the results of the document matching through a Venn diagram and/or UpSet plot, as well as a summary of the matching procedure. |
Authors: | Gabriel Vieira [aut, cre, cph] , Jacqueline Leta [ctb] |
Maintainer: | Gabriel Vieira <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.3.9000 |
Built: | 2024-11-15 03:06:41 UTC |
Source: | https://github.com/gavieira/biblioverlap |
Shiny App for the biblioverlap package
biblioverApp(port = NULL, max_upload_size = 1000, launch.browser = TRUE)
biblioverApp(port = NULL, max_upload_size = 1000, launch.browser = TRUE)
port |
|
max_upload_size |
|
launch.browser |
|
opens a instance of the biblioverlap UI
#Running the ShinyApp biblioverApp()
#Running the ShinyApp biblioverApp()
This function identifies document overlap between bibliographic datasets and records it through the use of Universally Unique Identifiers (UUID).
biblioverlap( db_list, matching_fields = default_matching_fields, n_threads = 1, ti_penalty = 0.1, ti_max = 0.6, so_penalty = 0.1, so_max = 0.3, au_penalty = 0.1, au_max = 0.3, py_max = 0.3, score_cutoff = 1 )
biblioverlap( db_list, matching_fields = default_matching_fields, n_threads = 1, ti_penalty = 0.1, ti_max = 0.6, so_penalty = 0.1, so_max = 0.3, au_penalty = 0.1, au_max = 0.3, py_max = 0.3, score_cutoff = 1 )
db_list |
|
matching_fields |
|
n_threads |
|
ti_penalty |
|
ti_max |
|
so_penalty |
|
so_max |
|
au_penalty |
|
au_max |
|
py_max |
|
score_cutoff |
|
In this procedure, any duplicates in the same dataset are removed. Then, Universally Unique Identifiers (UUID) are attributed to each record. If a match is found between two documents in a pairwise comparison, the UUID of the record from the first dataset is copied to the record on the second.
All preprocessing and modifications to the dataset are performed in a copy of the original data, which is used internally by the program. After all pairwise comparisons are completed, the UUID data is added as a new column in the original data.
Thus, the db_list
returned by this function contains the same fields provided by the user plus the UUID column with the overlap information. This allows for further analysis using other fields (e.g. 'number of citations' or 'document type').
a list object containing:
(i) db_list
: a modified version of db_list where matching documents share the same UUID
(ii) summary
: a summary of the results of the matching procedure
In its internal data, the program will attempt to split the AU (Author) field to extract only the first author, for which it will calculate the Levenshtein distance.
It assumes that the AU field is ";" (semicolon) separated. Thus, in order to correctly perform the matching procedure to when another separator is being applied to this field, the user can either: (i) change the separator to semicolon; or (ii) create a new column containing only the first author.
#Example list of input dataframes lapply(ufrj_bio_0122, head, n=1) #List of columns for matching (identical to biblioverlap()'s defaults) matching_cols <- list(DI = 'DOI', TI = 'Title', PY = 'Publication Year', AU = 'Author/s', SO = 'Source Title') #Running document-level matching procedure (first two dataframes) biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2], matching_fields = matching_cols) #Taking a look at the matched db_list lapply(biblioverlap_results$db_list, head, n=1) #Taking a look at the matching results summary biblioverlap_results$summary
#Example list of input dataframes lapply(ufrj_bio_0122, head, n=1) #List of columns for matching (identical to biblioverlap()'s defaults) matching_cols <- list(DI = 'DOI', TI = 'Title', PY = 'Publication Year', AU = 'Author/s', SO = 'Source Title') #Running document-level matching procedure (first two dataframes) biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2], matching_fields = matching_cols) #Taking a look at the matched db_list lapply(biblioverlap_results$db_list, head, n=1) #Taking a look at the matching results summary biblioverlap_results$summary
Get all matches from a given subset of biblioverlap's results
get_all_subset_matches(subset_db_list, db_list)
get_all_subset_matches(subset_db_list, db_list)
subset_db_list |
|
db_list |
|
the subset data plus any other records outside the subset that have been matched to its documents
#Running document-level matching procedure for two datasets biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2])$db_list #Change the document type of one of the datasets from 'journal article' to #'article' to emulate bibliographical source differences biblioverlap_results[[2]][['Publication Type']] <- gsub('journal article', 'article', biblioverlap_results[[2]][['Publication Type']]) #Generating venn diagram for the entire dataset venn_plot(biblioverlap_results) #Obtaining only the subset of records with publication type 'article' biblioverlap_results_subset <- lapply(biblioverlap_results, function(db) { db[db$'Publication Type' == "article", ] }) #Generating venn diagram for data subset #Returns us how many documents categorized as 'article' are unique to a given #dataset and how many find a match against other documents in the subset #(i.e. that are also categorized as 'article', in this example) venn_plot(biblioverlap_results_subset) #Recovering missing matches due to bibliographical source differences #in the subsetting process subset_all_matches <- get_all_subset_matches(biblioverlap_results_subset, biblioverlap_results) #Generating venn diagram for data subset plus all its matches #Returns us how many documents categorized as 'article' are unique to a given #dataset and how many find a match against any other document venn_plot(subset_all_matches)
#Running document-level matching procedure for two datasets biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2])$db_list #Change the document type of one of the datasets from 'journal article' to #'article' to emulate bibliographical source differences biblioverlap_results[[2]][['Publication Type']] <- gsub('journal article', 'article', biblioverlap_results[[2]][['Publication Type']]) #Generating venn diagram for the entire dataset venn_plot(biblioverlap_results) #Obtaining only the subset of records with publication type 'article' biblioverlap_results_subset <- lapply(biblioverlap_results, function(db) { db[db$'Publication Type' == "article", ] }) #Generating venn diagram for data subset #Returns us how many documents categorized as 'article' are unique to a given #dataset and how many find a match against other documents in the subset #(i.e. that are also categorized as 'article', in this example) venn_plot(biblioverlap_results_subset) #Recovering missing matches due to bibliographical source differences #in the subsetting process subset_all_matches <- get_all_subset_matches(biblioverlap_results_subset, biblioverlap_results) #Generating venn diagram for data subset plus all its matches #Returns us how many documents categorized as 'article' are unique to a given #dataset and how many find a match against any other document venn_plot(subset_all_matches)
Plotting biblioverlap's matching summary
matching_summary_plot( matching_summary_df, add_logo = TRUE, text_size = 15, ... )
matching_summary_plot( matching_summary_df, add_logo = TRUE, text_size = 15, ... )
matching_summary_df |
|
add_logo |
|
text_size |
|
... |
|
a barplot summary of the matching results
#Running document-level matching procedure biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2]) #Checking biblioverlap results (summary table) biblioverlap_results$summary #Plotting the matching summary matching_summary_plot(biblioverlap_results$summary)
#Running document-level matching procedure biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2]) #Checking biblioverlap results (summary table) biblioverlap_results$summary #Plotting the matching summary matching_summary_plot(biblioverlap_results$summary)
Merge multiple input files from the same source
merge_input_files(input_files, sep = ",", quote = "\"")
merge_input_files(input_files, sep = ",", quote = "\"")
input_files |
|
sep |
|
quote |
|
It is fairly common to retrieve data from a single bibliographic database in small chunks. Thus, this function is designed to merge multiple files from the same source into a single file while also removing duplicate records.
a single dataframe with all unique records from the input files
## Generating tempfiles tempfile1 <- tempfile(fileext = ".csv") tempfile2 <- tempfile(fileext = ".csv") write.csv(ufrj_bio_0122$Biochemistry, file = tempfile1, row.names = FALSE) write.csv(ufrj_bio_0122$Genetics, file = tempfile2, row.names = FALSE) ## Testing function merged_files <- merge_input_files(c(tempfile1, tempfile2)) dim(merged_files) head(merged_files)
## Generating tempfiles tempfile1 <- tempfile(fileext = ".csv") tempfile2 <- tempfile(fileext = ".csv") write.csv(ufrj_bio_0122$Biochemistry, file = tempfile1, row.names = FALSE) write.csv(ufrj_bio_0122$Genetics, file = tempfile2, row.names = FALSE) ## Testing function merged_files <- merge_input_files(c(tempfile1, tempfile2)) dim(merged_files) head(merged_files)
Merge biblioverlap's results into a single dataframe
merge_results(db_list, filter = "none")
merge_results(db_list, filter = "none")
db_list |
|
filter |
|
a single dataframe containing data from db_list, featuring an additional 'SET_NAME' column to indicate from which dataset each record came
#Running document-level matching procedure for two datasets biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2]) #Obtaining the results as a single dataframe (all records) all_data <- merge_results(biblioverlap_results$db_list) #Checking number of total rows and overlapping documents are in the dataframe nrow(all_data) sum(duplicated(all_data$UUID)) #Obtaining only distinct records as a single dataframe distinct_data <- merge_results(biblioverlap_results$db_list, filter = 'distinct') #Checking number of total rows and overlapping documents are in the dataframe nrow(distinct_data) sum(duplicated(distinct_data$UUID)) #Obtaining only matched records as a single dataframe matched_data <- merge_results(biblioverlap_results$db_list, filter = 'matched') #Checking number of total rows and overlapping documents are in the dataframe nrow(matched_data) sum(duplicated(matched_data$UUID))
#Running document-level matching procedure for two datasets biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2]) #Obtaining the results as a single dataframe (all records) all_data <- merge_results(biblioverlap_results$db_list) #Checking number of total rows and overlapping documents are in the dataframe nrow(all_data) sum(duplicated(all_data$UUID)) #Obtaining only distinct records as a single dataframe distinct_data <- merge_results(biblioverlap_results$db_list, filter = 'distinct') #Checking number of total rows and overlapping documents are in the dataframe nrow(distinct_data) sum(duplicated(distinct_data$UUID)) #Obtaining only matched records as a single dataframe matched_data <- merge_results(biblioverlap_results$db_list, filter = 'matched') #Checking number of total rows and overlapping documents are in the dataframe nrow(matched_data) sum(duplicated(matched_data$UUID))
Data obtained from The Lens Scholarly Search in September 6, 2023.
The original data contained all documents from four major biological sciences fields published in the year 2022 by at least one author affiliated to the Universidade Federal do Rio de Janeiro (UFRJ). The data was then subsampled to documents published exclusively in January 2022 to reduce package size.
The biological disciplines featured in this dataset are Biochemistry, Genetics, Microbiology and Zoology.
ufrj_bio_0122
ufrj_bio_0122
ufrj_bio_0122
A named list with 4 elements. Each element is a dataframe that contains the following fields:
Unique identifier given to each record in The Lens database
Digital Object Identifier
Document title
Document publication year
Source (e.g. journal) where the document has been published
Document authors
Type of the document (e.g. 'journal article', 'book chapter', etc...)
Total number of citations received by document at the time of data recovery
Type of open access (e.g. gold, bronze, green, etc...)
Plotting UpSet plot from biblioverlap results
upset_plot(db_list, add_logo = TRUE, ...)
upset_plot(db_list, add_logo = TRUE, ...)
db_list |
|
add_logo |
|
... |
|
a UpSet plot representation of document overlap between the input datasets
#Running document-level matching procedure biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2]) #Checking biblioverlap results (db_list) lapply(biblioverlap_results$db_list, head, n=1) #Plotting the UpSet plot upset_plot(biblioverlap_results$db_list)
#Running document-level matching procedure biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2]) #Checking biblioverlap results (db_list) lapply(biblioverlap_results$db_list, head, n=1) #Plotting the UpSet plot upset_plot(biblioverlap_results$db_list)
Plotting Venn Diagram from biblioverlap results
venn_plot(db_list, add_logo = TRUE, ...)
venn_plot(db_list, add_logo = TRUE, ...)
db_list |
|
add_logo |
|
... |
|
a Venn Diagram representation of document overlap between the input datasets
#Running document-level matching procedure biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2]) #Checking biblioverlap results (db_list) lapply(biblioverlap_results$db_list, head, n=1) #Plotting the Venn diagram venn_plot(biblioverlap_results$db_list)
#Running document-level matching procedure biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2]) #Checking biblioverlap results (db_list) lapply(biblioverlap_results$db_list, head, n=1) #Plotting the Venn diagram venn_plot(biblioverlap_results$db_list)