Setting up environment
I developed and tested labyrinth
on Fedora Linux version 38 and 39.
While this package does not contain any operating system-specific code,
it has not been tested on other operating systems. In theory,
labyrinth
should work on other Unix-like operating systems
as long as the required dependencies are installed.
If you need to rebuild this model, it is strongly recommended to use Fedora Linux or its downstream operating system (i.e., Red Hat Enterprise Linux, CentOS Stream, Rocky Linux, Alma Linux) and follow the instructions above. Using a different environment may lead to unexpected issues or differences in results.
Red Hat Enterprise Linux, Rocky Linux, Alma Linux, and CentOS Stream
The Fedora RPMs for R have been ported to CentOS/RHEL by the project
Extra Packages for Enterprise Linux (EPEL). These RPMs are also
compatible with distributions derived from CentOS/RHEL. To use the EPEL
repository, it is sufficient to download and install the appropriate
epel-release
RPM. Then R can be installed as described in
the Fedora section.
-
CentOS Stream 9
-
Rocky Linux 9, AlmaLinux 9, and RHEL 9
-
CentOS Stream 8
-
Rocky Linux 8, AlmaLinux 8, and RHEL 8
-
CentOS 7 and RHEL 7
sudo yum-config-manager --enable extras sudo yum install dnf -y sudo dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm -y # Use puias computational repo curl -o /tmp/RPM-GPG-KEY-springdale http://springdale.math.ias.edu/data/puias/7/x86_64/os/RPM-GPG-KEY-springdale sudo mv /tmp/RPM-GPG-KEY-springdale /etc/pki/rpm-gpg/RPM-GPG-KEY-springdale tee > /tmp/puias_computational.repo << EOF [puias_computational] name=PUIAS computational Base $releasever - $basearch mirrorlist=http://puias.math.ias.edu/data/puias/computational/$releasever/$basearch/mirrorlist # baseurl=http://puias.math.ias.edu/data/puias/computational/$releasever/$basearch enabled=1 gpgcheck=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-puias EOF sudo mv /tmp/puias_computational.repo /etc/yum.repos.d sudo dnf update -y
Python 3
This R package relies on several Python packages. Users can install these packages using one of the following methods:
-
System package manager on Linux
# On Fedora, you can run: sudo dnf install python3-numpy python3-sklearn python3-scipy python3-seaborn python3-tqdm python3-beautifulsoup4 python3-selenium python3-lxml # On Debian, you can run: sudo apt install python3-numpy python3-sklearn-pandas python3-scipy python3-seaborn python3-tqdm python3-bs4 python3-selenium python3-lxml
-
Python pip
-
Anaconda / Miniconda
conda create -n labyrinth numpy scikit-learn scipy seaborn tqdm beautifulsoup4 selenium lxml r-tidyverse r-devtools r-rcppeigen r-rcppprogress r-fastmatch r-rpca r-future.apply r-pbapply r-dbi r-rsqlite r-bursts r-patchwork r-furrr r-datapreparation r-tokenizers r-reticulate r-knitr r-progressr r-future.callr r-hrbrthemes r-proc r-ggthemes r-meta r-ggally r-matrixtests r-corrplot r-statix bioconductor-tcgabiolinks bioconductor-clusterprofiler bioconductor-fgsea bioconductor-deseq2 bioconductor-m3c -c bioconda -c conda-forge conda activate labyrinth Rscript -e "devtools::install_version('dbparser', version = '1.2.0')" Rscript -e "remotes::install_github('randef1ned/word2vec')"
R
install.packages(c('tidyverse', 'devtools', 'igraph', 'future.apply', 'pbapply', 'DBI', 'RSQLite', 'bursts', 'patchwork', 'furrr', 'M3C', 'dataPreparation', 'tokenizers', 'reticulate', 'magrittr', 'knitr', 'progressr', 'future.callr', 'hrbrthemes', 'pROC', 'ggthemes', 'meta', 'GGally', 'matrixTests', 'rstatix', 'corrplot', 'BiocManager'))
devtools::install_version('dbparser', version = '1.2.0')
remotes::install_github('randef1ned/word2vec')
remotes::install_github('randef1ned/diffusr')
This project uses a custom fork of the diffusr package, which is maintained by @randef1ned. This fork optimizes the computational execution of the package, providing improved performance compared to the original version. View the documentation online.
Folder structure
The tools/
folder in this GitHub repository contains the
code and scripts required for the pre-processing and processing steps of
labyrinth
.
Preprocessing steps
-
Parse the drug names:
-
extract_names.Rmd
: This R Markdown file is used for parsing and extracting names from the data.
-
-
Download and read the data:
-
parse_wos.py
: This Python script reads and parses the Web of Science (WoS) data. -
export_edge_tsv.py
: This Python script exports the edge data as a TSV (tab-separated values) file. -
download_mesh_synonyms.py
: This Python script downloads and processes the MeSH (Medical Subject Headings) synonyms. -
clinical_trials_parser.py
: This Python script parses the clinical trials data. -
clean_wos.py
: This Python script cleans and preprocesses the WoS data.
-
-
Identify stopwords and run word2vec:
-
identify_stopwords.Rmd
: This R Markdown file is used to identify and handle stopwords in the data. -
run_word2vec.R
: This R script runs the word2vec algorithm on the tokenized data. -
compute_distance.R
: This R script computes the distances between the word2vec embeddings for each drug.
-
-
Construct large knowledge network:
-
build_network.R
: This R script builds the large network by linking the clinical trials IDs and paper IDs.
-
-
Extract the stages of the drugs:
-
drug_status.R
: This R script extracts the stages of the drugs from the data.
-
Construct the model of labyrinth
and validate
-
Construct the main knowledge network:
-
main_network.R
: This R script constructs the main network for the project.
-
-
Compute the details in the citation network:
-
citation_metric.R
: This R script computes various metrics and details in the citation network.
-
-
Download MeSH structures for validation:
-
download_mesh_structure.py
: This Python script downloads the MeSH structure data for validation purposes.
-
-
Download TCGA and process the data for validation:
-
download_tcga.R
: This R script downloads and processes the TCGA (The Cancer Genome Atlas) data for validation.
-
Data preparation
Data | Link |
---|---|
ChEMBL | http://doi.org/10.6019/CHEMBL.database.32 |
UniChEM | https://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/data/table_dumps/ |
DrugBank | https://go.drugbank.com/releases/5-1-10 |
Comparative Toxicogenomic Database | https://ctdbase.org/downloads/ |
Cochrane Library | https://www.cochranelibrary.com/central |
GSE65185 | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65185 |
DisGENET | https://www.disgenet.org/download/sqlite/current/2020 |