User Guide #3: Downloading and Joining Metadata Files
All data uploaded to the portal is associated with metadata. Not exactly sure what metadata really is? Read up on that here.
This user guide covers how to download metadata files associated with a specific study, and how to join those metadata files with the data files of that study (programmatically). We will use the Mouse M005 Metabolomics Study to illustrate.
This example should be helpful if you’re trying to answer a question such as, “Where can I find sex, age, or tissue type among these metadata files?” for human data, or, “Where can I find species, life stage, or taxon among these metadata files?” for non-human data.
The following can be done using the R client or Python client. Both sets of instructions are included below.
How to find and download metadata files
Once you have these files downloaded, did you know that you can join them together and match them to their associated data files? You can find instructions on how to do this programmatically below.
How to read metadata files
At this point, you should know that these metadata files are presented in three types: individual, biospecimen, and assay type. Let’s say you are looking for a specific subset of metadata information, such as sex or tissue type. How do you do that?
First, determine which file your information of interest would be stored in. For reference, see What is contained in each metadata file? Let’s use the examples of sex and tissue type. Sex is associated with the individual, so it would be located in the individual metadata file, while tissue type is associated with the specimen, so it would be located in the biospecimen metadata file.
To find these values, you would open each of these downloaded CSV files, and find the column that represents the value (i.e., sex, tissue).
Instructions for using R client
How to join multiple files
In short, metadata files of the individual, biospecimen, and assay type can be joined together on the keys individualID and specimenID. Below are instructions on how to use R software to join multiple files.
First, you need the metadata files—see instructions above
Ensure you have installed and loaded the
synapser
andtidyverse
packages in R to download data and perform data frame manipulations:CODEinstall.packages("synapser", repos = c("http://ran.synapse.org")) install.packages(c("tidyverse")) library(synapser) library(tidyverse)
Ensure you are logged in to Synapse
CODEsynLogin()
Read in the individual, biospecimen, and assay metadata files that were programmatically downloaded using
read_csv
.
# Individual metadata
individual_metadata <- read_csv("files/individual_non_human_M005_Longevity Consortium_11-11-2024_final.csv", show_col_types = FALSE)
# Biospecimen metadata
biospecimen_metadata <- read_csv("files/biospecimen_non_human_M005_Longevity Consortium_11-11-2024_final.csv", show_col_types = FALSE)
# Assay metadata
assay_metadata <- read_csv("files/synapse_storage_manifest_assaymetabolomicstemplate.csv", show_col_types = FALSE)
Join the metadata using
left_join
, matching on specimenID, then on individualID
joined_metadata <- assay_metadata %>%
#join rows from biospecimen metadata that match on specimenID
left_join(biospecimen_metadata, by = "specimenID") %>%
# join rows from individual metadata that match on individualID
left_join(individual_metadata, by = "individualID")
You have now bulk downloaded and joined metadata files!
Instructions for using Python client
How to import libraries and log in to Synapse
Make sure pandas and the synapseclient python libraries are installed. Import both libraries and create a synapse object syn
.
You will need a Synapse account to access Synapse data with the synapseclient. You can supply a username and password, ex: `syn.login("username", "password"), or use a local .synapseConfig file to supply credentials.
import synapseclient
syn = synapseclient.Synapse()
syn.login(authToken="")
How to download the metadata files
The three files needed for this tutorial are:
Individual Metadata
Biospecimen Metadata
Assay Metadata
Files can be downloaded manually through the ELITE Portal, or you can use the code below to download them with the Synapse python client.
query = syn.tableQuery("SELECT * FROM syn52234677 WHERE ( ( \"Study\" = 'Mouse_M005_Study_Metabolomics' ) ) AND ( `resourceType` = 'metadata' )")
query.asDataFrame()
# We use the downloadLocation argument to specify the download directory.
# (https://python-docs.synapse.org/build/html/index.html?highlight=get#synapseclient.Synapse.get)
individual_human_metadata = syn.get("syn10930250", downloadLocation = "./metadata") # LLFS_individual_human_metadata.csv
biospecimen_metadata = syn.get("syn21522653", downloadLocation = "./metadata") # LLFS_biospecimen_metadata.csv
RNAseq_metadata = syn.get("syn21499318", downloadLocation = "./metadata") # LLFS_assay_RNAseq_metadata.csv
snpArray_metadata = syn.get("syn21499317", downloadLocation = "./metadata") # LLFS_assay_snpArray_metadata.csv
How to read data from CSVs into pandas dataframes
individual_metadata = pd.read_csv("files/individual_non_human_M005_Longevity Consortium_11-11-2024_final.csv", dtype=str)
biospecimen_metadata = pd.read_csv("files/biospecimen_non_human_M005_Longevity Consortium_11-11-2024_final.csv", dtype=str)
assay_metadata = pd.read_csv("files/synapse_storage_manifest_assaymetabolomicstemplate.csv", dtype=str)
How to join data
# Use function to join biospecimen, individual, and assay metadata
left_df_list = [biospecimen_metadata, individual_metadata]
Joined_metadata = right_join_multiple(left_df_list, assay_metadata_df)
How to display joined data
pd.set_option("max_rows", 10) #Change the second argument to 'None' if you wish to display all rows
Joined__metadata #Run this line to display individual and biospecimen metadata data for that given assay