User Guide #3: Joining Metadata with Data
All data uploaded to the portal is associated with metadata—read more about that here.
This user guide covers how to download metadata files associated with a specific study, and how to join those metadata files with the data files of that study (programmatically). We’ll use the same study—LLFS—to illustrate.
This example should be helpful if you’re trying to answer a question such as, “Where can I find sex, age, or tissue type among these metadata files?”
The following can be done using the R client or Python client. Both sets of instructions are included below
How to find and download metadata files
Not exactly sure what metadata really is? Read up on that here.
So, you’ve narrowed down your data of interest using some combination of the data exploration and downloading tools above—but how do you find and download the associated metadata?
Metadata files are associated with specific studies. Regardless of how you narrowed down your data of interest, there is only one way to download metadata files. Reference the labelled images below along with the instructions.
Go to Explore → Studies (1) and click on the study that you want to download metadata files for
From the list of studies, click on the name of the study (2) (you can find the study using the filtering tools or search tool)
On the resulting Study Details page (3), click the Study Data tab (4)
Scroll down to the Study Metadata (5) section
Then, filter the metadata down if you wish, and follow the same instructions as explained in Downloading data (above)
For a more specific breakdown of accessing metadata for a particular study, finding specific metadata of interest, and joining metadata files with data for that study, see our user guide #3: Accessing and Joining Metadata Files With Data.How to find and download metadata files
To find and download the metadata files associated with LLFS, we’ll follow the steps as outlined here:
Go to Explore → Studies and search for LLFS using the search tool in the top right
The LLFS study should appear in the list below—click on it to access its Study Details page
On the resulting page, click the Study Data tab
From the table of contents on the left, click Study Metadata
Download the metadata files using the same instructions as found in user guide #1, step 6
Once you have these files downloaded, did you know that you can join them together and combine them with the data files? You can find instructions on how to do this programmatically below.
How to read metadata files
At this point, you should know that these metadata files are presented in three types: individual, biospecimen, and assay type. Let’s say you are looking for a specific subset of metadata information, such as sex or tissue type. How do you do that?
First, determine which file your information of interest would be stored in. For reference, see What is contained in each metadata file? Let’s use the examples of sex and tissue type. Sex is associated with the individual, so it would be located in the individual metadata file, while tissue type is associated with the specimen, so it would be located in the biospecimen metadata file. To find these values, you would open each of these downloaded CSV files, and find the column that represents the value (i.e., sex, tissue).
Instructions for using R client
How to join multiple files
In short, metadata files of the individual, biospecimen, and assay type can be joined together on the keys individualID and specimenID. Below are instructions on how to use R software to join multiple files.
First, you need the metadata files—see above
Install and load the
tidyverse
package in R to perform data frame manipulations:install.packages("tidyverse")
library(tidyverse)
While reading in each metadata file with
read_csv
, specify the column types as character with “c”. Consistent column types ensure that common variables can be joined. This code joins data frames using all variables in common across the individual, biospecimen, and assay metadata files. Aright_join
preserves only the individuals and biospecimens that are characterized in each assay type: RNA-seq and SNP array.Rindividual <- read_csv("LLFS_individual_human_metadata.csv", col_types = cols(.default = "c") ) biospecimen <- read_csv("LLFS_biospecimen_metadata.csv", col_types = cols(.default = "c") ) rnaseq_assay <- read_csv("LLFS_assay_RNAseq_metadata.csv", col_types = cols(.default = "c") ) snparray_assay <- read_csv("LLFS_assay_snpArray_metadata.csv", col_types = cols(.default = "c") ) RNASeq <- reduce( list(individual, biospecimen, rnaseq_assay), right_join ) snpArray <- reduce( list(individual, biospecimen, snparray_assay), right_join )
Instructions for using Python client
How to import libraries and log in to Synapse
Make sure pandas and the synapseclient python libraries are installed. Import both libraries and create a synapse object syn
.
You will need a Synapse account to access Synapse data with the synapseclient. You can supply a username and password, ex: `syn.login("username", "password"), or use a local .synapseConfig file to supply credentials.
# 1. Import libraries
import pandas as pd
import synapseclient
syn=synapseclient.Synapse()
# log in to Synapse
syn.login()
How to download the metadata files
The four files needed for this analysis are:
LLFS_individual_human_metadata.csv : syn10930250
LLFS_biospecimen_metadata.csv : syn21522653
LLFS_assay_RNAseq_metadata.csv : syn21499318
LLFS_assay_snpArray_metadata.csv : syn21499317
Files can be downloaded manually through the ELITE Portal, or you can use the code below to download them with the Synapse python client. These files contain controlled-access human data, and should only be downloaded in a secure environment.
# We use the downloadLocation argument to specify the download directory.
# (https://python-docs.synapse.org/build/html/index.html?highlight=get#synapseclient.Synapse.get)
individual_human_metadata = syn.get("syn10930250", downloadLocation = "./metadata") # LLFS_individual_human_metadata.csv
biospecimen_metadata = syn.get("syn21522653", downloadLocation = "./metadata") # LLFS_biospecimen_metadata.csv
RNAseq_metadata = syn.get("syn21499318", downloadLocation = "./metadata") # LLFS_assay_RNAseq_metadata.csv
snpArray_metadata = syn.get("syn21499317", downloadLocation = "./metadata") # LLFS_assay_snpArray_metadata.csv
How to read data from CSVs into pandas dataframes
individual_human_metadata_df = pd.read_csv("metadata/LLFS_individual_metadata.csv", dtype=str)
biospecimen_metadata_df = pd.read_csv("metadata/LLFS_biospecimen_metadata.csv", dtype=str)
RNAseq_metadata_df = pd.read_csv("metadata/LLFS_assay_RNAseq_metadata.csv", dtype=str)
snpArray_metadata_df = pd.read_csv("metadata/LLFS_assay_snparray_metadata.csv", dtype=str)
How to join data
# Define function to right join multiple dfs
# Right join "preserves only the individuals and biospecimens that are characterized in each assay type: RNA-seq and SNP array."
# https://help.adknowledgeportal.org/apd/Use-Case-%233:-Working-with-File-Annotations-and-Metadata.2426208334.html
def right_join_multiple (left_df_list, right_df):
for df in left_df_list:
right_df = pd.merge(df, right_df, how = 'right')
return right_df
# Use function to join biospecimen, individual and RNASeq metadata. Do the same with SNP array data
left_df_list = [biospecimen_metadata_df, individual_human_metadata_df]
Joined_RNAseq_metadata_df = right_join_multiple(left_df_list, RNAseq_metadata_df)
Joined_SNP_metadata_df = right_join_multiple(left_df_list, snpArray_metadata_df)
How to display joined data
pd.set_option("max_rows", 10) #Change the second argument to 'None' if you wish to display all rows
Joined_RNAseq_metadata_df #Run this line to display individual and biospecimen metadata that have RNA Seq data