Skip to main content
Skip table of contents

User Guide #3: Joining Metadata with Data

All data uploaded to the portal is associated with metadata—read more about that here.

This user guide covers how to download metadata files associated with a specific study, and how to join those metadata files with the data files of that study (programmatically). We’ll use the same study—LLFS—to illustrate.

This example should be helpful if you’re trying to answer a question such as, “Where can I find sex, age, or tissue type among these metadata files?”

The following can be done using the R client or Python client. Both sets of instructions are included below

How to find and download metadata files

Not exactly sure what metadata really is? Read up on that here.

So, you’ve narrowed down your data of interest using some combination of the data exploration and downloading tools above—but how do you find and download the associated metadata?

Metadata files are associated with specific studies. Regardless of how you narrowed down your data of interest, there is only one way to download metadata files. Reference the labelled images below along with the instructions.

  1. Go to Explore → Studies (1) and click on the study that you want to download metadata files for

  2. From the list of studies, click on the name of the study (2) (you can find the study using the filtering tools or search tool)

  3. On the resulting Study Details page (3), click the Study Data tab (4)

  4. Scroll down to the Study Metadata (5) section

  5. Then, filter the metadata down if you wish, and follow the same instructions as explained in Downloading data (above)

Explore Studies  Study Detail.png

For a more specific breakdown of accessing metadata for a particular study, finding specific metadata of interest, and joining metadata files with data for that study, see our user guide #3: Accessing and Joining Metadata Files With Data.How to find and download metadata files

To find and download the metadata files associated with LLFS, we’ll follow the steps as outlined here:

  1. Go to Explore → Studies and search for LLFS using the search tool in the top right

  2. The LLFS study should appear in the list below—click on it to access its Study Details page

  3. On the resulting page, click the Study Data tab

  4. From the table of contents on the left, click Study Metadata

  5. Download the metadata files using the same instructions as found in user guide #1, step 6

Once you have these files downloaded, did you know that you can join them together and combine them with the data files? You can find instructions on how to do this programmatically below.

How to read metadata files

At this point, you should know that these metadata files are presented in three types: individual, biospecimen, and assay type. Let’s say you are looking for a specific subset of metadata information, such as sex or tissue type. How do you do that?

First, determine which file your information of interest would be stored in. For reference, see What is contained in each metadata file? Let’s use the examples of sex and tissue type. Sex is associated with the individual, so it would be located in the individual metadata file, while tissue type is associated with the specimen, so it would be located in the biospecimen metadata file. To find these values, you would open each of these downloaded CSV files, and find the column that represents the value (i.e., sex, tissue).

Instructions for using R client

How to join multiple files

In short, metadata files of the individual, biospecimen, and assay type can be joined together on the keys individualID and specimenID. Below are instructions on how to use R software to join multiple files.

  1. First, you need the metadata files—see above

  2. Install and load the tidyverse package in R to perform data frame manipulations:

    • install.packages("tidyverse")

    • library(tidyverse)

  3. While reading in each metadata file with read_csv, specify the column types as character with “c”. Consistent column types ensure that common variables can be joined. This code joins data frames using all variables in common across the individual, biospecimen, and assay metadata files. A right_join preserves only the individuals and biospecimens that are characterized in each assay type: RNA-seq and SNP array.

    R
    individual <- read_csv("LLFS_individual_human_metadata.csv",
      col_types = cols(.default = "c")
    )
    biospecimen <- read_csv("LLFS_biospecimen_metadata.csv",
      col_types = cols(.default = "c")
    )
    rnaseq_assay <- read_csv("LLFS_assay_RNAseq_metadata.csv",
      col_types = cols(.default = "c")
    )
    snparray_assay <- read_csv("LLFS_assay_snpArray_metadata.csv",
      col_types = cols(.default = "c")
    )
    RNASeq <- reduce(
      list(individual, biospecimen, rnaseq_assay),
      right_join
    )
    snpArray <- reduce(
      list(individual, biospecimen, snparray_assay),
      right_join
    )

Instructions for using Python client

How to import libraries and log in to Synapse

Make sure pandas and the synapseclient python libraries are installed. Import both libraries and create a synapse object syn.

You will need a Synapse account to access Synapse data with the synapseclient. You can supply a username and password, ex: `syn.login("username", "password"), or use a local .synapseConfig file to supply credentials.

PY
# 1. Import libraries
import pandas as pd
import synapseclient
syn=synapseclient.Synapse()

# log in to Synapse
syn.login()

How to download the metadata files

The four files needed for this analysis are:

  1. LLFS_individual_human_metadata.csv : syn10930250

  2. LLFS_biospecimen_metadata.csv : syn21522653

  3. LLFS_assay_RNAseq_metadata.csv : syn21499318

  4. LLFS_assay_snpArray_metadata.csv : syn21499317

Files can be downloaded manually through the ELITE Portal, or you can use the code below to download them with the Synapse python client. These files contain controlled-access human data, and should only be downloaded in a secure environment.

PY
# We use the downloadLocation argument to specify the download directory.
# (https://python-docs.synapse.org/build/html/index.html?highlight=get#synapseclient.Synapse.get)

individual_human_metadata = syn.get("syn10930250", downloadLocation = "./metadata") # LLFS_individual_human_metadata.csv
biospecimen_metadata = syn.get("syn21522653", downloadLocation = "./metadata") # LLFS_biospecimen_metadata.csv
RNAseq_metadata = syn.get("syn21499318", downloadLocation = "./metadata") # LLFS_assay_RNAseq_metadata.csv
snpArray_metadata = syn.get("syn21499317", downloadLocation = "./metadata") # LLFS_assay_snpArray_metadata.csv

How to read data from CSVs into pandas dataframes

PY
individual_human_metadata_df = pd.read_csv("metadata/LLFS_individual_metadata.csv", dtype=str)
biospecimen_metadata_df = pd.read_csv("metadata/LLFS_biospecimen_metadata.csv", dtype=str)
RNAseq_metadata_df = pd.read_csv("metadata/LLFS_assay_RNAseq_metadata.csv", dtype=str)
snpArray_metadata_df = pd.read_csv("metadata/LLFS_assay_snparray_metadata.csv", dtype=str)

How to join data

PY
# Define function to right join multiple dfs
# Right join "preserves only the individuals and biospecimens that are characterized in each assay type: RNA-seq and SNP array."
# https://help.adknowledgeportal.org/apd/Use-Case-%233:-Working-with-File-Annotations-and-Metadata.2426208334.html

def right_join_multiple (left_df_list, right_df):
    for df in left_df_list:
        right_df = pd.merge(df, right_df, how = 'right')
    return right_df


# Use function to join biospecimen, individual and RNASeq metadata. Do the same with SNP array data
left_df_list = [biospecimen_metadata_df, individual_human_metadata_df]
Joined_RNAseq_metadata_df =  right_join_multiple(left_df_list, RNAseq_metadata_df)
Joined_SNP_metadata_df = right_join_multiple(left_df_list, snpArray_metadata_df)

How to display joined data

PY
pd.set_option("max_rows", 10)   #Change the second argument to 'None' if you wish to display all rows
Joined_RNAseq_metadata_df       #Run this line to display individual and biospecimen metadata that have RNA Seq data

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.