Example Walkthrough

The CuBIDS workflow is currently being used in neuroimaging labs at a number of institutions including University of Pennsylvania, Children’s Hospital of Philadelphia, the Child Mind Institute, and University of Minnesota’s Masonic Institute for the Developing Brain. To demonstrate the utility of CuBIDS, here we apply the software to a small example dataset that is included in our Github repo. This example dataset does not contain any PHI.

Following the installation instructions at the installation page, you should have successfully installed CuBIDS, DataLad, and the bids-validator inside a conda environment titled “cubids”. In this example, we use validator version 1.7.2. Using a different version of the validator may result in slightly different validation tsv outputs, but the example should still be useful.

Throughout this example, we use DataLad for version control. Although DataLad is an optional dependency of CuBIDS, we use it here to demonstrate its powerful integration with CuBIDS and the benefits it can provide its users.

Now that we have installed CuBIDS and all necessary dependencies, we are ready to begin the curation process on our example dataset. We create a CuBIDS_Test directory to function as our working directory and navigate to it as follows.

$ mkdir $PWD/CuBIDS_Test
$ cd CuBIDS_Test
$ conda activate cubids

Now, download and unzip the example data (you can also do this in your Finder window):

$ curl -sSLO https://github.com/PennLINC/CuBIDS/raw/main/cubids/testdata/BIDS_Dataset.zip
$ unzip BIDS_Dataset.zip
$ rm BIDS_Dataset.zip

Identifying and removing PHI

As a first step, we use CuBIDS to identify the metadata fields present in the dataset, and remove any protected health information (PHI) or other sensitive fields. We want to do this before implementing any DataLad commands, as we must ensure PHI is not tracked as part of version control.

This is accomplished with the following command:

$ cubids print-metadata-fields BIDS_Dataset

This command returns a total of 66 fields, including acquisition parameters and other metadata fields present in the dataset’s JSON sidecars. From the output we can see that the dataset contains (simulated) PHI — the PatientName field, which we wish to remove. To remove the PatientName field from the sidecars, we can use the command:

$ cubids remove-metadata-fields BIDS_Dataset --fields PatientName

This command should succeed silently.

Checking the BIDS dataset into DataLad

Now that all PHI has been removed from the metadata, we are ready to check our dataset into datalad. To do this, we run the following command:

$ datalad create -c text2git BIDS_Dataset_DataLad

This command creates a new directory called BIDS_Dataset_DataLad where DataLad will begin implementing version control and provenance tracking while we implement the rest of our CuBIDS workflow. The creation of our datalad dataset is accordingly reflected in the dataset’s version control history, accessible with git log. At any point in the CuBIDS workflow, we can view a summary of our dataset’s version history by running the following commands:

$ cd BIDS_Dataset_DataLad
$ git log --oneline
$ cd ..

This command will write the following to the terminal:

_images/screenshot_1.png

Next, we copy the contents of our BIDS dataset into the newly created and currently empty DataLad dataset and save the changes.

$ cd ..
$ cp -r BIDS_Dataset/* BIDS_Dataset_DataLad

In addition to being able to access the version history of our data, any point in this workflow, we can also check the status of untracked (not yet saved) changes using the datalad status command, as seen below:

$ cd BIDS_Dataset_DataLad && datalad status
$ cd ..

This command produces a description of the changes we have made to the data since the last commit (see below)

_images/screenshot_2.png

The command above shows all files untracked, as we have copied the BIDS data into ~/CuBIDS_Test/BIDS_Dataset_DataLad but have not yet saved those changes. Our next step is to run save. It is best practice to provide a detailed commit message, for example:

$ datalad save -d ~/CuBIDS_Test/BIDS_Dataset_DataLad -m "checked dataset into datalad"

At this stage, we also recommend removing the BIDS_Dataset directory — its contents are safely copied into and tracked in BIDS_Dataset_DataLad. We can check our git history to be sure, which will display the version history of our dataset thus far, with the following command:

$ cd BIDS_Dataset_DataLad/
$ git log --oneline
$ cd ..

which will produce the following:

_images/screenshot_3.png

As seen above, the creation of our DataLad dataset is now reflected in the dataset’s version control history. Note that it is best practice to provide a detailed commit message with each change made to the data.

Adding NIfTI Information to JSON Sidecars

Next, we seek to add more image parameters to our sidecars so that we can better define our Key Groups. Historically, only a subset of parameters in the NIfTI image header have been included in a BIDS sidecar… Parameters such as image dimensions, number of volumes, image obliquity, and voxel sizes — all important data that can change how our pipelines will eventually run!

To add them to the sidecar metadata, run:

$ cubids add-nifti-info BIDS_Dataset_DataLad --use-datalad

This command adds the NIfTI header information to the JSON sidecars and saves those changes. In order to ensure that this command has been executed properly, we can run cubids print-metadata-fields once more, which reveals that new NIfTI header information has been successfully included in the metadata. Since we ran cubids add-nifti-info with the --use-datalad flag set, CuBIDS automatically saves the changes made to the dataset to the git log as follows:

_images/screenshot_4.png

BIDS validation

The next step in the CuBIDS workflow is to run BIDS validation to detect potential curation errors using cubids validate.

$ cubids validate BIDS_Dataset_DataLad v0 --sequential

Note

The use of the --sequential flag forces the validator to treat each participant as its own BIDS dataset. This can be helpful for identifying heterogeneous elements, but can be slowed down by extremely large datasets.

This command produces the following tsv:

v0_validation.tsv

files

type

severity

description

code

url

subject

/sub-02/ses-phdiff/func/sub-02_ses-phdiff_task-rest_bold.nii.gz

PHASE_ENCODING_DIRECTION_NOT_DEFINED

warning

You should define ‘PhaseEncodingDirection’ for this file. If you don’t provide this information field map correction will not be possible.

7

https://neurostars.org/search?q=PHASE_ENCODING_DIRECTION_NOT_DEFINED

sub-02

/sub-03/ses-phdiff/dwi/sub-03_ses-phdiff_acq-HASC55AP_dwi.nii.gz

TOTAL_READOUT_TIME_NOT_DEFINED

warning

You should define ‘TotalReadoutTime’ for this file. If you don’t provide this information field map correction using TOPUP might not be possible.

9

https://neurostars.org/search?q=TOTAL_READOUT_TIME_NOT_DEFINED

sub-03

This initial validation run reveals firstly that Phase Encoding Direction (PED) is not specified for one of the task-rest BOLD scans. This is an important parameter for fieldmap correction in fMRIPRep, so knowing this ahead of time is valuable information. To resolve this, we could either find the PED for this scan elsewhere and edit its sidecar to include it, or remove that scan from the dataset. For the purposes of this demonstration, we elect to remove the scan. To do this, we run the cubids purge command.

cubids purge requires as input a list of files to cleanly “purge” from the dataset. You can create this file in any text editor, as long as it is saved as plain text .txt. For this example, we created the following file:

$ cat no_ped.txt

/AN/EXAMPLE/PATH/CuBIDS_Test/BIDS_Dataset_Datalad/sub-02/ses-phdiff/func/sub-02_ses-phdiff_task-rest_bold.nii.gz

and saved it in our CuBIDS_Test directory.

To safely purge this file from the dataset, run:

$ cubids purge BIDS_Dataset_DataLad no_ped.txt --use-datalad

We elect to use cubids purge instead of simply removing the scan due to the fact that purge will ensure all associations, such as sidecars and IntendedFor references in fieldmaps, are also safely deleted. CuBIDS will reflect these deletions in the git history:

_images/screenshot_5.png

Returning again to v0_validation.tsv, we can also see that there is one DWI scan missing TotalReadoutTime, a metadata field necessary for fieldmap correction. After conferring with our MR physicist and the scanner technician, we determine that TotalReadoutTime (TRT) was erroneously omitted from the DWI sidecars! After some digging, the technician provided us with the correct value, so it’s now our job to manually add it to the sidecar for which it is missing. Once we have this value, we manually add it to the sidecar for which it is missing by opening BIDS_Dataset_DataLad/sub-03/ses-phdiff/dwi/sub-03_ses-phdiff_acq-HASC55AP_dwi.json in an editor and adding the following line:

"TotalReadoutTime": 0.0717598,

on a new line anywhere inside the curly braces between lines containing parameters and their values, save the changes, and close the JSON file. We then save the latest changes to the dataset with a detailed commit message as follows:

$ datalad save -d BIDS_Dataset_DataLad -m "Added TotalReadoutTime to sub-03_ses-phdiff_acq-HASC55AP_dwi.nii.json"

This change will be reflected in the git history.

_images/screenshot_6.png

To verify that there are no remaining validation errors, we rerun validation with the following command:

$ cubids validate BIDS_Dataset_DataLad v1 --sequential

This command should produce no tsv output, and instead print “No issues/warnings parsed, your dataset is BIDS valid” to the terminal, which indicates that the dataset is now free from BIDS validation errors and warnings.

Visualizing metadata heterogeneity

Next, we’ll use CuBIDS to gain some insight on the dataset’s structure, heterogeneity, and metadata errors. We’ll do this with cubids group.

Large datasets almost inevitably contain multiple validation and metadata errors, so it’s useful to run both cubids validate and cubids group in parallel, as validation errors are better understood within the context of a dataset’s heterogeneity. Being able to see both metadata errors (such as missing or incorrectly specified sidecar parameters) that grouping reveals alongside BIDS errors that the validator catches, gives users a more comprehensive view of the issues they will need to fix during the curation process. Note that if users choose to provide just a pass in just a filename prefix (e.g. V1) for the second argument, then CuBIDS will put the four grouping outputs in bids_dir/code/CuBIDS. If users provide a path (e.g., /Users/scovitz/BIDS/V1), then output files will go to the specified location. An example command for running the grouping function as follows:

$ cubids group BIDS_Dataset_DataLad v0

This command will produce four tables that describe the dataset’s heterogeneity in different ways.

  1. v0_summary.tsv contains all detected Key and Parameter groups and provides a high-level overview of the heterogeneity in the entire dataset.

  2. v0_files.tsv maps each imaging file in the BIDS directory to a Key and Parameter group.

  3. v0_AcqGrouping.tsv maps each session in the dataset to an Acquisition Group.

  4. v0_AcqGroupInfo.txt lists the set of scanning parameters present in each Acquisition Group.

By first examining v0_summary.tsv users are given he opportunity to conduct metadata quality assurance (QA). The file can help identify instances of incomplete, incorrect, or unusable parameter groups, based on acquisition fields such as dimension and voxel sizes, number of volumes, obliquity, and more.

While v0_validation.tsv identified all the BIDS validation errors present in the dataset, it did not identify any potential issues that might be present within the sidecars’ metadata. Below, we see insances of missing metadata fields in a handful of sidecars, which may impact successful execution of BIDS Apps.

v0_summary.tsv

Notes

ManualCheck

MergeInto

RenameKeyGroup

KeyParamGroup

KeyGroup

ParamGroup

Counts

Dim1Size

Dim2Size

Dim3Size

EchoTime

EffectiveEchoSpacing

FlipAngle

HasFieldmap

KeyGroupCount

Modality

NSliceTimes

NumVolumes

Obliquity

PartialFourier

PhaseEncodingDirection

RepetitionTime

TotalReadoutTime

UsedAsFieldmap

VoxelSizeDim1

VoxelSizeDim2

VoxelSizeDim3

datatype-anat_suffix-T1w__1

datatype-anat_suffix-T1w

1

4

180

216

180

0.0029

8

False

4

anat

0

1.0

False

i-

2.5

False

1.0

1.0

1.0

acquisition-HASC55AP_datatype-dwi_suffix-dwi__1

acquisition-HASC55AP_datatype-dwi_suffix-dwi

1

2

36

43

36

0.089

0.00069

90

True

4

dwi

36

61.0

False

j

4.2

0.072

False

5.0

5.0

5.0

acquisition-HASC55APVARIANTNumVolumes_datatype-dwi_suffix-dwi

acquisition-HASC55AP_datatype-dwi_suffix-dwi__2

acquisition-HASC55AP_datatype-dwi_suffix-dwi

2

1

36

43

36

0.089

0.00069

90

True

4

dwi

36

10.0

False

j

4.2

0.072

False

5.0

5.0

5.0

acquisition-HASC55APVARIANTEchoTime_datatype-dwi_suffix-dwi

acquisition-HASC55AP_datatype-dwi_suffix-dwi__3

acquisition-HASC55AP_datatype-dwi_suffix-dwi

3

1

36

43

36

0.07

0.00069

90

True

4

dwi

36

61.0

False

j

4.2

0.072

False

5.0

5.0

5.0

acquisition-v4_datatype-fmap_fmap-magnitude1_suffix-magnitude1__1

acquisition-v4_datatype-fmap_fmap-magnitude1_suffix-magnitude1

1

4

36

43

36

0.004

60

False

4

fmap

0

1.0

False

0.75

j-

1.5

False

5.0

5.0

5.0

acquisition-v4_datatype-fmap_fmap-magnitude2_suffix-magnitude2__1

acquisition-v4_datatype-fmap_fmap-magnitude2_suffix-magnitude2

1

4

36

43

36

0.006

60

False

4

fmap

0

1.0

False

0.75

j-

1.5

False

5.0

5.0

5.0

acquisition-v4_datatype-fmap_fmap-phasediff_suffix-phasediff__1

acquisition-v4_datatype-fmap_fmap-phasediff_suffix-phasediff

1

4

36

43

36

60

False

4

fmap

0

1.0

False

0.75

j-

1.5

True

5.0

5.0

5.0

datatype-fmap_direction-PA_fmap-epi_suffix-epi__1

datatype-fmap_direction-PA_fmap-epi_suffix-epi

1

3

36

43

36

0.089

0.00069

90

False

4

fmap

36

1.0

False

j-

4.2

0.072

True

5.0

5.0

5.0

acquisition-VARIANTUnused_datatype-fmap_direction-PA_fmap-epi_suffix-epi

datatype-fmap_direction-PA_fmap-epi_suffix-epi__2

datatype-fmap_direction-PA_fmap-epi_suffix-epi

2

1

36

43

36

0.089

0.00069

90

False

4

fmap

36

1.0

False

j-

4.2

0.072

False

5.0

5.0

5.0

datatype-func_suffix-bold_task-rest__1

datatype-func_suffix-bold_task-rest

1

2

36

43

36

0.089

0.00069

70

True

3

func

36

61.0

False

j

1.0

0.072

False

5.0

5.0

5.0

acquisition-VARIANTFlipAngle_datatype-func_suffix-bold_task-rest

datatype-func_suffix-bold_task-rest__2

datatype-func_suffix-bold_task-rest

2

1

36

43

36

0.089

0.00069

90

True

3

func

36

61.0

False

j

1.0

0.072

False

5.0

5.0

5.0

Examining v0_summary.tsv we can see that one DWI Parameter Group — acquisition-HASC55AP_datatype-dwi_suffix-dwi__2 — contains only one scan (see “Counts” column) with only 10 volumes (see “NumVolumes” column). Since the majority of DWI scans in this dataset have 61 volumes, CuBIDS assigns this single scan to a “Variant” (i.e. non-dominant) Parameter Group, and automatically populates that Parameter Group’s “RenameKeyGroup” column in v0_summary.tsv with a suggested name: acquisition-HASC55APVARIANTNumVolumes_datatype-dwi_suffix-dwi. This time, though, we elect to remove this scan because it does not have enough volumes to be usable for most analyses. To do this, we can either use cubids purge again, or we could edit v0_summary.tsv by adding 0 to the MergeInto column in the row (Parameter Group) we want to remove. This will ensure all scans in that Parameter Group (in this example, just one scan) are removed.

Make this change and save this edited version of v0_summary.tsv as v0_edited_summary.tsv, which will be passed to cubids apply in our next curation step.

v0_edited_summary.tsv

Notes

ManualCheck

MergeInto

RenameKeyGroup

KeyParamGroup

KeyGroup

ParamGroup

Counts

Dim1Size

Dim2Size

Dim3Size

EchoTime

EffectiveEchoSpacing

FlipAngle

HasFieldmap

KeyGroupCount

Modality

NSliceTimes

NumVolumes

Obliquity

PartialFourier

PhaseEncodingDirection

RepetitionTime

TotalReadoutTime

UsedAsFieldmap

VoxelSizeDim1

VoxelSizeDim2

VoxelSizeDim3

datatype-anat_suffix-T1w__1

datatype-anat_suffix-T1w

1

4

180

216

180

0.0029

8

FALSE

4

anat

0

1.0

FALSE

i-

2.5

FALSE

1.0

1.0

1.0

acquisition-HASC55AP_datatype-dwi_suffix-dwi__1

acquisition-HASC55AP_datatype-dwi_suffix-dwi

1

2

36

43

36

0.089

0.00069

90

TRUE

4

dwi

36

61.0

FALSE

j

4.2

0.072

FALSE

5.0

5.0

5.0

0

acquisition-HASC55APVARIANTNumVolumes_datatype-dwi_suffix-dwi

acquisition-HASC55AP_datatype-dwi_suffix-dwi__2

acquisition-HASC55AP_datatype-dwi_suffix-dwi

2

1

36

43

36

0.089

0.00069

90

TRUE

4

dwi

36

10.0

FALSE

j

4.2

0.072

FALSE

5.0

5.0

5.0

acquisition-HASC55APVARIANTEchoTime_datatype-dwi_suffix-dwi

acquisition-HASC55AP_datatype-dwi_suffix-dwi__3

acquisition-HASC55AP_datatype-dwi_suffix-dwi

3

1

36

43

36

0.07

0.00069

90

TRUE

4

dwi

36

61.0

FALSE

j

4.2

0.072

FALSE

5.0

5.0

5.0

acquisition-v4_datatype-fmap_fmap-magnitude1_suffix-magnitude1__1

acquisition-v4_datatype-fmap_fmap-magnitude1_suffix-magnitude1

1

4

36

43

36

0.004

60

FALSE

4

fmap

0

1.0

FALSE

0.75

j-

1.5

FALSE

5.0

5.0

5.0

acquisition-v4_datatype-fmap_fmap-magnitude2_suffix-magnitude2__1

acquisition-v4_datatype-fmap_fmap-magnitude2_suffix-magnitude2

1

4

36

43

36

0.006

60

FALSE

4

fmap

0

1.0

FALSE

0.75

j-

1.5

FALSE

5.0

5.0

5.0

acquisition-v4_datatype-fmap_fmap-phasediff_suffix-phasediff__1

acquisition-v4_datatype-fmap_fmap-phasediff_suffix-phasediff

1

4

36

43

36

60

FALSE

4

fmap

0

1.0

FALSE

0.75

j-

1.5

TRUE

5.0

5.0

5.0

datatype-fmap_direction-PA_fmap-epi_suffix-epi__1

datatype-fmap_direction-PA_fmap-epi_suffix-epi

1

3

36

43

36

0.089

0.00069

90

FALSE

4

fmap

36

1.0

FALSE

j-

4.2

0.072

TRUE

5.0

5.0

5.0

acquisition-VARIANTUnused_datatype-fmap_direction-PA_fmap-epi_suffix-epi

datatype-fmap_direction-PA_fmap-epi_suffix-epi__2

datatype-fmap_direction-PA_fmap-epi_suffix-epi

2

1

36

43

36

0.089

0.00069

90

FALSE

4

fmap

36

1.0

FALSE

j-

4.2

0.072

FALSE

5.0

5.0

5.0

datatype-func_suffix-bold_task-rest__1

datatype-func_suffix-bold_task-rest

1

2

36

43

36

0.089

0.00069

70

TRUE

3

func

36

61.0

FALSE

j

1.0

0.072

FALSE

5.0

5.0

5.0

acquisition-VARIANTFlipAngle_datatype-func_suffix-bold_task-rest

datatype-func_suffix-bold_task-rest__2

datatype-func_suffix-bold_task-rest

2

1

36

43

36

0.089

0.00069

90

TRUE

3

func

36

61.0

FALSE

j

1.0

0.072

FALSE

5.0

5.0

5.0

Applying changes

Now that all metadata issues have been addressed — both validation and CuBIDS summary — we are ready to rename our files based on their RenameKeyGroup values and apply the requested deletion in v0_edited_summary.tsv. The cubids apply function renames scans in each Variant Parameter Group according to the metadata parameters with a flag “VARIANT”, which is useful because the user will then be able to see, in each scan’s filename, which metadata parameters associated with that scan vary from those in the acquisition’s Dominant Group. If the edited summary and files tsvs are located in the bids_dir/code/CuBIDS directory, the user may just pass in those filenames. Otherwise, specifying the path to those files is necessary. We can execute cubids apply with the following command:

$ cubids apply BIDS_Dataset_DataLad v0_edited_summary.tsv v0_files.tsv v1 --use-datalad

Checking our git log, we can see that our changes from apply have been saved.

_images/screenshot_7.png

We can check the four grouping tsvs cubids apply produces (v1_*) to ensure they look as expected — that all files with variant scanning parameters have been renamed to indicate the parameters that vary in the acquisition fields of their filenames.

Exemplar testing

The curation of the dataset is complete; finally, it’s time for pre-processing. To streamline this step, and as an added measure for reproducibility and quality assurance, CuBIDS facilitates this subsequent step through the creation of an Exemplar Dataset: a subset of the full dataset that spans the full variation of acquisitions and parameters by including one subject from each Acquisition Group. By testing only one subject per Acquisition Group, users are able to pinpoint specific metadata values and scans that may trigger pipeline failures. These acquisition groups could then be evaluated in more detail and flagged for remediation or exclusion. The Exemplar Dataset can easily be created with the cubids copy-exemplars command, to which we pass in v2_AcqGrouping.tsv as input (the post cubids apply acquisition grouping tsv).

$ cubids copy-exemplars BIDS_Dataset_DataLad Exemplar_Dataset v1_AcqGrouping.tsv --use-datalad

Since we used the use-datalad flag, Exemplar_Dataset is a DataLad dataset with the version history tracked in its git log (see below):

_images/screenshot_8.png

Once a preprocessing pipeline completes successfully on the Exemplar Dataset, the full dataset can be executed with confidence, as a pipeline’s behavior on the full range of metadata heterogeneity in the dataset will have already been discovered during exemplar testing.