Guidelines for Submitting Controlled Access Data

The purpose of this document is to provide guidelines for the submission of controlled access data to FaceBase. For a more general overview of Data Management and Sharing (DMS) with respect to controlled access data, please see Designating Scientific Data for Controlled Access (NIH).

Subject-level Data

Data and metadata produced from human subjects research often include sensitive subject-level information such as geolocations, dates, rare diseases, etc. By definition, human subjects data that are submitted to FaceBase will include some form of subject-level information. Typically, such data come in the form of metadata on the subjects, per-subject data files like images or genomics, and/or per-subject rows in a database. In the following sections, we discuss the standards for submitting each of these forms of data to FaceBase.

If your human subjects data includes distinct consent groups or other cohort groupings you should submit each grouping as a separate dataset. Doing so will allow users of FaceBase to request your data, specific to a particular consent group or cohort. This streamlines the work of the NIH Data Access Committee (DAC) in reviewing requests for controlled access data. The online forms for entering metadata on datasets allow you to cross-reference related datasets. In addition, datasets are always grouped under your FaceBase project page.

De-identification

Investigators MUST de-identify their research data, removing all Personally Identifiable Information (PII) and Protected Health Information (PHI), to the extent possible. For example, explicit identifiers such as name, date of birth, address, and medical record number, can and should be removed or otherwise de-identified before submission to FaceBase. However, it may be impossible to fully de-identify data of all implictly or theoretically identifiable information. For example, some rare diseases may be so rare that just knowing the disease and broad geographic location or age in years, may be enough for some specialists in the disease to be able to combine information to re-identify the individual. Investigators are not expected to de-identify data to such an extent that it would be impossible to re-identify the subjects by combining with other knowledge or out-of-band information. With this qualified definition in mind, investigators MUST de-identify data to the extent possible.

Metadata

Metadata is defined as “data that describes other data”. FaceBase provides a number of online forms for entering metadata. These forms cover details about experiments, biological samples, protocols, and other related information to help researchers understand and reuse your data. Investigators SHOULD submit de-identified metadata for their data. Investigators SHOULD enter as much demographic, disease, phenotype, anatomic, and other biological characteristics and experimental metadata as possible, so long as it does not include any PII or PHI. In addition, Investigators SHOULD NOT include implicitly identifiable information on subjects that could be combined with other information to re-identify the data.

Local Identifiers

Investigators MAY include a Local Identifier for each entry in the Biosample metadata, if applicable. The Local Identifier serves two purposes. First, if we or the users of your data have questions regarding your study, we can use the Local Identifier as a way to reference the subject or sample in question. Second, the Local Identifier may be used for mapping your submitted data files to the dataset metadata online. In addition, investigators MAY include optional identifiers for Subject ID to support longitudinal studies and Family ID to support proband information. These are separate fields in the Biosample metadata.

Supplementary Metadata

The FaceBase online data entry forms cover details that help other investigators find and reuse data. Some datasets, however, may need additional attributes to fully describe aspects of the data that fall outside of what the FaceBase data entry forms are designed to support. Investigators MAY submit supplementary metadata in a structured file format, such as a Comma-Separated Values (CSV) formatted spreadsheet. Investigators MAY also include more sensitive information than would be allowed in the standard FaceBase metadata.

Files

Investigators typically submit data files as part of their dataset submission. Files MAY be organized in an arbitrary file hierarchy with any practical depth of subdirectories under the dataset base directory. Files MAY use any file name with the exception of checksums.md5 and mappings.csv, and file names MUST NOT include any explicit PII or PHI.

The structure follows the form:

dataset/
	    file1
	    file2
	    ...
	    fileN

For example:

1-2345/
	   mappings.csv
	   checksums.md5
	   dirA/
		dirB/
			fileA.ext
	   dirC/
		fileB.ext
	   fileC.ext
	   dirZ/
		   fileD.ext
	   dirG/
		   fileE.ext

File Mappings for Subject-Level Data

When submitting subject-level data files, Investigators MAY also submit a file mappings.csv that maps the data files to the subject-level metadata in the FaceBase database via the Local Identifier. The mappings file MUST be named mappings.csv and be included in the top-level of the dataset upload directory. The format MUST be plain text Comma-Separated Values (CSV) with exactly two columns local_identifier and filename. The local_identifier MUST match a value that you entered in the online Biosample metadata. The filename MAY include a relative path (e.g., path/to/file.ext) and MUST match a filename under the dataset upload directory. Note that the local_identifier need not be unique within the context of the mappings.csv file, however, the filename field MUST be unique within the mappings.csv file.

Example of a valid mappings.csv file:

local_identifier filename
Cohort1SubA7 dirA/dirB/fileA.ext
Cohort1SubA7 dirC/fileB.ext
COHORT2SUB99 fileC.ext
siteA_Subj-H24 dirZ/fileD.ext
siteA_Subj-H24 dirG/fileE.ext

Structured Data Files

When submitting “structured” data such as database extracts, spreadsheets, or other tabular data, Investigators MUST use standard Comma-Separated Values (CSV) format and MUST include a “Data Dictionary” to describe each file. Data dictionaries MUST be named DD_table_name.ext corresponding to a specific tabular data file named table_name.csv. Data dictionaries MAY be either CSV or Microsoft Excel Open XML Spreadsheet (XLSX) formatted. For example, if you submit tabular file experiments.csv then the data dictionary would be either DD_experiments.csv or DD_experiments.xlsx depending on the choice of format. While FaceBase does not currently dictate a precise schema for the data dictionary, it MUST include the data element name (i.e., the table column name), data type, and definition and SHOULD include allowable values, format, and any other relevant information to assist data consumers with using the table data. Data dictionaries SHOULD be column-major.

File Checksums

Investigators MAY submit the checksums for all files in a file named checksums.md5. The checksums file should be formatted according to the standard output of the GNU md5sum utility. If you are not familiar with using this utility we will provide detailed instructions.

09f7e02f1290be211da707a266f153b3  dirA/dirB/fileA.ext
52f83ff6877e42f613bcd2444c22528c  dirc/fileB.ext