Documentation¶

All methods in this package can be called from command line, and by direct call from Python console.

Import SCRIdb as:

import SCRIdb as sc

In command line:

$ scridb -h
You are using the latest version 'v1.2.10-alpha.1'    OK!
usage: scridb [-h] [-c [CONFIG]] [-f [FILE]] [-v] [-o [RESULTS_OUTPUT]]
              [-j [JOBS]] [-e [EMAIL]] [-p [PEM]] [-a [AMI]]
              [-n [INSTANCE_TYPE]] [-dS [DOCKERIZEDSEQC]] [-tp [TOOL_PATH]]
              {data_submission,process,upload_stats,data_transfer,run} ...

positional arguments:
  {data_submission,process,upload_stats,data_transfer,run}
                        Required. Call one of the following sub-commands.
                        While `process` sub-command calls `run`, `run` can be
                        called independently, if a processing jobs.yml already
                        exists.
    data_submission     sub-command to submit new projects and samples to the
                        database and to collect parameters on library
                        preparations.
    process             sub-command to process newly delivered fatsq files
                        from fhe genome center core.
    upload_stats        sub-command to upload stats following successful
                        SEQC/scata/sera run. This sub-command accepts one or
                        more space separated positional arguments of full
                        paths to jobs.yml files, or labels.json in the case of
                        hashtags. Use -o to provide full path to output csv
                        file with data results necessary for data_transfer
                        sub-command.
    data_transfer       sub-command to transfer SEQC results to destination
                        folder designated in resulting csv file from
                        upload_stats. Use -o to provide the full path to the
                        csv file.
    run                 sub-command to call SeqC or scata (ATAC-seq) or sera
                        (Cell Ranger) independently, conditional on having a
                        jobs.yml file in path defined by config.

optional arguments:
  -h, --help            show this help message and exit

global options:
  -c [CONFIG], --config [CONFIG]
                        path to database configuration file. Default:
                        $HOME/.config.json
  -f [FILE], --file [FILE]
                        With `data_submission` sub-command: string, path to
                        html form. With `process` sub-command: string, path to
                        `.csv` file listing newly delivered sequencing data
                        from the genome center. OR, list of comma separated
                        sample names (no spaces)- override the `.csv` file. A
                        special case for `process` sub-command: pass `-`, and
                        follow the prompt on screen. For more details and
                        examples follow the documentation at
                        https://tinyurl.com/y5f9pzx8
  -v, --version         Print the package version and exit.

process options:
  -j [JOBS], --jobs [JOBS]
                        Default: `jobs.yml`. When used with `process` sub-
                        command, a <jobs.yml> is passed as the OUTPUT file
                        name instead of the default `jobs.yml`. When used with
                        `run` the provided jobs filename will override the
                        default `jobs.yml`.
  -e [EMAIL], --email [EMAIL]
                        Override email address to receive SEQC run summary in
                        config.
  -p [PEM], --pem [PEM]
                        Override path to AWS EC key pair file `.pem` in
                        config.
  -a [AMI], --ami [AMI]
                        Override Amazon Machine Image (AMI) in config.
  -n [INSTANCE_TYPE], --instance_type [INSTANCE_TYPE]
                        Instance Type in config for SEQC.
  -dS [DOCKERIZEDSEQC], --dockerized-SEQC [DOCKERIZEDSEQC]
                        Override default path in config to root directory of
                        the dockerized SEQC.
  -tp [TOOL_PATH], --tool_path [TOOL_PATH]
                        Provide path to root directory of the scata or sera.
                        This will override the default, assuming it is in
                        $HOME.

output options:
  -o [RESULTS_OUTPUT], --results_output [RESULTS_OUTPUT]
                        Default: ~/results_output.csv. Path to results_output
                        `csv` file. When used with `process` sub-command, a
                        `sample_data` data frame is written. When used with
                        `upload_stats` sub-command, a data frame with
                        necessary information used as input for
                        `data_transfer` sub-command, is written. Before
                        invoking `data_transfer` sub-command, the user must
                        complete the `destination` column in the resulting
                        data frame. IN CASES WHERE `process` AND/OR
                        `upload_stats` ARE SKIPPED USE `-` AS THE ARGUMENT.
                        This is useful for cases where we need to share raw
                        data with collaborators without processing.

Call for a module data_submission as:

$ scridb data_submission -h
You are using the latest version 'v1.2.10-alpha.1'    OK!
usage: scridb data_submission [-h] [{submitnew,stats} [{submitnew,stats} ...]]

positional arguments:
  {submitnew,stats}  Choose the mode of action: {submitnew, stats}; submitnew:
                     submit new forms to the database; stats: insert new
                     library parameters to stats table.Can choose either one
                     or both

optional arguments:
  -h, --help         show this help message and exit

Data Submission from iLabs : `data_submission`¶

Data is collected from newly submitted forms of projects and samples, by downloading and parsing the forms from iLabs.

The module and sub-command can be called by:

$ scridb -f input.html data_submission submitnew

In the above example, the input html from iLabs is parsed and the new records are entered into the database.

Calling stats will record library preparation parameters for each sample:

$ scridb -f input.html data_submission stats

Another form of usage, calling both submitnew stats with sample submission forms:

$ scridb -f input.html data_submission submitnew stats

Parsing¶

SCRIdb.htmlparser.cleanhtml

HTML parser that returns clean parameters and data extracted from iLabs submitted forms.

Data Submission¶

`SCRIdb.submission.data_submission_main`(fn, mode)	A method that reads HTML data and submits new records into the database.
`SCRIdb.sql.sampledata_sql`(kwargs)	A method to insert new records of samples to the database.
`SCRIdb.sql.projectdata_sql`(kwargs)	A method to insert new records to the database on projects.
`SCRIdb.sql.statsdata_sql`(kwargs)	A method to records library preparation stats of samples.
`SCRIdb.sql.get_hashtagBarcodes_id`(seq[, bar])	Search and return ids of hashtag barcodes from the database.
`SCRIdb.sql.get_sampleIndex`(x)	If sample_index id record not found, records new `sample_index` value in the database, and returns its `id`.
`SCRIdb.sql.sqlargs`(kargs[, val])	A cleaner for `nan` data types of empty values in dictionary.

Data Processing : `process`¶

Processing and spinning pipelines on sequencing data returned from IGO. This process relies on a submit_data.csv that can be downloaded using the RShiny interface. It’s a three column csv file in the form:

submit_data.csv¶
IGO source path (proj_folder)	s3URI (s3_loc)	Sample names (fastq)
peerd/FASTQ/Project../sub../	s3://dp-lab-data/sc-seq/Project..	sample.. sample.. sample.. sample..

The module can be called as follows:

$ scridb -f submit_data.csv process

The process module workflow can be controlled through switches in command line:

$ scridb -f submit_data.csv process --no-rsync --runseqc

In he above example, no data will be copied to S3 from IGO peer drive, and the processing pipeline will not be called. But, a jobs.yml file will be generated for the provided samples.

List of controls:

--runseqc             Skip `seqc` run on submitted jobs.
--hashtag             Skip `hashtag` run on submitted jobs.
--atac                Skip `atac-seq` run on submitted jobs.
--CR                  Skip `Cell Ranger` run on submitted jobs.
--no-rsync            Skip copying files to AWS.
--seqc-args [KEY1=VAL1,KEY2=VAL2...]
                     Additional arguments to pass to SEQC.
--md5sums [MD5SUMS]   Path to MD5 hashes.
--save                Write `sample_data` to .csv output configured in
                     --results_output.

Other processing options are available to override default parameters passed to the different methods in process, that are passed in command line to scridb, such as --tool_path:

-j [JOBS], --jobs [JOBS]
                    Default: `jobs.yml`. When used with `process` sub-
                    command, a <jobs.yml> is passed as the OUTPUT file
                    name instead of the default `jobs.yml`. When used with
                    `run` the provided jobs filename will override the
                    default `jobs.yml`.
-e [EMAIL], --email [EMAIL]
                    Override email address to receive SEQC run summary in
                    config.
-p [PEM], --pem [PEM]
                    Override path to AWS EC key pair file `.pem` in
                    config.
-dS [DOCKERIZEDSEQC], --dockerized-SEQC [DOCKERIZEDSEQC]
                    Override default path in config to root directory of
                    the dockerized SEQC.
-tp [TOOL_PATH], --tool_path [TOOL_PATH]
                    Provide path to root directory of the scata or sera.
                    This will override the default, assuming it is in
                    $HOME.

Process¶

`SCRIdb.worker.worker_main`(f_in[, …])	A method to process raw sequencing data returned from IGO.
`SCRIdb.tools.jobs_yml_config`(sample_data[, …])	Constructor for `yaml` formatted file for batch processing of samples.
`SCRIdb.tools.json_jobs`(sample_data, config_path)	Constructor for `json` formatted files for batch processing of hashtag samples.
`SCRIdb.tools.put_object`(dest_bucket_name, …)	Put an object on Amazon S3 bucket, while verifying the integrity of the uploaded object.
`SCRIdb.tools.prepare_statements`(sample_data)	A constructor for `aws s3 cp` command and MySQL statements.
`SCRIdb.tools.sample_data_frame`(sd)	Constructor for data frame with samples to be processed.
`SCRIdb.tools.filter_samples`(df)	To prevent accidental run on samples, for each sample in the data frame make sure:
`SCRIdb.tools.check_samples`(x)	Check that records of sample exist in the database.
`SCRIdb.tools.execute_cmd`([cmd])	Execute an external command.
`SCRIdb.tools.get_cromwell_credentials`([config])	Return path to Cromwell server credentials.
`SCRIdb.tools.etag_compare`(filename, etag)	Checks the integrity of files copied to `S3`.
`SCRIdb.tools.etag_checksum`(filename[, …])	Compute the `ETag` digest (which is not the `MD5` digest) of an object.
`SCRIdb.tools.md5_checksum`(filename)	Compute the `MD5` digest of an object.
`SCRIdb.tools.get_bucket_contents`(root_source)	List the contents of an object on `S3`.
`SCRIdb.tools.sample_fun`(x)	A function to return a new sample name omitting `IGO` part

Collect Sequencing Stats Parameters : `upload_stats`¶

Tools and methods to collect data from successfully processed scRNAseq data. The module can be called as follows:

$ scridb -o <output.csv> upload_stats <jobs.yaml> [<jobs.yaml> ...]

Having a .yaml file makes it easier to process the command, but in the absence of such file the command can be used by providing a single source path to root directory for project where samples are stored. In this case, additional arguments are required to complete the action:

$ scridb upload_stats <path to root dir> [-s [SAMPLE_NAMES [SAMPLE_NAMES ...]]] [-i [SAMPLE_IDS [SAMPLE_IDS ...]]]

$ scridb upload_stats -h
You are using the latest version 'v1.2.10-alpha.1'    OK!
usage: scridb upload_stats [-h] [-s [SAMPLE_NAMES [SAMPLE_NAMES ...]]]
                           [-i [SAMPLE_IDS [SAMPLE_IDS ...]]]
                           [-r [RESULTS_FOLDER [RESULTS_FOLDER ...]]]
                           [--cellranger] [--hash_tags]
                           s3paths [s3paths ...]

positional arguments:
  s3paths               A single string or a list of paths to jobs.yml, OR a
                        single string to parent directory of project with
                        samples listed in `sample_names`. For HASHTAGS: a
                        single string or a list of paths to labels.json, OR a
                        single string to root directory of project with
                        samples listed in `sample_names` CONDITIONAL on using
                        the `--hash_tags` switch.

optional arguments:
  -h, --help            show this help message and exit
  -s [SAMPLE_NAMES [SAMPLE_NAMES ...]]
                        A single string or a list of sample names. Used in the
                        case of uploading stats NOT using a jobs files.
  -i [SAMPLE_IDS [SAMPLE_IDS ...]]
                        A single string or a list of sample ids. Used in the
                        case of uploading stats NOT using a jobs files.
  -r [RESULTS_FOLDER [RESULTS_FOLDER ...]]
                        SEQC summary results containing folder. Override the
                        default `seqc-results` folder.
  --cellranger          Cell Ranger or ATAC-seq stats. Used in the case of
                        uploading stats NOT using the `jobs.yml` template as
                        input.
  --hash_tags           Hash tags switch. Used in the case of uploading stats
                        using the `lables.json` template as input.

Read and upload sequencing parameter stats¶

`SCRIdb.upload_stats.upload_stats_main`(…[, …])	Collects sequencing parameters from successfully processed scRNAseq samples, and uploads the data to the database.
`SCRIdb.upload_stats.stats`(s3paths[, …])	The core method which collects stats from successfully processed `scRNAseq` samples.
`SCRIdb.upload_stats.get_stats`(s3paths, …)	Process stats data and return a formatted structure compatible with the processing pipeline, that will be uploaded to the database.
`SCRIdb.upload_stats.read_stats`(bucket, keys)	A method to collect sequencing parameters from output `csv` or `json` files from samples on `S3`, provided a `bucket` and a `key`.
`SCRIdb.upload_stats.get_objects`([bucket, …])	Filter keys of a bucket with objects matching a pattern.

Data transfer : `data_transfer`¶

Tools to share outputs of processed samples with collaborators and team members.

This module can be called as:

$ scridb -o <input.csv> data_transfer

The input csv is generated from earlier step upload_stats. In some cases, the module can be called without providing an input csv file, as follows:

$ scridb -o - data_transfer -i [SAMPLE_IDS [SAMPLE_IDS ...]] -t [TARGET], --target [TARGET]

$ scridb -o - data_transfer -h
You are using the latest version 'v1.2.10-alpha.1'    OK!
usage: scridb data_transfer [-h] [-i [SAMPLE_IDS [SAMPLE_IDS ...]]]
                            [-m {all,hashtags,TCR}] [-t [TARGET]]

optional arguments:
  -h, --help            show this help message and exit
  -i [SAMPLE_IDS [SAMPLE_IDS ...]]
                        A single string or a list of sample ids. Used in the
                        case of uploading stats NOT using a jobs files.
  -m {all,hashtags,TCR}, --mode {all,hashtags,TCR}
                        Default: all, a switch mode for hashtags.
  -t [TARGET], --target [TARGET]
                        S3URI in the form of s3://bucket/key/, where data will
                        be copied to. In general, the expected `s3URI` is the
                        root path for `s3URI/<project_name>/<sample_name>/
                        where <project_name>/<sample_name>/ are omitted.

All methods for this module can be also accessible through a Python console.

Collect samples and transfer to collaborators¶

`SCRIdb.transfer.data_transfer`([sample_ids, …])	Move the processed samples to the destination folder on AWS, given the provided parameters, including the destination parent `s3uri`.
`SCRIdb.transfer.df_compiler`(d[, m, dist])	Collect essential information for transferring samples to the destination folder on AWS.
`SCRIdb.transfer.get_glob`([sample_name, …])	Compile a list of `awscli` commands to move the desired data from a sample to a desired destination on `s3uri`.
`SCRIdb.transfer.make_credentials`([user, …])	A method to create an AWS user with credentials.