Documentation¶
All methods in this package can be called from command line, and by direct call from Python console.
Import SCRIdb as:
import SCRIdb as sc
In command line:
$ scridb -h
You are using the latest version 'v1.2.10-alpha.1' OK!
usage: scridb [-h] [-c [CONFIG]] [-f [FILE]] [-v] [-o [RESULTS_OUTPUT]]
[-j [JOBS]] [-e [EMAIL]] [-p [PEM]] [-a [AMI]]
[-n [INSTANCE_TYPE]] [-dS [DOCKERIZEDSEQC]] [-tp [TOOL_PATH]]
{data_submission,process,upload_stats,data_transfer,run} ...
positional arguments:
{data_submission,process,upload_stats,data_transfer,run}
Required. Call one of the following sub-commands.
While `process` sub-command calls `run`, `run` can be
called independently, if a processing jobs.yml already
exists.
data_submission sub-command to submit new projects and samples to the
database and to collect parameters on library
preparations.
process sub-command to process newly delivered fatsq files
from fhe genome center core.
upload_stats sub-command to upload stats following successful
SEQC/scata/sera run. This sub-command accepts one or
more space separated positional arguments of full
paths to jobs.yml files, or labels.json in the case of
hashtags. Use -o to provide full path to output csv
file with data results necessary for data_transfer
sub-command.
data_transfer sub-command to transfer SEQC results to destination
folder designated in resulting csv file from
upload_stats. Use -o to provide the full path to the
csv file.
run sub-command to call SeqC or scata (ATAC-seq) or sera
(Cell Ranger) independently, conditional on having a
jobs.yml file in path defined by config.
optional arguments:
-h, --help show this help message and exit
global options:
-c [CONFIG], --config [CONFIG]
path to database configuration file. Default:
$HOME/.config.json
-f [FILE], --file [FILE]
With `data_submission` sub-command: string, path to
html form. With `process` sub-command: string, path to
`.csv` file listing newly delivered sequencing data
from the genome center. OR, list of comma separated
sample names (no spaces)- override the `.csv` file. A
special case for `process` sub-command: pass `-`, and
follow the prompt on screen. For more details and
examples follow the documentation at
https://tinyurl.com/y5f9pzx8
-v, --version Print the package version and exit.
process options:
-j [JOBS], --jobs [JOBS]
Default: `jobs.yml`. When used with `process` sub-
command, a <jobs.yml> is passed as the OUTPUT file
name instead of the default `jobs.yml`. When used with
`run` the provided jobs filename will override the
default `jobs.yml`.
-e [EMAIL], --email [EMAIL]
Override email address to receive SEQC run summary in
config.
-p [PEM], --pem [PEM]
Override path to AWS EC key pair file `.pem` in
config.
-a [AMI], --ami [AMI]
Override Amazon Machine Image (AMI) in config.
-n [INSTANCE_TYPE], --instance_type [INSTANCE_TYPE]
Instance Type in config for SEQC.
-dS [DOCKERIZEDSEQC], --dockerized-SEQC [DOCKERIZEDSEQC]
Override default path in config to root directory of
the dockerized SEQC.
-tp [TOOL_PATH], --tool_path [TOOL_PATH]
Provide path to root directory of the scata or sera.
This will override the default, assuming it is in
$HOME.
output options:
-o [RESULTS_OUTPUT], --results_output [RESULTS_OUTPUT]
Default: ~/results_output.csv. Path to results_output
`csv` file. When used with `process` sub-command, a
`sample_data` data frame is written. When used with
`upload_stats` sub-command, a data frame with
necessary information used as input for
`data_transfer` sub-command, is written. Before
invoking `data_transfer` sub-command, the user must
complete the `destination` column in the resulting
data frame. IN CASES WHERE `process` AND/OR
`upload_stats` ARE SKIPPED USE `-` AS THE ARGUMENT.
This is useful for cases where we need to share raw
data with collaborators without processing.
Call for a module data_submission
as:
$ scridb data_submission -h
You are using the latest version 'v1.2.10-alpha.1' OK!
usage: scridb data_submission [-h] [{submitnew,stats} [{submitnew,stats} ...]]
positional arguments:
{submitnew,stats} Choose the mode of action: {submitnew, stats}; submitnew:
submit new forms to the database; stats: insert new
library parameters to stats table.Can choose either one
or both
optional arguments:
-h, --help show this help message and exit
Data Submission from iLabs : data_submission
¶
Data is collected from newly submitted forms of projects and samples, by downloading and parsing the forms from iLabs.
The module and sub-command can be called by:
$ scridb -f input.html data_submission submitnew
In the above example, the input html
from iLabs
is parsed and the new records are
entered into the database.
Calling stats
will record library preparation parameters for each sample:
$ scridb -f input.html data_submission stats
Another form of usage, calling both submitnew
stats
with sample
submission forms:
$ scridb -f input.html data_submission submitnew stats
Parsing¶
HTML parser that returns clean parameters and data extracted from iLabs submitted forms. |
Data Submission¶
|
A method that reads HTML data and submits new records into the database. |
|
A method to insert new records of samples to the database. |
|
A method to insert new records to the database on projects. |
|
A method to records library preparation stats of samples. |
|
Search and return ids of hashtag barcodes from the database. |
If sample_index id record not found, records new |
|
|
A cleaner for |
Data Processing : process
¶
Processing and spinning pipelines on sequencing data returned from IGO.
This process relies on a submit_data.csv
that can be downloaded using the RShiny
interface.
It’s a three column csv
file in the form:
IGO source path (proj_folder) |
s3URI (s3_loc) |
Sample names (fastq) |
---|---|---|
peerd/FASTQ/Project../sub../ |
s3://dp-lab-data/sc-seq/Project.. |
sample.. sample.. sample.. sample.. |
The module can be called as follows:
$ scridb -f submit_data.csv process
The process
module workflow can be controlled through switches in command line:
$ scridb -f submit_data.csv process --no-rsync --runseqc
In he above example, no data will be copied to S3
from IGO peer
drive, and the processing pipeline
will not be called. But, a jobs.yml
file will be generated for the provided samples.
List of controls:
--runseqc Skip `seqc` run on submitted jobs.
--hashtag Skip `hashtag` run on submitted jobs.
--atac Skip `atac-seq` run on submitted jobs.
--CR Skip `Cell Ranger` run on submitted jobs.
--no-rsync Skip copying files to AWS.
--seqc-args [KEY1=VAL1,KEY2=VAL2...]
Additional arguments to pass to SEQC.
--md5sums [MD5SUMS] Path to MD5 hashes.
--save Write `sample_data` to .csv output configured in
--results_output.
Other processing options are available to override default parameters passed to the different methods
in process
, that are passed in command line to scridb
, such as --tool_path
:
-j [JOBS], --jobs [JOBS]
Default: `jobs.yml`. When used with `process` sub-
command, a <jobs.yml> is passed as the OUTPUT file
name instead of the default `jobs.yml`. When used with
`run` the provided jobs filename will override the
default `jobs.yml`.
-e [EMAIL], --email [EMAIL]
Override email address to receive SEQC run summary in
config.
-p [PEM], --pem [PEM]
Override path to AWS EC key pair file `.pem` in
config.
-dS [DOCKERIZEDSEQC], --dockerized-SEQC [DOCKERIZEDSEQC]
Override default path in config to root directory of
the dockerized SEQC.
-tp [TOOL_PATH], --tool_path [TOOL_PATH]
Provide path to root directory of the scata or sera.
This will override the default, assuming it is in
$HOME.
Process¶
|
A method to process raw sequencing data returned from IGO. |
|
Constructor for |
|
Constructor for |
|
Put an object on Amazon S3 bucket, while verifying the integrity of the uploaded object. |
|
A constructor for |
Constructor for data frame with samples to be processed. |
|
To prevent accidental run on samples, for each sample in the data frame make sure: |
|
Check that records of sample exist in the database. |
|
|
Execute an external command. |
|
Return path to Cromwell server credentials. |
|
Checks the integrity of files copied to |
|
Compute the |
|
Compute the |
|
List the contents of an object on |
A function to return a new sample name omitting |
Collect Sequencing Stats Parameters : upload_stats
¶
Tools and methods to collect data from successfully processed scRNAseq
data.
The module can be called as follows:
$ scridb -o <output.csv> upload_stats <jobs.yaml> [<jobs.yaml> ...]
Having a .yaml
file makes it easier to process the command, but in the absence of such file
the command can be used by providing a single source path to root directory for project where
samples are stored. In this case, additional arguments are required to complete the action:
$ scridb upload_stats <path to root dir> [-s [SAMPLE_NAMES [SAMPLE_NAMES ...]]] [-i [SAMPLE_IDS [SAMPLE_IDS ...]]]
$ scridb upload_stats -h
You are using the latest version 'v1.2.10-alpha.1' OK!
usage: scridb upload_stats [-h] [-s [SAMPLE_NAMES [SAMPLE_NAMES ...]]]
[-i [SAMPLE_IDS [SAMPLE_IDS ...]]]
[-r [RESULTS_FOLDER [RESULTS_FOLDER ...]]]
[--cellranger] [--hash_tags]
s3paths [s3paths ...]
positional arguments:
s3paths A single string or a list of paths to jobs.yml, OR a
single string to parent directory of project with
samples listed in `sample_names`. For HASHTAGS: a
single string or a list of paths to labels.json, OR a
single string to root directory of project with
samples listed in `sample_names` CONDITIONAL on using
the `--hash_tags` switch.
optional arguments:
-h, --help show this help message and exit
-s [SAMPLE_NAMES [SAMPLE_NAMES ...]]
A single string or a list of sample names. Used in the
case of uploading stats NOT using a jobs files.
-i [SAMPLE_IDS [SAMPLE_IDS ...]]
A single string or a list of sample ids. Used in the
case of uploading stats NOT using a jobs files.
-r [RESULTS_FOLDER [RESULTS_FOLDER ...]]
SEQC summary results containing folder. Override the
default `seqc-results` folder.
--cellranger Cell Ranger or ATAC-seq stats. Used in the case of
uploading stats NOT using the `jobs.yml` template as
input.
--hash_tags Hash tags switch. Used in the case of uploading stats
using the `lables.json` template as input.
Read and upload sequencing parameter stats¶
Collects sequencing parameters from successfully processed scRNAseq samples, and uploads the data to the database. |
|
|
The core method which collects stats from successfully processed |
|
Process stats data and return a formatted structure compatible with the processing pipeline, that will be uploaded to the database. |
|
A method to collect sequencing parameters from output |
|
Filter keys of a bucket with objects matching a pattern. |
Data transfer : data_transfer
¶
Tools to share outputs of processed samples with collaborators and team members.
This module can be called as:
$ scridb -o <input.csv> data_transfer
The input csv
is generated from earlier step upload_stats
.
In some cases, the module can be called without providing an input csv
file, as follows:
$ scridb -o - data_transfer -i [SAMPLE_IDS [SAMPLE_IDS ...]] -t [TARGET], --target [TARGET]
$ scridb -o - data_transfer -h
You are using the latest version 'v1.2.10-alpha.1' OK!
usage: scridb data_transfer [-h] [-i [SAMPLE_IDS [SAMPLE_IDS ...]]]
[-m {all,hashtags,TCR}] [-t [TARGET]]
optional arguments:
-h, --help show this help message and exit
-i [SAMPLE_IDS [SAMPLE_IDS ...]]
A single string or a list of sample ids. Used in the
case of uploading stats NOT using a jobs files.
-m {all,hashtags,TCR}, --mode {all,hashtags,TCR}
Default: all, a switch mode for hashtags.
-t [TARGET], --target [TARGET]
S3URI in the form of s3://bucket/key/, where data will
be copied to. In general, the expected `s3URI` is the
root path for `s3URI/<project_name>/<sample_name>/
where <project_name>/<sample_name>/ are omitted.
All methods for this module can be also accessible through a Python console.
Collect samples and transfer to collaborators¶
|
Move the processed samples to the destination folder on AWS, given the provided parameters, including the destination parent |
|
Collect essential information for transferring samples to the destination folder on AWS. |
|
Compile a list of |
|
A method to create an AWS user with credentials. |