Documentation
Overview
Snakeobjects is a workflow management framework based on Snakemake that
uses an object-oriented abstraction of workflows. snakeobjects workflows
are easier to develop, to maintain and to adopt compared to the equivalent
workflows written in Snakemake, but inherit all the powerful features of
Snakemake. These include the portability, efficient resource usage, the
large expressive power due to the tight python integration, and the large
community of the Snakemake users.
Workflow object-oriented abstraction
The Snakeobjects introduces an abstraction of workflows inspired by
object-oriented design that replaces the low-level input-output relationships
between files that are at the core of Snakemakes rules. A pipeline (workflow)
in snakeobjects operates on projects and projects are composed of objects. The
objects within a project are connected with dependency relationships organized
in a directed acyclic graph called object graph in which each object has a list
of the objects it depends on (dependency objects). Each object has also a
specified object type and an object type is characterized by a set of targets
that need to be created for each object of the given object type together with
the rules for creating the targets. The rules for building targets for an
object type are included in a snakefile named after the object type and are written
using Snakemake’s syntax where the inputs and outputs specify targets instead of
files. Crucially, inputs can refer to targets in the current object and to
targets in objects the current object depends on as specified in the object
graph. Finally, projects and objects can be associated with a set of key-value
parameters.
Pipelines
In snakeobjects, pipelines reside in a pipeline directory. The pipeline
directory and its content are created by the workflow designer and define the
workflow. The pipeline directory usually contains a python script called
build_object_graph.py that uses meta data associated with a given project
to create the project’s object graph and one Snakemake file for each object type
used in the object graph. The pipeline directory can also contain scripts, conda
environment definitions, or other artifacts used by the pipeline.
build_object_graph.py script
build_object_graph.py script that should contain a function with the following
interface:
- run(project, OG[, *args])
Creates an object graph for the project.
- Parameters
project (snakeobjects.Project) – the snakeobjects project
OG (snakeobjects.ObjectGraph) – the newly created object graph
*args (list[str]) – command line arguments passed to
sobjects prepareorsobjects prepareTest.
The run() function usually obtains the location of the project meta data through
project.parameters, loads the meta data, and uses it to create the
corresponding object graph using the add()
method to add object to the OG.
Object-type snakefiles
Each object-type snakefiles declares the list the targets for the object type and the rules for creating the targets.
The workflow designer uses the add_targets() function to declare the targets and uses the
Snakemake’s syntax to create the rules.
For example, the following rule in the object type snakefile sample.snakefile
add_targets("A.txt","B.txt")
, decleares that each of the objects of type sample need two targets
created (T("A.txt") and T("B.txt")).
All the rules are written using the Snakemake’s syntax (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html)
and use a set of snakeobjects extension functions (see
Snakemake extension functions) for referring to:
The example below demonstrates the main features of the snakeobjects rules:
rule create_B:
input: a=T("A.txt"), r=DT("chrAll.fa",dot="reference")
output: T("B.txt")
parameters: g=P("gender")
log: **LFS("B")
shell: "some_command.py {input.a} {param.g} {input.ref} > {output} 2> {log.E}"
Typical rule has a name, here it is create_B, and several attributes, such as
input, output, parameters, log, and shell. Attributes should be indented
relative to the term rule. Attributes values are strings or lists of stings
separated by commas. They may start on the same line as the attribute name or
on separate line in which case they are indented relative its attribute
position. The first two lines in this rule use functions T() and
DT() to specify the values of input and output files. The values for
parameters and log are defined by functions P() and LFS().
The values of attributes can be named as in a=T(“A.txt”) or g=P(“gender”) and
these names could be used for reference in the shell command. Shell attribute
value is valid shell command or a list of commands enclosed in quotation marks.
Attribute values in shell command are enclosed in curly braces. The complete
documentation for snakefiles rules can be found at Snakemake.
Projects
In snakeobjects, a workflow user creates a project directory
and inside a project configuration file called so_project.yaml.
The so_project.yaml file contains parameters that specify the pipeline operating on
the project, pointers to the input data and metadata associated with the project, and
parameters that control the processing to configure the project.
The workflow user uses the sobjects
command line tool to initialize (usually using the sobjects prepare
command) and to execute (sobjects run) the associated
pipeline.
The sobjects prepare performs the following steps:
creates an object graph (using the
build_object_graph.pyscript from the pipeline) and stores it in the project directory (<project directory>/OG.json);creates in the project directory an object directory for all objects in the object graph that have
symlink.<name>parameters (<project directory>/<object type>/<object id>);creates the
<pipeline directory>/Snakefilethat is subsequently used bySnakemake; andcreates the symbolic links for all object that have
symlink.<name>parameters.
The targets and the log files created during the execution of the pipeline (sobjects run) are
stored in the object directories. If object directory is not created in the step 2. of sobjects prepare command, Snakemake creates them automaticly.
In addition, Snakemake creates its standard internal
subdirectory <project directory>/.snakemake.
so_project.yaml file
The so_project.yaml contains the project parameters that configure the
project and may include:
a
so_pipelineparameter that points to the pipeline directory for the pipeline that will operate on the project (a relative paths are relative based on the project directory);parameters pointing to extra directories with shell, python, etc. scripts;
parameters pointing to the input data that will be used by the project;
parameters pointing to the meta-data describing the projects input;
a
default_snakemake_argsparameter that specifies the command line arguments that are passed toSnakemakeat every invocation ofsobjects run.
Parameter values may contain expressions [E:<env_variable_name>], [PP:parameter], [D:project], and [D:pipeline]. These meta expressions are replaced with interpolation function. In the first case the expression is replaced by the value of environment variable called env_variable_name; in the second case the expression is replaced with the value of parameter called parameter in the so_project.yaml file; in the third and the fourth cases the expression is replaced correspondingly with the project directory and the pipline directory. Interpolation is applied to all project parameters. If parameter does not contain the above meta expressions, it remains unaffected; parameters represented by lists and dictionaries are processed recursively by applying interpolation to all its members. Parameters that are defined for objects in the build_object_graph.py can be referred to in the snakefile rules with the expressions [P:parameter] or with functions (P()).
Snakeobjects also allows for hierarchy of projects. In this case so_project.yaml file of a nested project should have the parameter so_parent_projects, for example:
so_parent_projects:
"projectA": "path to projectA directory"
"projectB": "path to projectctB directory"
and this allows for interpolation of nested projects parameters with the values of the parent project parameters, for example:
par_nested: "[PP:par_name:parent_name]"
The value of par_nested parameter will be the same as the value of parameter par_name in the parent project parent_name. Here the PP signifies that nested project has a parent project projectA with its parameter par_name.
If projectA has its own parent project projectC and we would like to transfer the par_name parameter from the projectC to our project, then the last line may become:
par_nested: "[PP:par_name:projectA/projectC]"
The parent_name in PP is optional. If it is not present the the value of par_name in the current project will be used.
Apart from conda environment and so_pipeline directory, the project may need some extra directories with shell, python, etc. scripts. In this case so_project.yaml file should have parameters so_environment_*_set or so_environment_*_add, etc., for example:
so_environment_bin_add: "<path to bin directory>"
so_environment_python_add: "<path to python directory>"
These paths are added to PATH and PYTHONPATH environment variables when sobjects executes run and submit commands. If so_pipeline directory has subdirectories bin, and/or python, these directories are added to the PATH and PYTHONPATH without need to specify them in so_project.yaml.
objects subdirectories
The files related to snakeobjects targets have the following general name:
<project directory>/<object type>/<object id>/<target name>
For example, the target T("A.txt") of object of object type sample and with
id i1232 will be stored in the file <project directory>/sample/i1232/A.txt;
The general form for the log.O, log.E, and log.T log files referenced
using the LFS(<name>) function are:
<project directory>/<object type>/<object id>/log/<name>-out.txt
<project directory>/<object type>/<object id>/log/<name>-err.txt
<project directory>/<object type>/<object id>/log/<name>-time.txt
respectively. For example, log file (log.E)
named A for the sample i1232 object is <project
directory>/sample/i1232/log/A-err.txt.
Objects types, objects, and object graph
Object types in snakeobjects typically correspond to object types in the
domain of the workflow. For example, in sequence analysis workflows, we can
have reference genome, library, sample, individual, family, or
population object types. snakeobjects object types are characterized by
the set of targets that will be created for each object of the object type.
For example, sample may have targets T("sample.bam"),
T("sample.bai"), T("sample.vcf"), and T("depth-histogram.png");
reference genome object type may have targets T("chr.fa"),
T("chr.fa.fai"), and T("bwa.index").
Each snakeobjects project is associated with one object graph a
structure representing a directed acyclic graph of objects (the
ObjectGraph is the snakeobjects implementation of the object
graph and the objects in the object graph are implemented by the
OGO class). Each of the objects is from one of the
pipeline’s object types and is assigned with an object id that must
be unique string among all objects for the same object type (i.e., there can be
only one object of type individual with object id john).
Each object is also associated with a list of dependency objects. The dependency
objects are objects whose targets will be used in the creation of the targets
of the current object. A target, T(t) of an object is created by the rule
from the snakefile of the object’s object type that has the target in its
output clause (i.e., output: T(t)). The input clause of the rule may contain
other targets from the same object type (T()), targets in a
dependency object (DT()), or other files.
In addition, each object is associated with project parameters, a dictionary of parameter name to parameter value strings that provide important information for the creation of the objects targets.
Object are typically created by the pipeline’s build_object_graph.py
script with the add() method of the
ObjectGraph. The order of the dependency objects is preserved and
the DT() and DP() functions will use the order in the
bread-first traversal of the object graph.
For example:
def run(project,OG):
...
OG.add("individual","ann",{"symlink.sample.bam":"/data/bamFiles/ann.bam","diagnosis":"none"}, [])
OG.add("individual","tom",{"symlink.sample.bam":"/data/bamFiles/tom.bam","diagnosis":"schizophrenia"}, [])
OG.add("individual","liz",{"symlink.sample.bam":"/data/bamFiles/liz.bam","diagnosis":"autism"}, [])
...
OG.add("family","johns",{},[OG['individual','ann'],OG['individual','tom'],OG['individual','liz']])
...
OG.add("individuals","all",{},OG['individual'])
...
shows the creation of five objects. Three of the objects are of type
individual and have object ids ann, tom, and liz. Each of the
tree individuals have two parameters, symlink.sample.bam and diagnosis,
and are not dependent on other objects as indicated by the last parameter , [],
of the add function. The symlink.sample.bam parameter is a special
parameter that will lead to the creation of a symbolic link called sample.bam in
the objects’ directories pointing to the bam files associated with each individual,
(provided as values to the symlink.sample.bam parameters).
The fourth object is of type family, has object id equal to johns, has not parameters, and is dependent on the
the tree individuals, ann, tom, and liz. The last object (all of type individuals) is dependent on all
individuals included in the graph. That will include ann, tom, and liz but may include many more
individuals created in the parts of the run function that are not shown.
During the execution of the workflow targets for of the project’s object get created and stored
in files in the objects subdirectory of the project directory.