Components¶
This part explains how to make new components. You need to know the following definitions first:
seed: a seed is a computer application, a program or in general a command line tool that performs a specific task. This can be a simple bash copy command or a software suite like MutationSeq or Strelka.
component: a component is a wrapper around a seed that makes the seed compatible with kronos
so that the seed then can be used as a part of a pipeline. In other words, components are the building blocks of the pipelines generated by kronos
.
Develop a component¶
The purpose of components is to modularize workflows with reusable building blocks that require minimal development. The number of lines of codes for making a new component is very small. The simple development instructions eliminate, for example, the need to use Ruffus decorators, input/output management using regex expressions and complicated dependency management in the code that can easily become very complex with the number of tasks in a workflow. Furthermore, a large workflow can be divided into a set of small components that results in a much faster and manageable workflow development.
All command line tools can be used as seeds and therefore wrapped as Kronos
components. Regardless of how complicated they are, their corresponding components have a standard directory structure composed of specific wrappers and sub-directories. The wrappers are agnostic to the programming language used for developing the seed.
The components should be developed prior to making the workflow. However, since they are individually and independently developed and due to their reusability, the development of a component happens only once and then it can be used in various pipelines.
In order to develop a component you need to:
create a directory with a name that is the same as the name you want for the component, e.g.
my_comp
.create a directory called “component_seed” inside the
my_comp
directory and copy the seed source code into it.create the following files inside the
my_comp
directory:__init__.py
: an empty file.- component_main.py: the main python script that contains the
Component
class. - component_params.py: contains all the information about input/output parameters of the component.
- component_reqs.py: contains all the information about the requirements of the component.
- component_ui.py: an argparse UI for the component.
Tip
All the above files and directories are generated by Kronos
make_component command.
The user only needs to customize them.
Tip
The seed source code is not required to be copied into the component_seed
directory. Instead, the seed can be used as a requirement for the component which can be listed in the component_reqs.py
.
The component directory tree looks like the following:
|-- <component_name>
| |-- __init__.py
| |-- component_main.py
| |-- component_params.py
| |-- component_reqs.py
| |-- component_ui.py
| |-- component_seed
Note
<component_name>
should be replaced with the actual name of the component, e.g. my_comp
.
The rest of the file and directory names should be exactly as shown above.
Tip
It is recommended to add the following files and directories (not generated automatically) as well:
component_test
: a directory where all the test files exist.README
: a readme file to provide more information about the component.
Component_main¶
The core of a component is the component_main.py
python script.
This module defines Component
class which extends the ComponentAbstract class.
Using the make_component command, the following component_main.py
file is generated:
"""
component_main.py
This module contains Component class which extends
the ComponentAbstract class. It is the core of a component.
Note the places you need to change to make it work for you.
They are marked with keyword 'TODO'.
"""
from kronos.utils import ComponentAbstract
import os
class Component(ComponentAbstract):
"""
TODO: add component doc here.
"""
def __init__(self, component_name="my_comp",
component_parent_dir=None, seed_dir=None):
## TODO: pass the version of the component here.
self.version = "v0.99.0"
## initialize ComponentAbstract
super(Component, self).__init__(component_name,
component_parent_dir, seed_dir)
## TODO: write the focus method if the component is parallelizable.
## Note that it should return cmd, cmd_args.
def focus(self, cmd, cmd_args, chunk):
pass
# return cmd, cmd_args
## TODO: this method should make the command and command arguments
## used to run the component_seed via the command line. Note that
## it should return cmd, cmd_args.
def make_cmd(self, chunk=None):
## TODO: replace 'comp_req' with the actual component
## requirement, e.g. 'python', 'java', etc.
cmd = self.requirements['comp_req']
cmd_args = []
args = vars(self.args)
## TODO: fill the following component params to seed params dictionary
## if the name of parameters of the seed are different than
## component parameter names.
comp_seed_map = {
#e.g. 'component_param1': 'seedParam1',
#e.g. 'component_param2': 'seedParam2',
}
for k, v in args.items():
if v is None or v is False:
continue
## TODO: uncomment the next line if you are using
## comp_seed_map dictionary.
# k = comp_seed_map[k]
cmd_args.append('--' + k)
if isinstance(v, bool):
continue
if isinstance(v, str):
v = repr(v)
if isinstance(v, (list, tuple)):
cmd_args.extend(v)
else:
cmd_args.extend([v])
if chunk is not None:
cmd, cmd_args = self.focus(cmd, cmd_args, chunk)
return cmd, cmd_args
## To run as stand alone
def _main():
c = Component()
c.args = component_ui.args
c.run()
if __name__ == '__main__':
import component_ui
_main()
Note
Note the places you need to change the generated file to make it work for you are marked with keyword ‘TODO’.
There are two methods in this file that you need to customize:
focus
method¶
Each parallelizable component will require a focus
method.
The purpose of this method is to tell the component to process only one chunk of the input data rather than the entire file.
How this is done will vary depending on the component, but basically will add to, or alter the component command to this end.
For example, in the following implementation, focus
method simply passes the chunk to the --interval
option in the command arguments cmd_arg
(most of the time, this implementation does the job):
focus(cmd, cmd_args, chunk):
cmd_args.append('--interval ' + chunk)
return cmd, cmd_args
Note
You need to implement focus
method only if the component is parallelizable.
make_cmd
method¶
All the components should implement this method in their component_main.py
.
This method essentially returns the command string that one can use to run the seed on a command line.
For example, if the seed can be run using the following command:
python my_seed_command.py --foo data1 --bar data2
then make_cmd
method would look like this (note that we only need to change the first two lines of the default file made by kronos
):
def make_cmd(self, chunk):
path = os.path.join(self.seed_dir, 'my_seed_command.py')
cmd = self.requirements['python'] + ' ' + path
cmd_args = []
args = vars(self.args)
## TODO: fill the following component params to seed params dictionary
## if the name of parameters of the seed are different than
## component parameter names.
comp_seed_map = {
#e.g. 'component_param1': 'seedParam1',
#e.g. 'component_param2': 'seedParam2',
}
for k, v in args.items():
if v is None or v is False:
continue
## TODO: uncomment the next line if you are using
## comp_seed_map dictionary.
# k = comp_seed_map[k]
cmd_args.append('--' + k)
if isinstance(v, bool):
continue
if isinstance(v, str):
v = repr(v)
if isinstance(v, (list, tuple)):
cmd_args.extend(v)
else:
cmd_args.extend([v])
if chunk is not None:
cmd, cmd_args = self.focus(cmd, cmd_args, chunk)
return cmd, cmd_args
Tip
In the above example, python
is a requirement for the component and should be added to the component_reqs.py of the component.
Also, parameters foo
and bar
should be added to the component_params.py.
ComponentAbstract
class¶
This class comprises of the following attributes and methods:
Attributes:
Attribute | Description |
---|---|
args | the argparse namespace containing all the input arguments from the Component_ui module |
components_dir | path to the directory where the component exists |
component_name | name of the component - specific |
component_params | Component_params module of the component |
component_reqs | Component_reqs module of the component |
env_vars | see Component_reqs |
memory | see Component_reqs |
parallel | see Component_reqs |
requirements | see Component_reqs |
seed_dir | path to the directory where the seed exists. Most of the time it is <component_name>/<component_seed> |
seed_version | version of the seed |
version | version of the component - specific |
Tip
specific means it should be assigned value when implementing a component.
Methods:
Method | Description |
---|---|
__init__ | initialize general attributes that each component must have |
run | run the component command locally |
focus | update the command and command arguments for each chunk - virtual |
make_cmd | make the command used to run the seed of the component. This returns the same command that one would use to run the component as a stand alone program via command line - virtual |
test | run unittest of the component - virtual |
Tip
The class can be imported from utils
module from the kronos
package:
from kronos.utils import ComponentAbstract
Component_params¶
This is a python module and contains the following information:
- input_files: a dictionary with keys being the input file parameters and the values being the default values or a proper flags based on the component UI. For example:
input_files={'samples':['tumour:__REQUIRED__',
'normal:__REQUIRED__',
'reference:__REQUIRED__',
'model:__REQUIRED__'
],
'config':'some_default.cfg',
'positions_file':None
}
Note
This dictionary includes only parameters that expect input files or directories.
- output_files: a dictionary with keys being the output file parameters and the values being the default values or a proper flags based on the component ui. For example:
output_files = {'export_features':None,
'log_file':'mutationSeq_run.log',
'out':None
}
Note
This dictionary includes only parameters that expect output files or directories.
- input_params: a dictionary with keys being the input non_file parameters and the values being the default values or a proper flags based on the component ui. For example:
input_params = {'all':'__FLAG__',
'buffer_size':'2G',
'coverage':4,
'deep':'__FLAG__',
'interval':None,
'no_filter':'__FLAG__'
}
Note
All other parameters that are not included in input_files and output_files should be listed in input_params.
Using the make_component command, the following component_params.py
file is generated:
"""
component_params.py
Note the places you need to change to make it work for you.
They are marked with keyword 'TODO'.
"""
## TODO: here goes the list of the input files. Use flags:
## '__REQUIRED__' to make it required
## '__FLAG__' to make it a flag or switch.
input_files = {
# 'input_file1' : '__REQUIRED__',
# 'input_file2' : None
}
## TODO: here goes the list of the output files.
output_files = {
# 'output_file1' : '__REQUIRED__',
# 'output_file1' : None
}
## TODO: here goes the list of the input parameters excluding input/output files.
input_params = {
# 'input_param1' : '__REQUIRED__',
# 'input_param2' : '__FLAG__',
# 'input_param3' : None
}
## TODO: here goes the return value of the component_seed.
## DO NOT USE, Not implemented yet!
return_value = []
Example:
This is an example showing the content of a component_params.py
file:
input_files = {'tumour':'__REQUIRED__',
'normal':'__REQUIRED__',
'reference':'__REQUIRED__',
'model':'__REQUIRED__',
'config':'metadata.config',
'positions_file':None
}
output_files = {'export_features':None,
'log_file':'mutationSeq_run.log',
'out':'__REQUIRED__'
}
input_params = {'all':'__FLAG__',
'buffer_size':'2G',
'coverage':4,
'deep':'__FLAG__',
'interval':None,
'no_filter':'__FLAG__',
'normalized':'__FLAG__',
'normal_variant':25,
'purity':70,
'mapq_threshold':20,
'baseq_threshold':10,
'indl_threshold':0.05,
'manifest':'__OPTIONAL__',
'single':'__FLAG__',
'threshold':0.5,
'tumour_variant':2,
'features_only':'__FLAG__',
'verbose':'__FLAG__',
'titan_mode':'__FLAG__'
}
Component_reqs¶
This is a python module and contains the following information:
- env_vars: a dictionary with keys being the name of environment variables and values being the path/content to export.
The values can be updated in the configuration file using env_var in the run subsection.
Therefore, it is recommended not include the paths as values in this file and instead use an empty list,
[]
, or None as a value. - memory: specifies the minimum memory required by the component to properly run on a cluster.
The format is
nG
, e.g. 30G. - parallel: a boolean flag that specifies whether or not a component can run in parallel mode.
- requirements: a dictionary with keys usually being the name of a program/software and values being None or the flag __REQUIRED__ .
The values will be later updated by
kronos
using the content of the __GENERAL__ section. - seed_version: the version of the seed.
- version: the version of the component.
Using the make_component command, the following component_reqs.py
file is generated:
"""
component_reqs.py
Note the places you need to change to make it work for you.
They are marked with keyword 'TODO'.
"""
## TODO: here goes the list of the environment variables, if any,
## required to export for the component to function properly.
env_vars = {
# 'env_var1' : ['value1', 'value2'],
# 'env_var2' : 'value3'
}
## TODO: here goes the max amount of the memory required.
memory = '5G'
## TODO: set this to True if the component is parallelizable.
parallel = False
## TODO: here goes the list of the required software/apps
## called by the component.
requirements = {
# 'python': '__REQUIRED__',
}
## TODO: here goes the version of the component seed.
seed_version = '0.99.0'
## TODO: here goes the version of the component itself.
version = '0.99.0'
Example:
This is an example showing the content of a component_reqs.py
:
env_vars = {'LD_LIBRARY_PATH': []}
memory = '4G'
parallel = True
requirements = {'java': '__REQUIRED__'}
seed_version = 'version 3.2'
version = 'v1.0.1'
Component_ui¶
It is a python module that contains an argparse UI for the component.
Using the make_component command, the following component_ui.py
file is generated:
"""
component_ui.py
Note the places you need to change to make it work for you.
They are marked with keyword 'TODO'.
"""
import argparse
#==============================================================================
# make a UI
#==============================================================================
## TODO: pass the name of the component to the 'prog' parameter and a
## brief description of your component to the 'description' parameter.
parser = argparse.ArgumentParser(prog='my_comp',
description = """
brief description of your component goes here.""")
## TODO: create the list of input options here. Add as many as desired.
parser.add_argument(
"-x", "--xparam",
default = None,
help= """
help message goes here.
""")
## parse the argument parser.
args, unknown = parser.parse_known_args()
Example:
This is an example showing the content a component_ui.py
:
import sys
import argparse
#==============================================================================
# make a UI
#==============================================================================
parser = argparse.ArgumentParser(prog='snpeff',
description='''Genetic variant annotation and effect
prediction toolbox. It annotates and predicts the
effects of variants on genes (such as amino acid changes)''',
epilog='''Input file: Default is STDIN''')
# required arguments
required_arguments = parser.add_argument_group("Required arguments")
required_arguments.add_argument("--out",
default=None,
required=True,
help='''specify the path/to/out.vcf to save output to a file''')
# mandatory / positional arguments
required_arguments.add_argument("genome_version",
choices=['GRCh37.66'],
help='''genomic build version''')
required_arguments.add_argument("variants_file",
help='''file containing variants''')
# optional options
optional_options = parser.add_argument_group("Options")
optional_options.add_argument("-a", "--around",
default=False, action="store_true",
help='''Show N codons and amino acids around change
(only in coding regions). Default is 0 codons.''')
args, unknown = parser.parse_known_args()
Warning
It is required to use parse_known_args
instead of parse_args
.
component_seed¶
This is a directory within the component directory where all the source codes of the actual program reside.
Examples¶
Please refer to our Github repositories for more examples. The repositories with the postfix _workflow
are the pipelines and the rest are the components.