IGoR is a C++ software designed to infer V(D)J recombination related processes from sequencing data such as:
-
Recombination model probability distribution
-
Hypermutation model
-
Best candidates recombination scenarios
-
Generation probabilities of sequences (even hypermutated)
The following article describes the methodology, performance tests and some new biological results obtained with IGoR:
High-throughput immune repertoire analysis with IGoR, Nature Communications, (2018) Quentin Marcou, Thierry Mora, Aleksandra M. Walczak
Its heavily object oriented and modular style was designed to ensure long term support and evolvability for new tasks in assessing TCR and BCR receptors features using modern parallel architectures.
IGoR is a free (as in freedom) software released under the GNU-GPLv3 license.
Version
Latest released version: 1.4.0
Dependencies
-
a C++ compiler supporting OpenMP 3.8 or higher and POSIX Threads (pthread) such as GCC (GNU C Compiler)
-
GSL library : a subpart of the library is shipped with IGoR and will be statically linked to IGoR’s executable to avoid dependencies
-
jemalloc (optional although recommended for full parallel proficiency) memory allocation library: also shipped with IGoR to avoid dependencies issues (requires a pthreads compatible compiler)
-
bash
-
autotools suite, asciidoctor, pygments (optional), doxygen and the latex suite if building from unpackaged sources
Install
IGoR uses the autotools suite for compilation and installation in order to ensure portability to many systems.
Installing from packaged releases (recommended)
First download the latest released package on the Release page. Extract the files from the archive.
Installing from unpackaged sources (by cloning or direct download of the repository)
For this you will have to get git, and all other dependencies mentioned above. Note that this is the most convenient way to keep IGoR up-to-date but involves a few extra installation steps. Using git, clone the repository where you desire. Go in the created directory and run the autogen.sh bash script. This will create the configure script. Upon this stage the installation rules are the same as for packaged developer sources. From git you can chose among two branches: the master branch corresponds to the latest stable (packaged) release, the dev branch is the most up to date branch including current developpments until they are issued in the next release. The dev branch is therefore more bug prone, however this is the natural branch for people ready to help with developpment (even only by functionality testing).
A (sadly) non exhaustive list of potential installation troubleshoots follows in the next section. If your problem is not referenced there please open a GitHub issue. If you end up finding a solution by yourself please help us append it to the following list and help the user community.
To upgrade IGoR uninstall your previously installed version and install the new one. |
Linux
Widely tested on several Debian related distros. Install gcc/g++ if not
already installed (note that another compiler could be used). With the
command line go to IGoR’s root directory and simply type ./configure
.
This will make various check on your system and create makefiles
compatible with your system configuration. Many options can be appended
to ./configure such as ./configure CC=gcc CXX=g+ +
to enforce the use
of gcc as compiler. The full set of the configure script options can be found
here.
Once over, type make
to compile the sources (this
will take a few minutes). IGoR’s executable will appear in the igor_src
folder
Finally in order to access all IGoR’s features, install IGoR by typing
make install
. This will install IGoR’s executable, supplied models and
manual in your system’s default location (note that depending on this
location you might require administrator privileges and use the sudo
prefix). If you do not have administrator privileges, IGoR can be
installed locally in the folder of your choice by passing
--prefix=/your/custom/path
upon calling the configure script (e.g
./configure --prefix=$HOME
). Other configure options can be accessed
using ./configure -h.
As a brief summary for default installation use the following set of commands:
./configure (1)
make (2)
make install (3)
1 | Specify your custom installation options at this step. |
2 | Compile the sources before installation. |
3 | Install IGoR. |
Clean uninstallation of IGoR (e.g before upgrading IGoR to a newer version)
is obtained via the make uninstall command.
|
MacOS
MacOS is shipped with another compiler (Clang) when installing Xcode that is called upon calling gcc (through name aliasing) and is not supporting OpenMP. In order to use gcc and compile with it an OpenMP application you will first need to download Macports or Homebrew and install gcc from there.
First if not already present on your system install XCode through the application store.
Macports can be found here. Download and install the version corresponding to your MacOS version.
Once installed, use Macports to install GCC:
sudo port selfupdate #Update macports database
sudo port install gcc6 #install gcc version 6
The full list of available GCC versions is available here, select a sufficiently recent one to get C++11 standards enabled. In order to set GCC as your default compiler use the following commands:
port select --list gcc #Will list the versions of gcc available on your system
sudo port select --set gcc mp-gcc6 #set the one you wish to have as default call upon using the gcc command
If you prefer to use Homebrew over Macports, it can be downloaded and installed here.
Then install GCC using the following command:
brew install gcc
Note: if you decide to use Homebrew you should apparently refrain
yourself from assigning the newly installed gcc to the gcc
command(see
this page for
more details). You will thus have to pass the correct compiler
instructions to the configure script with the CC and CXX flags.
Alternatively you could also install GCC directly from sources as described by this guide.
Once done, simply follow instructions from the Linux installation section to complete IGoR’s installation.
Windows (not tested)
The configure script relies on bash to work. A first step would be to
download a bash interpreter (such as Cygwin or MinGW) and a compiler.
Open the command line of the one of your choice and use
./configure;make
Troubleshoots
Here is a list of some install troubleshoots that have been reported and their corresponding solution
Issue | Reason | Solution |
---|---|---|
In file included from Aligner.cpp:8: /n ./Aligner.h:19:10: fatal error: 'omp.h' file not found /n #include <omp.h> |
The compiler used is not supporting OpenMP |
Make sure you have an OpenMP compatible compiler installed (such as GCC). If such a compiler is installed make sure the right compiler is called upon compiling. In order to specify a specific compiler to use (such as mc-gcc6 for macport installed gcc under MacOS) pass the following option upon executing the configure script: `./configure CC=mc-gcc6 CXX=mc-g6`. The _CC_ option will specify the C compiler to use to compile jemalloc and gsl, while _CXX_ specifies the C compiler to use to compile IGoR sources. |
aclocal-1.15: command not found; WARNING: 'aclocal-1.15' is missing on your system.; _make: _** [aclocal.m4] Error 127* |
The configure script relies on file timestamps to assess whether it is up to date. These time stamps might be compromised when extracting files from the archive. |
Run the following command in IGoR root directory:
|
.libs/sasum.o: No such file or directory error at compile time |
Unknown |
Running |
undefined reference to symbol 'clock_gettime@@GLIBC_2.2.5' at link time |
Jemalloc used an extra library to extract system time |
Run the last command printed to the screen (g -std=gnu11 -I./../libs/jemalloc/include/jemalloc -I./../libs/gsl_sub -fopenmp …… -lpthread -ldl -fopenmp) and add -lrt after -ldl. This will be automated and fixed soon |
src/jemalloc.c:241:1: error: initializer element is not constant ; static malloc_mutex_t init_lock = MALLOC_MUTEX_INITIALIZER; |
Might be related to MacOS Sierra? |
Unknown |
Undefined symbols for architecture x86_64: "comp_nt_int(int const&, int const&)", referenced from: Deletion::iterate(double&, Enum_fast_memory_map<Seq_type, double>&,… |
Unknown issue with GCC8, cf issue #22 |
Downgrade your GCC version to a 7.X version. |
Workflow
As a preprocessing step IGoR first needs to align the genomic templates
to the read (-align
, see detailed commands in the Alignments commands section) before exploring all putative recombination
scenarios for this read. After aligning IGoR can be used to infer a
recombination model (-infer
, see the Inference/Evaluation section), evaluate sequences statistics
(-evaluate
) using an already inferred model. Synthetic sequences can
be generated from a learned model (as one supplied by IGoR, or one
inferred de novo through the -infer
command) with the -generate
(see the Sequence generation section)
command.
Predefined genomic templates and models
IGoR is shipped with a set of genomic templates and already inferred models from [1].
In order to use the predefined models and demo IGoR must have been installed on your system.
Available options are listed below:
Species | Chains |
---|---|
human |
TRA (or alpha), TRB (or beta), IGH (or heavy chain), IGL (or lambda light chain), IGK (or kappa light chain) |
mouse |
TRB (or beta) |
If you are working on datasets not present in this list refer to the Advanced Usage section and/or contact us for assistance. Help us filling this database for other users and share the resulting models with us!
Validity of the recombination and error models
Some text discussing the validity of error and recombination models
Runtimes
As runtimes may evolve with IGoR’s maturation, below is a table recapitulating the latest per sequence runtimes for different tasks on different chains:
Chain/Read | (Pre)Alignments time (seconds) | Probabilistic treatment time (seconds) |
---|---|---|
TRA 100bp |
0.3 |
10-4 |
TRB 60bp |
0.1 |
0.1 |
IGH 130bp |
0.2 |
0.2 |
Command line tools
Although the full flexibility of IGoR is reachable through C++ highlevel functions (see the C++ section) we provide some command line options to perform most frequent tasks on immune receptor sequences.
Command options are nested arguments, the general organization of the
commands follows -arg1 --subarg1 ---subsubarg1
to reach the different
levels.
General
General commands summary
Command line argument | Description |
---|---|
|
Displays IGoR’s manual. Alternatively one could use
|
|
Displays IGoR’s installed version number. |
|
Sets the working directory to /path/to/dir/, default is /tmp. This should be an already existing directory and will not be created by IGoR |
|
Sets the number of OpenMP threads to N for alignments and inference/evaluation. By default IGoR will use the maximum number of threads. |
|
Redirects the standard output to the file /path/to/file |
|
Reads the input sequences file /path/to/file and reformat it in the working directory. This step is necessary for running any action on sequences using the command line. Can be a fasta file, a csv file (with the sequence index as first column and the sequence in the second separated by a semicolon ';') or a text file with one sequence per line (format recognition is based on the file extension). Providing this file will create a semicolon separated file with indexed sequences in the align folder. |
|
Sets the batch name. This name will be used as a prefix to alignment/indexed sequences files, output, infer, evaluate and generate folders. |
|
Selects a model and a set of genomic template
according to the value. Possible values for |
|
Selects a species from the set of predefined
species. Possible values are: |
|
Set a set of custom genomic templates for gene gene (possible values are --V,--D and --J) with a list of genomic templates contained in the file /path/to/file.fasta in fasta format. If the set of provided genomic templates is already fully contained (same name and same sequence) in the loaded model (default, custom, last_inferred), the missing ones will be set to zero probability keeping the ratios of the others. For instance providing only one already known genomic template will result in a model with the considered gene usage to be 1.0, all others set to 0.0. When using this option and introducing new/modified genomic templates, the user will need to re-infer a model since the genomic templates will no longer correspond to the ones contained in the reference models, the model parameters are thus automatically reset to a uniform distribution. |
|
Load a semicolon separated file containing the indices/offset of the CDR3 anchors for the gene(--V or --J). The index should correspond to the first letter of the cysteine (for V) or tryptophan/phenylalanin (for J) for the nucleotide sequence of the gene. Indices are 0 based. |
|
Use a custom model as a baseline for inference or evaluation. Note that this will override custom genomic templates for inference and evaluation. Alternatively, providing only the model parameters file will lead IGoR to create model marginals initialized to a uniform distribution. |
|
Using this command will load the last inferred model (folder inference/final_xx.txt) as a basis for a new inference, evaluation or generation of synthetic sequences |
|
Runs the demo code on 300 sequences of 60bp TCRs (mostly a sanity run check) |
|
Runs the code inside the custom section of the main.cpp file |
|
Perform actions on a random subsample of N sequences.
This flag will have different effects depending on the supplied
commands: if the |
Bash scripts
A set of bash scripts for common tasks using igor. We can use igor-compute_pgen to calculate the Pgen of a specific sequence.
igor-compute_pgen <specie> <chain> <ntsequence>
Example
igor-compute_pgen human beta actcagctttgtatttctgtgccagcagcgtagattgggacagggggcctcctacgagcagtacgtcgggccg
Working directory
This is where all IGoR outputs will appear. Specific folders will be created for alignments, inference, evaluation and outputs.
Alignments
Algorithm
Performs Smith-Waterman alignments of the genomic templates. Using a slight alteration of the Smith-Waterman score matrix, we enforce that V can only be deleted on the 3' side and J on the 5' side (thus enforcing the alignment on the other side until the end of the read or of the genomic template). D is aligned using a classical Smith-Waterman local alignment approach allowing gene deletions on both sides.
Alignments represent a critical step for a precise probabilistic characterization of your sequences. Make sure you use all the information you have on your sequences (e.g use the knowledge of your primers positions to provide template specific alignment offsets) for optimal result. |
Alignment commands summary
Alignment of the sequences is performed upon detection of the -align
switch in the command line. For each gene, alignment parameters can be
set using --V
,--D
or --J
. Specifying any of those three argument
will cause to align only the specified genes. In order to specify a set
of parameters for all genes or force to align all genes the argument
--all
should be passed.
The complementarity-determining region (CDR3) of the aligned sequences is
by default written on a file <batchname>_indexed_CDR3.csv in the aligns
directory when --all
is used. In case of separated alignments the CDR3 file
can be generated by using the --feature ---ntCDR3
option.
The arguments for setting the different
parameters are given in the table below.
If the considered sequences are nucleotide CDR3 sequences (delimited by
its anchors on 3' and 5' sides) using the command --ntCDR3
alignments will
be performed using gene anchors information as offset bounds.
Command line argument | Description |
---|---|
|
Sets the score threshold for the considered gene alignments to X. Default is 50.0 for V, 15.0 for D and 15.0 for J |
|
Sets the substitution matrix to the one given in the file. Must be ',' delimited. Default is a NUC44 matrix with stronger penalty on errors (5,-14) |
|
Sets the alignment gap penalty to X. Default is 50.0 |
|
If true only keep the best alignment for each gene/allele. If false outputs all alignments above the score threshold. Default is true for V and J, and false for D. |
|
If true only keep alignments for best scoring gene candidate (or candidates if several genes have the same maximum score). If false outputs alignments for every aligned gene/allele. Default is false for V, D and J. |
|
Constrains the possible positions of the
alignments. The offset is defined as the position on the read to which
the first nucleotide of the genomic template aligns (can be negative,
e.g for V for which most of the V is on the 5' of the read and cannot be
seen). Default values are -inf and +inf. If the |
|
Constrains the possible positions of the
alignments differently for each genomic template. The file should be a semi colon separated
file formated as follows: |
|
If true provided offsets are accounted for reversed offsets. Reversed offsets are defined relative to the last nucleotide of the read instead of the first. Reversed offsets must be ⇐0 by construction. |
Alignment output files summary
Upon alignment the alignment parameters/dates/filenames will appended to the aligns/aligns_info.out file for easy traceability.
Alignment files are semicolon separated files. For each alignment of a genomic template to a sequence the following fields are given:
Field | Description |
---|---|
seq_index |
The sequence index the alignment corresponds to in the indexed_sequences.csv file. |
gene_name |
The gene name as provided in the genomic template file |
score |
SW alignment score |
offset |
The index of the first letter of the (undeleted) genomic template on the read as described in the previous section. Indices are 0 based. |
insertions |
Indices of the alignment inserted nucleotides (relative to the read) |
deletions |
Indices of the alignment deleted nucleotides (relative to the genomic template) |
mismatches |
Indices of the alignment mismatches (relative to the read) |
length |
Length of the SW alignment (including insertions and deletions) |
5_p_align_offset |
Offset of the first nucleotide of the SW alignment (relative to the read) |
3_p_align_offset |
Offset of the last nucleotide of the SW alignemnt (relative to the read) |
The CDR3 files are semicolon separated files. For each IGoR indexed sequence the following fields are given:
Field | Description |
---|---|
seq_index |
The sequence index the alignment corresponds to in the indexed_sequences.csv file. |
v_anchor |
Position of the V anchor (first nucleotide of 2nd-CYS codon) relative to the read sequence. |
j_anchor |
Position of the J anchor (last nucleotide of J-PHE or J-TRP codon) relative to the read sequence. |
CDR3nt |
Nucleotide CDR3 sequence of the indexed sequence. |
CDR3aa |
Amino acids CDR3 sequence of the indexed sequence. |
Anchors positions are 0 based indices relative to the read sequence. |
Inference and evaluation
Inference and evaluation commands
The inference is reached using the command -infer
. Logs and models
parameters values for each iteration will be created in the folder
inference of the working directory (or batchname_inference if a
batchname was supplied).
Sequence evaluation is reached using the command -evaluate
. This is
the same as performing an iteration of the Expectation-Maximization on
the whole dataset and thus accepts the same arguments as -infer
for
arguments related to the precision of the algorithm. The logs of the
sequences evaluation are created in the folder evaluate (or
batchname_evaluate if a batchname was supplied).
-
Note that -infer and -evaluate are mutually exclusive in the same command since it brings ambiguity reagarding which model should be used for each **
Optional parameters are the following:
Command line argument | Description | Available for |
---|---|---|
|
Sets the number of EM iterations for the inference to N |
inference |
|
Sets the sequence likelihood threshold to X. |
inference & evaluation |
|
Sets the probability ratio threshold to X. This influences how much the tree of scenarios is pruned. Setting it 0.0 means exploring every possible scenario (exact but very slow), while setting it to 1.0 only explores scenarios that are more likely than the best scenario explored so far (very fast but inaccurate). This sets a trade off between speed and accuracy, the best value is the largest one for which the likelihood of the sequences almost doesn’t change when decreasing it further. |
inference & evaluation |
|
Runs the algorithm in a 'Viterbi like' fashion. Accounts for the Most Likely Scenario Only (as fast as using a probability ratio threshold of 1.0) |
inference & evaluation |
|
During the inference only the parameters of the events with nicknames listed will be updated. Note that not passing any event nickname will fix all events. |
inference |
|
Opposite command to the one above, will fix the parameters of the listed events |
inference |
|
In the same vein as the two commands above, this one will fix the parameters related to the error rate. |
inference |
Inference and evaluation output
Upon inferring or evaluating several files will be created in the corresponding folder.
Model parameters files
*_parms.txt files contain information to create Model_Parms C++
objects. It encapsulates information on the individual model events,
their possible realizations, the model’s graph structure encoding events
conditional dependences and the error model information. All fields are
semi colon separated. The different sections of the files are delimited
by an @
symbol, each further subdivided as follows:
-
@Event_list
introduces the section in which the recombination events (i.e the Bayesian Network/graph nodes) are defined.-
#
introduces a new recombination event (or node). The line contains 4 fields:-
the event type (GeneChoice, Deletion, Insertion, DinucMarkov)
-
the targeted genes (V_gene, VD_genes, D_gene, DJ_genes, J_gene, VJ_genes)
-
the gene side (Five_prime, Three_prime, Undefined_side)
-
the event priority: an integer influencing the order in which events are processed during the inference such that events with high priority are preferentially processed earlier.
-
the event nickname
-
-
%
introduces a new event realization. Depending on the recombination event, the first fields will define the realization name and/or values (e.g gene name and gene sequence for GeneChoice or number of deletions for Deletion) while the final field always denotes the realization’s index on the probability array. This index is automatically assigned by IGoR upon addition of an event realization, changing it will cause undefined behavior. See the Advanced usage section of this README for more information on how to add/remove event realizations.
-
-
@Edges
introduces the section in which the conditional dependencies (i.e graph directed edges) are defined.-
%parent;child
introduces a new directed edge/conditional dependence between the parent and child event.
-
-
@ErrorRate
introduces the section in which the error model is defined.-
#
introduces a new error model, the first field defining the error model type and subsequent fields other meta parameters of the error model-
%
introduces the parameters values linked to the actual error/mutation rate.
-
-
Model marginals files
*_marginals.txt files contain information to create Model_Marginals C++ objects. It encapsulates the probabilities for each recombination event’s realization. As for the model parameters files, the marginals files are are sectioned by special characters as follows:
-
@
introduces the recombination event’s nickname the following probabilities are referring to. -
$Dim
introduces the dimensions of the event and its conditional dimensions probability array. By convention the last dimension refers to the considered event dimension. -
#
introduces the indices of the realizations of the parent events and their nickname corresponding to the following 1D probability array -
%
introduces the 1D probability array for all of the considered event realizations for fixed realizations of the parents events whose indices were given in the previous line.
Python functions are provided to read such files along with the corresponding model parameters file within the GenModel object.
Inference information file
inference_info.out contains the inference parameters/date/time for traceability and potential error messages.
Inference logs file
inference_logs.txt contains some information on each sequence for each iteration. This is a useful tool to debug inference troubleshoots.
Model likelihood file
likelihoods.out contains the likelihood information for a given dataset.
Inference and evaluation Troubleshoots
Although the inference/evaluation generally run smoothly we try to list out some possible troubleshoots and corresponding solutions.
Issue | Putative solution |
---|---|
map_base::at() exception |
This exception is most likely thrown by a Gene_Choice event in the inference. Try/Catch handling is runtime costly thus some checks are not performed on the fly. Explanation: This is most likely the inference receiving a genomic template whose name does not exist in the model realizations. Solution: make sure the genomic templates (and their names) used for alignments correspond to those contained in your model file. |
All 0 output |
All marginal files contains 0 parameters after one
iteration. All sequences have zero likelihood in the
inference_logs.txt file. Explanation: none of the scenarios had a
sufficiently high likelihood to reach the likelihood threshold.
Solution: use the |
Extreme slowness |
Runtimes are very far from the ones given in the Runtimes section. Check the mean number of errors in the inference_logs.txt file. If these numbers are higher than you would expect from your data (e.g if you are not studying hypermutated data) check your alignments statistics. A possible explanation would be an incorrect setting of the alignment offsets bounds |
Outputs
Outputs or Counters in the C++ interface are scenario/sequence statistics, each individually presented below. They are all written in the output folder (or batchname_output if a batchname was supplied).
In order to specify outputs use the -output
argument, and detail the
desired list of outputs. Outputs are tied to the exploration of
scenarios and thus require to have -infer
or -evaluate
in the same
command. Note that although it might be interesting to track some
outputs during the inference for debugging purposes, best practice would
be to use it along with evaluation.
The different outputs are detailed in the next sections.
Python utility functions are provided to analyze these outputs in the
pygor.counters
submodule.
Best scenarios
Output the N best scenarios for each sequence
Use command --scenarios N
The output of this Counter is a semicolon separated values file with one
field for each event realization, associated mismatches/errors/mutations
indices on the read, the scenario rank, its associated probability and
the sequence index.
Python functions to parse the output of the Best scenario counter can be
found in the pygor.counters.bestscenarios
submodule.
Generation probability
Estimates the probability of generation of the error free/unmutated ancestor sequence By default only outputs an estimator of the probability of generation of the ancestor sequence underlying each sequencing read. See IGoR’s paper for details.
Use command --Pgen
Coverage
Counts for each genomic nucleotide how many times it has been seen and how many times it was mutated/erroneous
Use command --coverage
Sequence generation
Using a recombination model and its associated probabilities IGoR can generate random sequences mimicking the raw product of the V(D)J recombination process.
Sequence generation commands
Reached using the command -generate N
where N is the number of
sequences to be generated. The number of sequences to generate must be
passed before optional arguments. Optional parameters are the following:
Command line argument | Description |
---|---|
|
Generate sequences without sequencing error (the rate and the way those errors are generated is controlled by the model error rate) |
|
Outputs nucleotide CDR3 from generated sequences. The file
contains three fields: CDR3 nucleotide sequence, whether the CDR3
anchors were found (if erroneous/mutated) and whether the sequence is
inframe or not. Gene anchors are not yet defined for all the default
models shipped with IGoR, use |
|
Prefix for the generated sequences filenames. Note that setting the batchname will change the generated sequences folder name, while setting --name will change the file names. |
|
Impose X as a seed for the random sequence generator. By default a random seed is obtained from the system. |
Command examples
First as a sanity after installation check try and run the demo code (this will run for a few minutes on all cores available):
igor -run_demo
Here we give an example with a few commands illustrating a typical workflow. In this example we assume to be executing IGoR from the directory containing the executable.
WDPATH=/path/to/your/working/directory #Let's define a shorthand for the working directory
#We first read the sequences contained in a text file inside the demo folder
#This will create the align folder in the working directory and the mydemo_indexed_seqs.csv file.
igor -set_wd $WDPATH -batch foo -read_seqs ../demo/murugan_naive1_noncoding_demo_seqs.txt
#Now let's align the sequences against the provided human beta chain genomic templates with default parameters
#This will create foo_V_alignments.csv, foo_D_alignments.csv and foo_J_alignments.csv files inside the align folder.
igor -set_wd $WDPATH -batch foo -species human -chain beta -align --all
#Now use the provided beta chain model to get the 10 best scenarios per sequence
#This will create the foo_output and foo_evaluate and the corresponding files inside
igor -set_wd $WDPATH -batch foo -species human -chain beta -evaluate -output --scenarios 10
#Now generate 100 synthetic sequences from the provided human beta chain model
#This will create the directory bar_generate with the corresponding files containing the generated sequences and their realizations
igor -set_wd $WDPATH -batch bar -species human -chain beta -generate 100
Since all these commands use several time the same arguments here is some syntactic sugar using more Bash syntax for the exact same workflow with a lighter syntax:
WDPATH=/path/to/your/working/directory #Let's define a shorthand for the working directory
MYCOMMANDS=./igor -set_wd $WDPATH
$MYCOMMANDS -batch foo -read_seqs ../demo/murugan_naive1_noncoding_demo_seqs.txt #Read seqs
MYCOMMANDS=$MYCOMMANDS -species human -chain beta #Add chain and species commands
$MYCOMMANDS -batch foo -align --all #Align
$MYCOMMANDS -batch foo -evaluate -output --scenarios 10 #Evaluate
$MYCOMMANDS -batch bar -generate 100 #Generate
Advanced usage
The set of command lines above allows to use predefined models or their topology to study a new dataset. Additionally the user can define new models directly using the model parameters file interface. For instance, in order to investigate a conditional dependence between two recombination events, the user can simply add or remove an edge in the graph following the syntax defined earlier.
In order to change the set of realizations associated with an event the user can also directly modify a recombination parameters file. Adding or removing realizations should be done with great care as IGoR will use the associated indices to read the corresponding probabilities on the probability array. These indices should be contiguous ranging from 0 to the (total number of realizations -1).
Any change in these indices or to the graph structure will make the
corresponding model marginals file void, and a new one should be
automatically created by passing only the model parameters filename to
the -set_custom_model
command.
Note that changing the GeneChoice realizations can be done automatically
(without manually editing the recombination parameter file) by supplying
the desired set of genomic templates to IGoR using the -set_genomic
command. This could be used e.g to define a model for a chain in a
species for which IGoR does not supply a model starting from of model
for this chain from another species.
C++
Although a few command line options are supplied for basic use of IGoR,
its full modularity can be used through high level C++ functions on
which all previous command lines are built. A section of the main.cpp
file is dedicated to accept user supplied code and can be executed using
the -run_custom
command line argument when launching IGoR from the
shell. An example of the high level workflow is given in the run demo
section and the full Doxygen generated documentation is available as
PDF. For any question please contact us.
Good practice would be to append the C++ code in the dedicated area of the main.cpp file:
else{
//Write your custom procedure here
(1)
}
1 | Insert you custom code here. |
This part of the code is reachable using the -run_custom
command line argument.
This design aims at keeping the command line interface fully functional while
still appending some custom code.
Python
A set of Python codes are shipped with Igor in order to parse IGoR’s
outputs (alignments,models etc) as the pygor module.
This module can be installed locally with pip using the included
setup.py
file (use command pip install ./pygor
from within the IGoR directory).
Contribute
-
Your feedbacks are valuable, please send your comments about usability, bug reports and new features you would like to see
-
Code contribution: IGoR was designed to be modular and evolve, please get in touch if you would like to do something new with your data and would like some more guidance on the code structure
If you would like to share some improvements on IGoR’s code please file a pull request according to this logic:
-
If you’d like to propose a change in the documentation of existing functions to provide clearer insights please file your pull request on the
master
branch. -
If you’d like to propose a bug fix for a function already present in a release file your pull request on the
master
branch. -
If you’d like to propose a new functionality and its associated documentation please file your pull request on the
dev
branch.
Copying
Free use of IGoR is granted under the terms of the GNU General Public License version 3 (GPLv3).