I’ve been using Python programs to manage NGS sample pipelines for a while, and while it started slowly, they’re in a state in which the code is much more reliable and I can work much faster.
A big part of it was due to some simple object modeling of projects and samples.
Here’s the basics of what I implemented. See the full code at github.
A Project object
In its simplest form, a project object holds attributes and defines and creates (if necessary) a directory structure.
Here’s how I chose to structure my projects:
So, all the Project object takes as argument is name and parent. The structure is then created when __init__ (which is called automatically upon creation of the object), calling in its turn setProjectDirs and makeProjectDirs.
A Sample object
I decided to have my Sample objects created from a Pandas Series, since sample annotation sheet are often in tabular form and can easily be read with Pandas.
I wanted something like:
I first considered creating Sample inheriting from pandas.Series to take advantage of its already implemented methods, but in the end it was lacking some features (tab-completion in iPython wasn’t showing the methods I defined). Also, compatibility with new Pandas versions was not guarenteed. Therefore, I simply assign the pandas Series attributes to a new Sample object.
The directory structure if sample-centric: all files from a sample are under a sample-specific directory, and then, other sub-directories hold more specific files:
Sample methods
I create some useful methods for the samples.
I check if it contains required attributes and if these aren’t nan:
I create a name for a sample from every non-nan attribute it might contain from a specific list:
A SampleSheet object
Obviously, always creating a new Pandas Series, just to pass it to Sample does not make much sense.
I created a new class which loads a sample annotation sheet form a csv file
and creates samples from it.
SampleSheet methods
Obviously methods to create samples from the SampleSheet (either from a single pandas Series or from the whole sheet:
Two methods to revert to a csv file (to_csv like in a pandas.DataFrame) and to get a new data frame from the already created samples (asDataFrame):
Binding them all
Ideally one would:
create a Project;
add a csv file to it in a new method which would create a SampleSheet object. This would:
Make new Sample objects for each sample, creating its attributes and directory structure;
Add the Sample objects to a container in Project.
Practical examples
Here’s a step in an example pipeline which runs Fastqc on (unmapped) bam files from all samples:
Notice the absent use of file paths in the pipeline. Although still pretty simple, it is now much simpler to handle every file created by the pipeline for each sample.
These objects are also useful during analysis steps to quickly grab files produced by the pipeline and start an analysis right away.
Here I grab all ChIP-seq peak files from all samples and create a peak set by concatenating them all and merging: