I’ve been working with pySCENIC in the recent couples of months. It’s a tool that is meant to uncover gene sets that are controlled by transcription factors and suggests the underlying logic of how genes are regulated. It takes gene expression data from single-cell RNA sequencing and seems to be fairly easy to use.
So, what took me several months? I did my preliminary run with their R implementaion, which was released earlier in the original reference. It was well-documented and ran smoothly over a set of 10,000 cells. Nonetheless, it did not scale well. With 100,000 cells, it could take a month to run. The authors are very aware of this and kindly re-implemented it in Python, which is approximately more efficient by two orders of magnitude in my case.
Everything should be smooth in theory, but in practice, I encountered a
dependency issue at the first step when I
pip installed the package.
After finding a way out of the compilation issue, I experienced repeated
segmentation faults running the analysis. Since it ran successfully once, I
thought that it was something wrong with my main analysis, but it turned out
to be that later versions of
dask is not compatible with the package itself.
It was then I painfully learned that I should really manage my analysis with
virtual environments. I found the authors provided a YAML environment file in
the repository, so I tried to use it.
conda complained. It said that
mkl-random was not compatible with
openblas, so it could do nothing here.
I played with the versions of packages a little bit and was able to run my analysis with a bunch of warning messages, and after two weeks, after I tried to enable jupyter notebook in the environment, the dependency was broken again.
I started over, but I forgot what I did to the versions so I failed to restore
a setting that enables my analysis. Then I came across
a thread at StackOverflow saying
numpy from the default channel
was somehow creating a mess with
mkl. Per the suggestion in
the above therad, I added a line in the YAML, prioritizing
defaults, and magically the original environment file from the authors runs
without any additional adjustments.
The morale of the story:
Dependencies should always be managed and documented when possible. It’s much harder, if not possible, to trace it from error messages. I was able to find
daskcausing segmentation faults was sheer luck when I tried to start everything over with a clean environment and finally opted in to use
Even with dependencies provided (like a
condaenvironment file), there could be local settings that was not in it. In my case, the authors could have set
conda-forgeas the preferred channel in
$HOME/.condarcand forgot about it.
Maybe I should create a dependency tag because that seems like the main theme of this blog, right?