R

Strangled by factors

I am exaggerating, but sometimes stringsAsFactors is almost this deadly. I work with genomic data, and a common quest in my job is to identify interesting features (in most cases, genes) from a pool of 25,000+. To climb onto giants’ shoulders, genes that were repoted to be important in a certain process, say cell fate determination, are invaluable, and these genes often come in the supplementary tables of the original papers.

Migrating from Medium to Blogdown

After struggling for a while, I decided to move from Medium and switch to blogdown. While Medium is a beautiful platform for blogging, its philosophy seems to fit less well when there are more than articles to host. I came to realize that my expectation for a blog is not only to keep notes, but also to demonstrate projects and provide a brief biography. In short, I wanted a website instead of a blog, and blogdown seems to provide an appropriate amount of versatility.

Dealing with dependency without sudo like a dummy (again)

The more I work with Linux, the more I encounter dependency issues. This is of course not too big a surprise, but it can be painful especially when you aren’t sudo, so the most obvious solution does not work for you. Don’t misunderstand me. sudo is definitely not something to play with when we don’t know what we are doing, and access control does a terrific job preventing me from shooting my own foot (or worse, other’s foot) from time to time.

Reverse and find complement sequence in R

Recently, I am continuously being amazed by how a seemingly simple task is actually implemented in a sophisticated way. I guess I am just taking so many things for granted just because it was implemented and refined to an extent that I don’t even feel it. When someone asked me “how do you reverse and then find the complement sequence of some DNA?” I googled and found a couple of functions, and then I decided to challenge myself to re-invent this wheel.

Single or double?: AND operator and OR operator in R

One classmate complained about having trouble subsetting a data frame to keep non-zero rows, like: # I don't want rows of zero here! non_zero <- rna_seq[wt != 0 && mutant != 0 && resq !=0, ] This did not work as expected. The culprit here was the logical operator &&. There are two versions of AND and OR in R, &&, &, ||, |, and just like my friend, I also find it difficult to tell them apart and suffer for very long.

Installing R package XML on MacOS 10.13.6

I updated my R packages the other day, and not surprisingly, one package failed to compile. This time, it was XML. The error message suggested configure: error: “libxml not found”, but homebrew suggested I had installed libxml2 and had it up-to-date. By running ./configure manually and after some googling, I realized the compiler went to /usr/bin/xml2-config instead of the homebrew version, and thanks to this thread on StackOverflow, I learned that the compiler would be directed to the correct copy of libxml2 if I set the environmental variable XML_CONFIG.

K-means exercise in R language

As a novice in genomic data analysis, one of my goal is to benchmark how well a clustering method works. I ran across this practice of doing k-means at R-exercises the other day and felt it might be a nice start because k-means is easy to perform and conceptually simple for me to correlate what is happening behind the clustering machinery. It starts with manipulating the built-in iris dataset as usual.

Using Limma to find differentially expressed genes

Ritchie, ME, Phipson, B, Wu, D, Hu, Y, Law, CW, Shi, W, and Smyth, GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies.Nucleic Acids Research 43(7), e47. limma is an R package hosted on Bioconductor which finds differentially expressed genes for RNA-seq or microarray. Recently I’ve been working on a PCR-based low-density array and noticed that I forgot how to use limma for the one hundredth time, so I decided to make a note.

Remote connection to Jupyter Notebook

Recently, I analyzed a few single cell RNA-seq datasets and experimented with several new tools from recent publication. While it was fun, most datasets were just too large for my poor laptop to process, and I relied a lot on our server. I have to admit I am not too good an analyst and am spoiled by the freedom interpreted languages provided — to try and error line by line. However, this freedom would be gone if I have to do run my analysis like Rscript my-analysis.

Following up library dependency in R package compilation

Hello there. I had not anticipated to encounter this problem again so soon, but I did when I was installing yet another package on my own laptop. ld: warning: directory not found for option '-L/usr/local/gfortran/lib/gcc/x86_64-apple-darwin15/6.1.0' ld: warning: directory not found for option '-L/usr/local/gfortran/lib' ld: library not found for -lgfortran clang: error: linker command failed with exit code 1 (use -v to see invocation) This prompted to think again about the issues I grunted about the other day.