Strangled by factors

I am exaggerating, but sometimes stringsAsFactors is almost this deadly. I work with genomic data, and a common quest in my job is to identify interesting features (in most cases, genes) from a pool of 25,000+. To climb onto giants’ shoulders, genes that were repoted to be important in a certain process, say cell fate determination, are invaluable, and these genes often come in the supplementary tables of the original papers.

Migrating from Medium to Blogdown

After struggling for a while, I decided to move from Medium and switch to blogdown. While Medium is a beautiful platform for blogging, its philosophy seems to fit less well when there are more than articles to host. I came to realize that my expectation for a blog is not only to keep notes, but also to demonstrate projects and provide a brief biography. In short, I wanted a website instead of a blog, and blogdown seems to provide an appropriate amount of versatility.

Dealing with dependency without sudo like a dummy (again)

The more I work with Linux, the more I encounter dependency issues. This is of course not too big a surprise, but it can be painful especially when you aren’t sudo, so the most obvious solution does not work for you. Don’t misunderstand me. sudo is definitely not something to play with when we don’t know what we are doing, and access control does a terrific job preventing me from shooting my own foot (or worse, other’s foot) from time to time.

Reverse and find complement sequence in R

Recently, I am continuously being amazed by how a seemingly simple task is actually implemented in a sophisticated way. I guess I am just taking so many things for granted just because it was implemented and refined to an extent that I don’t even feel it. When someone asked me “how do you reverse and then find the complement sequence of some DNA?” I googled and found a couple of functions, and then I decided to challenge myself to re-invent this wheel.

Single or double?: AND operator and OR operator in R

One classmate complained about having trouble subsetting a data frame to keep non-zero rows, like: # I don't want rows of zero here! non_zero <- rna_seq[wt != 0 && mutant != 0 && resq !=0, ] This did not work as expected. The culprit here was the logical operator &&. There are two versions of AND and OR in R, &&, &, ||, |, and just like my friend, I also find it difficult to tell them apart and suffer for very long.

Installing R package XML on MacOS 10.13.6

I updated my R packages the other day, and not surprisingly, one package failed to compile. This time, it was XML. The error message suggested configure: error: “libxml not found”, but homebrew suggested I had installed libxml2 and had it up-to-date. By running ./configure manually and after some googling, I realized the compiler went to /usr/bin/xml2-config instead of the homebrew version, and thanks to this thread on StackOverflow, I learned that the compiler would be directed to the correct copy of libxml2 if I set the environmental variable XML_CONFIG.

Seeking signal in the midst of noise with R

Oftentimes, the sample I deal with is full of noise or confounding factors that I am not interested in. For example, human specimen is doomed noisy because the race, age, sex, occupation, or the life story of the subject would have influenced the results. Careful matching those statistics and increasing sample number would help a lot minimize known confounding factors and have a better chance to cancel other unknown factors, but sometimes sample number is just beyond our control.

K-means exercise in R language

As a novice in genomic data analysis, one of my goal is to benchmark how well a clustering method works. I ran across this practice of doing k-means at R-exercises the other day and felt it might be a nice start because k-means is easy to perform and conceptually simple for me to correlate what is happening behind the clustering machinery. It starts with manipulating the built-in iris dataset as usual.

Using Limma to find differentially expressed genes

Ritchie, ME, Phipson, B, Wu, D, Hu, Y, Law, CW, Shi, W, and Smyth, GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies.Nucleic Acids Research 43(7), e47. limma is an R package hosted on Bioconductor which finds differentially expressed genes for RNA-seq or microarray. Recently I’ve been working on a PCR-based low-density array and noticed that I forgot how to use limma for the one hundredth time, so I decided to make a note.

Following up library dependency in R package compilation

Hello there. I had not anticipated to encounter this problem again so soon, but I did when I was installing yet another package on my own laptop. ld: warning: directory not found for option '-L/usr/local/gfortran/lib/gcc/x86_64-apple-darwin15/6.1.0' ld: warning: directory not found for option '-L/usr/local/gfortran/lib' ld: library not found for -lgfortran clang: error: linker command failed with exit code 1 (use -v to see invocation) This prompted to think again about the issues I grunted about the other day.