Openly shared data is invaluable. It provides a way for others to test reproducibility of analysis and reduces the need of repeated screening experiments. Besides, these data is also an excellent training ground for amateurs like me.
Sometimes, the dataset I want consists of multiple samples. I first clicked all the download links manually, but I soon got lost and forgot which ones I hadn’t downloaded. Thankfully, I realized repetitive tasks like this on a computer can often be automated.
What you are going to find here A minimal introduction of the awk command in Linux and Mac (For Mac user, installing GNU awk might be necessary. It introduced some new functions like sorting an array with asort().) An awk command that would randomly subsample k reads from a given fastq file of a pair-ended sequencing. Why I am making this note In single cell RNA-sequencing, there seems to be no good way telling how deep you should sequence to date.