Next generation sequencing (NGS) represents a revolution in data generation in the genetic world.  Compared to Sanger sequencing, NGS allows for sequencing the complete genomic content of a sample without the need to make clone libraries.  It allows a researcher or clinician to use a single test to examine a genome in great detail.  What took weeks or months to perform can now be completed in a matter of days.

All of this information has one major drawback.  NGS data sets can grow to unmanageable size, and this data needs to be organized, assembled, analyzed, etc.  Read lengths are typically between 100 and 400 base pairs in length, but can grow into much larger sizes, all of which need to be assembled.  In the end, this translates into large amounts of data – volumes ranging from 120 to 600 gigabytes – that need to be stored.  For many, this can be a serious problem.

In investigating storage solutions, some might go looking for some outside assistance in the matter only to be shocked at the price tags they find.  After all, storage is cheap.  You can go to Costco down the road and pick up a 1 terabyte hard drive for under $100.  Why is there so much fuss over NGS data?  And how do these companies justify the prices they charge?

In reality, things are not so simple as to be able to stick some NGS data on a hard drive and call it done.  There are several factors that must be considered in dealing with this very unique type of data.

First, a single hard drive is not backed up and does not properly secure your data.  That drive represents a single point of failure in which your data can be lost.  If that drive is damaged, corrupted, or lost, you might be out of the data that took a long time on a very expensive machine to generate.  Even if the disk has its own backup system onboard, that doesn’t protect it from physical damage.  Mirroring the hard drive to another is better, but if those are stored in the same place, the same problem could take out both copies.  Additionally, backups can be prone to failure and it’s no easy task verifying the data integrity of them.

Second, if you want to use these data-sets for meaningful analysis, you’re going to need your data to be available.  This means it needs to be readily accessible and in the right format.  The process of parsing and mapping data can make working with the data infinitely more usable.  If your data is stored on separate drives or in raw formats, doing any work with the data requires hours of work to prep the data.  If you constantly have to shuffle around drives or need to perform mapping and analysis work every time you need information from that data, it’s going to become very cumbersome, very quickly.  Centralized storage of your data allows everything to be in one place, in one format, where it can easily be read and you can use your analysis tools to interact directly with the data.

Finally, collaboration can be a total headache with massive datasets like the ones NGS machines produce.  Sanger/SNP datasets (which can range from about 1 megabyte up to 40) are small enough that you can attach several sets to an email, if you really needed to.  NGS, however, with its 120 – 600 gigabytes of data, could take days to send over the internet.  In many cases, the only practical option is to mail a hard drive to your collaborators.   There are some obvious concerns with time, expense, and security that come with this option.

What this all boils down to is that you need access to a serious data storage operation to handle NGS datasets.  Once you accumulate several sets (let alone hundreds or thousands of them), you need massive storage arrays to keep all the data secure and readily accessible.  With this comes the processing power necessary to return data in a reasonable time, security measures, and a staff to ensure everything is running correctly.  The question is, will you store and manage your own data, or will you look to an outside solution?