Blog

  • May 7, 2019
    The shelf-life of bioinformatics software

    “Computing should be no more mystical a technique than any other laboratory method."
    Robert J Beynon, in the inaugural editorial (January 1, 1985) of Bioinformatics.

    We have come a long way since the 1980s, when life scientists first encountered sequence comparison tools. Today, many of us are bioinformatics savvy and comfortable using many complex computational tools. Advancements in mathematics, statistics, and computer science no longer remain locked in books or journal articles. Instead they make their way to applications in biology, which has brought new challenges. We are now inundated with thousands of bioinformatics tools: 20,918 by one account. Unfortunately, most of these are rarely used.

    Until the mid-1990s, advances in string comparison and sequence alignment, quantifying nucleotide and amino acid substitution rates, constructing evolutionary trees, and secondary/tertiary protein structure analysis drove bioinformatics. At the same time, troves of data became accessible to all. Margaret O. Dayhoff pioneered the storage, curation and distribution of biological data through the Protein Identification Resource (PIR), Frances C. Bernstein did the same for macromolecular structures through the Protein Data Bank (PDB), and Walter Goad for nucleotide sequences through the GenBank data repository. Their groundbreaking contributions encouraged new ways of looking at the data. After the mid-1990s, shotgun sequencing, microarrays, and next-generation sequencing each led to a flurry of activity in bioinformatics tool development.

    Shotgun sequencing—efficient assembly of tens of thousands of independent, random sequences into a single assembly—arrived with the publication of the 1.8 Mbp genome of Hemophilus influenzae Rd in 1995. Before then, we could not sequence genomes beyond clones of ~40 kb, the size of a bacteriophage lambda genome. The new technique set off a flurry of activity, allowing scientists to publish many whole bacterial and archaeal genomes. When combined with increased computing (the TIGR ASSEMBLER software took 30 hours on a single-core processor with 512 Mb of RAM to assemble 24,304 sequence fragments), it also brought new tools to improve genome assemblies, annotate genomes, and understand the organization of genes and genomes.

    Another trend was new high-throughput technologies to analyze DNA, transcripts, proteins, metabolites and their interactions—through microarray, mass spectrometry, and screening techniques. Microarrays, especially, benefited from new methods of data analysis. At this time, several authors settled on R as a language and environment for statistical computing and graphics and their packages became popular through Bioconductor, an open source platform in R. Microarray datasets were typically on the order of tens of megabytes per sample.

    The mid-2000s brought next-generation sequencing (NGS), starting with 454 Life Sciences' GS20 pyrosequencing and Illumina's Genome Analyzer platforms. This approach was markedly different from the traditional Sanger method and produced datasets that were on the order of gigabytes per sample. This required researchers to focus on new ways to make sense of this data in molecular, cellular, physiological and organismal contexts.

    Bioinformatics tool development has also contributed to a shift in how biologists visualize data. At first, we visualized pairwise alignments, phylogenetic trees, multiple alignments, and the 3D structure of proteins. Now we also utilize interactive heatmaps with a hierarchical clustering of gene expression data; networks of co-expressed genes; and multi-omic data sets (expression, regulatory elements, epigenetic changes, etc.) overlayed on a genome browser.

    Bioinformatics and computational biology tools and resources are published today in peer-reviewed journals like Bioinformatics, Nucleic Acids Research, BMC Bioinformatics, Database, Genome Research, and Genome Biology. As the number of tools exploded, journals like PeerJ and PeerJ Computer Science started to summarize the tools into curated collections. Datasets2Tools from Avi Ma'ayan's Lab at Mount Sinai School of Medicine is another curated repository, listing more than 31,500 canned bioinformatics analyses applied to over 6,800 datasets. It contains more than 4,900 published bioinformatics software tools and databases, and all the analyzed datasets.

    Of the tools that go through a peer-review process and get published in a journal, a majority are consigned forever to its table of contents but a small number have staying power. An issue of Nature Biotechnology probed the reasons. It contained interviews with architects of several successful software, including Stephen Altschul, Barry Demchak, Richard Durbin, Robert Gentleman, Martin Krzywinski, Heng Li, Anton Nekrutenko, James Robinson, Wayne Rasband, James Taylor, and Cole Trapnell. They discussed factors that contributed to the success of their respective tools, whether scientific software development differs from other software development, the misconceptions in the research community about software development or use of software, why too many new software tools don't get used, how the field of computational biology is evolving in terms of software development, and emerging trends in computational biology and software tools.

    Inspired by this issue, we reached out to Stephen Altschul. He kindly spent several hours giving us historical context around the development of BLAST, one of the most widely used bioinformatics tools of all time. In our next blog post, we will summarize our conversation and explore how BLAST became the standard tool for sequence search, demonstrating that scientific vision, collaboration, and timeliness are all essential to a software’s success.