Blog

  • Aug 13, 2018
    Staying Power: The shelf-life of software

    Bioinformatics resources and tools started being developed for life scientists 30 years ago. Robert J Beynon, writing in the inaugural editorial (January 1, 1985) of the journal Computer Applications in Biological Sciences (CABIOS was later renamed to Bioinformatics), provides reasons as to why computers had assumed no central role in the activities of most life scientists and why that should change. Beynon's general philosophy of building bioinformatics tools to enable life scientists in their research and discovery holds even more true today. He said "CABIOS is a journal for life scientists who wish to understand how computers can assist in their work. The emphasis is on application and their description in a scientifically rigorous format. Computing should be no more mystical a technique than any other laboratory method"

    We have come a long way since the publication of tools like BLAST and Clustal in the late 1980s. Today, most life scientists are bioinformatics savvy and quite comfortable using several bioinformatics tools for their everyday research. A lot of algorithms and advancements in mathematics, statistics, and computer science that remained locked in books or journal articles have come to mainstream biology. These theoretical advancements have been embraced by life scientists wholeheartedly through various tools that implemented such algorithms and made them accessible. However, we encounter a new set of challenges directly related to the explosion in the plethora of bioinformatics tools, petabytes of data that are also poorly managed and organized, heterogeneity in the nature of computing environments, and outdated ways of sharing data. All of these new challenges pose serious obstacles to organizing, managing, and creating knowledge.

    Until the mid-1990s, bioinformatics tool development was motivated by some groundbreaking developments in the theory and practice of string comparison and sequence alignment, the quantification of nucleotide and amino acid substitution rates, the construction of evolutionary trees, and secondary/tertiary protein structure analysis. In parallel, computer archives for the storage, curation and distribution of biological data was pioneered by Margaret O. Dayhoff for protein sequences through the Protein Identification Resource (PIR), by Frances C. Bernstein for macromolecular structures through the Protein Data Bank (PDB), and later on by Walter Goad for nucleotide sequences through the GenBank data repository. Their groundbreaking contributions made valuable data accessible to everyone and encouraged new ways of looking at the data.

    The sequencing and publication of the 1.8 Mbp genome of Hemophilus influenzae Rd in 1995 marked the beginning of the development of tools for efficient assembly of a large number (tens of thousands) of independent, random sequences into a single assembly. Until then, despite advances in DNA sequencing technology the sequencing of genomes had not progressed beyond clones on the order of the size of ~40 kb (the size of a bacteriophage lambda genome). The TIGR ASSEMBLER software took 30 hours of CPU time using a single-core processor on a SPARCenter 2000 machine with 512 Mb of RAM to assemble 24,304 sequence fragments of H. influenzae. The mid-1990s saw a flurry of whole bacterial and archaeal genomes published using this strategy (shotgun sequencing). The wide adoption of this strategy to sequence genomes and increased computing power also saw the development of a new set of tools to improve genome assemblies, annotate genomes, and understand the organization of genes and genomes. The mid-1990s also saw the development of new high-throughput technologies to qualitatively and quantitatively assay DNA, transcripts, proteins, metabolites and their interactions, through microarray, mass spectrometry, and screening techniques. Microarrays, especially, saw the development of new methods of data analysis. R, the language and environment for statistical computing and graphics became a popular choice with several authors writing packages for microarray data manipulation. These packages became popular through Bioconductor, an open source and open development platform that provides tools written in R for the analysis and comprehension of high-throughput genomic data. Since the mid-2000s, starting with 454 Life Sciences' GS20 pyrosequencing platform and Illumina's Genome Analyzer platform, sequencing technologies that were markedly different from the traditional Sanger method, labeled next-generation sequencing (NGS) also attracted, unsurprisingly, the development of new algorithms and tools to analyze and make sense of the data at molecular, cellular, physiological and organismal levels.

    Bioinformatics tool development has also contributed to a shift in how biologists visualize data, from looking at dot matrices of pairwise alignment, phylogenetic trees, multiple alignments of sequences, 3D structure of a protein in the early days to interactive heatmaps from a hierarchical clustering of microarray expression data, visualizing a network of co-expressed genes, to overlaying multi-omic data sets (expression, regulatory elements, epigenetic changes, etc.) on a genome browser.

    Today, biologists are inundated with thousands of bioinformatics tools: 20,918 by one account. The journal PeerJ provides a curated collection of bioinformatics tools that have been published in PeerJ and PeerJ Computer Science. Datasets2Tools from Avi Ma'ayan's Lab at Mount Sinai School of Medicine is another repository that lists over 31,500 canned bioinformatics analyses applied to over 6,800 datasets. The repository also contains over 4,900 published bioinformatics software tools and databases, and all the analyzed datasets. It is worth mentioning that the first few issues of CABIOS had a section titled "Software Reviews", which was discontinued and soon replaced with "Application Note", which still continues to this day in the peer-reviewed journal Bioinformatics. Most of the bioinformatics and computational biology tools and resources are published today in Bioinformatics, Nucleic Acids Research, BMC Bioinformatics, Database, Genome Research, and Genome Biology.

    Of the tools that go through a peer-review process and get published in a journal, a majority of them are likely to be consigned forever to just that journal's table of contents and unlikely to gain a wider audience for various reasons. A good discussion on this was compiled by Nature Biotechnology (published on October 8, 2013) where they spoke with the creators of software widely used in bioinformatics and computational biology. Stephen Altschul, Barry Demchak, Richard Durbin, Robert Gentleman, Martin Krzywinski, Heng Li, Anton Nekrutenko, James Robinson, Wayne Rasband, James Taylor, and Cole Trapnell offer their views in this article. They discuss the factors that contributed to the success of their respective tools, whether scientific software development differs from other software development, the misconceptions in the research community about software development or use of software, why too many new software tools don't get used, how the field of computational biology is evolving in terms of software development, and emerging trends in computational biology and software tools.

    On our blog, we are planning a series of posts that will provide a historical account and perspective on some of the most successful tools used by biologists and gain some insight as to the reasons behind the success and popularity of a tool. Our objective is to provide historical context for the various aspects of method and tool development. Our discussions with authors have indicated that this will also be a celebration of teamwork and collaboration in science.