The software behind the science
Just eight years ago, analyzing research data was a more primitive affair, remembers Hua Li, PhD, a biostatistician in the Computational Biology Core group at the Stowers Institute. At that time, research scientists could still crunch numbers from most experiments on personal computers and use traditional charts and graphs to highlight findings. But technological advances yielding vast amounts of biological data have forever changed the way research is conducted, reported, and shared.
By Anissa Orr
High-tech tools help Stowers scientists focus on discovery
“Now it is often impossible to analyze all that data on your own workstation,” Li says. “You need to have a room full of servers and good IT [information technology] support. Data storage and computational skills have become essential for biomedical research.”
That’s especially true for scientists at the Stowers Institute who deal heavily in genomics research that allows studying an organism’s complete set of DNA (genome). An estimated 80 percent of data processed by the Computational Biology Core involves sequenced genomic data. Sequencing—figuring out the order of DNA bases in a genome: the As, Cs, Gs, and Ts that make up an organism’s genetic code—has become more affordable and accessible for scientists, thanks to high-throughput next-generation sequencing. These technologies also provide scientists with other important forms of genetic information.
To make sense of all that data, Stowers scientists increasingly rely on sophisticated computing technologies. The Institute backs their efforts by devoting a substantial portion of the scientific operating budget to providing and supporting computing resources.
The result is a culture that embraces creativity and technological innovation. In particular, new advances in scientific software programs and computing techniques and tools are boosting productivity and making it easier for researchers to focus on important scientific questions. Here’s a closer look at how Stowers researchers are using tech to drive discovery.
Always adapting in IT
Meeting the technology needs of scientists is a constant challenge in an age when new technologies emerge daily and hardware and software quickly become obsolete, says Mike Newhouse, head of Information Management at the Institute.
“The days of stagnant IT are gone,” he says. “Today’s approach to information management demands a continual fluid change of programs, hardware, and storage. Our job is to adapt and handle those changes as they come up.”
Newhouse joined the Institute in 1997 when he was hired by co-founder James E. Stowers Jr. Stowers had pioneered the application of computing power to investment management at American Century Investments—Stowers’ renowned investment management firm—and sought to do the same with the Institute’s basic research. Newhouse joined as the Institute’s sixth staff member and helped build the IT team from the ground up.
Since then, Stowers’ information management has grown tremendously—from its humble beginnings in a double-wide trailer with two team members and two computer servers, to its current state-of-the-art offices and data center, housing seventeen team members and more than 250 servers. The rise in storage capacity alone astounds, soaring from just 40 gigabytes to 2.3 petabytes (one petabyte is one quadrillion bytes)—an increase of nearly 60,000-fold.
“Much of our growth is clearly based around the sequencing data and imaging data we collect now,” Newhouse says. “The data our researchers are creating in core groups like Molecular Biology (next-generation sequencing) and Microscopy is massive. The growth is increasing exponentially because of the technologies behind it.”
To keep up, Newhouse maintains a strong IT infrastructure that supports new technologies and provides investigators with up-to-date tools, including more than 350 software packages. “Giving scientists what they need is a challenge at many scientific institutions stymied by bureaucracy,” he says.
“Here there is an attitude of ‘Let’s get investigators what they need to do science. And let’s get it now,’” Newhouse explains.
Madelaine Gogol and Arnob Dutta, PhD
Visualizing data from all angles
While the Information Management team keeps technology running at the Institute, an array of programmers and analysts helps researchers process, analyze, and visualize data. Many of these adept data handlers can be found in the Institute’s Computational Biology Core group, which provides computational support to labs on projects lasting from days to years.
“Piles and piles of sequences don’t mean much, and tables of numbers are really hard to look at and interpret,” says Programmer Analyst Madelaine Gogol. “But seeing data distilled into a plot or figure will allow you to pull meaning from it much easier. Patterns emerge and help you understand what is going on. Like the saying says, ‘A picture is worth a thousand words.’” Gogol and her Computational Biology colleagues create pictures that precisely illustrate complex data, using a variety of software and programming tools and their own custom scripts. The information revealed can be insightful or surprising, and may lead to more questions begging to be explored.
Gogol recently completed a year-long project with Arnob Dutta, PhD, a postdoctoral research associate in the laboratory of Jerry Workman, PhD. He studied how the Swi/Snf chromatin remodeling complex, a group of proteins that work together to change the way DNA is packaged, regulates gene transcription. Gene transcription is the first step of the process by which information encoded in a gene directs the assembly of a protein molecule. Recent studies have found that 20 percent of all cancers have mutations in the Swi/Snf complex, and have led scientists, like Dutta, to investigate the complex in more detail.
To help Dutta visualize his results, Gogol used programming packages created in R, an open source computing language used for data analysis, to map individual sequence reads to their position in the genome. She then sliced out the regions around the genes, retaining only the desired genetic area. Next, she clustered the genes by comparing the patterns in each row to one another and placing the two closest rows together. Finally, she represented the numerical values with a color gradient to form a graphical image called a heat map.
The final visualized data pops in red and blue. The image gives an immediate global view of gene profiles across different experimental conditions as well as how genes cluster into groups with similar profiles. Dutta used the heat map to understand how a particular component of the chromatin remodeling complex associates with genes under different conditions, with the color gradient representing the degree of association.
“Looking at the colors, you can see that blue is low and red is high, and you immediately get the picture,” says Hua Li, Gogol’s colleague in Computational Biology. “With numbers it is really hard to see a pattern, but with colors you get it immediately.”
Virtually wrapping up the whole package
In the laboratory of Julia Zeitlinger, PhD, Research Specialist Jeff Johnston uses virtual machines both to make sense of their research data and to allow other scientists to reproduce their results. Researchers in Zeitlinger’s lab are studying how an organism is able to turn on and off the correct genes during development, using the fruit fly Drosophila as a model system.
“We go through many different versions of a manuscript before settling on one for publication,” Johnston explains. “During this time, many of the software packages we use get updated, similar to how the apps on your phone or software programs on your laptop are regularly updated. Because of all these changes, we can use virtual machines to build a clean computational environment with specific versions of all the software we need, and then repeat our analysis to ensure it is reproducible.”
A virtual machine is a program on a computer that works as if it is a separate computer inside the main computer and allows users to run multiple operating systems without interference from each other. For example, a virtual machine would allow a Windows program to run on a Mac.
The Zeitlinger team made one of their first virtual machines public in 2013, with the publication of a paper in eLife. The link to the study’s virtual machine contained all the software packages, analysis code, raw data, and processed data used to create the figures and tables in the published manuscript.
“Since the virtual machine is essentially self-contained and frozen in time, it will always be able to reproduce our analysis, even years later when much of the underlying software code becomes obsolete,” Johnston says.
Sharing data in this way is important because it advances research and paves the way for future developments in how data is analyzed and shared, he says. In this spirit, Johnston and his colleagues also use literate programming, a form of data analysis that mixes software code with descriptive text. When users click on a file, they see a more detailed description of the programming used to analyze data—a document that reads more like a research “how to” than a string of code.
“This makes the resulting analysis much more presentable, easier to follow, and more amenable to use as a teaching tool,” Johnston says.
The past decade has been one of immense change for biomedical research, and continual innovations in technology and genome engineering promise even more change. It’s a future that excites IT experts, analysts, and scientists alike, who look forward to the challenge of using the latest technology to further the Institute’s science.
“My basic goal is to help investigators understand and really see their data as quickly and thoroughly as possible, with the underlying hope that it will tell us something interesting and new about the processes of life,” Gogol says. “I hope to contribute in my own small way to the discoveries that researchers are making about these wonderful complex biological systems that are going on daily within and all around us.”
Information Management: Left to right, back row—Steve DeGennaro, Andrew Holden, Dustin Dietz, Chad Harvey, David Hahn, Mark Matson, Jay Casillas, Mike Newhouse, Samuel Burns, Dan Stranathan. Front row—Chris Locke, Jenny McGee, David Duerr, Amy Ubben, Jordan Hensley. (Not pictured Shaun Price and Robert Reece)