The birth of BPNet—a true interdisciplinary collaboration
A behind-the-paper glimpse at how scientific discoveries come to life
By Melissa Fryman
KANSAS CITY, MO—DNA is well-known for encoding proteins. However, an estimated 80 percent of disease-causing mutations actually occur in non-coding DNA, which makes up about 99 percent of many genomes including the human genome. Non-coding DNA can serve many regulatory purposes including turning on and off nearby genes that encode proteins. Understanding how mutations in regulatory elements cause disease is an extremely difficult task because the rules by which elements in the regulatory DNA encode gene regulatory activity are not well understood.
An innovative new tool, a neural network known as Base Pair Network (BPNet), described in a paper published online February 18, 2021, in Nature Genetics, promises to allow researchers to identify patterns within these regulatory elements to ultimately probe the rules by which they instruct genes to become active.
To understand the power of BPNet, it is important to understand how it came into being.
“We had these beautiful data from many studies,” explains Julia Zeitlinger, PhD, study lead and Stowers investigator. The Zeitlinger Lab had pioneered ChIP-nexus, a protocol that generates high-resolution profiles on what is happening on regulatory DNA. In particular, it reveals footprints of specialized proteins called transcription factors bind DNA, right down to the specific DNA base-pairs. “We could see in the binding profiles of many experiments that transcription factors were interacting, but with the limited computational tools we had, we couldn’t identify the DNA sequence patterns that promoted these interactions. It was very frustrating.”
“We knew that there are patterns to be found,” adds Melanie Weilert, second author on the paper and a bioinformatician in the Zeitlinger Lab. “The problem is that these patterns are so complex, that one postdoc staring at a computer for four years is not even going to begin to uncover the complexity of the nature of some of these sequences.”
Then Anshul Kundaje, PhD, came into play. Zeitlinger attended a large conference in 2015 where Kundaje gave a talk and laid out the idea and first results on how neural networks can be used to understand the regulatory code.
“It was a completely new approach. Its potential was hard to judge, but the premise made sense and intrigued me,” Zeitlinger says.
Then Kundaje’s student, Avanti Shrikumar, PhD, convinced her over lunch that this was indeed a worthwhile approach. “The clarity by which Avanti answered my questions and explained the approach deeply impressed me,” Zeitlinger recounts.
The next time Zeitlinger saw Kundaje at a conference in 2016, they began talking about collaborating, and in 2017, she sent him a ChIP-nexus data set from an analysis of gene regulation in mouse cells with the hope that his lab could analyze the data with neural networks. But nobody had ever modeled such high-resolution data, and it was unclear whether the project would take off.
For a while, not much progress was made, until Žiga Avsec, PhD, joined the project in 2018. Now a senior scientist at DeepMind in London, Avsec is the first author of this work and the creator of the original code for BPNet.
Zeitlinger considers Avsec’s contribution as the keystone that built a bridge between her lab and Kundaje’s lab, which made it a truly interdisciplinary collaboration. “Although both our labs work on regulatory DNA, our approaches are very different,” Zeitlinger says. “Avanti and others in the Kundaje lab have developed these amazing neural network tools, which is all about equations and the math behind it. That’s very far from what we do. We care a lot about what transcription factors do in biology, and this requires a lot of literature knowledge and careful hypothesis testing. Žiga was very communicative and receptive to ideas from both sides. He had this amazing intuition, discipline, and brilliance.”
At the time, Avsec was wrapping up his doctoral studies as a bioinformatics student in the lab of Julien Gagneur, PhD, at the Technical University of Munich, Germany. Two-thirds of the way through his thesis work on Kipoi, an online repository that serves as a platform for sharing trained neural networks in genomics, he decided to do a lab rotation across the Atlantic.
“I wanted to do something else just as a refresher. I had collaborated with Anshul, so I wanted to learn more about the neural network tools his lab had developed. Also, he was at Stanford University, so I figured I would do this in California on a beautiful American college campus,” Avsec recalls.
What was meant to be a three-month visit turned into much more, resulting in a more than a year-long intense collaboration and the birth of BPNet. Shortly after Avsec’s arrival, Kundaje suggested he should work on the mouse ChIP-nexus data from the Zeitlinger Lab.
At first, Avsec says, he tried to build a neural network to cluster the profiles of transcription factor binding footprints, which was a hard problem. “It didn’t really work. Then I thought, let me do something simpler. Maybe I just try to predict the footprints directly from the DNA sequence. By that evening, I had the model working.”
Neural networks are a type of artificial intelligence comprised of a large set of equations designed to recognize patterns. Usually, neural networks are trained for, and used to predict, a specific output. For example, an image classification neural network is trained to recognize a type of image, such as a face, for example. After the training, the user can give the network any image as input, and the hidden layers of the neural network will make the calculations to predict whether the patterns match those of a specific face. These kinds of neural networks are typically highly accurate – however, they are considered black boxes.
“Finding out exactly how the face is recognized is very, very, hard. With pixels, interpreting the model is almost impossible. But Anshul and Avanti had realized that with genomics data, this is actually feasible. You just have to find the right math to do it,” says Zeitlinger. “And if you can open the black box and interpret the model, it opens a lot of possibilities.”
“When seeking patterns in the genome, assumptions can be limiting,” Avsec explains. “More traditional approaches in bioinformatics involve coming up with a very rigid, fixed model, with a pre-specified set of rules, and then trying to map the data on top of the rules. You can miss a lot of nuances when doing that. Here, we abandoned this paradigm, by first fitting the function and distilling the rules later.”
Zeitlinger agrees. “Traditionally, the analysis we did is done in a very hypothesis-driven manner. But deep learning does not require that. It’s complex enough that it can capture any patterns that exist in the data. The trick is to have sufficient data and a very good readout. By training the network with the base-resolution profile itself, we were able to get a much finer readout than ever before.”
“This is a perfect example of neural network training,” says Weilert. “The model converges on the solution inside, and the numbers all manage to match what they should be. Then, we open up the model. What we want to know is how the model got to the solution, because those rules of biology are what we’re trying to uncover.”
“I was fascinated when I first ran the code. I still remember sitting in my room coding this up and I saw these sequence motifs popping up beautifully,” says Avsec. “I didn’t have the knowledge to connect the dots biologically, so I just tried to share as much information as I could.”
“Žiga wrote a conference paper with Anshul that he sent to me, and this was like a dream come true for me,” says Zeitlinger. “The promise of being able to see how transcription factors influence each other—that was there.”
“I was shocked, in a positive way, that Julia was so excited,” Avsec says. “Because generally, biologists are skeptical of modeling. But Julia was extremely open to it, so I was very lucky that she immediately understood the potential of this.”
What followed were many weeks and weekends of intense collaboration—cycles of data processing, analysis, and feedback, all online. “We had the COVID-19 experience of working remotely even then!” Avsec jokingly remembers. “I’m really glad that Julia kept it all together and kept us on track. Without her, this project may not have seen the light of day, or at least not with such an exciting biological focus. I was also fortunate that my supervisor Julien Gagneur was very generous and let me work on this for the rest of my PhD.”
“This was the most fun period of doing science that I’ve experienced,” says Avsec. “It’s hard to say what came from me and what came from Anshul on the interpretation front. One of Avanti’s interpretation tools, TF-MoDISCo, was absolutely critical. Avanti and Amr Alexandari helped with ideas for distilling the rules by which motifs instruct the interactions between transcription factors. Melanie was extremely good at programming, and super helpful with data processing and following up the results with more analysis. I was impressed.”
“The enthusiasm of everyone involved kept the project going,” Zeitlinger says. “Usually, collaborations go much more slowly, but Žiga kept the results coming. He made them easy to access and responded very quickly to suggestions. It was additive.”
Both Zeitlinger and Avsec commented that their work encountered resistance. “People can be skeptical because it might look like magic to some,” says Avsec. “But once you dig deeper, it’s just that we now have enough of high-quality data, computing power, and good algorithms to be able to capture these phenomena well,” says Avsec.
Zeitlinger agrees. “A lot of computational scientists had built up prejudices against neural networks because five or ten years ago, they weren’t that powerful and there weren’t enough data. And biologists often just shrugged their shoulders when I gave talks. Some were even quite dismissive or gave back-handed compliments like ‘You are brave to present such preliminary data’.”
To that end, it was important to demonstrate that BPNet could predict the outcome of mutating sequences in the mouse genome, an effort that was led by Sabrina Krueger, PhD, in the Zeitlinger Lab. “We made very subtle single-base mutations with CRISPR gene-editing technology and our predictions were indeed amazingly accurate. It was really good to see that validation,” says Zeitlinger.
“Ultimately, this is where BPNet is so powerful,” Zeitlinger explains. “It can predict the effect of sequences that were not included in the experiment and, thus, guide future experiments. Before, we could not easily change regulatory sequences in the genome because the effect was too hard to predict. With BPNet, we can now make very accurate predictions, which will allow us to do more focused experiments in vivo. It should also allow us to identify disease mutations more easily in the future.”
“There are, of course, limitations,” cautions Zeitlinger. “The regulatory code is very different for each cell type, and neural networks are only good at making predictions for problems they were trained on. It will take a while to apply it to all sorts of data. But so far, it has worked very well for us, and we are making the software broadly available to other scientists.”
Avsec recalls, “We had a running joke—when will we fall off the cliff? In research, there’s no guarantee that it’s going to work, that’s just the reality. But then we were quite far in on the project, and we still hadn’t fallen off the cliff.” Luckily for all, the cliff never came, and BPNet emerged to see the light of day.
Read the news release related to this story.