How OICR’s open source software tools are fostering a new era of global collaboration in cancer research
Genomics data play a vital role in cancer research because understanding the biology of cancer and how it evolves is essential for the development of new tools that will allow for the better diagnosis and treatment of disease. Huge international research projects such as the International Cancer Genome Consortium (ICGC) have sequenced tens of thousands of tumour samples and made them freely available to researchers around the world.
One of the results of this ambitious and groundbreaking work is that a major problem for cancer researchers now isn’t a scarcity of genomic data, but an overabundance of it. Researchers need the right tools, such as well-curated databases and accessible software portals, to sort and find the data that projects like the ICGC have already shared with the research community – and to do so in a way that isn’t so time consuming that it hinders or impedes their work.
Over the last decade, OICR researchers in the Genome Informatics group have built open source tools to solve this problem, providing researchers across the globe with quick and easy access to some of the largest databases of tumour genome information on the planet. These tools have helped to foster the collaborative ecosystem required for today’s data-intensive precision medicine initiatives in cancer research, ultimately helping to accelerate the pace of cancer research so it can benefit patients.
This year, OICR was awarded two major new projects: The ICGC-ARGO (ICGC-Accelerate Research in Genomics Oncology) data portal, a project that will have four times the amount of genomic data as ICGC and will also store related clinical data, and the Kids First Data Resource Portal (Kids First), which will link clinical and genomic information about birth defects and childhood cancer to try to better understand commonalities between them.
The software driving today’s science
“You can’t do genomics science without software,” says Dr. Vincent Ferretti, Director of Genome Informatics at OICR.
For researchers, a well-designed database is essential to conducting their work. Imagine a library with no classification system and branches all over the city; it could easily take days to track down a single book. Researchers often face a similar problem on a much larger scale, searching for large amounts of data that could be stored in any number of siloed research databases around the world.
“As data sizes grow, research is pretty much unmanageable without the right tools and prevents people from doing large-scale work,” says Junjun Zhang, Senior Bioinformatics Manager on the Genome Informatics team. “You need user-friendly tools that can make sense of it.”
That’s where Ferretti and his team step in. The software and databases they’ve built over the last decade have improved how cancer research is conducted, allowing researchers to search global databases with relative ease.
Ferretti and Zhang were both instrumental in the design and launch of the ICGC Data Portal in September 2013, which was the first major project for Ferretti’s group. OICR is host of the Data Coordination Center and the Secretariat for the ICGC.
Today the ICGC Data Portal provides access to more than 1.3 petabytes of data and has more than 200 daily users (a petabyte is one million gigabytes; for comparison, the average high-end smartphone today is 64 gigabytes). Its success facilitated the accomplishments of ICGC as a whole and the ICGC’s ability to foster research projects around the world. Ferretti notes that the team “became kind of famous in the field” for their work on portal development very much because of their focus on putting user experience first.
It also led to recognition from the National Cancer Institute (NCI) in the US, which asked the University of Chicago and OICR to build a similar portal for the NCI’s Genomics Data Commons, a unified data repository for cancer genomic studies fostering collaboration in precision medicine.
The GDC is even bigger and more heavily used by the research community than the ICGC Data Portal. It has thousands of monthly users, with about two petabytes of data accessed each month. OICR designed and built the frontend user interface and data query API (application programming interface) middle layer for the project. “It is smaller role for us than with ICGC, but it is critical, and super high impact,” says Ferretti. “Our expertise was recognized and we were recruited to create tools for a project that is much larger than ICGC.”
Moving to the cloud
The lessons learned from building ICGC and GDC led directly to the building an academic compute cloud resource called the Cancer Genome Collaboratory. Because of the large size of cancer datasets, it can often take weeks or months for data to be downloaded for researchers to use. By storing it in the cloud, more users are able to access it and can save the time of having to download it. The Collaboratory consists of 2592 CPU cores and more than 7.7 petabytes of storage. It opens up about one terabyte of ICGC data to researchers anywhere in the world – whether they are working at institutions with high-powered supercomputers or not.
Many organizations use commercial cloud computing services such as Amazon for this work, but the Collaboratory is more of an academic resource, allowing researchers to test their work at a lower price.
“If an organization cannot process the ICGC data the way they need, the Collaboratory is available to them,” says Ferretti. The raw data within the Collaboratory, which has been sequenced by various ICGC projects in jurisdictions around the world, has been harmonized for further ease of use. This “allows researchers to compare apples with apples,” he says, making comparisons more scientifically significant.
Next generation software
Now the team has two new challenges: building the database behind ICGC-ARGO and the data portal and all other software tools behind Kids First.
ICGC-ARGO is the next phase of ICGC and it’s much larger in size, looking at biospecimens from 100,000 cancer patients and combining them with clinical data to make a resource even more powerful for researchers – and more impactful for patients.
“We are designing a whole new system for ARGO, which is strongly inspired by ICGC but expanded to include clinical data,” says Ferretti. “It will be bigger size, collect more comprehensive longitudinal clinical information such as diagnoses, exposures, lifestyle, family history, treatment response and survival, have better annotation, and it will be searchable.” This means there will be more clinical fields, data points and clinical reports, which will make for greater complexity in managing it.
Work is also being done in partnership with The Global Alliance for Genomics and Health to make the data interoperable with other systems around the world.
Kids First was announced in August 2017. OICR is currently building Kids First in partnership with the National Institutes of Health (NIH) Common Fund’s Gabriella Miller Kids First Pediatric Research Program and the Children’s Hospital of Philadelphia (CHOP). Kids First presents a different model for the future of how such open source databases can operate.
Kids First will bring together data from dozens of distinct, previously established cohorts focused on birth defects and childhood cancers. CHOP and OICR researchers will combine these data and make them available through a single, cloud-based database and discovery portal. One of the main goals of the project is to help researchers better understand the link between childhood cancers and birth defects, and to find strategies to stop or slow the development of childhood cancers.
Ferretti calls Kids First a ‘third generation’ portal, built on the knowledge and lessons learned from building ICGC and GDC. Unlike those projects though, the added complexity and functionality of Kids First is that it will integrate features commonly found in social media: Kids First users, including both researchers and patients, will be able connect with each other through the portal. Researchers will be able to browse and share their projects and patients will be able to connect with researchers and ask questions. In short, the database will give users the platform to talk to one another.
For Ferretti, it’s an incredibly exciting opportunity to connect those who are doing the research with those who benefit from it most. “It’s more challenging, but more interesting, and it’s a whole new role for OICR,” says Ferretti. “I’ve never met an ICGC donor, despite working with their data for years. A project like this makes our work more concrete, and reminds us how our work has such large impact for patients.”
Open science. Open data. Open source.
Everything the Genome Informatics team at OICR has designed is built upon existing open source software. The tools that are being used for ICGC-ARGO and Kids First are being built as individual components that can then be shared back to the community to build more projects, in an effort to help to build more resources for the open source community.
This is our new way of building things, this is an opportunity to give back to the community we have been a part of for many years. We’re doing it so the impact of what we do will be even bigger.
Dr. Vincent Ferretti
For Zhang, the new generation of software tools will enhance OICR’s mission to foster more collaborative science. Together with the existing portals, he sees this software as an opportunity to build bridges between research sectors that were previously disconnected. “The initiatives we’ve worked on were all built to encourage sharing and to reduce silos,” says Zhang. “Bringing everyone together to work is very important and that’s why making the tools easy to use is so essential.”
The team’s work will also help drive cancer research closer to the clinic. While their efforts happen behind the scenes at computers and in server rooms, these tools are fostering a new, more collaborative era of cancer research that was impossible even a decade ago. “We’re not doing therapy. We’re not doing dry lab or drug work. But our contribution with software tools is nevertheless significant and fundamentally important to understanding cancer,” says Zhang.
“Everything the team works on helps the cancer research community do their work faster, better and smarter,” says Ferretti. “And this is the work that will lead to better diagnosis and treatment of cancer.”
As Ferretti transitions to a new position in Quebec, OICR is proud to welcome Dr. Christina Yung to lead the Genome Informatics team at OICR. Returning to OICR after a year at the University of Chicago, Yung brings a decade of experience in building infrastructure for large-scale genomic data sharing and analysis. At OICR, Yung will be leading the development of the ICGC-ARGO database and access tools, while maintaining the ICGC data portal and the Cancer Genome Collaboratory.