The Synapse platform facilitates collaboration among scientific teams, and integrations with analysis tools and programming environments. It is both a software environment and a data corpus that now includes over 10,000 datasets from GEO, TCGA, ArrayExpress, and other public databases.
Synapse is now being used as the official resource for hosting analysis-ready TCGA data for use in the TCGA Pan Cancer analysis working group, and will provide the versioned, citable datasets referenced in the TCGA Pan Cancer publications from the Nature Publishing Group to appear as a collection later this year. We have been highlighted by Amazon as an official case study in the use of their Simple Workflow technology, which we used to provide flexible use of both in-house and cloud-based IT resources to process large volumes of data in a cost-efficient and timely manner.
The past two decades have seen an amazing exponential growth in the ability to generate genetic and biomolecular data on patients in a variety of disease contexts. These breakthroughs have resulted in an increasing amount of resources directed at genomic research by both industry and academia. However, with a few exceptions, these investments have failed to significantly improve prevention or treatment of many common human diseases.
A fundamental reason for this discrepancy between data generation and clinical improvement is the immature development of analytical techniques to meaningfully interpret these new data types. As with any new field, analytical methodologies need to be iteratively developed and refined in order to gain predictive power. The difficulty of accessing, evaluating, and reusing data, analysis methods, or models of disease across multiple labs with complimentary fields of expertise is a major barrier to the effective interpretation of genomic data today. Additionally, much of the relevant data to answer a particular research question is spread among multiple public and private repositories. Because each research group protects their own data, tools, and results, the end result is enormous duplication of effort and missed opportunities across both industry and academia.
As a 503c nonprofit organization, Sage Bionetworks’ mission is to catalyze a cultural transition from the traditional single-PI, single-lab, and single-company research paradigm to a model founded on broad precompetitive collaboration. This structure would benefit patients by accelerating development of disease treatments, and society as a whole by reducing the cost of health care and biological research. Sage Bionetworks is actively engaged with academic, industrial, governmental, and philanthropic collaborators in developing this distributed research model.
Part of Sage Bionetworks’ strategic solution is Synapse, an informatics platform for open data-driven collaborative research which Sage Bionetworks develops and operates as a public resource for the scientific community. Synapse helps scientists solve a series of problems:
- Finding and using relevant data – Currently, scientists have difficulty tracking down and gaining access to data and resources generated by others, even within the confines of the same organization. Synapse provides a central registry for scientific data, in which data can be annotated and queried even while the raw data resides in other systems.
- Understanding analysis workflows – Synapse is built with the understanding that most analytical research is experimental and ad hoc in nature, with hardened analysis methods only emerging over time. Tracking who has run what version of code on what version of the data immediately helps projects run more smoothly, and ultimately enables reproducible workflows that allow others to build off of prior work.
- Supporting genome-scale analysis – Analyzing datasets with information on whole genomes is currently limited to those with access to large computational resources and significant IT support. Synapse facilitates a computational style where code and users can move to the data, wherever it is stored.
- Forming and maintaining productive collaborations – Most scientists tend to start a research effort from scratch rather than elaborate on work in an unknown state. Synapse helps scientists track what work has already been done in a particular area and create and sustain active collaborations in which research results are published online as they are generated.
The Synapse web portal is an online registry of living research projects that allows data scientists to discover and share data, models, and analysis methods. Any scientist can create and invite specific collaborators or the public to join a project hosted on Synapse. These online workspaces then serve as the glue to help distributed teams of researchers collaborate on complex scientific analyses. The web portal has a wiki-centric design, making it easy for scientists to build narratives which include analysis results, figures, and prose to explain their work to collaborators, and provide a staging ground for eventual publication. Synapse projects range from individual research efforts to large distributed collaborations across dozens of sub-teams such as TCGA working groups, or Sage’s open DREAM challenges in computational biology.
Bioinformaticians apply a diverse and rapidly evolving set of tools to understand genomics and other biomolecular data, and leverage a variety of IT infrastructures for their analysis. Synapse works with a variety of different tools by providing integrations with R, Python, or the Linux shell, which allow analysts to communicate with Synapse via a common set of web services as they perform their analyses. These services also allow for tracking of the history of multi-step analyses though the generation of provenance relationships which connect the analysis data, code, intermediate results, and figures together. These relationships can be visualized as interconnected graphs and can be explored from the Synapse web site, where they provide visual documentation of the analysis workflow and aid in reproducibility and reuse.
Synapse provides a registry of data, code, and other research resources hosted on the web. Files may either be stored in Synapse directly, or reside in a variety of external systems such as academic or industry data centers, cloud computing providers such as Amazon, or code repositories such as GitHub. In either case, the files may still be annotated, queried, and used in Synapse provenance records to track and share their workflows. The Synapse architecture allows diverse teams of researchers to analyze data on the IT infrastructure that makes sense for their project while calling Synapse web services to centrally aggregate and distribute the results of their work. The end result is a centralized registry of living research projects, with data storage and computation spread over a variety of different back end providers.