KnowEng

The NIH has developed a vision for establishing The Commons—a digital enterprise that sup-ports the discovery, identification and description of a variety of digital objects. As a first step towards this goal we propose to explore a scalable cloud-hosted data publication system that provides core functionality required for the Commons. Our approach builds upon and integrates (a) data publication capabilities being developed for the Big Data for Discovery Science (BDDS) center and (b) data preparation capabilities developed in the Big Data to Knowledge (KnowEnG) center. It also leverages capabilities being developed within the National Data Service (NDS). Specifically, we propose to 1) deploy, operate and pilot a self-service cloud-hosted data publica-tion service; 2) Develop automated and easy to use data publication pipelines that are able to introspect data and extract semantically meaningful metadata and structure; and 3) Explore linkages between the data publication service and the bioCADDIE index. Our approach fills an important void in the biomedical publication landscape, enabling the publication of any biomedical data: irrespective of its location, size, type or format. Our approach builds upon an innovative software-as-a-service model, through which sophisticated data publication capabilities are offered via always available and easy to use web interfaces and APIs. Thus, enabling reduced human and economic costs for data publication. These capabilities will be based on a flexible self-service publication model through which researchers can create and customize personalized data publication repositories over several dimensions (e.g., metadata schema, input forms, access control, data storage location, and persistent identifier providers). To satisfy the unique challenges associated with big data, the service will support the creation of repositories hosted on a variety of different storage systems, such as, storage provided by a specific BD2K project or institution, operated by NDS, or from a commercial cloud provider. At the conclusion of this project we will demonstrate the value of a flexible cloud-hosted publication service as a crucial component of the Commons, disseminate new machine- learning based metadata extraction and publication pipelines, and provide linkages with bioCADDIE.