Tom Narock1 and Anirudh Prabhu2
Goucher College, Center for Data, Mathematical, & Computational Sciences
Carnegie Institution for Science, Earth and Planets Laboratory
McGranaghan and colleagues (McGranaghan et al., 2021) have highlighted the problem of our outdated data systems and have proposed an Earth and Space Data Knowledge Commons. The problem they describe is not one of information, but of access. Datasets, disciplines, people, and projects are all siloed. The envisioned knowledge commons is an open ecosystem, which the entire community evolves and maintains. The knowledge commons is both technical and social. To create a knowledge commons, we must first create knowledge graphs, connect those graphs to form a knowledge network, and develop a self-sustaining community around that knowledge network (McGranaghan et al., 2021). Past experiences in knowledge graphs and knowledge networks have contained both pain points and network effects. Highlighting these shared experiences is important for future development of a sustainable knowledge community.
This essay details previous efforts in community-based knowledge graph and knowledge network construction. We highlight the pain points as well as the network effects that can be leveraged for additional value as more people join the community. Our goal is to highlight exemplary projects in the context of community ontology development, interoperability of ontologies, governance, and collaboration. We quantify the benefits that knowledge graphs afford while also discussing the challenges of transitioning to knowledge networks. In doing so, we present lessons learned and strategies for future knowledge commons development. We aim to highlight challenges that may arise around such things as: shared terminology, maintaining and evolving ontologies, interoperability of ontologies, multi-level coordination and collaboration, and community engagement.
To facilitate a shared discussion, we reuse the terminology developed in (McGranaghan et al., 2021) and restate it here for convenience.
Knowledge Graph (KG): information structured into a graph form by a specific data model/schema/ontology that defines entities (objects, events, situations or abstract concepts) and their relationships. It is a collection of interlinked descriptions of entities – objects, events or concepts.
Knowledge Network (KN): Connected knowledge graphs. Knowledge networks construct linkages between disparate knowledge bases. An example is the Linked Open Data Cloud (https://lod-cloud.net/).
Knowledge Community: The knowledge network provided in a manner to build community around it--connecting the traditional technological to a cultural technological component. An example is the full complement of Wikimedia Foundation projects and chapters: (https://www.wikimedia.org/).
A Knowledge Commons is a combination of intelligent information representation and the openness, governance, and trust required to create a participatory ecosystem whereby the whole community maintains and evolves this shared information space.
Although the term knowledge graph was popularized with the release of Google’s Knowledge Graph in 2012 (Hitzler, 2021), the definition of “knowledge graph” remains contentious (see Hogan et al., 2021 and references therein). We agree with the McGranaghan et al. (2021) definition and highlight the inclusive Hogan et al. (2021) definition that a knowledge graph is “a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent relations between these entities.” Much of the contention in defining knowledge graphs comes from the breadth and depth of their actual design and use. Knowledge graphs can be openly accessible as in the case of Wikidata (Vrandečić and Krötzsch, 2014), or they can be enterprise knowledge graphs (e.g. efforts by Google, Facebook, and eBay) serving internal business needs and not entirely accessible outside of the company (Noy et al., 2019; Hitzler, 2021). Additionally, data can be structured as a graph in several different ways with some of the more popular techniques including directed edge-labeled graphs and property graphs (Angles et al., 2017; Hogan et al., 2021). Implementations of the former often, but not exclusively, being encoded in Resource Description Framework (RDF, Cyganiak et al., 2014) and Web Ontology Language (OWL, Hitzler et al., 2009) standards. While the latter are commonly implemented in graph databases such as neo4j. Placed in this context, knowledge community members may all be implementing respective “knowledge graphs”; yet, encodings and query languages differ. We do not advocate for a specific implementation of knowledge graphs. We simply begin our discussion highlighting the fuzziness of the term and with a reminder that interoperability is not intrinsically guaranteed.
Launched in 2009, the Deep Carbon Observatory (DCO) was a 10 year scientific initiative focused on the study of carbon in deep earth. Work over the next 2 years led to the emergence of four communities in the DCO: Deep Life, Deep Energy, Reservoirs and Fluxes and Chemistry. As these teams were preparing a set of decadal goals to complete by 2019, researchers noticed a need for a cross-cutting team with expertise in Data Science to direct the large quantities of data generated by the scientific activities (Ausubel 2019). Thus, the DCO Data Portal (led by the DCO Data Science Team) was established in 2010, with the goal of identifying and providing access to data, articles, people, projects, applications and other research products generated from the DCO activities (M.Parsons, personal communication, April 2019). This led to a steady rise in the data science activities and collaborations in the DCO community.
One of the first major goals of the DCO Data science team was to create a Deep Carbon Virtual Observatory (DCVO). The DCVO provided: 1) a schema categorizing various concepts in DCO’s scientific works, 2) an ability to identify and annotate all key entities, agents and activities, 3) a repository for archiving data and associate metadata and 4) an integrated portal to manage diverse content that can be accessed at various levels (Ma et al. 2014). Over the years, the DCVO has been expanded, extended and improved, thus exemplifying the strength of the underlying framework used to build this virtual observatory. This framework used and connected existing services and platforms like Drupal, VIVO, CKAN and the handle system in a modularized and highly reusable manner (Ma et al. 2017).
The DCO data science team also put into practice the creation of an open knowledge network, where open science efforts in each of the four communities were connected, categorized and annotated, instead of being published as separated fragments (Ma et al. 2017). This knowledge network met the needs of a community of over 1000 scientists from over 35 countries1.
The success of DCO shows the need for a Data Science team in any knowledge commons effort. This team will be key in connecting various parts of the wider scientific community, and is essential to consistent and continuous maintenance and upgrading of the data resources, the services and platforms used, and if required the underlying framework. Also, a modularized framework and reusable framework provides the flexibility and versatility for extended applications. For example, the Global Earth Mineral Inventory, which was created as a DCO data legacy, used the same framework and platforms as the larger DCO data portal (Prabhu et al. 2021).
While the existence of a data science team in a knowledge commons and the use of a modularized framework provide a pragmatic and efficient way to foster a knowledge commons, there are potential risks associated with this approach. The continued existence of a core data science team becomes essential to the sustainability of the knowledge commons, and the frequency and quality of the interactions between the data science team and other parts of the scientific community influence the successful development, maintenance, and expansion of the knowledge commons. Additionally, while building a platform like the DCVO with other existing services and platforms as components, we need to choose the right components based on the needs and goals of the use case. This is because changes in the individual components of the platform ripple onto the knowledge network. This may be solved or at least curtailed with consistent communication between the people managing the components and the team managing the created platform (like DCVO), thus expanding the members of the knowledge commons.
OceanLink (Narock et al., 2014) was an online platform that addressed scholarly discovery and collaboration in the ocean sciences. At the time, a wide spectrum of maturing methods and tools, collectively characterized as the Semantic Web, were beginning to vastly improve the discovery and dissemination of scientific research. OceanLink leveraged the Semantic Web, in conjunction with web mining and crowdsourcing, to identify links between data centers, digital repositories, and professional societies.
An effective knowledge commons will integrate a broad collection of diverse resources. One often overlooked aspect of using such a collective is the need to inform users of who is asserting what. Let us illustrate this point with an example from the OceanLink project. Figure 1 shows the results of an OceanLink query in which a user was searching for a particular research cruise. The OceanLink portal used the underlying knowledge graph to aggregate results from several providers on the user’s behalf. The knowledge graph identified the chief scientist of the cruise and provided a link to their home page. Links were also provided to dataset and cruise details at BCO-DMO and R2R, which are repositories for oceanographic data. However, the key aspect of the system we’d like to highlight is the provenance.
Figure 1. An example results page from OceanLink in which a user asks for items related to a specific research cruise. This figure is adapted from Figure 5 in Narock et al, 2014.
The user is informed that BCO-DMO asserts that its data and R2R’s data are in fact referring to the same cruise despite using different identifiers. Moreover, the simple separation in the results page helps inform users which participant of the knowledge community is saying what. BCO-DMO, as a repository, is asserting a connection between its data and another repository’s data. The knowledge graph itself is asserting that cruises take place on vessels. The American Geophysical Union (AGU) is leveraging the knowledge graph’s inference to further assert that while the specific cruise was not found in any presentation abstracts, the vessel was.
We believe that provenance is vital to a successful knowledge commons. Discovery information, resources, and connection is necessary, but not sufficient. Every member of a knowledge commons is a unique participant. Those unique participants need to have a unique voice. Participants first need the capability to assert statements independent of the other participants, a la BCO-DMO. It is important for quality assurance and decision making that the knowledge commons then enables a means of sharing these individual assertions with the community. For these reasons, we believe it is vital for a knowledge commons to invest in provenance technology such as the W3C PROV-O recommendation (Lebo et al., 2013).
In 2007, NASA commissioned several federated information systems. Five information systems were funded serving the solar, heliospheric, magnetospheric, Earth’s radiation belts, and Earth’s upper atmosphere communities. Each of these systems is responsible for providing uniform access to their respective underlying sources of heterogeneous and distributed data. These systems were broadly known as Virtual Observatories, a paradigm (Szalay and Gray, 2001) that began in astronomy and quickly spread to heliophysics, oceanography, volcanology, and other diverse scientific communities. Specifically, the Virtual Observatory paradigm unites large quantities of disparate and heterogeneous data usually under one web-based portal. The underlying data remain heterogeneous and distributed, yet common metadata, access protocols, and terminology provide transparent access to users. Within the NASA Heliophysics domain the five Virtual Observatories were supported by numerous data analysis and visualization capabilities that enabled the provision of a diverse Heliophysics data environment. The systems within NASA’s Heliophysics Data Environment implemented search capabilities relevant to their domain. For example, the Virtual Solar Observatory, dealing primarily with solar images, focuses on optical search parameters such as wavelength and intensity. The Virtual Heliospheric Observatory, by contrast, dealt primarily with in situ time series data. In a similar manner the remaining NASA Virtual Observatories implement search capabilities analogous to the types of data they contain while leveraging a common underlying hierarchy of terms and relationships.
This unified collection of terms and relationships was managed by a group known as the Space Physics Archive Search and Extract (SPASE) consortium (Harvey et al., 2008). In an effort to serve the entire NASA Heliophysics Data Environment, the consortium imposed a formal governance structure consisting of regular meetings and community voting on any modifications and additions. We applaud the openness and democratic nature of the consortium and this critique is not intended as an indictment. We simply want to highlight two potential challenges for future knowledge commons. First, while this governance model is open and democratic, it is also very often tedious and slow moving. Proposed modifications need to wait until the next formal meeting to be debated - and may not be resolved in one session. Moreover, voting results, much like in our political systems, can be dependent on engagement and turn out. This is presented as a reminder that knowledge commons are a socio-technical endeavor. A sound technical infrastructure may still lead to a waning community.
Second, an initial understanding of the user community is vital. The governance structure and the diversity of the community often go hand in hand. A diverse user base will engage with a knowledge commons at different levels of granularity. Senior researchers will have more background knowledge - and need less integrating information - than citizen scientists. A more agile governance structure may be needed for a knowledge commons comprised of broad and deep knowledge graphs. The rigidity of the aforementioned SPASE consortium may suit a community with a common baseline of background knowledge; yet, this governance model is unlikely to work for a community with varying levels of background knowledge in need of regular and rapid updates.
The American Geophysical Union (AGU) is an Earth and space science professional society based in the United States. The AGU publishes scientific journals, sponsors meetings, and supports education and outreach efforts to promote public understanding of science. Research conducted by AGU members ranges from the Earth's deep interior to the outer planets of our solar system. Despite the American in its name, roughly 40% of the AGU's membership comes from outside of the U.S. Each year, the AGU hosts a Fall Meeting that draws tens of thousands of participants. The AGU Fall Meetings, dating back to the 2000 meeting, have been modeled as a collection of knowledge graphs (Rozell et al., 2012; Narock et al., 2012; Narock et al., 2019).
We present the AGU knowledge graphs to highlight the life cycle component of a knowledge commons. The AGU knowledge graphs have been in operation for nearly ten years now. At the time of their initial development, they utilized what were considered the leading ontologies/taxonomies. Looking at the graphs today, one would conclude, rightly, that they are a bit outdated. Knowledge graph semantics and development methodologies have advanced greatly in the ensuing decade with a few projects going by the wayside. All of this is to say that a knowledge commons will likely be a conglomerate of knowledge graphs and associated projects in varying stages of development. Not all constituent graphs will be designed in the same way. Semantic harmonization is challenging enough. A knowledge commons will likely be further challenged by the need to integrate components in various stages of their life cycle. Rather than build a single comprehensive knowledge graph, a collection of modular “cross-walks” and mappings may be a more effective long term solution.
A few other engineering challenges also became apparent in the AGU effort. First is the issue of co-reference resolution. The Person class in the AGU knowledge graph predates large-scale open science efforts around identifiers such as Orcids. As a result, integration with other knowledge graphs is challenged. Ad hoc methods for relating people across graphs have been attempted (Narock et al, 2014; Narock et al., 2019); yet, all such integration efforts come with some degree of uncertainty. A knowledge commons - whose knowledge integration will be far more expansive than that of the AGU graphs - will encounter this to a much larger degree. A knowledge commons needs to embrace, and ultimately encode, uncertainty.
A second challenge is in the desire to download exports of the AGU graph for external integration projects. Many prominent knowledge graph projects, e.g. WikiData, provide regular “dumps” of their knowledge graph for public download. While this can be useful in one-off projects, we caution against this in a fully formed knowledge commons. Copying graphs, or aggregating them to a central location, can be costly and difficult to maintain. We feel the path to success is with a decentralized approach that has some sort of overview monitoring capabilities. As an example, Figure 2 highlights an approach taken in the aforementioned OceanLink project. A monitoring service was created to regularly ping members of the collective and report their availability. This service had the secondary benefit of informing new users as to what each member was contributing to the aggregated knowledge graph. Although, we do acknowledge the many challenges of using SPARQL (Klarman, 2017) as a means of integrating knowledge graphs.
Figure 2. An example of how a knowledge commons could be distributed and decentralized; yet, maintain regular status updates. Example taken from Figure 1 of Narock et al., 2014.
Many applications naturally fit a graph data model and knowledge graphs seem an ideal basis for a knowledge commons. Historical knowledge graph projects have shown numerous practical benefits, such as decreasing time and effort needed for information discovery (Narock and Fox, 2012), analyzing knowledge networks to form scientific hypotheses (Prabhu and Fox, 2020), and applying network science to gain insights on unknown patterns and connections (Narock et al., 2019).
However, a knowledge commons will confront multiple challenges, including: data access, maintaining and evolving ontologies, interoperability of ontologies, multi-level coordination and collaboration, and public engagement. While no previous project encapsulates the entirety of what is envisioned with the Earth and space science knowledge commons, we hope that the aforementioned projects highlight some of the key socio-technical challenges and lessons learned from them. We conclude with thoughts on community incentives and avoiding the digital version of the tragedy of the commons.
The notion of a tragedy of the commons originated in an essay written in 1833 by the British economist William Lloyd. The essay described the effects of unregulated grazing on common land in Great Britain and Ireland. Garrett Hardin popularized the notion in a 1968 Science article, which he famously titled "The Tragedy of the Commons" (Hardin, 1968). The essence of the idea is a situation in which individual users, who have open access to a resource unhampered by formal rules that govern access and use, will act independently according to their own self-interest. This action is contrary to the common good of all users and causes depletion of the resource. In recent years, the notion has been extended to the digital age (see for example: Jayaraman, 2012) to describe issues in cyberinfrastructure development in which a small percentage of the population creates and the rest consume.
Contributing to the development of a knowledge commons requires a lot of time and resources. A lot of users do not contribute consistently because of a lack of recognition of their work by the community members or lack of other tangible incentives that would drive them to contribute actively (Jayaraman, 2012). While this is difficult - if not impossible - to avoid completely, there are a few strategies from those already building knowledge communities that we can learn from. The first is in a form of governance known as double-loop governance (see Caron, 2021 and references therein for an excellent summary).
Double-loop governance has four primary characteristics (adopted from Caron, 2021):
1. Distributed (shared) participation and control;
2. Free and informed choices;
3. Public testing of evaluations; and,
4. An ability to manage conflict.
As Caron (2021) put it “Members in an organization with double-loop governance have the ability to redirect, refocus, and recommit to the values and the vision of their organization. Double-loop governance creates actual peers for a peer-to-peer network. Membership is well-defined, and provided with responsibilities and rewards.” A key notion of double-loop organizations is that they are “do-ocracies” and value contributions. An exemplary model of an organization utilizing this model is the Earth Science Information Partners2 (ESIP). ESIP has an overarching governance (i.e. a President); yet, governance does not come from the top down. ESIP operates under the notion of Clusters. A Cluster is an informal working group devoted to achieving a specific goal. Any ESIP member can create a Cluster and any ESIP member can join an existing Cluster. The goal of the President is not to provide top-down leadership, but rather to drive a strategic vision and better enable both intra and inter-Cluster work. Clusters are the real driving force of the organization, and they come into and out of existence as work is completed and new needs arise. Cluster activities are highly focused and ESIP members self-select one or more Clusters based on interests and availability.
Perhaps a knowledge commons would benefit from such a governance structure? It provides a distributed, yet shared, form of participation. Members dictate what portions of the commons are important to them and align their participation accordingly. Critical mass around a common need dictates a new working group. Each of these distributed entities self-governs their actions while the second loop of governance is in place to facilitate large-scale integration and conflict management (for example, bringing two mature knowledge graphs together to form a knowledge network).
The tragedy of the commons is an oft cited economic concept. We conclude with one additional bit of economic jargon. The ultimate goal of any knowledge commons, like open science (Caron, 2021b), should be to create anti-rivalrous sharing. An anti-rival good is one in which consumption by one person does not reduce the amount available for others. Anti-rivalrous sharing produces items that gain value when shared. Contributors to a knowledge commons need to be incentivized by having their data/information/resource gain in value by being connected to other resources. Anti-rivalness typically benefits from network effects. As Weber (2004) nicely summarizes it “Under conditions of anti-rivalness, as the size of the Internet-connected group increases, there is a heterogeneous distribution of motivations with people who have a high level of interest and some resources to invest".
Numerous people have contributed to the projects mentioned in this essay, which in turn have shaped our thinking on knowledge graphs, knowledge networks, and ultimately a knowledge commons. We are indebted to each and every one of them. We are thankful for the project teams and collaborators who have contributed ideas over the years that form the basis of this essay. And we look forward to future collaborations around an Earth and space science knowledge commons.
Angles, R., Arenas, M., Barceló, P., Hogan, A., Reutter, J. L., and Vrgoc, D., 2017. Foundations of Modern Query Languages for Graph Databases. ACM Computing Surveys 50, 5 (2017), 68:1–68:40. https://doi.org/10.1145/3104031
Ausubel JH, 2019. A brief organizational history of the Deep Carbon Observatory. https://phe.rockefeller.edu/wp-content/uploads/2019/11/history-of-DCO3.pdf
Caron, B. R. (2021). Open Science needs Online Double-Loop Organizations. In OSH (1st ed.). https://doi.org/10.21428/8bbb7f85.a5023346
Caron, B. R. (2021b). Demand Sharing: a Real Sharing Economy for the Academy. In OSH (1st ed.). https://doi.org/10.21428/8bbb7f85.4f8b38c0
Cyganiak, R., Wood, D., and Lanthaler, M., 2014. RDF 1.1 Concepts and Abstract Syntax, W3C Recommendation 25 February 2014. W3C Recommendation. World Wide Web Consortium. https://www.w3.org/TR/2014/RECrdf11-concepts-20140225/
Hardin, Garrett (1968). "The Tragedy of the Commons" (PDF). Science. 162 (3859): 1243–1248. Bibcode:1968Sci...162.1243H. doi:10.1126/science.162.3859.1243. PMID 5699198.
Harvey, C., Gangloff, M., King, T., Perry, C., Roberts, D., Thieman, J., 2008. Virtual observatories for space and solar physics research. Earth Science Informatics 1 (1), 5–13.
Hitzler, P., Krötzsch, M., Parsia, B., Patel-Schneider, P. F., & Rudolph, S. (2009). OWL 2 web ontology language primer. W3C Recommendation, 27(1), 123.
Hitzler, P. (2021). A review of the semantic web field. Communications of the ACM, 64(2), 76–83.
Jayaraman, K. (2012), Tragedy of the Commons in the Production of Digital Artifacts, International Journal of Innovation, Management and Technology, Vol. 3, No. 5, October 2012
Klarman, S., (2017), Querying DBpedia with GraphQL, Medium blogpost, https://medium.com/@sklarman/querying-linked-data-with-graphql-959e28aa8013
Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J., (2013), PROV-O: The PROV Ontology, W3C Recommendation 30 April 2013
Ma, X., Chen, Y., Wang, H., Erickson, J. S., West, P., & Fox, P. (2014). Deep Carbon Virtual Observatory: A cyber-enabled platform for linked science. Proceedings of the SciDataCon2014, New Delhi, India, 2-5.
Ma, X., West, P., Zednik, S., Erickson, J., Eleish, A., Chen, Y., … Fox, P. (2017). Weaving a Knowledge Network for Deep Carbon Science. Frontiers in Earth Science, 5. https://doi.org/10.3389/feart.2017.00036
McGranaghan, R., Klein, S. J., Cameron, A., Young, E., Schonfeld, S., Higginson, A., Ringuette, R., Halford, A., Bard, C., Narock, A., and Thompson, B. (2021). The need for a Space Data Knowledge Commons. Structuring Collective Knowledge. Retrieved from https://knowledgestructure.pubpub.org/pub/space-knowledge-commons
Narock, T.W. and Fox, P., (2012), From Science to e-Science to Semantic e-Science: a Heliophysics Case Study, Computers & Geosciences, Volume 46, September, 2012, Pages 248-254
Narock, T., Rozell, E., and Robinson, E., (2012), Facilitating Collaboration Through Linked Open Data, Abstract ED44A-02 presented at 2012 Fall Meeting, AGU, San Francisco, Calif., 3-7 Dec.
Narock, T., Krisnadhi, A., Hitzler, P., Cheatham, M., Arko, R., Carbotte, S., Shepherd, A., Chandler, C., Raymond, L., Wiebe, P., Finin, T., (2014), The OceanLink Project, International Workshop on Challenges and Issues on Scholarly Big Data Discovery and Collaboration, 2014 IEEE International Conference on Big Data, 27 October 2014, Washington DC, USA.
Narock, T., Hasnain, S., and Stephan, R., (2019), Identifying and improving AGU collaborations using network analysis and scientometrics, Geosci. Commun., 2, 55–67, https://doi.org/10.5194/gc-2-55-2019
Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A., and Taylor, J. Industry-scale knowledge graphs: lessons and challenges. Commun. ACM 62, 8 (Aug. 2019), 36–43.
Prabhu, A. and Fox, P., (2020), Insights from Knowledge Graphs : Introducing a new formalism, Abstract IN026-04 presented at 2020 Fall Meeting, AGU, Virtual, 1-17 Dec.
Prabhu, A., Morrison, S. M., Eleish, A., Zhong, H., Huang, F., Golden, J. J., … Fox, P. (2021). Global earth mineral inventory: A data legacy. Geoscience Data Journal, 8(1), 74–89. https://doi.org/10.1002/gdj3.106
Rozell, E., Narock, T., and Robinson, E., (2012), Creating a Linked Data Hub in the Geosciences, Abstract IN51C-1696 presented at 2012 Fall Meeting, AGU, San Francisco, Calif., 3-7 Dec.
Szalay, A., Gray, J., 2001. The World-Wide Telescope. Science 293, 2037–2038 (14 September 2001).
Vrandečić, D. and Krötzsch, M., 2014. Wikidata: A Free Collaborative Knowledgebase. Communications of the ACM 57, 10 (2014), 78–85.
Weber, S. (2004), The Success of Open Source, Harvard University Press, ISBN 978-0-674-01292-9