(02/28/22) The AAAS annual meeting, originally planned for Philadelphia, was held virtually again after all due to the coronavirus pandemic. In addition to eliminating the need for travel, this also offered the advantage that the contributions to the scientific panels were pre-produced in advance as so-called "spotlight videos" lasting 25 minutes each and made available on the internet. This meant that on conference days the full 45-minute session scheduled in the program was available to discuss the contributions. Organized by the DFG's North America office, the panel was given a very good program slot on late Saturday morning -- clearly an expression of the AAAS's strong interest in infrastructure topics relevant to academic research.
In his introductory remarks, Johannes Fournier emphasized that an increasingly important part of this infrastructure is datasets. These are accumulated in academic research, private industry, and public administration, and -- once they are findable and available for systematic use -- can be of considerable value to all three sectors. For this to happen, said Fournier, metadata are needed, i.e., data that describe what is contained in the data sets. In 2016, the "FAIR Guiding Principles for Scientific Data Management and Stewardship" were agreed upon and published in the journal Nature: Scientific Data. These are essential guidelines on how data sets and their metadata should be made "Findable, Accessible, Interoperable, and Re-usable (FAIR)" in order to enable significantly more efficient data use, also internationally and across sectors. In the first round of questions, Fournier discussed the incentives for various players to follow the FAIR principles, which are subject neither the force of law nor the threat of sanctions. The most important incentive, according to York Sure-Vetter, is learning from others in the up to 30 "communities of practice" being created in the National Research Data Infrastructure (NFDI).
In the day-to-day work of the 19 consortia already established, said Fournier, it became apparent that the various disciplines had very different experiences and skills in dealing with data and (where applicable) data protection, and their dialogue about this led to a significantly improved overall "data literacy" among all participants. For a company like Google, so Natsha Noy, the FAIR Principles are very welcome for a number of reasons -- for example, with regard to recruiting new employees from among the users of findable data sets. She also emphasized that "accessible" only means that the datasets are accessible in principle, not that they are accessible free of charge, and that this is therefore not really a problem for private industry (even if the pharmaceutical industry, for example, is unfortunately very reluctant to share its research data and prefers to let the competition run into dead ends). In conclusion, Robert Hanisch from the National Institute of Standards and Technology (NIST) emphasized that the FAIR Principles had freed a regulatory institution like the NIST, which reports directly to the US Department of Commerce, from some of the pressure to act; it was good to hear him say: "Let us avoid making too many rules!"
The second round of questions dealt with the technical difficulties of making data sets and their metadata interoperable, given the fact that they are often derived from different "data cultures" of different communities. Hanisch cited the example of his own academic community, astronomy, where it is comparatively simple and the data sets are indexed according to celestial coordinates. In the consortia supported by the NFDI, Sure-Vetter noted, the problem is more complex but still solvable. For him, the key term here is "federation," i.e., access to very disparate data sets, especially for interdisciplinary research questions. This could be achieved by merging and homogenizing a wide variety of data sets, or - more simply -- at least in the eyes of a computer scientist -- by using cleverly designed tools for data analysis. According to Noy, while industry has largely set the standards for data collections and their descriptive metadata to date, it has been unspecific regarding its use, pursuing no concrete goals and certainly none that run counter to use outside industry.
A third round was devoted to the question of the quality of data sets and their metadata, an aspect which, according to Fournier, has a substantial impact on the reproducibility of research results and also on the (machine) interoperability of data sets. On this point, the panelists agreed that the best way to enforce the necessary discipline was to use the "carrots and sticks" method already available to research funding organizations and scientific publishers. This does not require further sanctioning bodies and certainly no "data quality police, as Hanisch put it.
Finally, Fournier asked the three panelists for a "take-away message" or motto under which they would like to see the national projects on research data infrastructures and their international networking develop. For Noy, this was the urgent need for some kind of peer review process to ensure the quality of datasets and their metadata. Sure-Vetter did not want to exclude the need for a legal framework, as research data and its use are a common good with a corresponding need for regulation. Hanisch was not as concerned with regulation does not arise to the same extent, remarking that he trusts in the common sense and discretion of the diverse users of data sets and data infrastructures, where enormous potential lies in international networking.
To help leverage this huge potential, Fournier concluded, research funding institutions, the public sector, private industry and the scientific communities are all called upon to build up data structures based on the FAIR Principles, and to network them and to make shared use of them.