Information for Researchers, No. 28 | April 3, 2025

Data Corpora for Artificial Intelligence

The DFG is supporting the development of data corpora for training artificial intelligence (AI) 

The DFG’s Committee on Scientific Library Services and Information Systems (AWBI) is responding to needs identified by the research community in connection with the Call for Ideas to Support AI in Research through Information Infrastructures(interner Link). This call invites proposals for projects aimed at preparing and providing data as a foundation for the development or advancement of AI in research. 

Background

Methods used in the context of artificial intelligence – such as machine learning (ML) and text and data mining (TDM) – are becoming increasingly relevant in many areas of digital research practice and in the provision of scientific information. Such methods are used to analyse and process large volumes of data, for instance, as well as for language processing and generation. The development and use of these methods rely on a wide range of multimodal data types, such as factual data, measurement data, behavioural data, monolingual and multilingual texts, image data, synthetic data, digitisation and indexing data. The quality and availability of these data types can vary significantly, and their aggregation and cleansing often require considerable effort. As such, there is a need within the research community for systematically prepared, curated, annotated and aggregated data corpora.

Objectives of the Call for Proposals

This call aims to support the establishment and development of high-quality, extensive data corpora that can serve as a robust and reliable foundation for the development/advancement and application of AI methods in research. The future use of methods and applications based on the funded corpora may take place both in research and within scientific information infrastructures. The quality, scope and composition of the corpora must be tailored to the relevant needs and must enable research beyond individual projects or locations, or contribute to improving the provision of scientific information. The data corpora should be provided in accordance with established principles and standards (such as FAIR and CARE) and via existing information infrastructures, in particular the National Research Data Infrastructure (NFDI).

Scope of Funding

The purpose of funding is to support the compilation and extension of data corpora for AI. Projects may also include the following aspects:

  • conception of selection and quality criteria, implementation of quality assurance measures
  • reuse, adaptation and in particular application of data cleansing, aggregation, annotation, curation and harmonisation procedures
  • provision of the compiled or extended data corpus accessible through existing scientific information infrastructures

Proposal Submission

Proposals must be submitted in English so as to enable international review. Proposals must be structured in accordance with the Proposal Preparation Instructions – Project Proposals in the Area of Scientific Library Services and Information Systems (DFG form 12.01(interner Link)). In addition, applicants must also consider the specific details of this call and the Guidelines and Supplementary Instructions (DFG form 12.14)(interner Link)relating to the “Information Infrastructures for Research Data” funding programme. The following instructions correspond to the LIS Proposal Preparation Instructions.

Project Description (LIS Proposal Preparation Instructions, Part B)

Please explicitly address the following aspects in your project description:

  • Explain in detail how and why the planned corpus will support the development/advancement and/or application of AI methods in research or within scientific information infrastructures. What existing data resources are already available in this area? What potential lies in aggregating and systematically preparing these data into a curated corpus?
  • Demonstrate the need for the proposed corpus. Describe one or more use cases to illustrate the research questions or improvements in AI-based information provision that will be enabled through the planned data corpus. What key tasks and activities are necessary to build the corpus? Describe how you will ensure that the corpus is appropriately prepared for its intended use.
  • Explain in detail the composition of the corpus and justify your selection: Which data is to be selected? From which sources will these data be collated? Where relevant, address the issue of bias and how you plan to handle it.
  • Outline the criteria you will use to assess data quality. Explain how these criteria align with commonly accepted standards. To what extent will FAIR and CARE principles be applied?
  • Specify the desired quality and format of the data. Describe the initial quality of the data and the targeted standard after preparation. What methods and measures will be used to achieve the targeted quality?
  • Provide details of how the corpus will remain accessible to researchers in the long term. If updates are expected after the project ends (e.g. due to planned versioning), explain how ongoing curation will be ensured.
  • Where available, the corpus should be made accessible via relevant subject-specific infrastructures, such as NFDI consortia or specialised information services (FID). In which structure (centralised or decentralised) will the corpus be made available on completion of the project? How will the corpus be technically and/or organisationally integrated into existing information infrastructures? Wherever possible and appropriate, cooperation with partners should be arranged.

Measures to Meet Funding Requirements and Handle Project Results (LIS Proposal Preparation Instructions Part B, 4.3)

  • The compiled data corpus is to be published under a licence that allows free use for research purposes. The chosen licence must be specified in the project description.
  • Availability and access to the corpus should be as open as possible for scientific users. If open access cannot be granted, this must be explained in detail in the proposal. In general, access modalities for users must be clearly described.

Please confirm that the following actions will be taken:

  • Key interim results will be published after the first year of the project.
  • The data corpus will be made findable in both disciplinary and cross-disciplinary directories, registries and the like.
  • The corpus will be documented according to recognised quality standards.
  • Please also confirm explicitly that no duplicate funding is involved and that the corpus is not already being compiled under another project.

Attachments (LIS Proposal Preparation Instructions Part C)

Up to three letters of intent may be included from researchers based in Germany. These should demonstrate that the planned data corpus is relevant to research in a broad variety of ways. This can relate to both individual disciplines and multiple subject areas. If the purpose of the corpus is to support information provision using AI, letters of intent from infrastructure facilities or scientific users of the target application may also be included.

Applicants are encouraged to ensure that both researchers working in the subject area and infrastructure specialists are involved in their projects. If a researcher is not named as an applicant, the involvement of researchers on an advisory basis is recommended.

Requested Funds

Under this call, applicants may request funding to cover staff, direct project costs and project-specific workshops according to the Proposal Preparation Instructions – Project Proposals in the Area of Scientific Library Services and Information Systems (DFG form 12.01(interner Link)) up to a maximum of €400,000. It is not possible to request funding for investments. The total funding available under this call is up to €8 million. 

Duration

Projects may run for a maximum of two years.

Deadline

Your proposal must be submitted to the DFG in English by 30 July 2025. Proposal submission is exclusively via the elan portal(externer Link) for the purpose of recording proposal-related data and transmitting documents in a secure manner.

If this is the first time you are submitting a proposal to the DFG, please note that you must register in the elan portal before you can submit your proposal. You must do so by 23 July 2025. You will normally receive confirmation of your registration by the next working day.

An informal letter of intent is requested by 28 May 2025. Please use the form linked under “Further Information” for this purpose.

All selected project owners will be required to attend a joint kick-off workshop in the first half of 2026 which will support networking and dialogue between the funded projects.

Further Information 

A virtual information event(externer Link) will be held from 10:00 to 11:30 on 7 May 2025.

Please submit your informal letter of intent here(externer Link).

See here for details of the review criteria in the area of LIS: DFG form 10.213(interner Link)

Please use the elan portal(externer Link) to submit your proposal, and refer to DFG form 12.01(interner Link) and DFG form 12.14(interner Link).

Contact Persons at the DFG Head Office,
Scientific Library Services and Information Systems

Programme contacts: 

Dr. Stefanie Mewes, phone +49 228 885-2218,  

Dr. Matthias Katerbow, phone +49 228 885-2358, 

Administrative contact: 

Clara Grau, phone +49 228 885-2473,  

Privacy Policy

We, the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), take the protection of your personal data and its confidential treatment extremely seriously. Therefore, please refer to the DFG’s Privacy Policy(interner Link). If you intend to transmit personal data of third parties, please make sure to do so only if the necessary legitimation under data protection law exist. Before transmitting data of third parties to the DFG, please forward the DFG’s Data Protection Notice to the individuals affected (data subjects). If there is a legitimate interest not to inform individuals beforehand (e.g. for reasons of secrecy or in case of a nomination or candidate proposal), these individuals should be informed no later than at the time of publication.