2025-08-06: Paper Summary: "ETD-MS v2. 0: A Proposed Extended Standard for Metadata of Electronic Theses and Dissertations"

Our paper, “ETD-MS v2.0: A Proposed Extended Standard for Metadata of Electronic Theses and Dissertations,” was accepted at the 27th International Symposium on Electronic Theses and Dissertations (ETD 2024), held in Livingstone, Zambia. ETD 2024 welcomed contributions on a wide range of topics related to Electronic Theses and Dissertations (ETDs), including digital libraries, institutional repositories, graduate education and training, open access, and open science. The symposium brought together global researchers, practitioners, and educators dedicated to advancing the creation, curation, and accessibility of ETDs.

As the number of ETDs in digital repositories continues to grow, the need for a metadata standard that aligns with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles becomes increasingly important. Dublin Core and ETD-MS v1.1 are widely used metadata schemas for scholarly documents and ETDs. However, we identified several gaps that limit their ability to fully represent the structure and content of ETDs. In particular, content-level metadata, such as the individual components or “objects” within an ETD, has become increasingly important. This level of detail is essential for supporting machine learning applications that extract scientific knowledge and for enabling large-scale scholarly data services.

In this paper, we present ETD-MS v2.0, an extended metadata schema developed to address these limitations. ETD-MS v2.0 provides a comprehensive description of ETDs by representing both document-level and content-level metadata. The proposed schema includes a Core Component building on the existing ETD-MS v1.1 schema, and an Extended Component that captures objects, their provenance, and user interactions for ETDs.

Motivation

The motivation for ETD-MS v2.0 arises from three major limitations observed in current metadata standards. First, existing metadata standards lack the metadata elements to describe access rights and ETD file formats in detail. For example, the dc.rights field in ETD-MS v1.1 offers only three preset values for access. The dc.format field assumes a single MIME type per ETD, which is inadequate for ETDs that include multiple file types. Second, current standards lack metadata elements for representing internal components of ETDs such as chapters, figures, and tables. In our schema, these are referred to as “objects,” and they often have rich attributes of their own that require structured representation. Third, existing schemas do not support metadata originating from sources outside the original ETD submission, such as those generated by human catalogers or AI models. The absence of provenance information for such metadata further limits its utility.

Schema Design

ETD-MS v2.0 is composed of two main components: the Core Component and the Extended Component.

Figure 1: Relationships among Entities in the Core and Extended Components of ETD-MS v2.0. Blue represents Extended Components, and green represents core components.

Core Component

The Core Component focuses on document-level metadata and was developed using a top-down approach. We analyzed 500 ETDs sampled from a collection of over 500,000 ETDs (Uddin et al., 2021) spanning various disciplines and publication years. The Core Component comprises 10 entities and 73 metadata fields.

Some key improvements include the transformation of dc.rights into a dedicated “Rights” entity, with attributes such as rights_type, rights_text, and rights_date. Another major addition is the “ETD_File” entity, which captures metadata related to multiple file types, file descriptions, generation methods, and checksums. We also introduced a new “References” entity, missing in earlier schemas, to capture structured metadata for cited works, including the fields reference_text, author, title, year, and venue.

The Core Component entities are categorized into two types: those that describe the ETD itself, such as “ETDs,” “Rights,” “Subjects,” “ETD_classes,” and “ETD_topics,” and those that capture relationships between ETDs or collections of ETDs, such as “References,” “ETD-ETD_neighbors,” “Collections,” and “Collection_topics”.

Extended Component

Figure 2: Relationships among Entities in the Extended Components of ETD-MS v2.0. Blue represents Category E.1, red represents Category E.2, and orange represents Category E.3.

The Extended Component focuses on content-level metadata and was developed using a bootstrap approach. It introduces 18 entities with 87 metadata fields, grouped into three categories:

Category E.1: Includes entities such as “Objects,” “Object_metadata,” “Object_summaries,” “Object_classes,” and “Object_topics” to describe individual components such as figures, tables, and sections.
Category E.2: Entities such as “Classifications,” “Classification_entries,” “Classifiers,” “Topic_models,” and “Summarizers” store metadata about how certain content was generated or classified.
Category E.3: Captures metadata about user behaviors and preferences using entities such as “Users,” “User_queries,” “User_queries_clicks,” “User_topics,” “User_classes,” and “User-user_neighbors”.

Implementation

To evaluate the feasibility of ETD-MS v2.0, we implemented the schema using a MySQL database and populated it with data from a separate collection of 1,000 ETDs (distinct from the 500 ETDs used for schema development). These ETDs, sourced from 50 U.S. universities and published between 2005 and 2019, were used to simulate real-world metadata extraction. We used OAI-PMH APIs and HTML scraping to gather document-level metadata, and employed PyMuPDF and Pytesseract for text extraction from born-digital and scanned ETDs, respectively. We developed a GPT-3.5 based prompt to classify ETDs using the ProQuest subject taxonomy, and applied summarization models such as T5-Small and Pegasus to generate chapter and object summaries. For topic modeling, we used LDA, LDA2Vec, and BERT, while CNNs and YOLOv7 were used to detect and classify visual elements such as figures and tables. User interaction data was populated with dummy data. The full process of extracting, processing, and inserting metadata for all 1,000 ETDs was completed in approximately 11 minutes on a virtual machine with 32 CPU cores and 125 GB RAM, demonstrating the scalability of our approach.

Interoperability and Mapping

To ensure interoperability and mitigate schema adoption challenges, we created a detailed mapping between ETD-MS v2.0 and the existing standards Dublin Core and ETD-MS v1.1. For example, the new field ETDs.owner_and_statement aligns with dc.rights, and ETDs.discipline maps to thesis.degree.discipline in ETD-MS v1.1. In some cases, our schema introduces new metadata fields with no equivalents in older standards, such as the detailed “References,” “ETD_File,” and “Object_metadata” entities.

Limitations and Future Work

The current version of the schema was developed using a sample of 500 ETDs, which may not fully capture the metadata of ETDs beyond the scope of selection. For example, some ETDs contain multiple date fields, such as submission date and public release date, or include metadata such as a “peer reviewed” status. These elements are not represented in our current schema.

We view ETD-MS v2.0 as an evolving framework. In the future, we will refine the schema by including additional metadata elements. We will also collect feedback from ETD repository managers, librarians, and other stakeholders.

Conclusion

ETD-MS v2.0 is a comprehensive and extensible metadata schema developed to align ETD metadata with the FAIR principles. Our proposed schema extends existing standards by providing a more complete and detailed description and integrating content-level metadata. The proposed ETD-MS v2.0 schema, along with its mappings to both Dublin Core and ETD-MS v1.1, is available at the following GitHub link: https://github.com/lamps-lab/ETDMiner/tree/master/ETD-MS-v2.0.

References

Salsabil, L., Wu, J., Ingram, W. A., & Fox, E. (2024). ETD-MS v2.0: A Proposed Extended Standard for Metadata of Electronic Theses and Dissertations . In Proceedings of the 27th International Symposium on Electronic Theses and Dissertations (ETD 2024).

Uddin, S., Banerjee, B., Wu, J., Ingram, W. A., & Fox, E. A. (2021, December). Building A large collection of multi-domain electronic theses and dissertations. In 2021 IEEE International Conference on Big Data (Big Data) (pp. 6043-6045). IEEE. https://doi.org/10.1109/BigData52589.2021.9672058

-- Lamia Salsabil (@liya_lamia)

Search This Blog

Web Science and Digital Libraries Research Group