About MPIWG Warm Data Storage

A central warm-storage system for managing, preserving, and providing controlled access to research data produced at the Max Planck Institute for the History of Science.

What is Warm Data Storage?

The MPIWG Warm Data Storage sits between active project storage and long-term cold archival. It provides:

  • Structured management of digital objects (images, audio, video, documents, datasets, transcripts, annotations)
  • Lifecycle tracking with auditable state transitions from creation through publication to archival
  • Classification-based access control to protect sensitive materials, especially oral history recordings involving human subjects
  • Consent management for documenting rights and permissions associated with research data
  • Integrity verification via SHA-256 checksums for all stored files

Core Entities

The system is organised around a hierarchy of entities that reflect how research data is produced and managed:

Project

A research project or institutional unit that produces digital objects. Projects have a status, date range, and can be linked to Airtable for external tracking.

Active Completed Archived

Collection

A logical grouping of digital objects within a project (e.g. "Interview recordings 2024", "Digitised manuscripts batch 3"). Types include general, digitisation batch, interview series, image collection, dataset.

Digital Object

The central entity — a finished, reusable digital asset. Can be an image, audio/video recording, transcript, dataset, annotation, or document. Each object has a lifecycle state, a classification level, and structured metadata via the Archive Metadata Schema (AMS).

Types: Image · Audio · Video · Document · Dataset · Transcript · Annotation · Other

File Record

A physical file on disk belonging to a Digital Object. One object can have multiple files (original + derivatives). Each file record tracks the storage backend, path, MIME type, size, and SHA-256 checksum for integrity verification.

Consent Record

Documents consent and rights associated with a Digital Object — critical for Oral History data involving human subjects. Types include Informed Consent, Release Form, Data Use Agreement, Copyright Clearance. Each record tracks its validity status (valid, expired, withdrawn, pending) and can link to scanned consent documents.

Entity Relationships

erDiagram
    Project ||--o{ Collection : has
    Project ||--o{ Object : owns
    Collection ||--o{ Object : groups
    Object ||--o{ File : has
    Object ||--o{ Consent : has
    Consent ||--o{ AuditEntry : logs
    Object ||--o{ Object : derives
    Object ||--o{ MetadataRecord : has
    MetadataRecord }o--|| MetadataSchema : uses

    Project {
        uuid id
        string name
        string status
        date start
        date end
    }
    Collection {
        uuid id
        string title
        string type
    }
    Object {
        uuid id
        string title
        string type
        string state
        string classification
    }
    MetadataRecord {
        uuid id
        json data
        string source_system
        datetime created
        datetime modified
    }
    MetadataSchema {
        int id
        string name
        string version
        json schema_definition
        bool is_active
    }
    File {
        uuid id
        string path
        string filename
        string mime
        bigint size
        string sha256
        bool primary
    }
    Consent {
        uuid id
        string type
        string status
        date obtained
        date expires
        uuid granted_by
        string notes
    }
    AuditEntry {
        uuid id
        string from_status
        string to_status
        uuid changed_by
        string notes
        datetime changed_at
    }
            

Digital Object Lifecycle

Every Digital Object moves through a defined lifecycle. Each transition is logged with the user who triggered it, a timestamp, and optional notes — creating a complete audit trail.

stateDiagram-v2
    direction LR
    [*] --> Draft
    Draft --> Ingested : ingest
    Ingested --> Reviewed : review
    Ingested --> Rejected : reject
    Reviewed --> Published : publish
    Reviewed --> Rejected : reject
    Reviewed --> Deleted : delete
    Published --> Archived : archive
    Published --> Reviewed : revise
    Published --> Withdrawn : withdraw
    Archived --> Published : recall
    Rejected --> Draft : resubmit
    Withdrawn --> Deleted : delete
    Deleted --> [*]
            

Lifecycle States

State Description Allowed Transitions
Draft Newly created object, not yet submitted. The object is being prepared with metadata and files.
Ingested
Ingested Data has been received into the storage system. Requires title, object type, and project assignment.
Reviewed Rejected
Reviewed Content has been reviewed and validated. Requires a classification level to be set.
Published Rejected Deleted
Published Available to authorised institute members according to its classification level. Requires an AMS metadata record.
Archived Reviewed Withdrawn
Archived Moved to long-term archive storage. Can be recalled to Published state if needed again.
Published
Rejected Content was rejected during review. Can be sent back to Draft for corrections and resubmission.
Draft
Withdrawn Removed from active access after publication (e.g. consent withdrawn, error discovered).
Deleted
Deleted Permanently removed. This is a terminal state — no further transitions are possible. None (terminal)

Transition Requirements

Certain transitions require specific metadata to be present before they are allowed:

Target StateRequired FieldsRationale
Ingested title, object_type, project Basic identification must be established before ingestion
Reviewed classification Access level must be determined before review is complete
Published AMS record, valid consent Content must be described via an AMS metadata record and have at least one valid consent record before publication
Archived None Published objects can be archived without additional requirements

Consent Management

The consent subsystem tracks rights, permissions, and ethical approvals associated with digital objects. This is critical for Oral History data involving human subjects, where informed consent governs what can be stored, accessed, and published.

Every consent record follows a defined lifecycle and is linked to the digital object it governs. Changes to consent status are logged in an immutable audit trail.

Consent Lifecycle

stateDiagram-v2
    direction LR
    [*] --> Pending : created
    Pending --> Valid : approved
    Valid --> Expired : expiry passed
    Valid --> Withdrawn : revoked
    Expired --> Valid : renewed
    Withdrawn --> [*]
            

Consent Types & Statuses

Consent Types

Informed Consent

Consent from human subjects for participation in research, interviews, or recordings. Required for oral history materials.

Release Form

Authorisation to use, reproduce, or distribute specific materials (images, recordings, documents).

Data Use Agreement

Agreement governing how research data may be stored, processed, shared, or archived.

Copyright Clearance

Permission to use copyrighted material — manuscripts, images, published works, software.

Other

Any other consent or rights documentation that doesn't fit the categories above.

Consent Statuses

Status Description Effect on Object
Pending Consent has been requested or recorded but not yet confirmed or verified. Info banner shown. Cannot publish.
Valid Active, verified consent. The object may be published and accessed according to its classification level. Enables publishing.
Expired The consent's expiry date has passed. The consent was once valid but needs renewal. Warning banner shown. Cannot publish.
Withdrawn Consent has been explicitly revoked by the grantor. This is a terminal status requiring immediate action. Error banner. Published objects are auto-withdrawn.

Consent & Lifecycle Enforcement

Consent is enforced at two critical points in the object lifecycle:

flowchart LR
    A["Reviewed\nObject"] -->|"publish"| B{"has_valid_consent?"}
    B -->|"Yes ✓"| C["Published\nInternal"]
    B -->|"No ✗"| D["❌ Transition\nBlocked"]
    C -->|"consent withdrawn"| E["Auto-withdraw"]
    E --> F["Withdrawn\nState"]

    style B fill:#fef3c7,stroke:#d97706,color:#92400e
    style C fill:#d1fae5,stroke:#059669,color:#065f46
    style D fill:#fee2e2,stroke:#dc2626,color:#991b1b
    style F fill:#fee2e2,stroke:#dc2626,color:#991b1b
            
Publish Gate

The reviewed → published_internal transition requires at least one consent record with status valid. Without valid consent, the transition is blocked with an error message explaining the requirement.

Auto-Withdrawal

When a consent record on a published object is changed to withdrawn, the system automatically transitions the object to the withdrawn lifecycle state and creates an audit note documenting the reason.

Consent Audit Trail

Every consent status change is recorded as an immutable ConsentAuditEntry. These entries cannot be edited or deleted, providing a complete history of consent decisions for compliance and governance:

Field Description
from_status Previous status (empty for initial creation)
to_status New status after the change
changed_by The user who made the change
notes Reason or context for the status change
changed_at Timestamp (automatically recorded)

Dashboard Monitoring

The dashboard includes a Consent Health card providing at-a-glance monitoring:

📊 Status Breakdown

Colour-coded counts of consent records by status — valid, pending, expired, withdrawn — for quick visual scanning.

⚠️ Missing Consent

Count of published objects that have no consent records at all, flagged as a warning that requires attention.

🔍 Object Detail

Each object detail page shows a consent summary badge in the header and warning banners for any consent issues.

Dedicated Consent Views

Consent records have their own first-class UI accessible from the Consent link in the main navigation bar. These standalone views provide a cross-project overview of all consent activity:

📋 Consent List

A filterable, paginated list of all consent records across every project. Filter by status (valid, pending, expired, withdrawn), consent type, project, or search by object title and notes. Real-time AJAX filtering with debounced search input — no page reload needed.

Browse All Consent Records
🔎 Consent Detail

A rich detail view for each consent record showing the full consent information (type, status, dates, notes, structured details), its parent object context (title, state, classification, project, collection), the object's overall consent health summary, and the complete status change audit trail with timestamps and the user who made each change.

📝 Consent Form

An intuitive create/edit form with collapsible guidance on which consent type to choose, inline status meaning badges, a withdrawal warning when changing status to "withdrawn" on a published object, date pickers, and best-practices tips for proper consent documentation.

🗂️ Breadcrumb Navigation

All consent views include full breadcrumb trails (Dashboard → Consent Records → Project → Collection → Object → Consent) for easy navigation between the consent record, its parent object, and the project hierarchy.

Classification Levels

Every Digital Object has a classification level that controls who can access it. This is especially important for sensitive materials such as oral history interviews with living subjects.

Open

Freely accessible to all institute members. No special restrictions. Suitable for published research data, public documents, and openly licensed materials.

Internal

Institute members only, not for public distribution. This is the default level. Suitable for working data, internal reports, and preliminary research outputs.

Restricted

Named individuals or project members only. Used for sensitive interview material, personal data, and materials with limited consent scope.

Confidential

Requires explicit approval for each access. Highest sensitivity level for materials with strict consent conditions, embargoed content, or legally restricted data.

Typical Workflow

Here is the typical journey of a research asset through the system:

flowchart LR
    A["1. Create\nProject"] --> B["2. Add\nCollection"]
    B --> C["3. Create\nObject"]
    C --> D["4. Upload\nFiles"]
    D --> E["5. Ingest"]
    E --> F["6. Review"]
    F --> G["7. Record\nConsent"]
    G --> H["8. Publish"]
    H --> I["9. Archive"]

    style G fill:#fef3c7,stroke:#d97706,color:#92400e
            
1
Create a Project — Set up a project to represent the research initiative. Assign start/end dates and a description.
2
Organise with Collections — Optionally group objects into Collections within the project.
3
Create a Digital Object — Register a new object with a title, type, and classification. Starts in Draft state.
4
Upload Files — Attach files to the object. Mark the primary file. SHA-256 checksums are computed automatically.
5
Ingest — Transition to Ingested. Confirms the data is in the system with basic identification.
6
Review — A curator reviews the object, checks metadata and classification, then transitions to Reviewed.
7
Record Consent — Add at least one consent record (Informed Consent, Release Form, etc.) and ensure its status is Valid. This is required before the object can be published. See the Consent tab for details on consent types, statuses, and audit trails.
8
Publish Internally — Once the AMS metadata record is complete and valid consent is in place, publish to Published. Accessible per its classification. If consent is later withdrawn, the object is automatically moved to Withdrawn.
9
Archive — When no longer actively needed, move to Archived for long-term preservation. Can be recalled later.

Archive Metadata Schema (AMS)

The Archive Metadata Schema (AMS) is the institute's standard for describing digital assets across all storage tiers — from active project work (hot) through central storage (warm) to long-term archive (cold/HPSS). The schema is maintained in the archive-metadata-schema repository.

The AMS standard is integrated directly into this system. The schema is defined in a machine-readable YAML file (ams-standard.yaml) and used to:

  • Generate forms dynamically — form fields are built at runtime from the schema, not hard-coded
  • Validate metadata — against two compliance profiles (AMS minimal and MPIWG institute)
  • Import/export YAML — AMS-compliant YAML files can be imported or exported at any time
  • Enforce controlled vocabularies — fields like license, format, and access level use values defined in the standard
  • Track schema versions — multiple schema versions can coexist; the dashboard shows which records are on outdated versions

AMS Document Structure

Every AMS metadata record is a YAML document with four top-level sections:

metadata

Descriptive and administrative metadata, split into two sub-sections:

  • context — project title, investigators, description, duration, department
  • archive — license, access level, version, responsible person
files

A list of file entries, each describing a physical file: filename, format (MIME type), size, checksum, and optional actors (who created or contributed to the file).

content

Logical components of the dataset (e.g., "main", "supplement", "documentation"). Each component has a description and type, and maps to a Collection in the infrastructure.

keywords

Structured keyword entries with term, controlled vocabulary (e.g., AAT, LCSH), and optional URI for linked data compatibility.

Validation Profiles

AMS supports a two-tier validation system. Fields can be required at different levels depending on the profile:

Profile Required Fields Description
AMS 9 fields Minimal compliance — enough to identify and locate the dataset. Includes: title, description, investigators, license, access level, version, and at least one file.
MPIWG 15 fields Institute-level compliance — adds department, duration, responsible person, file checksums, content components, and keywords. Required for archival.

How AMS Maps to Infrastructure

AMS fields correspond to infrastructure entities. When importing or exporting, these mappings are applied automatically:

AMS Section Infrastructure Entity Key Mappings
metadata.context Project title → Project.name, description → Project.description, duration → start/end dates
metadata.archive DigitalObject accessLevel → classification, status → lifecycle_state
files[*] FileRecord filename, format → mime_type, size → size_bytes, checksum → checksum_sha256
content[*] Collection component → title, description, type → collection_type

Using AMS in This System

There are four ways to work with AMS metadata:

📝 Web Form

AMS metadata is integrated into the Digital Object create and edit forms. Fields are generated from the schema with appropriate widgets for each field type.

📥 YAML Import

Upload an existing AMS YAML file from the Objects page. The importer creates all necessary infrastructure objects automatically.

📤 YAML Export

Export any metadata record as AMS-compliant YAML for transfer to other systems or cold storage. Available from any record's detail page or the project detail page.

📦 Load Schema

Upload an ams-standard.yaml file to load or update the schema definition used for form generation and validation.

Load AMS Standard

Schema Versioning

The AMS standard evolves over time. When a new version of ams-standard.yaml is loaded, the system handles the transition safely:

📦 Version Coexistence

Each schema load creates a new MetadataSchema row. The new version is marked active; previous versions are deactivated. Existing records remain linked to the version they were created under — no silent data loss.

📊 Dashboard Health

The dashboard's AMS Schema Health card shows the active schema version, how many records sit on outdated versions, and a per-version breakdown with active/outdated badges.

⚠️ Outdated Warnings

When viewing a record whose schema version is no longer active, a warning banner appears at the top of the detail page. The source card also displays an active or outdated badge next to the schema version.

🔍 Version-Aware Validation

Validation can be run against the exact schema version a record was created under (using the stored schema_definition) rather than always validating against the latest version. This prevents false positives when schema fields change between versions.

Technical Details

The system is built on:

  • Django 6.x — Python web framework with PostgreSQL (production) or SQLite (development)
  • REST API — Django REST Framework at /api/v1/ for programmatic access
  • AMS Integration — Dynamic form generation, YAML import/export, profile-based validation from ams-standard.yaml
  • HTMX — Dynamic form sections (add/remove file and content entries without page reload)
  • Pluggable Storage — Filesystem backend with support for future S3 or tape backends
  • Celery — Async task queue for background processing (checksum computation, metadata extraction)
  • DaisyUI + Tailwind CSS — Modern, accessible UI with the MPIWG brand colour theme
  • Docker — Multi-stage Dockerfile (dev/prod), PostgreSQL 16, Redis, Nginx reverse proxy