About MPIWG Warm Data Storage
A central warm-storage system for managing, preserving, and providing controlled access to research data produced at the Max Planck Institute for the History of Science.
What is Warm Data Storage?
The MPIWG Warm Data Storage sits between active project storage and long-term cold archival. It provides:
- Structured management of digital objects (images, audio, video, documents, datasets, transcripts, annotations)
- Lifecycle tracking with auditable state transitions from creation through publication to archival
- Classification-based access control to protect sensitive materials, especially oral history recordings involving human subjects
- Consent management for documenting rights and permissions associated with research data
- Integrity verification via SHA-256 checksums for all stored files
Core Entities
The system is organised around a hierarchy of entities that reflect how research data is produced and managed:
Project
A research project or institutional unit that produces digital objects. Projects have a status, date range, and can be linked to Airtable for external tracking.
Collection
A logical grouping of digital objects within a project (e.g. "Interview recordings 2024", "Digitised manuscripts batch 3"). Types include general, digitisation batch, interview series, image collection, dataset.
Digital Object
The central entity — a finished, reusable digital asset. Can be an image, audio/video recording, transcript, dataset, annotation, or document. Each object has a lifecycle state, a classification level, and structured metadata via the Archive Metadata Schema (AMS).
File Record
A physical file on disk belonging to a Digital Object. One object can have multiple files (original + derivatives). Each file record tracks the storage backend, path, MIME type, size, and SHA-256 checksum for integrity verification.
Consent Record
Documents consent and rights associated with a Digital Object — critical for Oral History data involving human subjects. Types include Informed Consent, Release Form, Data Use Agreement, Copyright Clearance. Each record tracks its validity status (valid, expired, withdrawn, pending) and can link to scanned consent documents.
Entity Relationships
erDiagram
Project ||--o{ Collection : has
Project ||--o{ Object : owns
Collection ||--o{ Object : groups
Object ||--o{ File : has
Object ||--o{ Consent : has
Consent ||--o{ AuditEntry : logs
Object ||--o{ Object : derives
Object ||--o{ MetadataRecord : has
MetadataRecord }o--|| MetadataSchema : uses
Project {
uuid id
string name
string status
date start
date end
}
Collection {
uuid id
string title
string type
}
Object {
uuid id
string title
string type
string state
string classification
}
MetadataRecord {
uuid id
json data
string source_system
datetime created
datetime modified
}
MetadataSchema {
int id
string name
string version
json schema_definition
bool is_active
}
File {
uuid id
string path
string filename
string mime
bigint size
string sha256
bool primary
}
Consent {
uuid id
string type
string status
date obtained
date expires
uuid granted_by
string notes
}
AuditEntry {
uuid id
string from_status
string to_status
uuid changed_by
string notes
datetime changed_at
}
Digital Object Lifecycle
Every Digital Object moves through a defined lifecycle. Each transition is logged with the user who triggered it, a timestamp, and optional notes — creating a complete audit trail.
stateDiagram-v2
direction LR
[*] --> Draft
Draft --> Ingested : ingest
Ingested --> Reviewed : review
Ingested --> Rejected : reject
Reviewed --> Published : publish
Reviewed --> Rejected : reject
Reviewed --> Deleted : delete
Published --> Archived : archive
Published --> Reviewed : revise
Published --> Withdrawn : withdraw
Archived --> Published : recall
Rejected --> Draft : resubmit
Withdrawn --> Deleted : delete
Deleted --> [*]
Lifecycle States
| State | Description | Allowed Transitions |
|---|---|---|
| Draft | Newly created object, not yet submitted. The object is being prepared with metadata and files. | Ingested |
| Ingested | Data has been received into the storage system. Requires title, object type, and project assignment. | Reviewed Rejected |
| Reviewed | Content has been reviewed and validated. Requires a classification level to be set. | Published Rejected Deleted |
| Published | Available to authorised institute members according to its classification level. Requires an AMS metadata record. | Archived Reviewed Withdrawn |
| Archived | Moved to long-term archive storage. Can be recalled to Published state if needed again. | Published |
| Rejected | Content was rejected during review. Can be sent back to Draft for corrections and resubmission. | Draft |
| Withdrawn | Removed from active access after publication (e.g. consent withdrawn, error discovered). | Deleted |
| Deleted | Permanently removed. This is a terminal state — no further transitions are possible. | None (terminal) |
Transition Requirements
Certain transitions require specific metadata to be present before they are allowed:
| Target State | Required Fields | Rationale |
|---|---|---|
| Ingested | title, object_type, project | Basic identification must be established before ingestion |
| Reviewed | classification | Access level must be determined before review is complete |
| Published | AMS record, valid consent | Content must be described via an AMS metadata record and have at least one valid consent record before publication |
| Archived | None | Published objects can be archived without additional requirements |
Consent Management
The consent subsystem tracks rights, permissions, and ethical approvals associated with digital objects. This is critical for Oral History data involving human subjects, where informed consent governs what can be stored, accessed, and published.
Every consent record follows a defined lifecycle and is linked to the digital object it governs. Changes to consent status are logged in an immutable audit trail.
Consent Lifecycle
stateDiagram-v2
direction LR
[*] --> Pending : created
Pending --> Valid : approved
Valid --> Expired : expiry passed
Valid --> Withdrawn : revoked
Expired --> Valid : renewed
Withdrawn --> [*]
Consent Types & Statuses
Consent Types
Consent from human subjects for participation in research, interviews, or recordings. Required for oral history materials.
Authorisation to use, reproduce, or distribute specific materials (images, recordings, documents).
Agreement governing how research data may be stored, processed, shared, or archived.
Permission to use copyrighted material — manuscripts, images, published works, software.
Any other consent or rights documentation that doesn't fit the categories above.
Consent Statuses
| Status | Description | Effect on Object |
|---|---|---|
| Pending | Consent has been requested or recorded but not yet confirmed or verified. | Info banner shown. Cannot publish. |
| Valid | Active, verified consent. The object may be published and accessed according to its classification level. | Enables publishing. |
| Expired | The consent's expiry date has passed. The consent was once valid but needs renewal. | Warning banner shown. Cannot publish. |
| Withdrawn | Consent has been explicitly revoked by the grantor. This is a terminal status requiring immediate action. | Error banner. Published objects are auto-withdrawn. |
Consent & Lifecycle Enforcement
Consent is enforced at two critical points in the object lifecycle:
flowchart LR
A["Reviewed\nObject"] -->|"publish"| B{"has_valid_consent?"}
B -->|"Yes ✓"| C["Published\nInternal"]
B -->|"No ✗"| D["❌ Transition\nBlocked"]
C -->|"consent withdrawn"| E["Auto-withdraw"]
E --> F["Withdrawn\nState"]
style B fill:#fef3c7,stroke:#d97706,color:#92400e
style C fill:#d1fae5,stroke:#059669,color:#065f46
style D fill:#fee2e2,stroke:#dc2626,color:#991b1b
style F fill:#fee2e2,stroke:#dc2626,color:#991b1b
The reviewed → published_internal transition requires at least one consent record with status valid. Without valid consent, the transition is blocked with an error message explaining the requirement.
When a consent record on a published object is changed to withdrawn, the system automatically transitions the object to the withdrawn lifecycle state and creates an audit note documenting the reason.
Consent Audit Trail
Every consent status change is recorded as an immutable ConsentAuditEntry. These entries cannot be edited or deleted, providing a complete history of consent decisions for compliance and governance:
| Field | Description |
|---|---|
| from_status | Previous status (empty for initial creation) |
| to_status | New status after the change |
| changed_by | The user who made the change |
| notes | Reason or context for the status change |
| changed_at | Timestamp (automatically recorded) |
Dashboard Monitoring
The dashboard includes a Consent Health card providing at-a-glance monitoring:
Colour-coded counts of consent records by status — valid, pending, expired, withdrawn — for quick visual scanning.
Count of published objects that have no consent records at all, flagged as a warning that requires attention.
Each object detail page shows a consent summary badge in the header and warning banners for any consent issues.
Dedicated Consent Views
Consent records have their own first-class UI accessible from the Consent link in the main navigation bar. These standalone views provide a cross-project overview of all consent activity:
A filterable, paginated list of all consent records across every project. Filter by status (valid, pending, expired, withdrawn), consent type, project, or search by object title and notes. Real-time AJAX filtering with debounced search input — no page reload needed.
Browse All Consent RecordsA rich detail view for each consent record showing the full consent information (type, status, dates, notes, structured details), its parent object context (title, state, classification, project, collection), the object's overall consent health summary, and the complete status change audit trail with timestamps and the user who made each change.
An intuitive create/edit form with collapsible guidance on which consent type to choose, inline status meaning badges, a withdrawal warning when changing status to "withdrawn" on a published object, date pickers, and best-practices tips for proper consent documentation.
All consent views include full breadcrumb trails (Dashboard → Consent Records → Project → Collection → Object → Consent) for easy navigation between the consent record, its parent object, and the project hierarchy.
Classification Levels
Every Digital Object has a classification level that controls who can access it. This is especially important for sensitive materials such as oral history interviews with living subjects.
Freely accessible to all institute members. No special restrictions. Suitable for published research data, public documents, and openly licensed materials.
Institute members only, not for public distribution. This is the default level. Suitable for working data, internal reports, and preliminary research outputs.
Named individuals or project members only. Used for sensitive interview material, personal data, and materials with limited consent scope.
Requires explicit approval for each access. Highest sensitivity level for materials with strict consent conditions, embargoed content, or legally restricted data.
Typical Workflow
Here is the typical journey of a research asset through the system:
flowchart LR
A["1. Create\nProject"] --> B["2. Add\nCollection"]
B --> C["3. Create\nObject"]
C --> D["4. Upload\nFiles"]
D --> E["5. Ingest"]
E --> F["6. Review"]
F --> G["7. Record\nConsent"]
G --> H["8. Publish"]
H --> I["9. Archive"]
style G fill:#fef3c7,stroke:#d97706,color:#92400e
Archive Metadata Schema (AMS)
The Archive Metadata Schema (AMS) is the institute's standard for describing digital assets across all storage tiers — from active project work (hot) through central storage (warm) to long-term archive (cold/HPSS). The schema is maintained in the archive-metadata-schema repository.
The AMS standard is integrated directly into this system. The schema is defined in a machine-readable YAML file (ams-standard.yaml) and used to:
- Generate forms dynamically — form fields are built at runtime from the schema, not hard-coded
- Validate metadata — against two compliance profiles (AMS minimal and MPIWG institute)
- Import/export YAML — AMS-compliant YAML files can be imported or exported at any time
- Enforce controlled vocabularies — fields like license, format, and access level use values defined in the standard
- Track schema versions — multiple schema versions can coexist; the dashboard shows which records are on outdated versions
AMS Document Structure
Every AMS metadata record is a YAML document with four top-level sections:
Descriptive and administrative metadata, split into two sub-sections:
- context — project title, investigators, description, duration, department
- archive — license, access level, version, responsible person
A list of file entries, each describing a physical file: filename, format (MIME type), size, checksum, and optional actors (who created or contributed to the file).
Logical components of the dataset (e.g., "main", "supplement", "documentation"). Each component has a description and type, and maps to a Collection in the infrastructure.
Structured keyword entries with term, controlled vocabulary (e.g., AAT, LCSH), and optional URI for linked data compatibility.
Validation Profiles
AMS supports a two-tier validation system. Fields can be required at different levels depending on the profile:
| Profile | Required Fields | Description |
|---|---|---|
| AMS | 9 fields | Minimal compliance — enough to identify and locate the dataset. Includes: title, description, investigators, license, access level, version, and at least one file. |
| MPIWG | 15 fields | Institute-level compliance — adds department, duration, responsible person, file checksums, content components, and keywords. Required for archival. |
How AMS Maps to Infrastructure
AMS fields correspond to infrastructure entities. When importing or exporting, these mappings are applied automatically:
| AMS Section | Infrastructure Entity | Key Mappings |
|---|---|---|
| metadata.context | Project | title → Project.name, description → Project.description, duration → start/end dates |
| metadata.archive | DigitalObject | accessLevel → classification, status → lifecycle_state |
| files[*] | FileRecord | filename, format → mime_type, size → size_bytes, checksum → checksum_sha256 |
| content[*] | Collection | component → title, description, type → collection_type |
Using AMS in This System
There are four ways to work with AMS metadata:
AMS metadata is integrated into the Digital Object create and edit forms. Fields are generated from the schema with appropriate widgets for each field type.
Upload an existing AMS YAML file from the Objects page. The importer creates all necessary infrastructure objects automatically.
Export any metadata record as AMS-compliant YAML for transfer to other systems or cold storage. Available from any record's detail page or the project detail page.
Upload an ams-standard.yaml file to load or update the schema definition used for form generation and validation.
Schema Versioning
The AMS standard evolves over time. When a new version of ams-standard.yaml is loaded, the system handles the transition safely:
Each schema load creates a new MetadataSchema row. The new version is marked active; previous versions are deactivated. Existing records remain linked to the version they were created under — no silent data loss.
The dashboard's AMS Schema Health card shows the active schema version, how many records sit on outdated versions, and a per-version breakdown with active/outdated badges.
When viewing a record whose schema version is no longer active, a warning banner appears at the top of the detail page. The source card also displays an active or outdated badge next to the schema version.
Validation can be run against the exact schema version a record was created under (using the stored schema_definition) rather than always validating against the latest version. This prevents false positives when schema fields change between versions.
Technical Details
The system is built on:
- Django 6.x — Python web framework with PostgreSQL (production) or SQLite (development)
- REST API — Django REST Framework at
/api/v1/for programmatic access - AMS Integration — Dynamic form generation, YAML import/export, profile-based validation from
ams-standard.yaml - HTMX — Dynamic form sections (add/remove file and content entries without page reload)
- Pluggable Storage — Filesystem backend with support for future S3 or tape backends
- Celery — Async task queue for background processing (checksum computation, metadata extraction)
- DaisyUI + Tailwind CSS — Modern, accessible UI with the MPIWG brand colour theme
- Docker — Multi-stage Dockerfile (dev/prod), PostgreSQL 16, Redis, Nginx reverse proxy