Gradio

What this app does

Downloads the German curriculum corpus CSV (~35k excerpts)
Filters excerpts for each of the three focus concepts
Encodes every excerpt with paraphrase-multilingual-mpnet-base-v2
(768-dim, L2-normalised, float32)
Saves per-concept artefacts:
- embeddings.npy — shape (N, 768)
- metadata.parquet — all CSV columns + row_id, concept, model, timestamps
- metadata_preview.json — schema + first 5 rows
Pushes all artefacts to
huggingface.co/datasets/deirdosh/curriculum_embeddings

How to use

Step	Action
1	Paste your `HF_TOKEN` (needs write access to the dataset repo)
2	Click ▶ Run Pipeline
3	Watch the live log — each concept takes ~10 min on CPU
4	Download individual files or the full ZIP below

All steps are individually cached. Re-running skips already-computed embeddings. The HF_TOKEN can also be set as a Space secret — leave the field blank if so.