What this app does
- Downloads the German curriculum corpus CSV (~35k excerpts)
- Filters excerpts for each of the three focus concepts
- Encodes every excerpt with
paraphrase-multilingual-mpnet-base-v2
(768-dim, L2-normalised, float32) - Saves per-concept artefacts:
embeddings.npyโ shape(N, 768)metadata.parquetโ all CSV columns +row_id,concept,model, timestampsmetadata_preview.jsonโ schema + first 5 rows
- Pushes all artefacts to
huggingface.co/datasets/deirdosh/curriculum_embeddings
How to use
| Step | Action |
|---|---|
| 1 | Paste your HF_TOKEN (needs write access to the dataset repo) |
| 2 | Click โถ Run Pipeline |
| 3 | Watch the live log โ each concept takes ~10 min on CPU |
| 4 | Download individual files or the full ZIP below |
All steps are individually cached. Re-running skips already-computed embeddings. The HF_TOKEN can also be set as a Space secret โ leave the field blank if so.