What this app does

  1. Downloads the German curriculum corpus CSV (~35k excerpts)
  2. Filters excerpts for each of the three focus concepts
  3. Encodes every excerpt with paraphrase-multilingual-mpnet-base-v2
    (768-dim, L2-normalised, float32)
  4. Saves per-concept artefacts:
    • embeddings.npy โ€” shape (N, 768)
    • metadata.parquet โ€” all CSV columns + row_id, concept, model, timestamps
    • metadata_preview.json โ€” schema + first 5 rows
  5. Pushes all artefacts to
    huggingface.co/datasets/deirdosh/curriculum_embeddings

How to use

Step Action
1 Paste your HF_TOKEN (needs write access to the dataset repo)
2 Click โ–ถ Run Pipeline
3 Watch the live log โ€” each concept takes ~10 min on CPU
4 Download individual files or the full ZIP below

All steps are individually cached. Re-running skips already-computed embeddings. The HF_TOKEN can also be set as a Space secret โ€” leave the field blank if so.