Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion
name: civil-judgment-taiwan-vectorstore
by alex02131926 · published 2026-04-01
$ claw add gh:alex02131926/alex02131926-civil-judgment-taiwan-vectorstore---
name: civil-judgment-taiwan-vectorstore
description: Ingest Taiwan civil court judgments (HTML or PDF) — exclusively covering Taiwan civil cases — into Qdrant with Ollama embeddings, preserving traceability, deduplication, and incremental updates.
user-invocable: true
---
# Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion
**Scope: Taiwan civil court judgments only** (民事判決). This skill ingests Taiwan civil cases (HTML **or PDF** files) into Qdrant. All parsing, chunking, and embedding logic lives in `scripts/ingest.py` — your job is to **run the script**, not to reimplement the pipeline.
---
Quick Start (follow these steps in order)
Step 1 — Activate venv
source {baseDir}/.venv/bin/activateStep 2 — Identify the run folder
The user will provide an **absolute path** to a run folder.
Example: `/path/to/output/judicialyuan/20260305_142030`
Verify it exists and has HTML or PDF files:
ls <RUN_FOLDER>/archive/ | grep -E '\.(html|pdf)$' | head -5If no `archive/*.html` or `archive/*.pdf` files → **stop and tell the user** the folder has no ingestible data.
Step 3 — Run ingestion
Use absolute paths throughout — no `cd` needed:
python3 {baseDir}/scripts/ingest.py \
--run-folder <RUN_FOLDER>The script handles everything: pre-flight checks, collection auto-creation (creates `civil_case_doc` / `civil_case_chunk` if they don't exist), canonicalization, chunking, embedding, Qdrant upsert, manifest + report writing.
**Re-running the same command on the same folder is always safe** — deterministic IDs mean upsert = overwrite. No special `--resume` flag needed; just run the same command again.
Step 4 — Check the result
**Successful output looks like:**
OK files=42 processed=42 skipped=0 errored=0 doc_points=42 chunk_points=187
manifest=<RUN_FOLDER>/ingest_manifest.jsonl
report=<RUN_FOLDER>/ingest_report.md**Read the report** (human-readable stats summary):
cat <RUN_FOLDER>/ingest_report.mdIf there are errors, check the **manifest** (machine-readable, one JSON line per file) for per-file diagnosis:
grep -E '"status":"(skipped|error|partial)"' <RUN_FOLDER>/ingest_manifest.jsonlStep 5 — Report to user
Tell the user:
**Done.** Do not proceed to additional steps unless the user asks.
---
DO NOT rules (critical)
---
Hard constraints
---
Troubleshooting
`PREFLIGHT_FAILED: Qdrant not reachable`
Qdrant is down or unreachable at the default/configured URL.
# Check if Qdrant is running
curl -s http://localhost:6333/collections | head -1
# If not running, start it (or ask the user)`PREFLIGHT_FAILED: Ollama not reachable`
# Check Ollama
curl -s http://localhost:11434/api/tags | head -5`PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest`
ollama pull bge-m3:latestThen re-run Step 3.
`PREFLIGHT_FAILED: No archive/*.html or archive/*.pdf found`
The run folder exists but has no archived detail pages. Check:
Output shows `skipped > 0` or `errored > 0`
Check `ingest_manifest.jsonl` for per-file details:
grep -E '"status":"(skipped|error|partial)"' "<RUN_FOLDER>/ingest_manifest.jsonl"| Manifest status | Meaning | Action |
|-----------------|---------|--------|
| `ok` | Doc + all chunks ingested | None |
| `partial` | Doc upserted, but some section chunks failed embedding | Check Ollama stability; can re-run safely |
| `skipped` | Doc-level embedding failed — nothing upserted for this doc | Check Ollama; re-run safely |
| `error` | HTML read/parse failed | Check if the HTML file is corrupted |
Re-running is always safe — use the exact same command. No special flags needed; deterministic IDs → upsert/overwrite.
Override service endpoints
# Via environment variables
OLLAMA_URL=http://localhost:11434 QDRANT_URL=http://localhost:6333 \
python3 scripts/ingest.py --run-folder "..."
# Via CLI flags (take precedence over env vars)
python3 scripts/ingest.py --run-folder "..." \
--ollama http://localhost:11434 --qdrant http://localhost:6333Default endpoints:
| Service | Default | Env override |
|---------|---------|--------------|
| Ollama | `http://localhost:11434` | `$OLLAMA_URL` |
| Qdrant | `http://localhost:6333` | `$QDRANT_URL` |
Test with a small batch first
python3 scripts/ingest.py --run-folder "..." --limit 5---
Input folder structure (expected)
<run_folder>/
archive/
fjud_detail_001.html ← HTML input
fjud_detail_002.html
fjud_detail_003.pdf ← PDF input (also supported)
fint_detail_001.html (if system=both)
results_fjud.jsonl (optional)
results_fint.jsonl (optional)The script discovers all `archive/*.html` and `archive/*.pdf` files automatically (sorted by filename). HTML and PDF files can coexist in the same run folder.
**v1 limitation**: The `system` metadata field is currently hardcoded to `FJUD`. If a run folder contains both FJUD and FINT files, FINT files will be ingested but mislabeled as `FJUD`. This does not affect chunking or embeddings — only the `system` metadata field on the resulting Qdrant points.
---
CLI reference
python3 scripts/ingest.py --run-folder <PATH> [options]| Flag | Default | Description |
|------|---------|-------------|
| `--run-folder` | (required) | Path to an input folder |
| `--ollama` | `$OLLAMA_URL` or `http://localhost:11434` | Ollama endpoint |
| `--qdrant` | `$QDRANT_URL` or `http://localhost:6333` | Qdrant endpoint |
| `--embed-model` | `bge-m3:latest` | Ollama embedding model |
| `--vector-size` | `1024` | Vector dimension |
| `--max-chars` | `900` | Max chars per chunk (500–1000) |
| `--overlap-chars` | `150` | Overlap between chunks (10–20% of max-chars) |
| `--limit` | `0` (no limit) | Process only first N files sorted by filename (lexicographic order); for testing |
---
Outputs
---
Roadmap
---
Internal details
For metadata schema, canonicalization rules, section-splitting patterns, and chunking implementation, see [`references/internals.md`](references/internals.md).
---
Lessons learned / operational gotchas
More tools from the same signal band
Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).
Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.
The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...