⚡

// Skill profile

Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion

Name: Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion
Author: alex02131926

name: civil-judgment-taiwan-vectorstore

by alex02131926 · published 2026-04-01

数据处理API集成

Total installs

Stars

★ 0

Last updated

2026-04

// Install command

$ claw add gh:alex02131926/alex02131926-civil-judgment-taiwan-vectorstore

View on GitHub

// Full documentation

---

name: civil-judgment-taiwan-vectorstore

description: Ingest Taiwan civil court judgments (HTML or PDF) — exclusively covering Taiwan civil cases — into Qdrant with Ollama embeddings, preserving traceability, deduplication, and incremental updates.

user-invocable: true

---

# Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion

**Scope: Taiwan civil court judgments only** (民事判決). This skill ingests Taiwan civil cases (HTML **or PDF** files) into Qdrant. All parsing, chunking, and embedding logic lives in `scripts/ingest.py` — your job is to **run the script**, not to reimplement the pipeline.

---

Quick Start (follow these steps in order)

Step 1 — Activate venv

source {baseDir}/.venv/bin/activate

Step 2 — Identify the run folder

The user will provide an **absolute path** to a run folder.

Example: `/path/to/output/judicialyuan/20260305_142030`

Verify it exists and has HTML or PDF files:

ls <RUN_FOLDER>/archive/ | grep -E '\.(html|pdf)$' | head -5

If no `archive/*.html` or `archive/*.pdf` files → **stop and tell the user** the folder has no ingestible data.

Step 3 — Run ingestion

Use absolute paths throughout — no `cd` needed:

python3 {baseDir}/scripts/ingest.py \
  --run-folder <RUN_FOLDER>

The script handles everything: pre-flight checks, collection auto-creation (creates `civil_case_doc` / `civil_case_chunk` if they don't exist), canonicalization, chunking, embedding, Qdrant upsert, manifest + report writing.

**Re-running the same command on the same folder is always safe** — deterministic IDs mean upsert = overwrite. No special `--resume` flag needed; just run the same command again.

Step 4 — Check the result

**Successful output looks like:**

OK files=42 processed=42 skipped=0 errored=0 doc_points=42 chunk_points=187
manifest=<RUN_FOLDER>/ingest_manifest.jsonl
report=<RUN_FOLDER>/ingest_report.md

**Read the report** (human-readable stats summary):

cat <RUN_FOLDER>/ingest_report.md

If there are errors, check the **manifest** (machine-readable, one JSON line per file) for per-file diagnosis:

grep -E '"status":"(skipped|error|partial)"' <RUN_FOLDER>/ingest_manifest.jsonl

Step 5 — Report to user

Tell the user:

How many docs were ingested (`doc_points`)

How many chunks were created (`chunk_points`)

Whether any were skipped or errored

Where the report file is

**Done.** Do not proceed to additional steps unless the user asks.

---

DO NOT rules (critical)

**DO NOT** write your own HTML parsing, chunking, or embedding code. `ingest.py` handles all of this.

**DO NOT** modify parsing/chunking logic casually. Only change heading detection or chunk fallback when the user explicitly asks to improve PDF/OCR robustness, and validate on a small sample before re-running a large batch.

**DO NOT** call Qdrant or Ollama APIs directly. The script does this.

**DO NOT** use `verify=False` or skip SSL verification for any HTTP request.

**DO NOT** modify or delete files under `archive/`. Raw HTML is immutable source of truth.

**DO NOT** change chunking defaults (`--max-chars`, `--overlap-chars`) unless the user explicitly asks.

---

Hard constraints

**Raw HTML/PDF is source of truth**; never overwrite it.

**Deterministic**: same input → same canonical text → same SHA-256 → same Qdrant point IDs. Safe to re-run.

**Traceability**: every Qdrant point carries `doc_url` + `local_path`.

**Batched upserts** (≤ 64 points/batch) to avoid Qdrant 32MB payload limit.

**`parser_version`** in every point's metadata. Current: `v3.5-sentence-boundary`.

---

Troubleshooting

`PREFLIGHT_FAILED: Qdrant not reachable`

Qdrant is down or unreachable at the default/configured URL.

# Check if Qdrant is running
curl -s http://localhost:6333/collections | head -1

# If not running, start it (or ask the user)

`PREFLIGHT_FAILED: Ollama not reachable`

# Check Ollama
curl -s http://localhost:11434/api/tags | head -5

`PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest`

ollama pull bge-m3:latest

Then re-run Step 3.

`PREFLIGHT_FAILED: No archive/.html or archive/.pdf found`

The run folder exists but has no archived detail pages. Check:

Is this the correct run folder?

Output shows `skipped > 0` or `errored > 0`

Check `ingest_manifest.jsonl` for per-file details:

grep -E '"status":"(skipped|error|partial)"' "<RUN_FOLDER>/ingest_manifest.jsonl"

| Manifest status | Meaning | Action |

|-----------------|---------|--------|

| `ok` | Doc + all chunks ingested | None |

| `partial` | Doc upserted, but some section chunks failed embedding | Check Ollama stability; can re-run safely |

| `skipped` | Doc-level embedding failed — nothing upserted for this doc | Check Ollama; re-run safely |

| `error` | HTML read/parse failed | Check if the HTML file is corrupted |

Re-running is always safe — use the exact same command. No special flags needed; deterministic IDs → upsert/overwrite.

Override service endpoints

# Via environment variables
OLLAMA_URL=http://localhost:11434 QDRANT_URL=http://localhost:6333 \
  python3 scripts/ingest.py --run-folder "..."

# Via CLI flags (take precedence over env vars)
python3 scripts/ingest.py --run-folder "..." \
  --ollama http://localhost:11434 --qdrant http://localhost:6333

Default endpoints:

| Service | Default | Env override |

|---------|---------|--------------|

| Ollama | `http://localhost:11434` | `$OLLAMA_URL` |

| Qdrant | `http://localhost:6333` | `$QDRANT_URL` |

Test with a small batch first

python3 scripts/ingest.py --run-folder "..." --limit 5

---

Input folder structure (expected)

<run_folder>/
  archive/
    fjud_detail_001.html               ← HTML input
    fjud_detail_002.html
    fjud_detail_003.pdf                ← PDF input (also supported)
    fint_detail_001.html               (if system=both)
  results_fjud.jsonl                   (optional)
  results_fint.jsonl                   (optional)

The script discovers all `archive/*.html` and `archive/*.pdf` files automatically (sorted by filename). HTML and PDF files can coexist in the same run folder.

**v1 limitation**: The `system` metadata field is currently hardcoded to `FJUD`. If a run folder contains both FJUD and FINT files, FINT files will be ingested but mislabeled as `FJUD`. This does not affect chunking or embeddings — only the `system` metadata field on the resulting Qdrant points.

---

CLI reference

python3 scripts/ingest.py --run-folder <PATH> [options]

| Flag | Default | Description |

|------|---------|-------------|

| `--run-folder` | (required) | Path to an input folder |

| `--ollama` | `$OLLAMA_URL` or `http://localhost:11434` | Ollama endpoint |

| `--qdrant` | `$QDRANT_URL` or `http://localhost:6333` | Qdrant endpoint |

| `--embed-model` | `bge-m3:latest` | Ollama embedding model |

| `--vector-size` | `1024` | Vector dimension |

| `--max-chars` | `900` | Max chars per chunk (500–1000) |

| `--overlap-chars` | `150` | Overlap between chunks (10–20% of max-chars) |

| `--limit` | `0` (no limit) | Process only first N files sorted by filename (lexicographic order); for testing |

---

Outputs

**Qdrant collections**: `civil_case_doc` (1 point/doc), `civil_case_chunk` (many points/doc). Auto-created if they don't exist.

**`ingest_report.md`**: human-readable summary (doc/chunk counts, error counts). **Read this first** after ingestion.

**`ingest_manifest.jsonl`**: machine-readable, one JSON line per doc with status (`ok` / `partial` / `skipped` / `error`). **Read this to diagnose specific file failures** (grep for non-`ok` statuses). Both files overlap on aggregate counts; the manifest adds per-file detail.

---

Roadmap

**v1** (current): doc + section-aware chunks

**v2**: candidate issue extraction (爭點抽取)

**v3**: issue-level index (`civil_case_issue` collection)

---

Internal details

For metadata schema, canonicalization rules, section-splitting patterns, and chunking implementation, see [`references/internals.md`](references/internals.md).

---

Lessons learned / operational gotchas

Qdrant rejects non-UUID/non-integer point IDs (`400 Bad Request`). The script uses deterministic UUIDs — do not change the ID generation logic.

Qdrant rejects payloads > 32MB. The script batches at 64 points — do not increase batch size.

Re-running on the same folder is safe: deterministic IDs mean upsert = overwrite.

台灣判決書 section headings 格式不統一（e.g.「理　由」with fullwidth space、兼容字如「⽂」）。目前 parser 已先做 heading normalization；若仍切不出 section，會 fallback 對 `full` 做 chunking，避免只留下 doc-level points。

// Comments

// Related skills

More tools from the same signal band

Order food/drinks (点餐) on an Android device paired as an OpenClaw node. Uses in-app menu and cart; add goods, view cart, submit order (demo, no real payment).

Sign plugins, rotate agent credentials without losing identity, and publicly attest to plugin behavior with verifiable claims and authenticated transfers.

The philosophical layer for AI agents. Maps behavior to Spinoza's 48 affects, calculates persistence scores, and generates geometric self-reports. Give your...

日历管理数据处理

1 installs★ 0

Taiwan Civil Judgment → Vector DB (Qdrant) Ingestion

Quick Start (follow these steps in order)

Step 1 — Activate venv

Step 2 — Identify the run folder

Step 3 — Run ingestion

Step 4 — Check the result

Step 5 — Report to user

DO NOT rules (critical)

Hard constraints

Troubleshooting

`PREFLIGHT_FAILED: Qdrant not reachable`

`PREFLIGHT_FAILED: Ollama not reachable`

`PREFLIGHT_FAILED: Ollama model missing: bge-m3:latest`

`PREFLIGHT_FAILED: No archive/*.html or archive/*.pdf found`

Output shows `skipped > 0` or `errored > 0`

Override service endpoints

Test with a small batch first

Input folder structure (expected)

CLI reference

Outputs

Roadmap

Internal details

Lessons learned / operational gotchas

More tools from the same signal band

`PREFLIGHT_FAILED: No archive/.html or archive/.pdf found`