A reproducible LLM pipeline for classifying open-ended text.
An open-source Python and R toolkit for classifying open-ended text with large language models. For survey classification, defaults are calibrated against the consensus of double-blind coding by sociologists and demographers across multiple datasets, yielding 88% to 99% agreement with human coders across 30+ LLMs depending on the task (Soria 2026). The same engine underlies separate packages for other text domains; the toolkit has been used in several academic research papers. Compatible with GPT, Gemini, and Claude, with the Hugging Face Inference API for hosted open-weight models, or with Ollama for fully local inference.
pip install cat-llm
Each design choice — model selection, prompt structure, ensembling, multi-label handling — is documented and reproducible, so methodological decisions can be inspected, modified, and reported in a paper.
From raw text to coded data. The pipeline can either consume an existing codebook or induce one from a sample of responses; categories are not required up front.
Automatically discover categories from your data. Sample responses, let the model surface recurring themes, then merge semantically similar labels into a clean taxonomy.
Extract structured classifications across full dataframe columns. Batches are processed with configurable model selection and provider routing.
Assign categories with multi-label support and ensemble voting. Verbose definitions with inclusion/exclusion criteria reduce spurious classification, as documented in Soria 2026.
Six domain packages share a common cat-stack engine. Each ships with domain-tuned prompts, specialized parameters, and any built-in data sources relevant to that field, while exposing the same classify() / extract() API.
Classify open-ended survey responses at scale. Handles ambiguity with verbose category definitions and ensemble voting.
pip install cat-survey
17 built-in data sources: municipal ordinances, federal laws, executive orders, presidential speeches, and more. Updated weekly.
pip install cat-pol
Classify social media posts, comments, and short-form text with domain-tuned prompts for informal language.
pip install cat-vader
Classify and summarize academic abstracts, full texts, and research documents across disciplines.
pip install cat-ademic
Specialized tools for cognitive assessment scoring, including CERAD drawing evaluation for dementia research.
pip install cat-cog
Classify scraped web pages, articles, and HTML content with domain-tuned prompts for long-form online text.
pip install cat-web
A desktop build of the same cat-llm pipeline, distributed as a self-contained application. Outputs and the generated reproducibility scripts are the same as the library; the audience is collaborators who do not maintain a local Python environment.
Apple Silicon builds are available below. Intel and Windows builds are planned.
LLMs are already accessible to most researchers through ChatGPT, Claude, and Gemini. The relevant question is not whether LLMs can classify text, but whether the workflow around them meets ordinary standards of empirical research — reproducible, transparent, validated, and scalable.
| Manual coding | ChatGPT / Claude / Gemini chat | CatLLM | |
|---|---|---|---|
| Reproducible | Depends on coder consistency; inter-rater drift is common | No — outputs vary run-to-run; prompts live in chat history | Yes — versioned prompts, pinned models, deterministic config |
| Scales to thousands of responses | No — hours of work per few hundred responses | Limited — copy/paste workflow, no batch processing | Yes — pandas / dataframe input, batch APIs, parallel execution |
| Standardized output | Varies by coder | Free-form prose that needs to be re-parsed | Structured DataFrame; CSV-ready for Stata, R, Python |
| Transparent prompts | — | Buried in conversation; not version-controlled | Inspectable, modifiable, committable to your repo |
| Validated defaults | Codebook-dependent | None — relies on the user's prompt design | Defaults tuned against expert human coders in published benchmarks |
| Multi-label & ensemble support | Manual aggregation | Inconsistent across runs and models | Native multi-label; unanimous-vote ensembles across providers |
| Best for | Small, exploratory studies; rich qualitative interpretation | One-off exploration; drafting a candidate taxonomy | Production research datasets that need to be defensible at peer review |
A minimal end-to-end pipeline in Python. The same workflow is available in R through the cat.llm package on R-universe.
Defaults are grounded in a systematic evaluation of 21 LLMs across six providers and four open-ended survey coding tasks, benchmarked against expert human coders.