CatLLM

A reproducible LLM pipeline for classifying open-ended text.

An open-source Python and R toolkit for classifying open-ended text with large language models. For survey classification, defaults are calibrated against the consensus of double-blind coding by sociologists and demographers across multiple datasets, yielding 88% to 99% agreement with human coders across 30+ LLMs depending on the task (Soria 2026). The same engine underlies separate packages for other text domains; the toolkit has been used in several academic research papers. Compatible with GPT, Gemini, and Claude, with the Hugging Face Inference API for hosted open-weight models, or with Ollama for fully local inference.

Soria, C. · UC Berkeley · SocArXiv preprint, 2026 · JOSS DOI
pip install cat-llm

Defaults grounded in systematic evaluation

Each design choice — model selection, prompt structure, ensembling, multi-label handling — is documented and reproducible, so methodological decisions can be inspected, modified, and reported in a paper.

Multi-model ensembles
Aggregate predictions across models with unanimous voting. Open-weight ensembles can reach agreement comparable to individual frontier models at lower per-call cost (Soria 2026).
Provider-agnostic
Compatible with GPT, Gemini, and Claude. Open-weight Hugging Face models can be accessed through the Hugging Face Inference API or served locally via Ollama. REST-based with no SDK dependencies; switching providers does not require changes to analysis code.
Local-only deployment
Runs entirely on local hardware via open-weight models. Sensitive data — survey responses, clinical text, PII — never leaves the machine, supporting IRB and HIPAA constraints.
Unified multimodal interface
Text, images, and PDFs share one consistent API. Automatic input detection handles mixed datasets without per-modality configuration.

A three-stage classification pipeline

From raw text to coded data. The pipeline can either consume an existing codebook or induce one from a sample of responses; categories are not required up front.

01
Explore

Automatically discover categories from your data. Sample responses, let the model surface recurring themes, then merge semantically similar labels into a clean taxonomy.

02
Extract

Extract structured classifications across full dataframe columns. Batches are processed with configurable model selection and provider routing.

03
Classify

Assign categories with multi-label support and ensemble voting. Verbose definitions with inclusion/exclusion criteria reduce spurious classification, as documented in Soria 2026.

pipeline.py
EXPLORE Discover categories from 50 samples
EXTRACT Batch classify full dataset
CLASSIFY Multi-label + ensemble vote
OUTPUT CSV ready for statistical analysis
Optional extras
pip install cat-llm[pdf] cat-llm[embeddings]

Domain-specific packages

Six domain packages share a common cat-stack engine. Each ships with domain-tuned prompts, specialized parameters, and any built-in data sources relevant to that field, while exposing the same classify() / extract() API.

Package
cat-survey
Survey Responses

Classify open-ended survey responses at scale. Handles ambiguity with verbose category definitions and ensemble voting.

pip install cat-survey
Package
cat-pol
Political Text

17 built-in data sources: municipal ordinances, federal laws, executive orders, presidential speeches, and more. Updated weekly.

pip install cat-pol
Package
cat-vader
Social Media

Classify social media posts, comments, and short-form text with domain-tuned prompts for informal language.

pip install cat-vader
Package
cat-ademic
Academic Papers

Classify and summarize academic abstracts, full texts, and research documents across disciplines.

pip install cat-ademic
Package
cat-cog
Cognitive Assessment

Specialized tools for cognitive assessment scoring, including CERAD drawing evaluation for dementia research.

pip install cat-cog
Package
cat-web
Web Content

Classify scraped web pages, articles, and HTML content with domain-tuned prompts for long-form online text.

pip install cat-web

For collaborators without a Python environment

A desktop build of the same cat-llm pipeline, distributed as a self-contained application. Outputs and the generated reproducibility scripts are the same as the library; the audience is collaborators who do not maintain a local Python environment.

CatLLM desktop app — General Classify view in dark mode

Apple Silicon builds are available below. Intel and Windows builds are planned.

Apple Silicon
.dmg · 286 MB
Intel Mac
Coming soon
Windows
Coming soon
First launch only: macOS may say "Apple cannot check it for malicious software." Right-click the app in Applications → OpenOpen. Verify your download with the matching SHA-256.

When CatLLM is the right tool

LLMs are already accessible to most researchers through ChatGPT, Claude, and Gemini. The relevant question is not whether LLMs can classify text, but whether the workflow around them meets ordinary standards of empirical research — reproducible, transparent, validated, and scalable.

Manual coding ChatGPT / Claude / Gemini chat CatLLM
Reproducible Depends on coder consistency; inter-rater drift is common No — outputs vary run-to-run; prompts live in chat history Yes — versioned prompts, pinned models, deterministic config
Scales to thousands of responses No — hours of work per few hundred responses Limited — copy/paste workflow, no batch processing Yes — pandas / dataframe input, batch APIs, parallel execution
Standardized output Varies by coder Free-form prose that needs to be re-parsed Structured DataFrame; CSV-ready for Stata, R, Python
Transparent prompts Buried in conversation; not version-controlled Inspectable, modifiable, committable to your repo
Validated defaults Codebook-dependent None — relies on the user's prompt design Defaults tuned against expert human coders in published benchmarks
Multi-label & ensemble support Manual aggregation Inconsistent across runs and models Native multi-label; unanimous-vote ensembles across providers
Best for Small, exploratory studies; rich qualitative interpretation One-off exploration; drafting a candidate taxonomy Production research datasets that need to be defensible at peer review
When the alternatives are right: general-purpose chat tools are useful for ad-hoc exploration, one-off summarization, and drafting an initial codebook. Manual coding remains the gold standard for small samples where qualitative depth matters more than scale. CatLLM is intended for the case in between: research datasets where the analyst wants LLM fluency together with the reproducibility expected of statistical software.

From raw responses to coded data

A minimal end-to-end pipeline in Python. The same workflow is available in R through the cat.llm package on R-universe.

python
import catsurvey as cat # Step 1: Discover categories cats = cat.extract( input_data=df['responses'], survey_question="Why did you move?", max_categories=12, api_key=api_key ) # Step 2: Classify with a multi-model ensemble results = cat.classify( input_data=df['responses'], categories=cats['top_categories'], survey_question="Why did you move?", multi_label=True, models=[ ("gpt-4o", "openai", openai_key), ("gemini-2.5-flash", "google", gemini_key), ], consensus_threshold="unanimous", ) # Step 3: Export results.to_csv("coded_responses.csv")

Defaults are grounded in a systematic evaluation of 21 LLMs across six providers and four open-ended survey coding tasks, benchmarked against expert human coders.

Empirical validation Soria, C. (2026). Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting. SocArXiv. osf.io/preprints/socarxiv/gjvcf
Software paper JOSS DOI Soria, C. (2026). Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting. Journal of Open Source Software. doi.org/10.21105/joss.09678