CatLLM

A reproducible LLM pipeline for classifying open-ended text.

An open-source Python and R toolkit for classifying open-ended text with large language models. For survey classification, defaults are calibrated against the consensus of double-blind coding by sociologists and demographers across multiple datasets, yielding 88% to 99% agreement with human coders across 30+ LLMs depending on the task (Soria 2026). The same engine underlies separate packages for other text domains; the toolkit has been used in several academic research papers. Compatible with GPT, Gemini, and Claude, with the Hugging Face Inference API for hosted open-weight models, or with Ollama for fully local inference.

Soria, C. · UC Berkeley · SocArXiv preprint, 2026 ·

Install Package View on GitHub Download Desktop App

pip install cat-llm

Design

Defaults grounded in systematic evaluation

Each design choice — model selection, prompt structure, ensembling, multi-label handling — is documented and reproducible, so methodological decisions can be inspected, modified, and reported in a paper.

Multi-model ensembles: Aggregate predictions across models with unanimous voting. Open-weight ensembles can reach agreement comparable to individual frontier models at lower per-call cost (Soria 2026).
Provider-agnostic: Compatible with GPT, Gemini, and Claude. Open-weight Hugging Face models can be accessed through the Hugging Face Inference API or served locally via Ollama. REST-based with no SDK dependencies; switching providers does not require changes to analysis code.
Local-only deployment: Runs entirely on local hardware via open-weight models. Sensitive data — survey responses, clinical text, PII — never leaves the machine, supporting IRB and HIPAA constraints.
Unified multimodal interface: Text, images, and PDFs share one consistent API. Automatic input detection handles mixed datasets without per-modality configuration.

Method

A three-stage classification pipeline

From raw text to coded data. The pipeline can either consume an existing codebook or induce one from a sample of responses; categories are not required up front.

Explore

Automatically discover categories from your data. Sample responses, let the model surface recurring themes, then merge semantically similar labels into a clean taxonomy.

Extract

Extract structured classifications across full dataframe columns. Batches are processed with configurable model selection and provider routing.

Classify

Assign categories with multi-label support and ensemble voting. Verbose definitions with inclusion/exclusion criteria reduce spurious classification, as documented in Soria 2026.

pipeline.py

EXPLORE Discover categories from 50 samples

↓

EXTRACT Batch classify full dataset

↓

CLASSIFY Multi-label + ensemble vote

↓

OUTPUT CSV ready for statistical analysis

Optional extras

pip install cat-llm[pdf] cat-llm[embeddings]

Packages

Domain-specific packages

Six domain packages share a common cat-stack engine. Each ships with domain-tuned prompts, specialized parameters, and any built-in data sources relevant to that field, while exposing the same classify() / extract() API.

Package

cat-survey

Survey Responses

Classify open-ended survey responses at scale. Handles ambiguity with verbose category definitions and ensemble voting.

pip install cat-survey

Package

cat-pol

Political Text

17 built-in data sources: municipal ordinances, federal laws, executive orders, presidential speeches, and more. Updated weekly.

pip install cat-pol

Package

cat-vader

Social Media

Classify social media posts, comments, and short-form text with domain-tuned prompts for informal language.

pip install cat-vader

Package

cat-ademic

Academic Papers

Classify and summarize academic abstracts, full texts, and research documents across disciplines.

pip install cat-ademic

Package

cat-cog

Cognitive Assessment

Specialized tools for cognitive assessment scoring, including CERAD drawing evaluation for dementia research.

pip install cat-cog

Package

cat-web

Web Content

Classify scraped web pages, articles, and HTML content with domain-tuned prompts for long-form online text.

pip install cat-web

Desktop application

For collaborators without a Python environment

A desktop build of the same cat-llm pipeline, distributed as a self-contained application. Outputs and the generated reproducibility scripts are the same as the library; the audience is collaborators who do not maintain a local Python environment.

CatLLM desktop app — General Classify view in dark mode

Apple Silicon builds are available below. Intel and Windows builds are planned.

Apple Silicon

.dmg · 286 MB

Intel Mac

Coming soon

Windows

Coming soon

First launch only: macOS may say "Apple cannot check it for malicious software." Right-click the app in Applications → Open → Open. Verify your download with the matching SHA-256.

Comparison

When CatLLM is the right tool

LLMs are already accessible to most researchers through ChatGPT, Claude, and Gemini. The relevant question is not whether LLMs can classify text, but whether the workflow around them meets ordinary standards of empirical research — reproducible, transparent, validated, and scalable.

	Manual coding	ChatGPT / Claude / Gemini chat	CatLLM
Reproducible	Depends on coder consistency; inter-rater drift is common	No — outputs vary run-to-run; prompts live in chat history	Yes — versioned prompts, pinned models, deterministic config
Scales to thousands of responses	No — hours of work per few hundred responses	Limited — copy/paste workflow, no batch processing	Yes — pandas / dataframe input, batch APIs, parallel execution
Standardized output	Varies by coder	Free-form prose that needs to be re-parsed	Structured DataFrame; CSV-ready for Stata, R, Python
Transparent prompts	—	Buried in conversation; not version-controlled	Inspectable, modifiable, committable to your repo
Validated defaults	Codebook-dependent	None — relies on the user's prompt design	Defaults tuned against expert human coders in published benchmarks
Multi-label & ensemble support	Manual aggregation	Inconsistent across runs and models	Native multi-label; unanimous-vote ensembles across providers
Best for	Small, exploratory studies; rich qualitative interpretation	One-off exploration; drafting a candidate taxonomy	Production research datasets that need to be defensible at peer review

When the alternatives are right: general-purpose chat tools are useful for ad-hoc exploration, one-off summarization, and drafting an initial codebook. Manual coding remains the gold standard for small samples where qualitative depth matters more than scale. CatLLM is intended for the case in between: research datasets where the analyst wants LLM fluency together with the reproducibility expected of statistical software.

Example

From raw responses to coded data

A minimal end-to-end pipeline in Python. The same workflow is available in R through the cat.llm package on R-universe.

python
import catsurvey as cat

# Step 1: Discover categories
cats = cat.extract(
    input_data=df['responses'],
    survey_question="Why did you move?",
    max_categories=12,
    api_key=api_key
)

# Step 2: Classify with a multi-model ensemble
results = cat.classify(
    input_data=df['responses'],
    categories=cats['top_categories'],
    survey_question="Why did you move?",
    multi_label=True,
    models=[
        ("gpt-4o", "openai", openai_key),
        ("gemini-2.5-flash", "google", gemini_key),
    ],
    consensus_threshold="unanimous",
)

# Step 3: Export
results.to_csv("coded_responses.csv")

Citation

Defaults are grounded in a systematic evaluation of 21 LLMs across six providers and four open-ended survey coding tasks, benchmarked against expert human coders.

Empirical validation Soria, C. (2026). Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting. SocArXiv. osf.io/preprints/socarxiv/gjvcf

Software paper

Soria, C. (2026). CatLLM: A Python package for Generating, Assigning, and Scoring Open-Ended Survey Data and Images. Journal of Open Source Software. doi.org/10.21105/joss.09678

Studies using CatLLM Soria, C. (2026). High Agreement, Different Stories: How LLM Classifiers Reshape Demographic Patterns in Survey Data. SocArXiv. osf.io/preprints/socarxiv/85kyd

Studies using CatLLM Soria, C. (2026). Model Diversity Over Model Size: Unanimous LLM Ensembles Correct Over-Classification in Survey Coding. SocArXiv. osf.io/preprints/socarxiv/er6mz

Read the preprint Read the JOSS paper View on Zenodo