SocioRAG: Retrieval Built for Reading Social Dynamics, Not Just Documents

Say you have two thousand pages of reporting on a contested election, a conflict, or a corruption case. You ask an ordinary retrieval system: "what does this corpus say about voter intimidation?" It finds the passages that mention voter intimidation and summarizes them. Useful, and also the easy half of the question.

The hard half is the one social research actually cares about: who is doing what to whom, which actors recur across documents, how the relationships connect, and where the same entity appears under three different spellings. A standard RAG pipeline treats your corpus as a bag of paragraphs to be matched against a query. Social analysis treats it as a web of actors and relationships that happens to be stored as text. Those are different problems, and a tool built for the first does the second badly.

SocioRAG is built for the second. It is a system for analyzing social dynamics in text: it extracts the entities and relationships, retrieves on both meaning and keyword, and grounds every answer in cited sources, across English and Arabic. It is the infrastructure underneath the kind of corpus work I do, and this is what makes it different from a generic RAG stack.

Retrieval that does not trust a single signal

The core decision in any retrieval system is how you decide what is relevant, and the lazy answer is to pick one method and live with its blind spots. Pure vector search captures meaning but misses exact terms, so it whiffs on a proper name or a statute number that shares no semantic neighborhood with the query. Pure keyword search catches the proper name but is blind to paraphrase, so it misses the passage that describes the thing without naming it.

SocioRAG runs both and reconciles them. Vector similarity for meaning, BM25 keyword matching for exact terms, then a cross-encoder reranking pass that re-scores the merged candidates against the query, plus a source-diversity step so the top results are not five paragraphs from the same document. The point is not to be clever. The point is that social corpora are full of named entities, dates, and statute references where keyword precision matters, and full of euphemism and paraphrase where semantic recall matters. A tool that honors only one of those signals will quietly fail on the other half of every corpus.

There is a second retrieval layer that a document-only tool does not have: entities get their own embeddings, separate from the chunk embeddings. So you can retrieve at the level of "this person" or "this organization" rather than only "this passage," which is what makes entity-level and relationship-level questions answerable at all.

Multilingual because the sources are

A detail that is easy to underweight until the corpus forces it on you: SocioRAG handles English and Arabic, with translation built into the query path. This is not a feature checkbox. If you work on the regions I work on, a meaningful share of the primary sources are in Arabic, and a system that can only read the English coverage is reading the world's summary of the story rather than the story. Building the multilingual path in from the start is the difference between analyzing the sources and analyzing the wire service's translation of them.

The model is a per-task decision, not a default

One architectural choice runs through everything I build, and SocioRAG is no exception. The system talks to language models through OpenRouter rather than bonding to a single vendor's API, and model selection is configurable rather than hardcoded. You assign the model the task needs and switch when the trade-off changes.

The reason is plain once you have run a pipeline at scale. Entity extraction, reranking, and answer generation are different jobs with different cost and quality profiles, and the best model for one is rarely the best for all three. Hardwiring one provider's flagship into every stage is how you end up paying premium rates for work a cheaper model does identically well, or accepting a weak result because switching meant a rewrite. Keeping model assignment a configuration decision means the architecture answers "what does this task need" per task, instead of forcing one answer on the whole pipeline. The result tends to be cheaper, but it is cheaper as a consequence of being right, not as the goal.

Grounded, or it does not ship

Every answer SocioRAG generates carries citations back to the source passages it drew on. This is the same conviction that runs through the rest of my work: a claim a reader cannot trace is a claim they cannot check, and in social research an uncheckable claim is worse than no claim. The retrieval pipeline exists precisely so that the generation step has real evidence to cite rather than plausible-sounding inference to dress up. Citation management is not a polish layer bolted on at the end. It is the reason the retrieval is built as carefully as it is.

What it is, concretely

For the engineers: a FastAPI backend on Python 3.12, a lightweight Preact frontend, SQLite-vec for vector storage and SQLite for entity relationships, spaCy plus an LLM pipeline for entity extraction, and Playwright for styled PDF report export. It runs locally with a single startup script, with query history, performance metrics, and a configurable retrieval setup exposed in the UI. The whole thing is designed to be stood up on your own machine and pointed at your own corpus, because the documents social researchers work with are frequently not ones you want to hand to a hosted service.

Where this fits

SocioRAG and the studies I publish are two halves of one practice. The studies show the analysis; SocioRAG is part of what makes that analysis repeatable on a new corpus without rebuilding the machinery each time. The retrieval design, the multilingual path, the per-task model routing, and the citations are not independent features. They are four expressions of the same idea: that reading social dynamics out of text is a specific problem, and it deserves tools built for it rather than borrowed from document search.

The open question I am still working is entity resolution across messy real-world sourcing, where the same actor appears under variant spellings, transliterations, and aliases. That is the hard edge of entity-level retrieval, and it is where the next round of work goes.

Repo and full documentation: https://github.com/Muhanad-husn/sociorag