Pixels and Patterns

Vision-Language Models are Weirdly Good at Old Korean

Aron van de Pol — Fri, 26 Jun 2026 16:24:41 GMT

OCR has changed substantially in the last few years. For clean, modern, printed English, the task no longer usually requires a dedicated OCR engine, a domain-specific training set, or extensive preprocessing. Current vision-language models can often transcribe such material directly from a prompt, sometimes zero-shot and sometimes with a single example. In that limited sense, printed Latin-script OCR for high-resource, modern material has become a largely infrastructural problem rather than a central research challenge.

Historical documents are still harder. Old paper, damaged scans, unusual layouts, handwriting, and older typefaces all push modern models well outside their comfort zone. But even here the progress has been striking. Platforms like Transkribus routinely read European manuscripts that would have seemed hopeless a decade ago, and projects such as Loghi report similarly strong results on historical print. OCR has become much more accessible.

Korean, and by extension CJK1 and other non-Western scripts, is a different case, and early-twentieth-century Korean is different again. Pre-1933, Korean was written in an orthography that has since fallen out of use. It spelled words closer to how they sounded than to a fixed stem, and it kept letters the modern keyboard has dropped, among them the dot vowel arae-a (ㆍ) and a set of old consonant clusters. The spelling we read today is the Unified Orthography of 1933 (한글 맞춤법 통일안), the morphophonemic standard carried by Chu Sigyŏng’s lineage in the Korean Language Society, and it became the convention only after a long argument that ran through the colonial period and resurfaced in the spelling crisis of the early 1950s.2

Modern Korean OCR remains less stable than the English and European-script cases above. On the Korean portion of the PM⁴Bench benchmark the leading system changes with each new model release, and in the current version Gemini 3 Pro ranks ahead of the newest open model, Qwen3-VL, which does not improve on its own predecessor for this task. This should be read as a relative ranking within a weak field rather than as evidence of reliable transcription. The absolute scores remain low.

In one of my projects, I need OCR labels for about ninety-four million individual character images. Because I study the appearance of historical type rather than the content of the text, it is important that older Korean characters are transcribed as they were printed, rather than silently modernized. This post explores how well current models actually do.

A few tests pages

The page is from Sonyŏn (소년), a children’s magazine published in 1908, and contains a chain-rhyme word game. It is printed in vertical columns read from right to left, combining (Old-)Korean Hangul and Hanja on the same page. Material like this has traditionally been difficult for OCR systems.

The part I cared about was not whether the models could produce readable Korean. (Though even this would be helpful for many historians.) It was whether they would OCR Korean letters as printed. Before the 1933 spelling reform, Korean used forms that later disappeared or changed in ordinary writing. These include the dot-vowel arae-a (ㆍ) and several old consonant clusters. A model can read a page fluently and still normalize those forms into modern Korean. For most readers the output still looks right. For my purposes, it is wrong.

The first page was a narrow test. It happened to contain two old cluster forms, ᄭᅡ and ᄯᅱ. A modern transcription would normally write these as 까 and 뛰. GPT-5.5 kept all six occurrences of those old clusters on every run. The previous GPT model had modernized all six. That was already a meaningful change, but it did not prove much beyond this particular page.

So I tried five more pages from Sonyŏn, using prose and dialogue rather than word games. These pages contained a wider range of old Korean forms: arae-a in forms such as ᄒᆞ and ᄂᆞᆫ, several ㅅ-cluster groups such as ㅺ, ㅼ, and ㅽ, and even the rarer triple cluster ㅴ. This was the real test. The question was no longer whether a model could preserve two repeated forms on one page, but whether it could keep a broader old-orthography repertoire across different pages.

The current models can preserve old Korean forms, but not with equal reliability. Preservation is still something to check, not something to assume. In practice I treated an old form as trustworthy only when different models produced it, because a single model can also hallucinate plausible old-looking Korean.

The dedicated OCR engines were not trained on any form of old-Korean and not surprising that they performed the poorest. Mistral OCR, DeepSeek-OCR, Baidu’s Unlimited-OCR, GLM-OCR, all read the page and then modernized the old letters or fell apart, one of them looping a few character lines a few hundred times, another reading the Hangul as Chinese and drifting into a gibberish. Qwen3-VL was different. It read the Korean fluently, but silently regularized every older form into modern Korean. If your goal is simply to recover the text, that is a respectable result. If you care about preserving historical orthography, it is not.

Going back further, to 1896

After Sonyŏn, I wanted to know how the models handled even earlier Korean print. As a final test, I tried the front page of the Tongnip Sinmun (독립신문, The Independent) from 1896, the first privately published Korean newspaper, set entirely in Hangul in the full old orthography with arae-a in almost every word. The only scan I could find is small, a single 960-pixel page, which makes it a hard test on top of an old one. For once there was something solid to check it against. The Korean Wikisource was manually transcribed, so this is the one page in the whole exercise I could measure against ground truth rather than judge by eye. It should also be noted that this newspaper image is denser and has a more difficult layout then the previous magazine had.

Most models struggled. Claude Opus 4.8 transcribed only about a sixth of the page, even reversing the masthead. Qwen3-VL reached roughly two thirds, but regularized the spelling throughout. Both ended up at around ninety percent character error against the Wikisource transcription.

Gemini 3.1 Pro was the clear exception. It read essentially the entire page, from the masthead through the advertisements and price table to the end of the editorial, about ninety-six percent of the page by length. Compared against the Wikisource transcription, it reached roughly seventeen percent character error while preserving 176 of the page’s 182 arae-a. On a 960-pixel scan of an 1896 newspaper, that amounts to a complete transcription of a full old-orthography page.

GPT-5.5 failed for a different reason. It did not misread the page so much as fail to read it at all. Tongnip Sinmun is one of the most famous newspapers in Korean history, and its first issue has been reproduced countless times. When shown the thirteenth issue, GPT-5.5 instead reproduced the first: its date, issue number, and opening lines, at both image resolutions. Against the actual page, the result is about eighty percent character error. It is a useful reminder that, for well-known historical documents, a model can sometimes retrieve what it already knows instead of reading what is actually in front of it.

Subscribe now

Why this matters to someone working with the sources

The interesting part is not the benchmark. It is what happens to the historical record.

When the National Institute of Korean History transcribes historical print, it modernizes the spelling. I checked the roughly six hundred thousand colonial-era records they have published, and almost none preserve the older letters. For anyone searching the archive, the original orthography has effectively disappeared.

That makes preservation more than a technical curiosity. A model that keeps the old letters is not simply producing another transcription. It is recovering information that was lost during digitization. For anyone working with pre-1933 Korean print, this may be of importance.

The preservation results only matter if the models can also read the page accurately. On that front, the picture is encouraging. Against four pages of the 1921 magazine Kaebyŏk, where a modern transcription exists, the best current models matched roughly ninety-seven percent of the printed characters directly from the low-resolution scans. That measures reading rather than preservation, since the 1921 transcription already modernizes the spelling. Taken together, though, the results point in the same direction. Current models can read these pages, and on sufficiently old material, the best of them also preserve what makes them historically distinctive.

A note on scale

For a single page, I would use a frontier vision-language model every time.

My own project is different. It involves millions of character crops. Running a frontier model over that volume would incur substantial API costs.

Instead, I built a small voting system with more traditional OCR engines. EasyOCR and a lightweight classifier trained on historical Korean type recognize the Old-Korean, PaddleOCR reads the Hanja, and a simple voting rule chooses between them. The entire corpus finished in about a week on hardware we already owned.

The ensemble has one advantage over a frontier model besides cost. It is explicit about uncertainty. When its engines disagree, it leaves the character blank instead of guessing. A frontier model often does the opposite. It produces a perfectly plausible modern character, and unless you compare it with the original page you may never realize the historical form has disappeared. A blank is an error you can revisit. Silent modernization is much harder to detect.

What this adds up to

For anyone working with pre-1933 Korean print, the practical picture has changed. Current vision-language models can do more than transcribe these pages. The strongest of them can preserve historical orthography that is absent from most digital editions. It can also help one quickly digitize small (or large) amount of works. However for very large datasets cost can quickly skyrocket. It might be worth distilling some of this knowledge into a smaller, more specific model to save on API costs.

That does not mean OCR has been fully solved. The results are model-dependent, page-dependent, and, as far as I can tell, language-dependent. But for this particular corner of the problem, the gap between what was possible a few years ago and what is possible today is surprisingly large.

Chinese-Korean-Japanese

Kim, Michael. “The Han’gŭl Crisis and Language Standardization: Clashing Orthographic Identities and the Politics of Cultural Construction.” Journal of Korean Studies 22, no. 1 (March 2017): 5–31. https://doi.org/10.1215/21581665-4153412.

A council of models

Steven Denney — Fri, 26 Jun 2026 13:39:34 GMT

This is an extension of part of the presentation on using LLMs in research pipelines I gave for Machine Collaborators, series of talks organized by Charles Crabtree (Monash University) where researchers walk through how AI is being used in their work.

OpenRouter recently announced a product called Fusion, which routes a single question through a panel of cheaper models at once and reports that such a panel can come within a point of a frontier model's score at about half the cost. The marketing is new, but the idea a bit less so. Pitting several models against one another has become a standard way to use language models for finding optimal ideas and outcomes. For empirical research, what use might it have?

Subscribe now

I have used a similar setup in two of my own studies. In one, six models read roughly four thousand open-ended survey answers and sorted each into a code, helping me determine what respondents were actually thinking when they navigated a survey experiment. In the other, four models read sixty-seven Korean history textbooks and, using a term extraction technique I created, determined which words best reflected national identity. Sorting answers into fixed codes and discovering which words matter are different problems, but the instrument is the same. It is a panel of open-weight language models, treated as a team of independent coders, voting.

In this piece I reflect on how it works, why I run it mostly on open-weight models, how it relates to the “councils” that Fusion and others have started to build, and what it does and does not support you, as a researcher, doing. The short version is that it works well, that for coding tasks it holds up against the human labor it replaces, and that its central value is scale rather than certainty.

The method: open models as a panel of coders

Take a judgment a trained research assistant would make, such as reading a piece of text and deciding what it is about, and give it to several models at once, each with the same codebook and the same instructions. Each model receives the same text, the same instructions, and the same codebook, then returns a single labeled decision. The panel's majority vote becomes the classification, while the number of models supporting that decision is recorded as a measure of agreement.

Two features of this design matter for the steps that follow. First, I treat each model as a coder. I ask it to assign a label, much as I would ask a trained research assistant, rather than relying on its probability distribution or confidence estimates. (More on this below.) I ask it for a labeled decision, the way I would ask a research assistant, and I read the spread of decisions across the panel the way content analysts read inter-coder reliability, with chance-corrected statistics such as Cohen’s and Fleiss’ kappa. Second, the models vote independently. They run at temperature zero (i.e., no “creativity” permitted), they do not see one another’s answers, and there is no deliberation. Any sense in which the models are played off against one another happens afterward when I line the votes up and read where they part ways.

I run the panel mostly on open-weight models. A proprietary model behind an API can change between versions, and its outputs cannot be reproduced exactly by another researcher. Open-weight models, run locally at a fixed temperature and seed, return the same answers when re-run in the same environment, which is what reproducible measurement requires. In the classification study I still included one proprietary model, GPT-4o-mini, as a convenient reference coder, because it is inexpensive, fast, and unusually capable on the Korean and Traditional Chinese text those projects involved. The term-extraction panel below is fully open-weight. But GPT-4o-mini is one voice on a panel that is otherwise open and re-runnable, not the authority on it.1

An example of the method. One open-text response from a survey is read by six models that vote independently. The majority becomes the code, and the size of the majority is recorded as a reliability signal.

Three ways to convene a council

The instinct to put models in concert has been built in several distinct ways, and they are worth separating, because they are not after the same thing.

Fusion is the production version. A panel of models answers in parallel, a judge model reads every answer and writes a structured account of where they agree, where they contradict, and what they all missed, and the calling model writes the final answer grounded in that account. What Fusion optimizes is the quality of one answer against its cost, which is why its own documentation says it is not a drop-in replacement and is worth the expense only when a question warrants several perspectives. The models often arrive at different conclusions, but those differences are resolved internally rather than exposed to the user.

Sakana AI’s Fugu uses the same basic approach. Rather than sending every query to a fixed panel of models in parallel, it uses a trained orchestration model to decide whether and when to delegate, select and coordinate models from an agent pool, verify their work, and synthesize the results into a single answer. The system is exposed through one model API, so the complexity of multi-model coordination never reaches your code.

Two other versions come from computer and political science. Andrej Karpathy's llm-council has several models answer, then rank one another with their identities hidden, before a chairman model writes a final answer. Andrew Hall's llm-council-governance extends that logic into an experiment on the decision rule itself, comparing seven governance procedures, from simple majority voting to deliberate-then-vote, on problems with verifiable answers. Deliberation followed by a vote performed best, at 80.9 percent against 71.7 for the strongest single model. There is also an academic Language Model Council in which twenty models write tests, answer them, and grade one another to produce rankings that track human judgment more closely than any single model does.

Three ways to wire a panel of models. All draw on the same lineage. They differ in what they do with the disagreement.

The resemblance is close enough that the model lineups overlap. Hall's council runs Qwen, Llama, Gemma, and Mistral. The classification panel below runs Qwen, Llama, Gemma, and three others. What separates my use from the rest is the last step. Fusion and the councils collapse the panel into a single output, an answer or a decision, and discard the spread behind it. Hall's experiment even finds that deliberation before the vote produces the best single answer, which is the right goal when you want one answer and the wrong one when the spread is the thing you are trying to measure. So I keep the spread, because in a measurement task the agreement, or the lack of it, is itself a quantity of interest. It helps to think of each model as a coder rather than a black box. Each one assigns a label, the panel may agree or divide, and each carries some uncertainty of its own. Those are the qualities I want to read, not average away.

Application one: classification

Take classification first. My thinking on this grew out of my working paper, "What Were They Thinking?", which asks how to validate survey constructs in conjoint designs using the open-text data usually collected only for manipulation checks. A conjoint experiment tells you what respondents chose, but rarely whether they reasoned about the choice the way the design assumes. To find out, I embedded an open-ended question after the conjoint task ("Why did you choose this person?") and coded the answers into a small set of categories. The corpus is about four thousand short answers in Korean and Traditional Chinese, from naturalization experiments in South Korea and Taiwan, where respondents weighed who deserves to be prioritized for citizenship. The codebook the models worked from had five codes: civic commitment, functional integration, economic focus, identity concern, and a none-of-the-above residual.

Six models coded every answer: GPT-4o-mini together with five open-weight models run locally through Ollama, spanning Qwen 2.5 at 3B, 32B, and 72B, Llama 3.1 8B, and Gemma 3 12B. A response is granted a classification when at least four of the six agree. Across the corpus the panel’s chance-corrected agreement is a Fleiss’ kappa of about 0.74, which content analysts would call substantial. As more of the six models converge on a label, each model is also individually more confident in it, and the unanimous cases are, likely, the ones a researcher could reasonably trust. The split cases, where the panel divides three against three, are the cases worth handing to a human reader or dealing with in another way.

As more of the six models agree on a code, each model is also individually more confident in it. Most answers sit at high agreement. The split cases are the ones to review by hand.

Two things make this more than a cost-saving trick. The first is scale. Hand-coding four thousand multilingual answers to a publishable standard, with two trained coders and an adjudication protocol, is a substantial project on its own, and it does not scale to the next survey or the one after that. A panel of models codes the full corpus in an afternoon and recodes it with ease, assuming a replicable setup, if the codebook is updated.

The second is quality. I do not treat the models as a cheap approximation of a human coder so much as a fast one that reasons about a short, context-dependent answer in much the way a careful human reader would. Where I had human-coded ground truth and the panel disagreed with it, I read the disputed cases myself, and more often than not I judged the model's label the better one.2 This is consistent with Ryan Briggs and colleagues, who used language models to extract validated data from roughly one hundred thousand political science articles, putting model coding to work at a scale hand-coding could not reach.3

Not every disagreement is the model's mistake to absorb, though. Some are a sign that the codebook is unclear, and those are the ones to chase, because the codebook is the part you actually tune. When I first checked the panel against human-coded ground truth, the agreement was only fair. Reading the misses in a batch, rather than one at a time, showed the models were applying two of my categories the way I had written them and not the way I meant them: one defined too narrowly, the catch-all residual left too broad. I rewrote those definitions, ran the panel again, and looked again, and over two rounds the agreement with the human codes rose from a fair kappa near 0.40 to a substantial one near 0.67. That gain was larger than anything I got from swapping in a bigger model. The model is the coder. The codebook is its instructions, and calibrating the instructions against a few hundred of your own annotations is where most of the accuracy is won.

Application two: finding the words that matter

The second use case approaches the use of multiple LLMs a bit differently. In "Constructing the Nation," written with Aron van de Pol, I study how South Korean history textbooks portray national identity across the postwar decades. This time there is no codebook to apply. The categories themselves are the quantity of interest. We set out to find which words carry the weight of national identity in the corpus, and how their use changes over time.

The standard tool for extracting terms, concepts, or words from a corpus is topic modeling. The standard topic model is Latent Dirichlet Allocation. LDA treats each document as a bag of words and groups terms that co-occur. It is fast and unsupervised, and it is indifferent to context. A word that means one thing in a sentence about colonial resistance and another in a sentence about economic development is the same token to LDA. For a corpus of long, discursive textbook chapters, where the meaning of a term such as minjok (nation, or ethnos) gets its meaning from the sentences around it, the lack of context being taken into account is a real limitation.

An LLM approach handles it differently. Four large models, each from a different country's lab, read the corpus in long passages and propose the identity-related terms they find: EXAONE from Korea, Aya Expanse from Canada, Qwen from China, and Gemma from the United States. A term enters the final set only when at least three of the four propose it, after which each candidate has to clear a corpus-frequency and semantic-similarity check. The models were chosen from different training traditions on purpose, so that agreement across them would mean more than agreement among near-copies. The result is a set of nine defining terms.

Of the nine terms the four-model consensus defines, an embedding-based topic model identifies all nine, while classic LDA identifies five. Context-aware methods find terms that bag-of-words co-occurrence misses.

As a check, we ran an embedding-based topic model, BERTopic, which reads words in context. It identified all nine, with far fewer edge cases than the bag-of-words baseline. BERTopic does not read a textbook whole, though. Its embedding model takes in short passages at a time, because its context window holds only so many tokens.

An LLM faces the same limit, only a much larger one. It cannot take in a whole book at once either, since the text would exceed its context window and be rejected or truncated, but its window is wide enough to read a term against far more of the surrounding passage, and you can window across a long document in overlapping batches. That extra room is where the real difference shows up.

Classic LDA, for what it is worth, identifies only five terms. The panel of LLMs does what topic modeling is meant to do: name the vocabulary that organizes a corpus. But it reads each term in the context of the passage around it, the way a human reader of these textbooks would, and it identifies terms that word co-occurrence alone misses. For a corpus of long documents, that is the difference between a list of frequent words and a list of meaningful ones.

What LLM voting does and does not measure

There are, of course, limits. Cross-model agreement is a measure of reliability, not of validity, and it is not a calibrated estimate of uncertainty. The panel can agree because the answer is clear, and it can agree because the models share training data and therefore share blind spots. The Condorcet jury theorem, which is the formal reason a majority of independent voters grows more accurate as the panel grows, assumes the voters are independent, and language models are not. Their errors are correlated, so a panel of nine carries far less independent information than nine human coders would.4 Agreement should therefore be reported as inter-coder consistency among non-independent coders, rather than accuracy or probability.

This limit is notable, but it is not, in my experience, the main thing. On the one task where I had adjudicated human codes to check against, which came from a separate corpus on North Korean entrepreneurs rather than the citizenship answers above, the models agreed with the human coders at a substantial level, short of the agreement between the humans themselves, and in the cases where they differed I usually sided with the model. The limit imposes modest discipline. Keep some human-coded ground truth in the project, if you can, even a few hundred cases, so that the panel can be checked rather than trusted on faith. Where a within-model confidence is needed, read it from the model's own token probabilities, which come from its predictive distribution, rather than from the agreement count or a model's self-reported confidence.

That last point is worth dwelling on, because if a model is going to stand in for a coder, its uncertainty has to be handled like a coder’s. A careful human coder is not equally certain of every judgment, and neither is a language model. Even with a fixed codebook and temperature set to zero, the model assigns different degrees of confidence to different classifications. That confidence is reflected in the log-probability it assigns to the label it ultimately selects, a quantity derived from the model’s predictive distribution at the moment of the decision. By contrast, a HIGH or LOW confidence tag the model produces if asked is simply another generated response, not a direct measurement of its uncertainty.

I therefore treat the log-probability as an indicator of uncertainty. It records how much probability the model assigned to the label it ultimately selected, according to its predictive distribution. A response that receives a label but a relatively low log-probability is one that merits closer inspection. It captures something the panel does not. The log-probability reflects how certain each model is about its own decision. Agreement across the panel reflects whether different models reach the same decision. The two measures complement one another rather than serving the same purpose.

Is it worth it

It is, if the claim stays within the evidence. A panel of open-weight models is a practical research instrument, and the two applications here show its range. For classification it does the work of a team of coders at a scale no team could match, and codes nearly as well, a little below expert human agreement. For term discovery it does the work of a topic model while reading each term in context, which lets it identify defining vocabulary that bag-of-words methods miss. In both cases the design that makes it trustworthy is the same. Use several models rather than one, choose them to be as different from one another as possible, run them on open weights so the result can be reproduced, and treat a split vote as a flag for human attention rather than as noise to be averaged away.

A panel of LLMs can outperform a single model. But it does not follow that high agreement means the panel is correct. High agreement tells us that these particular models concur, not that they have reached the truth. The two coincide only when the panel is both accurate and sufficiently diverse. Build the panel and pay attention to where it disagrees. But the human stays very much in the loop here, refining the codebook before the run and adjudicating the cases the panel cannot resolve afterward. The panel supplies scale, consistency, and a record of disagreement. The researcher still has to decide what the categories mean, whether the instrument is valid, and when the models and their output should not be trusted.

Using Claude Code

If you want to run something like this yourself, the steps are codified as a set of open-source Claude Code skills I maintain and reach for when working with agents, so each run stays reproducible. The ones closest to this post are model-council-voting, for running a panel and reading its agreement; text-classification, for designing and validating a codebook; and llm-calibration-logprobs, for reading per-decision confidence from token log-probabilities.

Subscribe now

The full classification panel is GPT-4o-mini plus five open-weight models run locally via Ollama at temperature zero with a fixed seed: Qwen 2.5 (3B, 32B, 72B), Llama 3.1 (8B), and Gemma 3 (12B). The case for running open weights rather than depending on a proprietary API is reproducibility: API models drift between versions and cannot be rerun deterministically (Barrie et al. 2025). GPT-4o-mini earns its place as a reference coder on grounds of cost and strong multilingual performance, not authority, and agreement with it rises monotonically with open-model size (Qwen 2.5: 0.73 at 3B, 0.83 at 32B, 0.86 at 72B), which is a useful sanity check on the smaller models.

The check comes from a separate corpus coding why North Korean entrepreneurs make the choices they do, where two trained humans coded every response. The two humans agree at kappa 0.88. Each model agrees with the humans at about 0.61 to 0.67, roughly what one would expect of a competent third coder. The models are also overconfident against human codes when their token probabilities are read as calibrated estimates (expected calibration error of 0.15 to 0.26, against 0.01 to 0.10 when scored against the panel’s own majority), which is the technical form of the point that agreement is not the same as accuracy.

Briggs, Mellon, and Arel-Bundock, “It must be very hard to publish null results” (working paper). They code a large corpus of articles for whether results are null, using LLMs validated against human coding, and document a steep gap between the share of published abstracts reporting non-null results and the share reporting nulls.

A recent study, “Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels” (Kohli 2026), estimates that a panel of nine frontier models from seven families carries only about two independent votes’ worth of information, because the models tend to be wrong on the same items. The remedy is real diversity among the voters, which is the reason the term-discovery panel draws its four models from four different national labs.