How I started teaching PDF.js to really edit PDF text

In the first versions of CrabPDF, I was not really editing PDF text.

I was doing something much more practical.

pdf.js rendered the page. I put an editable layer on top. When the user changed a word, I covered the old text and drew the new one.

That approach is useful. It is how many small PDF editors start, and for a lot of cases it is good enough.

But it always bothered me.

The edit looked real, but the original PDF text could still be there. The new text was often just a visual patch. If you extracted the PDF, copied text, or looked at the content stream, the illusion started to fall apart.

So I wanted to try the harder version:

Can I make pdf.js edit the original PDF text source, not just paint over it?

This is the experiment I have been working on.

It is very early. It breaks constantly. It is not a product. It has only been tested manually on one PDF:

test/pdfs/tracemonkey.pdf

I published the fork anyway because this is exactly the kind of project that needs more eyes and more broken PDFs.


Why I forked PDF.js instead of making a new project

At first I thought this could become a separate library.

Something like:

pdf-source-text-edit

Nice name. Clean API. Small package.

Then I looked at what the prototype actually needed.

It needed the text layer. It needed the display API. It needed worker messages. It needed font objects. It needed the content stream parser. It needed saveDocument(). It needed the viewer UI. It needed canvas rendering for local preview.

That is not a separate library yet.

That is a PDF.js fork.

So I stopped pretending. For now, the honest shape of the project is:

experimental fork of pdf.js

Maybe later there will be a clean package inside it. But today the useful thing is to keep the whole experiment close to the PDF.js internals it depends on.


The thing I wanted to edit

PDF text is not stored like HTML.

There is no:

<p>Abstract</p>

Inside a PDF content stream you get drawing instructions.

Sometimes text is simple:

(Abstract) Tj

Sometimes it is split into pieces:

[(Trace) -20 (Monkey)] TJ

Sometimes it is transformed, encoded with a custom font, split across multiple operators, or arranged in a way that makes sense visually but not semantically.

So “edit this word” really means:

  1. Find the visual text the user clicked.
  2. Find the original PDF operator that produced it.
  3. Prove that the operator still matches what we think it matches.
  4. Encode the replacement with the same font.
  5. Rewrite the content stream.
  6. Save the PDF.
  7. Re-render the page and make sure the text layer is not stale.

That is a lot of machinery for changing one word.

But that is the difference between drawing over a PDF and editing the PDF.


Source refs: the first real piece

The first big change was adding source references to text items.

Normally the viewer cares about what text should be shown and where it should be placed.

For editing, that is not enough.

I need to know where the text came from.

So the extraction path now attaches a textEditSource to editable text items. Conceptually it looks like this:

{
  editable: true,
  operatorName: "Tj",
  operatorIndex: 42,
  sourceText: "Abstract",
  segments: [
    { kind: "text", text: "Abstract", rawKind: "literal" }
  ]
}

The important part is not the exact object shape.

The important part is identity.

When the user clicks a span in the text layer, the editor can ask:

which PDF source operator produced this?

Without that, every edit is just guessing.


Planning before writing

One rule I added early was: do not write first.

Every edit has to produce a plan.

The viewer asks PDF.js something like:

const plan = await page.planTextSourceEdit({
  textEditSource,
  expectedSourceText: "Abstract",
  replacementText: "Abstract mi piace",
  includeDecodedStreamPatch: true
});

The planner checks:

  • is this source still present?
  • does the source text still match?
  • is this operator supported?
  • can the replacement be encoded with the original font?
  • can the operand be rewritten without touching unrelated PDF bytes?

If any of those checks fail, the edit should fail.

That sounds harsh, but it is the point of the experiment.

The old overlay editor tried to make something appear on screen no matter what. This path should only edit when it can explain what it is editing.


I tried blocks, then backed away

At some point I wanted bigger editing boxes.

Paragraphs. Regions. Multiline blocks. Something closer to a normal text editor.

It was tempting because the UI feels more powerful that way.

It was also too early.

PDF text extraction is already weird. If I also ask the prototype to detect paragraphs, columns, headings, line continuations, authors, and reflow behavior, then every bug becomes ambiguous.

Was the source patch wrong?

Was the block detection wrong?

Was the layout wrong?

Was the preview rectangle wrong?

Too many moving pieces.

So I reduced the scope to something much more boring:

click one visual line
edit one visual line
commit one source-backed patch

This is less ambitious, but it is much easier to debug.

Multiple PDF.js spans can still be grouped when they are on the same visual row. For example:

[Abstract] [ mi piace]

can be treated as one visual line.

But the editor does not pretend it has a real paragraph model yet.

That one decision removed a surprising amount of noise.


The galley

The active editor is a small single-line galley.

It has three jobs:

1. receive input
2. show the local preview
3. draw a reliable caret

The input is still a contenteditable element. I want the browser to handle typing, selection, paste, and IME as much as possible.

But I stopped trusting the browser caret visually.

The text in the galley is transparent because the preview is rendered through a canvas. On some characters and boundaries, the native caret was hard to see or disappeared completely.

So now the native caret is hidden:

input.style.caretColor = "transparent";

and the galley draws its own caret using insertion positions from the PDF layout.

This is not a full custom text editor.

It is still using the DOM selection underneath.

But the visible caret is now under my control, which makes editing feel much less random.


The “Abstract mi piace” bug

One of the useful bugs happened while editing the word Abstract.

I changed it to:

Abstract mi piace

The HTML contained the whole string.

The editor state contained the whole string.

But after commit the page showed only:

Abstract m

That was confusing for a moment because the input was not the problem.

The problem was the rectangle.

During typing, the galley had expanded to fit the longer text. But the final page preview was still using the original text span rectangle.

So PDF.js rendered the full replacement, but I copied only the left part of the rendered region back onto the visible page canvas.

The fix was architectural, not just a one-line patch.

I split out geometry and preview surface helpers:

native_text_edit_geometry.js
native_text_edit_surface.js

Now the commit preview can use the expanded galley rectangle instead of the original text span rectangle.

This is the kind of bug that tells you where the architecture is lying.


The current viewer-side shape

The experimental viewer code now has a more explicit structure.

Roughly:

web/native_text_edit_app_controller.js
  toolbar mode and page-change lifecycle
 
web/native_text_edit_service.js
  page marking, hover targets, plan/save/preview/reconcile
 
web/native_text_edit_block_builder.js
  builds editable candidates from text layer spans
 
web/native_text_edit_line_candidate.js
  makes the "single editable line" model explicit
 
web/native_text_edit_geometry.js
  shared DOM rect and canvas rect logic
 
web/native_text_edit_controller.js
  active edit state and input events
 
web/native_text_edit_galley.js
  galley UI, custom caret, input sink
 
web/native_text_edit_surface.js
  page and overlay preview capture

The commit path is now deliberately boring:

planVisualEditCommit
  -> save
  -> renderVisualEditCommitPreview
  -> applyVisualEditTextLayerResult
  -> reconcilePageTextLayer

I like boring here.

Before this split, preview rendering, DOM cleanup, source planning, and canvas rectangles were tangled together. Every fix risked breaking a different part of the edit lifecycle.

Now at least each part has a name.


How it is connected to PDF.js

The prototype is not bolted on from the outside.

It touches PDF.js in a few layers.

In core, it tokenizes and rewrites content streams:

src/core/content_stream_tokenizer.js
src/core/text_edit_planner.js
src/core/text_edit_rewriter.js
src/core/text_edit_content_stream.js

In the display API, it exposes planning and saving:

PDFPageProxy.beginTextEditLayout(...)
PDFPageProxy.planTextSourceEdit(...)
PDFDocumentProxy.saveDocument({ textEditPatches })

In the viewer, it uses the text layer as the interaction surface:

web/text_layer_builder.js
web/native_text_edit_service.js
web/native_text_edit_galley.js

That is why this is a fork.

The interesting part is exactly the connection between the rendered span and the PDF source operator behind it.


What works right now

Very little.

The best current demo is still editing simple source-backed text in:

test/pdfs/tracemonkey.pdf

The happy path is:

  1. Open the viewer.
  2. Enable Edit.
  3. Hover an editable line.
  4. Click it.
  5. Type a replacement.
  6. Press Enter.
  7. Let PDF.js save and reconcile the page.

When it works, the PDF source is patched instead of visually covered.

That is the whole point.


What does not work

Most PDFs.

That is the honest answer.

Current limitations include:

  • only narrow Tj and TJ cases are supported;
  • many font encodings are rejected;
  • no paragraph reflow;
  • no multiline editing;
  • no OCR path in this prototype;
  • no visual fallback writer;
  • transformed text may fail;
  • fragmented/non-contiguous source text may fail;
  • Type3 and custom font cases are not broadly handled;
  • selection visuals still need work.

This is not ready to replace the old CrabPDF editing path.

It is a research branch for a stricter kind of editing.


Why publish it now

Because PDF editing needs ugly PDFs.

I can keep improving the prototype against one file forever, but that will only teach it to edit one file.

The hard cases are out there:

  • PDFs from Word;
  • PDFs from LaTeX;
  • PDFs from browsers;
  • scanned PDFs with hidden OCR layers;
  • weird invoices;
  • old government forms;
  • custom embedded fonts;
  • text split into tiny operators for no obvious reason.

Publishing the fork makes it possible for people to send failures, add fixtures, and improve support one PDF class at a time.

I do not want to present this as finished.

I want to present it as a starting point.


What I learned

The hard part is not changing bytes.

The hard part is knowing which bytes are safe to change.

That is the real difference between the overlay editor and this PDF.js fork.

The overlay editor asks:

Can I make the page look right?

This prototype asks:

Can I prove what source produced this text, and can I safely rewrite it?

That second question is much harder.

It is also more interesting.

PDFs are not designed to be edited like documents. They are closer to frozen drawing programs. If I want CrabPDF to become a better editor, it needs both paths:

  • a pragmatic visual/OCR path for messy cases;
  • a source-backed path for PDFs where real text editing can be proven.

This fork is the beginning of the second path.

It barely works.

But when it does, it feels like the right direction.