diff --git a/CLAUDE.md b/CLAUDE.md index 3c6f272..640228f 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -26,6 +26,14 @@ npx vitest run src/routes/admin.test.ts kill-the-news is a Cloudflare Worker that ingests email newsletters and exposes them as private RSS/Atom feeds. Self-hosted, free-tier-friendly (Cloudflare + ForwardEmail). +## Development approach + +Work **test-first (TDD)** and **domain-driven (DDD)** in this repo — both are first-class, not optional. + +**TDD.** Write or extend a test before/with the change, then make it pass. Mirror the existing test layout (`*.test.ts` next to the source, `createMockEnv()` from `src/test/setup.ts`, MSW for outbound HTTP). End every change green: `npx tsc --noEmit`, `npm test`, and `npm run build` (dry-run deploy) must all pass before declaring done. + +**DDD.** Before adding logic, check whether the domain already models the concept — reach for the value objects in `src/domain/value-objects/` (`EmailAddress`, `Domain`, `FeedId`, `Lifetime`, `SenderPolicy`) and the `Feed` aggregate rather than re-deriving things ad hoc. New behavior belongs on the type that owns the data (e.g. "sender site URL" lives on `EmailAddress`, not in a helper). Respect the layering and aggregate rules below — imports point inward (routes → application → domain; infrastructure implements ports), and never reach across a layer for convenience (e.g. importing a favicon/infra helper just to parse a domain). When the same derivation appears twice, that's the signal to push it onto a domain type. + ## Architecture Single Cloudflare Worker built with Hono. Routes: @@ -185,9 +193,13 @@ MSW (`msw/node`) handles external HTTP mocks. Tests that hit validation paths in ## When changing behavior -Update together: +**Always document evolutions** — treat docs as part of the change, not a follow-up. When you add or change a feature, update the relevant docs in the same change: - `README.md` - `INSTALL.md` (setup, deployment, and configuration guide) - `setup.sh` (if setup/deploy assumptions changed) - Tests under `src/routes/*.test.ts` and `src/test/setup.ts` + +Keep it proportionate: user-facing or config changes warrant doc updates; purely internal refactors usually don't. + +**Marketing landing page (`docs/index.html`).** This is the public GH Pages site (served at the `CNAME` domain), not the in-app status page (`src/routes/home.tsx`). When a feature is also a selling point — something a prospective self-hoster would care about (privacy guarantees, full-body capture, burnable aliases, reader compatibility, automation/API, AI features…) — surface it there too (hero copy or a feature card), matching the existing section/card style. Internal correctness fixes don't belong on the landing page; differentiators do. diff --git a/README.md b/README.md index 60e488a..0b077e9 100644 --- a/README.md +++ b/README.md @@ -22,6 +22,7 @@ kill-the-news keeps the same workflow while avoiding shared domains and shared d - Optional per-feed sender allowlist (`email@domain.com` or `domain.com`) - RSS generation on demand (`/rss/:feedId`) - Atom feed at `/atom/:feedId` +- Reader-friendly output: relative links/images absolutized to the sender's site, lazy-loaded images promoted (`data-src` → `src`), plain-text feed titles, and XML-illegal control characters stripped so feeds parse in strict readers - Per-feed favicon derived from the last sender's domain (`/favicon/:feedId`), cached and shown in feeds + admin - Automatic RFC 8058 one-click unsubscribe when a feed is deleted — stops newsletters from mailing the now-dead address - Email attachments stored in Cloudflare R2 and exposed as RSS enclosures (optional) @@ -131,6 +132,7 @@ Then enable email ingestion (Cloudflare Email Workers or ForwardEmail) and open - When using Option B (ForwardEmail), inbound webhook access is IP-restricted to ForwardEmail MX sources. - Admin auth uses a signed, `HttpOnly`, `Secure`, `SameSite=Strict` cookie. - Admin responses are `no-store` to avoid cache leakage. +- Feed, entry, and attachment responses send `X-Robots-Tag: noindex`, and `/robots.txt` disallows `/rss`, `/atom`, `/entries`, `/files`, and `/admin`, so private feeds and emails are kept out of search engines. - For high-value feeds, set `Allowed senders` so only known sender addresses/domains are accepted. - You should use a strong admin password and rotate periodically. - All secret comparisons (admin password, proxy secret) use constant-time comparison to prevent timing attacks. diff --git a/TODO.md b/TODO.md index 734f4ee..676ec71 100644 --- a/TODO.md +++ b/TODO.md @@ -62,7 +62,7 @@ Gaps found by reading every open/closed issue + PR on [kill-the-newsletter](http - [ ] `P2·S` **Detect a newsletter's native Atom/RSS feed** — _top item on upstream's own [TODO](https://github.com/leafac/kill-the-newsletter/blob/main/TODO.md), not yet built there_. When an incoming email's HTML contains `` (or `application/rss+xml`), surface it: "this newsletter already publishes a feed — subscribe to it directly instead." We already parse HTML with linkedom in `src/infrastructure/html-processor.ts`, so detection is cheap; store the discovered URL on the feed and show it in the admin UI / a feed entry. A genuine differentiator — we'd ship it before upstream. -- [ ] `P1·S` **`X-Robots-Tag: none` on feed + entry routes** ([#33](https://github.com/leafac/kill-the-newsletter/issues/33)). Private feeds/emails should never be search-indexed. Upstream sets `X-Robots-Tag: none` on its responses; we set a CSP on `/entries` but **no** robots header anywhere. Add `X-Robots-Tag: noindex` to `rss.ts`, `atom.ts`, `entries.ts`, `files.ts` (and optionally a `/robots.txt`). Low effort, real privacy gap. +- [x] `P1·S` **`X-Robots-Tag: none` on feed + entry routes** ([#33](https://github.com/leafac/kill-the-newsletter/issues/33)). Private feeds/emails should never be search-indexed. Upstream sets `X-Robots-Tag: none` on its responses; we set a CSP on `/entries` but **no** robots header anywhere. Add `X-Robots-Tag: noindex` to `rss.ts`, `atom.ts`, `entries.ts`, `files.ts` (and optionally a `/robots.txt`). Low effort, real privacy gap. ## From similar projects & RSS readers (2026-05-24 review) @@ -152,15 +152,15 @@ Two final angles: (1) less-common RSS/Atom namespaces that visibly improve feeds ### Reader-rendering correctness (turn these into hardening tasks) -- [ ] `P1·S` **Rewrite relative URLs in content to absolute** **[correctness]** — most readers ignore `xml:base`; relative `src`/`href` in `content:encoded` break in Miniflux/NetNewsWire. Absolutize every link/image before emitting (`src/infrastructure/html-processor.ts`). — _origin: [W3C ContainsRelRef](https://validator.w3.org/feed/docs/warning/ContainsRelRef.html)_ +- [x] `P1·S` **Rewrite relative URLs in content to absolute** **[correctness]** — most readers ignore `xml:base`; relative `src`/`href` in `content:encoded` break in Miniflux/NetNewsWire. Absolutize every link/image before emitting (`src/infrastructure/html-processor.ts`). — _origin: [W3C ContainsRelRef](https://validator.w3.org/feed/docs/warning/ContainsRelRef.html)_ -- [ ] `P1·S` **Promote lazy-loaded images (`data-src` → `src`, strip `loading="lazy"`)** **[correctness]** — newsletters with lazy images render blank in readers. — _origin: [Hugo RSS & lazy images](https://brainbaking.com/post/2021/01/hugo-rss-feeds-and-lazy-image-loading/)_ +- [x] `P1·S` **Promote lazy-loaded images (`data-src` → `src`, strip `loading="lazy"`)** **[correctness]** — newsletters with lazy images render blank in readers. — _origin: [Hugo RSS & lazy images](https://brainbaking.com/post/2021/01/hugo-rss-feeds-and-lazy-image-loading/)_ -- [ ] `P1·S` **Strip XML-illegal control chars + guarantee valid UTF-8** **[correctness]** — a single bad codepoint fails the _whole_ feed parse in strict readers (newsboat). Sanitize before serialization. — _origin: [newsboat #2328](https://github.com/newsboat/newsboat/issues/2328), [W3C SAXError](https://validator.w3.org/feed/docs/error/SAXError.html); upstream hit this too ([ktn#1](https://github.com/leafac/kill-the-newsletter/issues/1) cyrillic, [ktn#9](https://github.com/leafac/kill-the-newsletter/issues/9) invalid XML char)_ +- [x] `P1·S` **Strip XML-illegal control chars + guarantee valid UTF-8** **[correctness]** — a single bad codepoint fails the _whole_ feed parse in strict readers (newsboat). Sanitize before serialization. — _origin: [newsboat #2328](https://github.com/newsboat/newsboat/issues/2328), [W3C SAXError](https://validator.w3.org/feed/docs/error/SAXError.html); upstream hit this too ([ktn#1](https://github.com/leafac/kill-the-newsletter/issues/1) cyrillic, [ktn#9](https://github.com/leafac/kill-the-newsletter/issues/9) invalid XML char)_ - [ ] `P2·S` **Real `enclosure` byte length + correct type (never `length="0"`)** **[correctness]** — zero/missing length makes podcast clients reject the enclosure; use the actual R2 object size. — _origin: [AzuraCast #7809](https://github.com/AzuraCast/AzuraCast/issues/7809)_ -- [ ] `P1·S` **Plain-text `` (strip HTML, decode entities)** **[correctness]** — raw tags in titles show literally in readers; keep markup only in `content`. — _origin: [RSS.app feed output guide](https://help.rss.app/en/articles/10769849-guide-to-feed-output); upstream [ktn#11](https://github.com/leafac/kill-the-newsletter/issues/11) (subject placed as link)_ +- [x] `P1·S` **Plain-text `<title>` (strip HTML, decode entities)** **[correctness]** — raw tags in titles show literally in readers; keep markup only in `content`. — _origin: [RSS.app feed output guide](https://help.rss.app/en/articles/10769849-guide-to-feed-output); upstream [ktn#11](https://github.com/leafac/kill-the-newsletter/issues/11) (subject placed as link)_ ## Per-feed favicon — design notes diff --git a/package.json b/package.json index 34a5f6d..d746916 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "kill-the-news", - "version": "0.1.0", + "version": "0.2.1", "description": "Convert email newsletters into private RSS feeds using Cloudflare Workers", "main": "dist/worker.js", "scripts": { diff --git a/src/domain/value-objects/email-address.test.ts b/src/domain/value-objects/email-address.test.ts index a5b84d2..52fd4ca 100644 --- a/src/domain/value-objects/email-address.test.ts +++ b/src/domain/value-objects/email-address.test.ts @@ -24,4 +24,10 @@ describe("EmailAddress", () => { expect(EmailAddress.parse("not an email")).toBeNull(); expect(EmailAddress.parse("")).toBeNull(); }); + + it("derives the sender site base URL from the domain", () => { + expect(EmailAddress.parse("News <a@Example.com>")?.siteBaseUrl()).toBe( + "https://example.com/", + ); + }); }); diff --git a/src/domain/value-objects/email-address.ts b/src/domain/value-objects/email-address.ts index 779d6de..db028e8 100644 --- a/src/domain/value-objects/email-address.ts +++ b/src/domain/value-objects/email-address.ts @@ -20,6 +20,15 @@ export class EmailAddress { return new EmailAddress(`${local}@${domain.value}`, domain); } + /** + * Best-effort website origin implied by the sender's domain + * (e.g. `https://example.com/`). Used to absolutize relative links in the + * email body — the sender's site is the only base we can infer. + */ + siteBaseUrl(): string { + return `https://${this.domain.value}/`; + } + toString(): string { return this.normalized; } diff --git a/src/index.test.ts b/src/index.test.ts index d6c316c..9093a07 100644 --- a/src/index.test.ts +++ b/src/index.test.ts @@ -54,3 +54,17 @@ describe("CORS middleware", () => { expect(res.headers.get("Access-Control-Allow-Origin")).toBe("*"); }); }); + +describe("GET /robots.txt", () => { + it("returns 200 and disallows the private feed/entry paths", async () => { + const res = await worker.fetch(req("/robots.txt"), env as unknown as Env); + expect(res.status).toBe(200); + const body = await res.text(); + expect(body).toContain("User-agent: *"); + expect(body).toContain("Disallow: /rss/"); + expect(body).toContain("Disallow: /atom/"); + expect(body).toContain("Disallow: /entries/"); + expect(body).toContain("Disallow: /files/"); + expect(body).toContain("Disallow: /admin/"); + }); +}); diff --git a/src/index.ts b/src/index.ts index 75b8420..e49fc55 100644 --- a/src/index.ts +++ b/src/index.ts @@ -184,6 +184,14 @@ app.get("/health", (c) => c.json({ status: "ok", timestamp: Date.now() })); // Public status page (counters + link to admin) app.get("/", handleHome); +// Keep private feeds/emails out of search engines (defense in depth alongside +// the X-Robots-Tag headers on the feed/entry/file responses). +app.get("/robots.txt", (c) => + c.text( + "User-agent: *\nDisallow: /rss/\nDisallow: /atom/\nDisallow: /entries/\nDisallow: /files/\nDisallow: /admin/\n", + ), +); + // Catch-all for 404s app.all("*", (c) => c.text("Not Found", 404)); diff --git a/src/infrastructure/feed-generator.test.ts b/src/infrastructure/feed-generator.test.ts index f790605..25fac5f 100644 --- a/src/infrastructure/feed-generator.test.ts +++ b/src/infrastructure/feed-generator.test.ts @@ -313,6 +313,66 @@ describe("generateAtomFeed", () => { expect(result).toContain("Bob"); }); + it("renders the subject as plain text in <title> (strips tags, decodes entities)", () => { + const emailWithHtmlSubject: EmailData = { + ...mockEmails[0], + subject: "<b>Sale</b> Tom & Jerry", + }; + const result = generateAtomFeed( + mockFeedConfig, + [emailWithHtmlSubject], + BASE_URL, + FEED_ID, + ); + // Tags are stripped and entities decoded; markup must not survive. + expect(result).toContain("Sale Tom & Jerry"); + expect(result).not.toContain("<b>Sale</b>"); + }); + + it("strips XML-illegal control characters from the output", () => { + const emailWithControlChar: EmailData = { + ...mockEmails[0], + subject: "Bad\x00\x1Fchar", + content: "<p>body\x0Bhere</p>", + }; + const result = generateAtomFeed( + mockFeedConfig, + [emailWithControlChar], + BASE_URL, + FEED_ID, + ); + expect(result).not.toMatch(/[\x00\x0B\x1F]/); + }); + + it("preserves emoji (surrogate pairs) in the output", () => { + const emailWithEmoji: EmailData = { + ...mockEmails[0], + subject: "Launch 🚀 today", + }; + const result = generateAtomFeed( + mockFeedConfig, + [emailWithEmoji], + BASE_URL, + FEED_ID, + ); + expect(result).toContain("🚀"); + }); + + it("absolutizes relative content URLs against the sender domain", () => { + const emailWithRelative: EmailData = { + ...mockEmails[0], + from: "News <news@acme.com>", + content: '<body><a href="/article">read</a></body>', + }; + const result = generateAtomFeed( + mockFeedConfig, + [emailWithRelative], + BASE_URL, + FEED_ID, + ); + expect(result).toContain("https://acme.com/article"); + }); + it("includes enclosure link for email with attachment in Atom feed", () => { const result = generateAtomFeed( mockFeedConfig, diff --git a/src/infrastructure/feed-generator.ts b/src/infrastructure/feed-generator.ts index e959b3f..2850924 100644 --- a/src/infrastructure/feed-generator.ts +++ b/src/infrastructure/feed-generator.ts @@ -1,9 +1,18 @@ import { Feed } from "feed"; import { FeedConfig, EmailData } from "../types"; -import { processEmailContent } from "./html-processor"; +import { processEmailContent, htmlToText } from "./html-processor"; +import { EmailAddress } from "../domain/value-objects/email-address"; export { processEmailContent as extractBodyContent }; +// XML 1.0 valid chars: #x9 #xA #xD #x20-#xD7FF #xE000-#xFFFD #x10000-#x10FFFF. +// A single illegal codepoint fails the whole feed parse in strict readers, so +// strip the complement before returning. The `u` flag iterates by code point, so +// valid surrogate pairs (emoji, …) survive while lone surrogates are removed. +function stripInvalidXmlChars(xml: string): string { + return xml.replace(/[^\x09\x0A\x0D\x20-퟿-�\u{10000}-\u{10FFFF}]/gu, ""); +} + function parseFromAddress(from: string): { name: string; email?: string } { const match = from.match(/^(.*?)\s*<([^>]+)>\s*$/); if (match) { @@ -60,9 +69,10 @@ function buildFeed( email.content, email.attachments, baseUrl, + EmailAddress.parse(email.from)?.siteBaseUrl() ?? "", ); feed.addItem({ - title: email.subject, + title: htmlToText(email.subject), id: entryUrl, link: entryUrl, description: bodyContent, @@ -89,13 +99,15 @@ export function generateRssFeed( feedId: string, selfUrl?: string, ): string { - return buildFeed( - feedConfig, - emails, - baseUrl, - feedId, - selfUrl ? { rss: selfUrl } : undefined, - ).rss2(); + return stripInvalidXmlChars( + buildFeed( + feedConfig, + emails, + baseUrl, + feedId, + selfUrl ? { rss: selfUrl } : undefined, + ).rss2(), + ); } export function generateAtomFeed( @@ -105,11 +117,13 @@ export function generateAtomFeed( feedId: string, selfUrl?: string, ): string { - return buildFeed( - feedConfig, - emails, - baseUrl, - feedId, - selfUrl ? { atom: selfUrl } : undefined, - ).atom1(); + return stripInvalidXmlChars( + buildFeed( + feedConfig, + emails, + baseUrl, + feedId, + selfUrl ? { atom: selfUrl } : undefined, + ).atom1(), + ); } diff --git a/src/infrastructure/html-processor.test.ts b/src/infrastructure/html-processor.test.ts index 2311866..9dba7cf 100644 --- a/src/infrastructure/html-processor.test.ts +++ b/src/infrastructure/html-processor.test.ts @@ -1,5 +1,9 @@ import { describe, it, expect } from "vitest"; -import { processEmailContent, extractInlineCids } from "./html-processor"; +import { + processEmailContent, + extractInlineCids, + htmlToText, +} from "./html-processor"; import type { AttachmentData } from "../types"; describe("processEmailContent — body extraction", () => { @@ -197,6 +201,105 @@ describe("processEmailContent — inline cid: rewriting", () => { }); }); +describe("processEmailContent — lazy image promotion", () => { + it("promotes data-src to src when src is missing", () => { + const html = '<body><img data-src="https://x.com/a.png"/></body>'; + const result = processEmailContent(html); + expect(result).toContain('src="https://x.com/a.png"'); + }); + + it("promotes data-src over a data: placeholder src", () => { + const html = + '<body><img src="data:image/gif;base64,AAAA" data-src="https://x.com/a.png"/></body>'; + const result = processEmailContent(html); + expect(result).toContain('src="https://x.com/a.png"'); + expect(result).not.toContain("data:image/gif"); + }); + + it("does not clobber a real src with data-src", () => { + const html = + '<body><img src="https://real.com/a.png" data-src="https://lazy.com/b.png"/></body>'; + const result = processEmailContent(html); + expect(result).toContain('src="https://real.com/a.png"'); + }); + + it("promotes data-srcset when srcset is absent", () => { + const html = '<body><img data-srcset="https://x.com/a.png 2x"/></body>'; + const result = processEmailContent(html); + expect(result).toContain('srcset="https://x.com/a.png 2x"'); + }); + + it("strips loading=lazy", () => { + const html = '<body><img src="https://x.com/a.png" loading="lazy"/></body>'; + const result = processEmailContent(html); + expect(result).not.toContain("loading"); + }); +}); + +describe("processEmailContent — relative URL absolutization", () => { + const base = "https://news.example.com/"; + + it("absolutizes a root-relative href against the sender base", () => { + const html = '<body><a href="/path">link</a></body>'; + const result = processEmailContent(html, undefined, "", base); + expect(result).toContain('href="https://news.example.com/path"'); + }); + + it("absolutizes a relative img src against the sender base", () => { + const html = '<body><img src="img/a.png"/></body>'; + const result = processEmailContent(html, undefined, "", base); + expect(result).toContain('src="https://news.example.com/img/a.png"'); + }); + + it("resolves protocol-relative URLs using https", () => { + const html = '<body><img src="//cdn.example.com/a.png"/></body>'; + const result = processEmailContent(html, undefined, "", base); + expect(result).toContain('src="https://cdn.example.com/a.png"'); + }); + + it("leaves absolute URLs unchanged", () => { + const html = '<body><a href="https://other.com/x">l</a></body>'; + const result = processEmailContent(html, undefined, "", base); + expect(result).toContain('href="https://other.com/x"'); + }); + + it("does not touch relative URLs when no sender base is given", () => { + const html = '<body><a href="/path">link</a></body>'; + const result = processEmailContent(html); + expect(result).toContain('href="/path"'); + }); + + it("does not absolutize mailto: or anchors", () => { + const html = + '<body><a href="mailto:x@y.com">m</a><a href="#top">t</a></body>'; + const result = processEmailContent(html, undefined, "", base); + expect(result).toContain('href="mailto:x@y.com"'); + expect(result).toContain('href="#top"'); + }); +}); + +describe("htmlToText", () => { + it("strips HTML tags", () => { + expect(htmlToText("<b>Bold</b> text")).toBe("Bold text"); + }); + + it("decodes HTML entities", () => { + expect(htmlToText("Tom & Jerry <3")).toBe("Tom & Jerry <3"); + }); + + it("collapses whitespace and trims", () => { + expect(htmlToText(" a\n\n b ")).toBe("a b"); + }); + + it("returns empty string for empty input", () => { + expect(htmlToText("")).toBe(""); + }); + + it("leaves plain text untouched", () => { + expect(htmlToText("Just a subject")).toBe("Just a subject"); + }); +}); + describe("extractInlineCids", () => { it("collects normalized cids referenced by cid: image sources", () => { const html = '<body><img src="cid:ii_abc"/><img src="CID:ii_def"/></body>'; diff --git a/src/infrastructure/html-processor.ts b/src/infrastructure/html-processor.ts index 05b8564..ae1d948 100644 --- a/src/infrastructure/html-processor.ts +++ b/src/infrastructure/html-processor.ts @@ -2,6 +2,8 @@ import { parseHTML } from "linkedom"; import escapeHtml from "escape-html"; import type { AttachmentData } from "../types"; +type ParsedDocument = ReturnType<typeof parseHTML>["document"]; + // Strip surrounding angle brackets and whitespace from a Content-ID so that a // stored value like "<ii_mpi85rqy0>" matches an HTML reference "cid:ii_mpi85rqy0". export function normalizeCid( @@ -28,6 +30,66 @@ export function extractInlineCids(content: string): Set<string> { return cids; } +// Render an HTML fragment (or already-plain string) down to plain text: strips +// tags and decodes entities. Used for feed <title>s, which must be plain text — +// raw markup/entities show literally in readers. +export function htmlToText(value: string): string { + if (!value) return ""; + const { document } = parseHTML(`<body>${value}</body>`); + return (document.documentElement?.textContent ?? "") + .replace(/\s+/g, " ") + .trim(); +} + +// Newsletters frequently defer images via data-src/loading="lazy"; readers don't +// run the lazy-loader, so the image renders blank. Promote the real source. +function promoteLazyImages(document: ParsedDocument): void { + document.querySelectorAll("img").forEach((img: Element) => { + const lazySrc = + img.getAttribute("data-src") || + img.getAttribute("data-original") || + img.getAttribute("data-lazy-src"); + if (lazySrc) { + const current = (img.getAttribute("src") ?? "").trim(); + if (!current || /^data:/i.test(current)) { + img.setAttribute("src", lazySrc); + } + } + const lazySrcset = img.getAttribute("data-srcset"); + if (lazySrcset && !img.getAttribute("srcset")) { + img.setAttribute("srcset", lazySrcset); + } + img.removeAttribute("loading"); + }); +} + +// Resolve a single URL against the sender base. Returns null for values that are +// already absolute or should never be rewritten (mailto:, data:, cid:, anchors). +function toAbsolute(value: string, base: string): string | null { + const v = value.trim(); + if (!v || /^(https?:|mailto:|tel:|data:|cid:|#)/i.test(v)) return null; + try { + return new URL(v, base).href; + } catch { + return null; + } +} + +// Most readers ignore xml:base, so relative href/src in content break. Absolutize +// them against the sender's site (best-effort, derived from its email domain). +// Protocol-relative //host/x are resolved too (they pick up the base's https:). +function absolutizeUrls(document: ParsedDocument, base: string): void { + if (!base) return; + document.querySelectorAll("a[href], area[href]").forEach((el: Element) => { + const abs = toAbsolute(el.getAttribute("href") ?? "", base); + if (abs) el.setAttribute("href", abs); + }); + document.querySelectorAll("img[src]").forEach((el: Element) => { + const abs = toAbsolute(el.getAttribute("src") ?? "", base); + if (abs) el.setAttribute("src", abs); + }); +} + function cleanMsoStyles(style: string): string { return style .split(";") @@ -98,11 +160,15 @@ function sanitizeElement(el: Element): void { * - Rewrites inline cid: image refs to the stored attachment URL. baseUrl="" * yields relative URLs (entry page, same origin); a baseUrl yields absolute * URLs (feeds, for external RSS readers). + * - Promotes lazy-loaded images (data-src → src, strips loading="lazy"). + * - Absolutizes relative href/src against senderBaseUrl (the sender's site, + * best-effort) so links/images don't break in readers that ignore xml:base. */ export function processEmailContent( content: string, attachments?: AttachmentData[], baseUrl = "", + senderBaseUrl = "", ): string { if (!content) return ""; @@ -124,6 +190,11 @@ export function processEmailContent( document.querySelectorAll("*").forEach((el: Element) => sanitizeElement(el)); + promoteLazyImages(document); + // Absolutize first: cid: refs are skipped here (not http(s)), then rewritten + // below to our /files/ URL — which must NOT be absolutized to the sender. + absolutizeUrls(document, senderBaseUrl); + if (cidMap.size > 0) { document .querySelectorAll("[src]") diff --git a/src/routes/atom.test.ts b/src/routes/atom.test.ts index d465978..e1edd8a 100644 --- a/src/routes/atom.test.ts +++ b/src/routes/atom.test.ts @@ -47,6 +47,11 @@ describe("Atom Feed Route", () => { const res = await testApp.request("/empty-feed", {}, mockEnv); expect(res.headers.get("Cache-Control")).toBe("max-age=1800"); }); + + it("sets X-Robots-Tag: noindex", async () => { + const res = await testApp.request("/empty-feed", {}, mockEnv); + expect(res.headers.get("X-Robots-Tag")).toBe("noindex"); + }); }); describe("valid feed with emails", () => { diff --git a/src/routes/atom.ts b/src/routes/atom.ts index 66da266..f024ff8 100644 --- a/src/routes/atom.ts +++ b/src/routes/atom.ts @@ -40,6 +40,7 @@ export async function handle(c: Context<{ Bindings: Env }>): Promise<Response> { headers: { "Content-Type": "application/atom+xml", "Cache-Control": "max-age=1800", + "X-Robots-Tag": "noindex", Link: linkHeader, }, }); diff --git a/src/routes/entries.test.ts b/src/routes/entries.test.ts index 3f6a819..a0b8b77 100644 --- a/src/routes/entries.test.ts +++ b/src/routes/entries.test.ts @@ -170,4 +170,11 @@ describe("GET /entries/:feedId/:entryId", () => { "default-src 'none'", ); }); + + it("sets X-Robots-Tag: noindex", async () => { + await seedFeed(env); + const app = makeApp(); + const res = await app.request(`/${FEED_ID}/${RECEIVED_AT}`, {}, env as any); + expect(res.headers.get("X-Robots-Tag")).toBe("noindex"); + }); }); diff --git a/src/routes/entries.ts b/src/routes/entries.ts index a137452..d1b24d7 100644 --- a/src/routes/entries.ts +++ b/src/routes/entries.ts @@ -2,6 +2,7 @@ import { Context } from "hono"; import { html, raw } from "hono/html"; import { Env } from "../types"; import { processEmailContent } from "../infrastructure/html-processor"; +import { EmailAddress } from "../domain/value-objects/email-address"; import { formatBytes } from "../domain/format"; import { FeedRepository } from "../infrastructure/feed-repository"; import { FeedId } from "../domain/value-objects/feed-id"; @@ -46,6 +47,14 @@ export async function handle(c: Context<{ Bindings: Env }>): Promise<Response> { "Content-Security-Policy", "default-src 'none'; style-src 'unsafe-inline'; img-src *; frame-src 'none'", ); + c.header("X-Robots-Tag", "noindex"); + + const bodyContent = processEmailContent( + emailData.content, + emailData.attachments, + "", + EmailAddress.parse(emailData.from)?.siteBaseUrl() ?? "", + ); // Inline images render in place (cid: refs are rewritten by processEmailContent); // only genuine, downloadable attachments belong in the list below. @@ -92,11 +101,7 @@ export async function handle(c: Context<{ Bindings: Env }>): Promise<Response> { <dt>Date:</dt> <dd>${new Date(emailData.receivedAt).toUTCString()}</dd> </dl> - <div class="content"> - ${raw( - processEmailContent(emailData.content, emailData.attachments), - )} - </div> + <div class="content">${raw(bodyContent)}</div> ${attachmentsSection} </body> </html>`, diff --git a/src/routes/files.test.ts b/src/routes/files.test.ts index dd20616..ccb1da4 100644 --- a/src/routes/files.test.ts +++ b/src/routes/files.test.ts @@ -72,6 +72,16 @@ describe("GET /files/:attachmentId/:filename", () => { ); }); + it("sets X-Robots-Tag: noindex", async () => { + const content = new TextEncoder().encode("data").buffer as ArrayBuffer; + await mockR2.put("robots-uuid", content, { + httpMetadata: { contentType: "application/pdf" }, + }); + + const res = await request(envWithR2, "/files/robots-uuid/doc.pdf"); + expect(res.headers.get("X-Robots-Tag")).toBe("noindex"); + }); + it("sets Content-Disposition from httpMetadata when present", async () => { const content = new TextEncoder().encode("data").buffer as ArrayBuffer; await mockR2.put("disp-uuid", content, { diff --git a/src/routes/files.ts b/src/routes/files.ts index e3b7db4..8b23570 100644 --- a/src/routes/files.ts +++ b/src/routes/files.ts @@ -25,6 +25,7 @@ export async function handle(c: Context<{ Bindings: Env }>): Promise<Response> { object.writeHttpMetadata(headers); headers.set("etag", object.httpEtag); headers.set("Cache-Control", "public, max-age=31536000, immutable"); + headers.set("X-Robots-Tag", "noindex"); if (!headers.get("Content-Disposition")) { headers.set( diff --git a/src/routes/rss.test.ts b/src/routes/rss.test.ts new file mode 100644 index 0000000..68fd480 --- /dev/null +++ b/src/routes/rss.test.ts @@ -0,0 +1,56 @@ +import { describe, it, expect, beforeEach } from "vitest"; +import { Hono } from "hono"; +import { handle } from "./rss"; +import { createMockEnv } from "../test/setup"; +import { Env } from "../types"; + +describe("RSS Feed Route", () => { + let testApp: Hono; + let mockEnv: Env; + + beforeEach(() => { + mockEnv = createMockEnv() as unknown as Env; + testApp = new Hono(); + testApp.get("/:feedId", handle); + }); + + describe("unknown feed", () => { + it("returns 404 when no metadata exists in KV", async () => { + const res = await testApp.request("/nonexistent-feed", {}, mockEnv); + expect(res.status).toBe(404); + expect(await res.text()).toBe("Feed not found"); + }); + }); + + describe("valid feed with no emails", () => { + beforeEach(async () => { + await mockEnv.EMAIL_STORAGE.put( + "feed:empty-feed:metadata", + JSON.stringify({ emails: [] }), + ); + }); + + it("returns 200 with application/rss+xml content type", async () => { + const res = await testApp.request("/empty-feed", {}, mockEnv); + expect(res.status).toBe(200); + expect(res.headers.get("Content-Type")).toContain("application/rss+xml"); + }); + + it("includes Cache-Control header", async () => { + const res = await testApp.request("/empty-feed", {}, mockEnv); + expect(res.headers.get("Cache-Control")).toBe("max-age=1800"); + }); + + it("sets X-Robots-Tag: noindex", async () => { + const res = await testApp.request("/empty-feed", {}, mockEnv); + expect(res.headers.get("X-Robots-Tag")).toBe("noindex"); + }); + + it("Link header advertises hub and self for WebSub discovery", async () => { + const res = await testApp.request("/empty-feed", {}, mockEnv); + const link = res.headers.get("Link") ?? ""; + expect(link).toContain(`rel="hub"`); + expect(link).toContain(`rel="self"`); + }); + }); +}); diff --git a/src/routes/rss.ts b/src/routes/rss.ts index 4d85274..c90f6ab 100644 --- a/src/routes/rss.ts +++ b/src/routes/rss.ts @@ -40,6 +40,7 @@ export async function handle(c: Context<{ Bindings: Env }>): Promise<Response> { headers: { "Content-Type": "application/rss+xml", "Cache-Control": "max-age=1800", + "X-Robots-Tag": "noindex", Link: linkHeader, }, });