Show HN: Docsingest – Turn developer docs into clean, LLM-friendly Markdown

docsingest.com

2 points by swiftlysingh 8 days ago

Hi HN,

I'm Pushpinder, and I built Docsingest (https://docsingest.com/) to solve a common pain point when trying to prepare online developer documentation for LLM workflows, like RAG or fine-tuning.

Scraping developer doc webpages is always messy— you often end up with navigation menus, footers, sidebars, and even ads mixed in with your core content. Worse, extracting code blocks and preserving the document’s structure (headings, lists, tables) is challenging, and manually cleaning it up is tedious.

Inspired by tools like [GitIngest](https://gitingest.com/) that effectively process code repositories, I wanted to build a similar tool focused on dev docs directly from the web. Docsingest takes a URL as input and:

1. *Isolates the main content:* It intelligently strips away the boilerplate (like headers, footers, and sidebars) that you don’t need. 2. *Preserves rich formatting:* It retains code formatting, proper heading hierarchies, lists, and tables to ensure that the Markdown output is structured and LLM-friendly. 3. *Handles JS-rendered pages:* To accurately capture content from modern sites, we use a headless browser (powered by [Browserless](https://account.browserless.io/)) to render JavaScript before extraction.

I built this after spending far too much time fighting with generic scrapers that failed to produce clean, usable Markdown. With Docsingest, you can quickly transform a messy developer docs webpage into a clean, structured format optimized for LLM ingestion.

I'm eager for any feedback from the community.

Thanks for taking a look!