v1.0.9 #66

janreges · 2025-06-08T16:33:39Z

janreges
Jun 8, 2025
Maintainer

This release introduces a powerful new Website to Markdown converter, allowing you to export entire websites into clean, single or multiple Markdown files, which is ideal for AI context or documentation purposes. We've also added the ability to start crawling directly from a sitemap.xml file and significantly enhanced the Offline Website Exporter with more granular control and better handling of international characters. Numerous new command-line options have been added for greater flexibility in crawling, filtering, and reporting, alongside many other improvements and bug fixes.

New Features

Website to Markdown Converter: A major new feature to convert entire websites into clean Markdown files, replacing the previous dependency on html2markdown.
Single-File Markdown Export: Use --markdown-export-single-file to combine all website content into a single, organized Markdown file, with smart removal of duplicate headers/footers.
Crawl from Sitemap: You can now provide a URL to a sitemap.xml or sitemap index file directly to the --url parameter to crawl all listed URLs.
Video Gallery in HTML Report: The HTML report now includes a gallery of all found videos, with lazy loading and an interactive player.
Custom DNS Resolution: Added the --resolve option (like curl) to provide custom IP addresses for specific domains and ports.
XPath and RegEx in Extra Columns: Enhance custom data extraction with support for XPath 1.0 and Regular Expressions in the --extra-columns option.
Max Crawl Depth: Control the crawling scope with the new --max-depth parameter for limiting how deep the crawler goes (for pages, not assets).
Customizable HTML Reports: Use --html-report-options to select which sections to include in the final HTML report.

Improvements

Offline Website Exporter:
- New --offline-export-remove-unwanted-code option to automatically strip analytics, cookie consents, and other non-essential scripts.
- New --offline-export-no-auto-redirect-html flag to prevent the creation of meta-refresh redirect files.
- Better handling of file paths with UTF-8 characters.
URL Transformations: Added --transform-url to internally change request URLs, useful for crawling sites that serve content from a different domain (e.g., a local instance).
Loop Protection: New --max-non200-responses-per-basename option to prevent getting stuck in loops with dynamically generated error pages.
Timezone Support: Set a --timezone for all dates and times displayed in reports and used in exported filenames.
Smarter Image Analysis: The WebP analysis will no longer report missing WebP images if more optimized AVIF alternatives are already present.
LICENSE: Switched to MIT: The project license has been changed to the more permissive MIT license.

This discussion was created from the release v1.0.9.

4smartbiz · 2025-06-11T20:24:05Z

4smartbiz
Jun 11, 2025

I'm eagerly looking forward to the release of the AppImage for the Linux x64 (Linux Mint user) v1.0.9 GUI version! Thank you for continuing to improve and provide this awesome website crawler / exporter.

0 replies

DavidMatthewson · 2025-06-16T11:49:35Z

DavidMatthewson
Jun 16, 2025

Newbie question - what's the best / recommended way to update 1.08 to 1.09 under Linux Mint without trashing existing configs? Thanks.

0 replies

windhamdavid · 2025-06-21T20:51:47Z

windhamdavid
Jun 21, 2025

Hey 👋🏼 @janreges. I spent some time today with this crawler command line and GUI and I just wanted to say - really nice work. Thank You!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.0.9 #66

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

v1.0.9 #66

Uh oh!

janreges Jun 8, 2025 Maintainer

New Features

Improvements

Replies: 3 comments

Uh oh!

4smartbiz Jun 11, 2025

Uh oh!

DavidMatthewson Jun 16, 2025

Uh oh!

windhamdavid Jun 21, 2025

janreges
Jun 8, 2025
Maintainer

4smartbiz
Jun 11, 2025

DavidMatthewson
Jun 16, 2025

windhamdavid
Jun 21, 2025