Skip to content

[Question]: Preserving hyperlinks when converting PDF to markdown or json #20308

@sarah114tran

Description

@sarah114tran

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Hi! I am currently trying to extract the citations from a series of PDFs. I need to convert the original PDF into either markdown or json, but I am noticing for both document versions the output does not preserve the hyperlinks. Please let me know if you have any suggestions!

This is my current code for parsing the PDFs:

all_markdown = []

for filename in os.listdir(folder_path):
    if not filename.lower().endswith(".pdf"):
        continue

    file_path = os.path.join(folder_path, filename)

    parser = LlamaParse(
        api_key=LLAMA_CLOUD_API_KEY,
        result_type="markdown",
        annotate_links=true         
    )

    try:
        docs = parser.load_data(file_path)
        md_text = docs[0].text       

        base_name = os.path.splitext(filename)[0]
        md_filename = base_name + ".md"
        md_path = os.path.join(output_folder, md_filename)

        with open(md_path, "w", encoding="utf-8") as f:
            f.write(md_text)

        print(f"Saved markdown: {md_path}")
        all_markdown.append({"filename": base_name, "markdown": md_text})

    except Exception as e:
        print(f"Error processing {file_path}: {e}")

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions