-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Question Validation
- I have searched both the documentation and discord for an answer.
Question
Hi! I am currently trying to extract the citations from a series of PDFs. I need to convert the original PDF into either markdown or json, but I am noticing for both document versions the output does not preserve the hyperlinks. Please let me know if you have any suggestions!
This is my current code for parsing the PDFs:
all_markdown = []
for filename in os.listdir(folder_path):
if not filename.lower().endswith(".pdf"):
continue
file_path = os.path.join(folder_path, filename)
parser = LlamaParse(
api_key=LLAMA_CLOUD_API_KEY,
result_type="markdown",
annotate_links=true
)
try:
docs = parser.load_data(file_path)
md_text = docs[0].text
base_name = os.path.splitext(filename)[0]
md_filename = base_name + ".md"
md_path = os.path.join(output_folder, md_filename)
with open(md_path, "w", encoding="utf-8") as f:
f.write(md_text)
print(f"Saved markdown: {md_path}")
all_markdown.append({"filename": base_name, "markdown": md_text})
except Exception as e:
print(f"Error processing {file_path}: {e}")
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested