[Python] pdf to text

Posted by Albert 23Day 7Hour 43Min 39Sec ago [2025-11-05]

pdf 파일 내용의 text 값 추출 시 사용

이를 위하여 PyMuPDF 라이브러리 사용해보자


1. 설치 pip install PyMuPDF

PS C:\Users\visio\IdeaProjects\llm> pip install PyMuPDF
Collecting PyMuPDF
Obtaining dependency information for PyMuPDF from https://files.pythonhosted.org/packages/c6/96/fd59c1532891762ea4815e73956c532053d5e26d56969e1e5d1e4ca4b207/pymupdf-1.26.5-cp39-abi3-win_amd64.whl.metadata
Downloading pymupdf-1.26.5-cp39-abi3-win_amd64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.5-cp39-abi3-win_amd64.whl (18.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.7/18.7 MB 11.3 MB/s eta 0:00:00
Installing collected packages: PyMuPDF


2. 사용법

import pymupdf
import os

'추출대상 pdf 경로
pdf_file_path = "data/sample.pdf"
doc = pymupdf.open(pdf_file_path)

full_text = ''

for page in doc: ' 문서 페이지 반복
text = page.get_text() ' 페이지 텍스트 추출
full_text += text

pdf_file_name = os.path.basename(pdf_file_path)
pdf_file_name = os.path.splitext(pdf_file_name)[0]

txt_file_path = f"output/{pdf_file_name}.txt"
with open(txt_file_path, 'w', encoding='utf-8') as f:
f.write(full_text)


3. 실행결과는 output폴더 에 관련pdf내용의 text가 sample.txt 로 추가된다.


끝 





LIST

Copyright © 2014 visionboy.me All Right Reserved.