Skip to content

Update pdf.py PageObject.extractText()#334

Merged
MartinThoma merged 3 commits intopy-pdf:masterfrom
jusdino:patch-1
Apr 7, 2022
Merged

Update pdf.py PageObject.extractText()#334
MartinThoma merged 3 commits intopy-pdf:masterfrom
jusdino:patch-1

Conversation

@jusdino
Copy link
Copy Markdown
Contributor

@jusdino jusdino commented Mar 19, 2017

These changes allow for an optional text separator for TJ and Tj operators.

These source alterations were originally suggested in StackOverflow at:
http://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file
by DSM

I'm just passing along the good suggestion in hopes that the change may become standard in some future version.

jusdino and others added 2 commits March 19, 2017 10:41
These changes allow for an optional text separator for TJ and Tj operators.

These source alterations were originally suggested in StackOverflow at:
http://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file
by DSM

I'm just passing along the good suggestion in hopes that the change may become standard in some future version.
@MartinThoma MartinThoma added PdfReader The PdfReader component is affected Feature labels Apr 6, 2022
@MartinThoma
Copy link
Copy Markdown
Member

Do you have an example where something else than a single whitespace would be desired?

@MartinThoma
Copy link
Copy Markdown
Member

By the way: Sorry that it took so long to react! I do realize that you propably don't even remember this PR.

Also: Don't worry about the failing tests; that is expected for this PR.

@jusdino
Copy link
Copy Markdown
Contributor Author

jusdino commented Apr 7, 2022

Yeah, this was a while ago... Ok, I resurrected the project I was working on.

So I was trying to extract text from some form-formatted pdf pages which had newlines separating the text I was interested in, so I used page.extractText(Tj_sep='\n') to get it organized the way I needed.

Comment thread PyPDF2/pdf.py Outdated
@MartinThoma MartinThoma merged commit 12c7047 into py-pdf:master Apr 7, 2022
MartinThoma added a commit that referenced this pull request Apr 7, 2022
Features:

 - Add alpha channel support for png files in Script (#614)

Bug fixes (BUG):

 - Fix formatWarning for filename without slash (#612)
 - Add whitespace between words for extractText() (#569, #334)
 - "invalid escape sequence" SyntaxError (#522)
 - Avoid error when printing warning in pythonw (#486)
 - Stream operations can be List or Dict (#665)

Documentation (DOC):

 - Added Scripts/pdf-image-extractor.py
 - Documentation improvements (#550, #538, #324, #426, #394)

Tests and Test setup (TST):

 - Add Github Action which automatically run unit tests via pytest and
   static code analysis with Flake8 (#660)
 - Add several unit tests (#661, #663)
 - Add .coveragerc to create coverage reports

Developer Experience Improvements (DEV):

 - Pre commit: Developers can now `pre-commit install` to avoid tiny issues
               like trailing whitespaces

Miscallenious:

 - Add the LICENSE file to the distributed packages (#288)
 - Use setuptools instead of distutils (#599)
 - Improvements for the PyPI page (#644)
 - Python 3 changes (#504, #366)

You can see the full changelog at: 1.26.0...1.27.0
@jusdino jusdino deleted the patch-1 branch April 8, 2022 02:32
@MartinThoma MartinThoma added is-feature A feature request and removed Feature labels Jun 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

is-feature A feature request PdfReader The PdfReader component is affected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants