Update pdf.py PageObject.extractText()#334
Merged
MartinThoma merged 3 commits intopy-pdf:masterfrom Apr 7, 2022
jusdino:patch-1
Merged
Update pdf.py PageObject.extractText()#334MartinThoma merged 3 commits intopy-pdf:masterfrom jusdino:patch-1
MartinThoma merged 3 commits intopy-pdf:masterfrom
jusdino:patch-1
Conversation
These changes allow for an optional text separator for TJ and Tj operators. These source alterations were originally suggested in StackOverflow at: http://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file by DSM I'm just passing along the good suggestion in hopes that the change may become standard in some future version.
Member
|
Do you have an example where something else than a single whitespace would be desired? |
Member
|
By the way: Sorry that it took so long to react! I do realize that you propably don't even remember this PR. Also: Don't worry about the failing tests; that is expected for this PR. |
Contributor
Author
|
Yeah, this was a while ago... Ok, I resurrected the project I was working on. So I was trying to extract text from some form-formatted pdf pages which had newlines separating the text I was interested in, so I used |
MartinThoma
reviewed
Apr 7, 2022
MartinThoma
added a commit
that referenced
this pull request
Apr 7, 2022
Features: - Add alpha channel support for png files in Script (#614) Bug fixes (BUG): - Fix formatWarning for filename without slash (#612) - Add whitespace between words for extractText() (#569, #334) - "invalid escape sequence" SyntaxError (#522) - Avoid error when printing warning in pythonw (#486) - Stream operations can be List or Dict (#665) Documentation (DOC): - Added Scripts/pdf-image-extractor.py - Documentation improvements (#550, #538, #324, #426, #394) Tests and Test setup (TST): - Add Github Action which automatically run unit tests via pytest and static code analysis with Flake8 (#660) - Add several unit tests (#661, #663) - Add .coveragerc to create coverage reports Developer Experience Improvements (DEV): - Pre commit: Developers can now `pre-commit install` to avoid tiny issues like trailing whitespaces Miscallenious: - Add the LICENSE file to the distributed packages (#288) - Use setuptools instead of distutils (#599) - Improvements for the PyPI page (#644) - Python 3 changes (#504, #366) You can see the full changelog at: 1.26.0...1.27.0
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
These changes allow for an optional text separator for TJ and Tj operators.
These source alterations were originally suggested in StackOverflow at:
http://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file
by DSM
I'm just passing along the good suggestion in hopes that the change may become standard in some future version.