Raise error when id tag doesn't match filename book id#141
Raise error when id tag doesn't match filename book id#141mshannon-sil merged 3 commits intomainfrom
Conversation
ddaspit
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)
machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):
settings.name, ) with text.get_rows() as rows:
We purposefully avoid parsing the book in the constructor. We want to avoid parsing errors in books that we filter out when actually iterating over the corpus. Can we perform this check when iterating over the corpus, i.e. when get_rows is called?
mshannon-sil
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)
machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
We purposefully avoid parsing the book in the constructor. We want to avoid parsing errors in books that we filter out when actually iterating over the corpus. Can we perform this check when iterating over the corpus, i.e. when
get_rowsis called?
That's fair, and yes I took a look and we should be able to do a similar check in get_rows, comparing the ref book to the text_id for each row. That means the check would happen for each row rather than once per SFM file, but it sounds like that's worth it to avoid parsing books in the constructor that we would have filtered out.
ddaspit
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)
machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):
Previously, mshannon-sil wrote…
That's fair, and yes I took a look and we should be able to do a similar check in
get_rows, comparing the ref book to the text_id for each row. That means the check would happen for each row rather than once per SFM file, but it sounds like that's worth it to avoid parsing books in the constructor that we would have filtered out.
It would be ideal if we could just perform the check when we hit the \id marker.
mshannon-sil
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)
machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
It would be ideal if we could just perform the check when we hit the
\idmarker.
Done.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #141 +/- ##
==========================================
+ Coverage 88.30% 88.33% +0.03%
==========================================
Files 275 275
Lines 16171 16192 +21
==========================================
+ Hits 14280 14304 +24
+ Misses 1891 1888 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ddaspit
left a comment
There was a problem hiding this comment.
Reviewed 14 of 14 files at r2, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)
60c78d6 to
894e2ba
Compare
This PR addresses sillsdev/silnlp#574. ParatextTextCorpus and ParatextBackupTextCorpus now raise a ValueError if the book id in the filename and the \id tag inside the file don't match for a given SFM file in the Paratext project. For example, if the filename is 07JDG.SFM but the \id tag is JUD, this will now raise an error during initialization, whereas before initialization would succeed without any message to the user and would use the incorrect \id tag for that book's verse refs. I added two error messages, one for if the \id tag itself is invalid, and another for if the \id tag is valid but does not match the book id in the filename.
This change is