Raise error when id tag doesn't match filename book id by mshannon-sil · Pull Request #141 · sillsdev/machine.py

mshannon-sil · 2024-11-11T23:43:18Z

This PR addresses sillsdev/silnlp#574. ParatextTextCorpus and ParatextBackupTextCorpus now raise a ValueError if the book id in the filename and the \id tag inside the file don't match for a given SFM file in the Paratext project. For example, if the filename is 07JDG.SFM but the \id tag is JUD, this will now raise an error during initialization, whereas before initialization would succeed without any message to the user and would use the incorrect \id tag for that book's verse refs. I added two error messages, one for if the \id tag itself is invalid, and another for if the \id tag is valid but does not match the book id in the filename.

This change is

ddaspit

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)

machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

                        settings.name,
                    )
                    with text.get_rows() as rows:

We purposefully avoid parsing the book in the constructor. We want to avoid parsing errors in books that we filter out when actually iterating over the corpus. Can we perform this check when iterating over the corpus, i.e. when get_rows is called?

mshannon-sil

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)

machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We purposefully avoid parsing the book in the constructor. We want to avoid parsing errors in books that we filter out when actually iterating over the corpus. Can we perform this check when iterating over the corpus, i.e. when get_rows is called?

That's fair, and yes I took a look and we should be able to do a similar check in get_rows, comparing the ref book to the text_id for each row. That means the check would happen for each row rather than once per SFM file, but it sounds like that's worth it to avoid parsing books in the constructor that we would have filtered out.

ddaspit

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)

machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

Previously, mshannon-sil wrote…

That's fair, and yes I took a look and we should be able to do a similar check in get_rows, comparing the ref book to the text_id for each row. That means the check would happen for each row rather than once per SFM file, but it sounds like that's worth it to avoid parsing books in the constructor that we would have filtered out.

It would be ideal if we could just perform the check when we hit the \id marker.

mshannon-sil

Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)

machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It would be ideal if we could just perform the check when we hit the \id marker.

Done.

codecov-commenter · 2024-11-14T21:31:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.33%. Comparing base (183fdfb) to head (894e2ba).
⚠️ Report is 96 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #141      +/-   ##
==========================================
+ Coverage   88.30%   88.33%   +0.03%     
==========================================
  Files         275      275              
  Lines       16171    16192      +21     
==========================================
+ Hits        14280    14304      +24     
+ Misses       1891     1888       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ddaspit

Reviewed 14 of 14 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

This reverts commit 8679b78.

mshannon-sil added the bug label Nov 11, 2024

mshannon-sil requested a review from ddaspit November 11, 2024 23:43

mshannon-sil self-assigned this Nov 11, 2024

ddaspit requested changes Nov 12, 2024

View reviewed changes

mshannon-sil commented Nov 12, 2024

View reviewed changes

ddaspit reviewed Nov 12, 2024

View reviewed changes

mshannon-sil commented Nov 14, 2024

View reviewed changes

ddaspit approved these changes Nov 14, 2024

View reviewed changes

mshannon-sil added 3 commits November 15, 2024 07:21

raise error when id tag doesn't match filename book id

84a6a34

Revert "raise error when id tag doesn't match filename book id"

bc41e28

This reverts commit 8679b78.

raise error on invalid and mismatched book ids, take 2

894e2ba

johnml1135 force-pushed the id_mismatch branch from 60c78d6 to 894e2ba Compare November 15, 2024 12:21

mshannon-sil merged commit bd8707f into main Nov 15, 2024

mshannon-sil deleted the id_mismatch branch November 15, 2024 14:30

ddaspit moved this from 👀 In review to ✅ Done in SIL-NLP Research Aug 29, 2025

Enkidu93 mentioned this pull request Feb 6, 2026

Port recent Machine updates #264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Raise error when id tag doesn't match filename book id#141

Raise error when id tag doesn't match filename book id#141
mshannon-sil merged 3 commits intomainfrom
id_mismatch

mshannon-sil commented Nov 11, 2024 •

edited by ddaspit

Loading

Uh oh!

ddaspit left a comment

Uh oh!

mshannon-sil left a comment

Uh oh!

ddaspit left a comment

Uh oh!

mshannon-sil left a comment

Uh oh!

codecov-commenter commented Nov 14, 2024 •

edited

Loading

Uh oh!

ddaspit left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mshannon-sil commented Nov 11, 2024 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mshannon-sil commented Nov 11, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Nov 14, 2024 •

edited

Loading