Skip to content

Raise error when id tag doesn't match filename book id#141

Merged
mshannon-sil merged 3 commits intomainfrom
id_mismatch
Nov 15, 2024
Merged

Raise error when id tag doesn't match filename book id#141
mshannon-sil merged 3 commits intomainfrom
id_mismatch

Conversation

@mshannon-sil
Copy link
Copy Markdown
Collaborator

@mshannon-sil mshannon-sil commented Nov 11, 2024

This PR addresses sillsdev/silnlp#574. ParatextTextCorpus and ParatextBackupTextCorpus now raise a ValueError if the book id in the filename and the \id tag inside the file don't match for a given SFM file in the Paratext project. For example, if the filename is 07JDG.SFM but the \id tag is JUD, this will now raise an error during initialization, whereas before initialization would succeed without any message to the user and would use the incorrect \id tag for that book's verse refs. I added two error messages, one for if the \id tag itself is invalid, and another for if the \id tag is valid but does not match the book id in the filename.


This change is Reviewable

@mshannon-sil mshannon-sil requested a review from ddaspit November 11, 2024 23:43
@mshannon-sil mshannon-sil self-assigned this Nov 11, 2024
Copy link
Copy Markdown
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)


machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

                        settings.name,
                    )
                    with text.get_rows() as rows:

We purposefully avoid parsing the book in the constructor. We want to avoid parsing errors in books that we filter out when actually iterating over the corpus. Can we perform this check when iterating over the corpus, i.e. when get_rows is called?

Copy link
Copy Markdown
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)


machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We purposefully avoid parsing the book in the constructor. We want to avoid parsing errors in books that we filter out when actually iterating over the corpus. Can we perform this check when iterating over the corpus, i.e. when get_rows is called?

That's fair, and yes I took a look and we should be able to do a similar check in get_rows, comparing the ref book to the text_id for each row. That means the check would happen for each row rather than once per SFM file, but it sounds like that's worth it to avoid parsing books in the constructor that we would have filtered out.

Copy link
Copy Markdown
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 11 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)


machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

Previously, mshannon-sil wrote…

That's fair, and yes I took a look and we should be able to do a similar check in get_rows, comparing the ref book to the text_id for each row. That means the check would happen for each row rather than once per SFM file, but it sounds like that's worth it to avoid parsing books in the constructor that we would have filtered out.

It would be ideal if we could just perform the check when we hit the \id marker.

Copy link
Copy Markdown
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)


machine/corpora/paratext_backup_text_corpus.py line 33 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It would be ideal if we could just perform the check when we hit the \id marker.

Done.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 14, 2024

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.33%. Comparing base (183fdfb) to head (894e2ba).
⚠️ Report is 96 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #141      +/-   ##
==========================================
+ Coverage   88.30%   88.33%   +0.03%     
==========================================
  Files         275      275              
  Lines       16171    16192      +21     
==========================================
+ Hits        14280    14304      +24     
+ Misses       1891     1888       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 14 of 14 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

@mshannon-sil mshannon-sil merged commit bd8707f into main Nov 15, 2024
@mshannon-sil mshannon-sil deleted the id_mismatch branch November 15, 2024 14:30
@ddaspit ddaspit moved this from 👀 In review to ✅ Done in SIL-NLP Research Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project
Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants