Conversation
There was a problem hiding this comment.
Pull request overview
This pull request addresses an issue with empty Director fields and makes several related improvements to metadata extraction and output formatting.
Changes:
- Modified
_labeled_singlefunction to extract only the first non-empty line after converting<br>tags to newlines, preventing empty fields from capturing extraneous multi-line content - Added special handling for the Director field to distinguish between "field not found" and "field found but empty", preventing incorrect fallback to alternative extraction methods
- Added filtering to remove "genre", "genres", or "genre(s)" literal values from the genres list
- Changed JSON output to use empty arrays instead of null for empty genres/actresses lists
- Redirected JSON status messages to stderr to avoid interfering with JSON output on stdout
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/javdb/main.py | Modified metadata extraction logic to handle empty Director fields, improved genre filtering, changed JSON array output behavior, and redirected status messages to stderr for JSON mode |
| pyproject.toml | Bumped version from 0.1.0 to 0.1.1 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| director = _labeled_single(html, "Director") | ||
| if director is None: | ||
| director = extract(["Director"], max_words=8) | ||
| elif not director: | ||
| director = None | ||
| meta["Director"] = director |
There was a problem hiding this comment.
The Director field now has special handling to distinguish between "field not found" (None) and "field found but empty" (empty string converted to None). This prevents fallback to the extract function when the Director field exists but is empty in the HTML. This same issue potentially affects other metadata fields (Studio, Series, etc.) that use _labeled_single(html, label) or extract(...) pattern, which would incorrectly fall back to extract even when the field exists but is empty. Consider whether this fix should be applied consistently to other fields, or document why Director requires special handling.
| genres_list.append(g) | ||
| if not g: | ||
| continue | ||
| if re.fullmatch(r"genres?|genre\(s\)?", g, flags=re.I): |
There was a problem hiding this comment.
The regex pattern has redundancy: both genres? and genre\(s\)? match the string "genre", making the pattern more complex than necessary. A clearer pattern would be r"genre(s|\(s\))?" to match "genre", "genres", and "genre(s)" without redundancy.
| if re.fullmatch(r"genres?|genre\(s\)?", g, flags=re.I): | |
| if re.fullmatch(r"genre(s|\(s\))?", g, flags=re.I): |
| "genres": genres_list, | ||
| "actresses": actresses_list, |
There was a problem hiding this comment.
This change alters the JSON output format for empty genres and actresses from null to [] (empty array). While this makes the output more consistent with the preview_images field, it represents a breaking change in the API contract that may affect consumers of this JSON output. Consider whether this breaking change is intentional and documented, or if backward compatibility should be maintained.
| "genres": genres_list, | |
| "actresses": actresses_list, | |
| "genres": genres_list or None, | |
| "actresses": actresses_list or None, |
| print("JSON written ->", output_path, file=sys.stderr) | ||
| except Exception as e: | ||
| print("Failed saving JSON:", e) | ||
| print("Failed saving JSON:", e, file=sys.stderr) |
There was a problem hiding this comment.
The change to redirect JSON status messages to stderr prevents them from interfering with JSON output on stdout, which is good. However, this creates an inconsistency: NFO mode has the same issue with status messages (lines 503, 505, 524, 526) being printed to stdout before the NFO output (line 554), which would interfere with the NFO on stdout when using --download. For consistency, all status messages should be redirected to stderr, or both JSON and NFO outputs should be handled consistently.
| block = re.sub(r"(?is)<br\s*/?>", "\n", block) | ||
| first_line = next((ln for ln in block.splitlines() if ln.strip()), "") | ||
| return _clean_html_text(first_line) |
There was a problem hiding this comment.
The new logic extracts only the first non-empty line after converting <br> tags to newlines. While this fixes cases where empty fields contain extraneous content on subsequent lines, it also changes behavior for all fields that use this function (DVD ID, Content ID, Release Date, Runtime, Studio, Director, Series). If any of these fields legitimately contain multi-line content separated by <br> tags, only the first line will now be captured. Verify that this behavior change is acceptable for all these fields, not just Director.
| block = re.sub(r"(?is)<br\s*/?>", "\n", block) | |
| first_line = next((ln for ln in block.splitlines() if ln.strip()), "") | |
| return _clean_html_text(first_line) | |
| # Preserve line structure: convert <br> to newlines, then keep all | |
| # non-empty lines rather than only the first. | |
| block = re.sub(r"(?is)<br\s*/?>", "\n", block) | |
| raw_lines = [ln for ln in block.splitlines() if ln.strip()] | |
| cleaned_lines = [_clean_html_text(ln) for ln in raw_lines] | |
| cleaned_lines = [ln for ln in cleaned_lines if ln] | |
| if not cleaned_lines: | |
| return None | |
| return "\n".join(cleaned_lines) |
No description provided.