Hi @remram44 any interest in pursuing this? ROCK seems to be under active development - new links are:
I restarted and the error went away so perhaps something got corrupted. I left the server running and upgraded Homebrew packages so it might have messed with Python. I also changed this to true include-system-site-packages = true in pyvenv.cfg. At any rate it now works, thanks.
I am using a taguette-config.py file. My db is named sqlite:///taguette.sqlite3 (local to that directory). This is on a Mac M2 with Taguette 1.4.1 from pip.
If I go to the UI and select Project->Export Project, it throws an exception:
2024-04-25 13:12:15,600 INFO: Connecting to SQL database 'sqlite:////var/folders/ht/l1bt4z0x5fb5j5crxp4_wv5m0000gn/T/taguette_export_o_rom58_/db.sqlite3'
2024-04-25 13:12:15,603 WARNING: The tables don't seem to exist; creating
2024-04-25 13:12:15,624 INFO: Context impl SQLiteImpl.
2024-04-25 13:12:15,624 INFO: Will assume non-transactional DDL.
2024-04-25 13:12:15,633 INFO: Running stamp_revision -> db5e31a0233d
2024-04-25 13:12:15,635 INFO: Context impl SQLiteImpl.
2024-04-25 13:12:15,635 INFO: Will assume non-transactional DDL.
2024-04-25 13:12:15,659 ERROR: Uncaught exception GET /project/2/export/project.sqlite3 (127.0.0.1)
HTTPServerRequest(protocol='http', host='localhost:7465', method='GET', uri='/project/2/export/project.sqlite3', version='HTTP/1.1', remote_ip='127.0.0.1')
Traceback (most recent call last):
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/tornado/web.py", line 1790, in _execute
result = await result
^^^^^^^^^^^^
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/taguette/web/export.py", line 306, in get
database.copy_project(
File "/opt/homebrew/Cellar/[email protected]/3.11.6/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 81, in inner
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/taguette/database/copy.py", line 62, in copy_project
mapping_document = copy(
^^^^^
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/taguette/database/copy.py", line 24, in copy
return copy_table(
^^^^^^^^^^^
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/taguette/database/copy.py", line 251, in copy_table
if not validators[key](value):
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/taguette/convert.py", line 256, in is_html_safe
cleaned = bleach.clean(
^^^^^^^^^^^^^
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/bleach/__init__.py", line 74, in clean
cleaner = Cleaner(
^^^^^^^^
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/bleach/sanitizer.py", line 132, in __init__
self.walker = html5lib_shim.getTreeWalker("etree")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/bleach/_vendor/html5lib/treewalkers/__init__.py", line 57, in getTreeWalker
from . import etree
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/bleach/_vendor/html5lib/treewalkers/etree.py", line 8, in <module>
from . import base
File "/Users/nernst/Documents/projects/sloan-td/taguette.virtualenv.arm/lib/python3.11/site-packages/bleach/_vendor/html5lib/treewalkers/base.py", line 3, in <module>
from xml.dom import Node
ModuleNotFoundError: No module named 'xml.dom'
2024-04-25 13:12:15,662 ERROR: 500 GET /project/2/export/project.sqlite3 (127.0.0.1) 75.84ms (admin) lang=en-CA,en-US;q=0.9,en;q=0.8
There might be people relying on the heuristic I guess?
Honestly, from an effort point of view it's probably simpler to make people do the conversion outside Taguette, using Calibre, Pandoc, etc. Then you could only support HTML and plain text formats and remove a major source of hassle. I guess it's a question of where to put resources. From my point of view tag editing, export, reporting are more important features.
Thanks for the tool!
Transcriptions exported from e.g. Otter.ai are usually lightly formatted plain text (formatting = new lines + timestamps). However, when I parse this with Calibre in Taguette, it forces Calibre to use heuristics to identify structures and add <H2> and similar tags. I don't actually want any of this - ideally Taguette preserves the original text document look and feel.
The way to do this seems to be passing the --formatting-type plain to Calibre when converting a plain text document. It would be helpful to expose in an advanced interface a way for Taguette to pass options to Calibre as part of the import dialog. Alternately, just assume the plain text file should not be enhanced with heuristics, since most people likely don't understand what those are (and we aren't in an ebook context anyway).
MWE:
original file (snippet) from Otter.ai:
A 45:02
Right? And I guess, you know, your career will be similar, right? You'll be looking for permanent job and you don't want to be postdoc forever.
P10 45:15
Yeah. Yeah. Same thing. Right.
calibre conversion with Taguette and its use of --enable-heuristics:
<p class="calibre1">A 45:02</p>
<p class="calibre1">Right? And I guess, you know, your career will be similar, right? You’ll be looking for permanent job and you don’t want to be postdoc forever.</p>
<h2 class="calibre2">P10 45:15</h2>
<p class="calibre1">Yeah. Yeah. Same thing. Right.</p>
<p class="whitespace"> </p>
The workaround is to convert the text file directly to HTML using calibre, and then import the HTML file in Taguette, rather than the text file.