Categories: geek » emacs » subed

View topic page - RSS - Atom - Subscribe via email

Comparing pronunciation recordings across time

Posted: - Modified: | french, emacs, org, subed
  • : Added reference audio for the second set.
  • : I added pronunciation segments for the new set of tongue-twisters I got on Mar 13.
  • : I added a column for Feb 20, the first session with the sentences. I also added keyboard shortcuts (1..n) for playing the audio of the row that the mouse is on.

2026-02-20: First set: Maman peint un grand lapin blanc, etc.

My French tutor gave me a list of sentences to help me practise pronunciation.

I can fuzzy-match these with the word timing JSON from WhisperX, like this.

Extract all approximately matching phrases
(subed-record-extract-all-approximately-matching-phrases
   sentences
   "/home/sacha/sync/recordings/2026-02-20-raphael.json"
   "/home/sacha/proj/french/analysis/virelangues/2026-02-20-raphael-script.vtt")
Sentences
  • Maman peint un grand lapin blanc.
  • Un enfant intelligent mange lentement.
  • Le roi croit voir trois noix.
  • Le témoin voit le chemin loin.
  • Moins de foin au loin ce matin.
  • La laine beige sèche près du collège.
  • La croquette sèche dans l'assiette.
  • Elle mène son frère à l'hôtel.
  • Le verre vert est très clair.
  • Elle aimait manger et rêver.
  • Le jeu bleu me plaît peu.
  • Ce neveu veut un jeu.
  • Le feu bleu est dangereux.
  • Le beurre fond dans le cœur chaud.
  • Les fleurs de ma sœur sentent bon.
  • Le hibou sait où il va.
  • L'homme fort mord la pomme.
  • Le sombre col tombe.
  • L'auto saute au trottoir chaud.
  • Le château d'en haut est beau.
  • Le cœur seul pleure doucement.
  • Tu es sûr du futur ?
  • Trois très grands trains traversent trois trop grandes rues.
  • Je veux deux feux bleus, mais la reine préfère la laine beige.
  • Vincent prend un bain en chantant lentement.
  • La mule sûre court plus vite que le loup fou.
  • Luc a bu du jus sous le pont où coule la boue.
  • Le frère de Robert prépare un rare rôti rouge.
  • La mule court autour du mur où hurle le loup.

Then I can use subed-record to manually tweak them, add notes, and so on. I end up with VTT files like 2026-03-06-raphael-script.vtt. I can assemble the snippets for a session into a single audio file, like this:

I wanted to compare my attempts over time, so I wrote some code to use Org Mode and subed-record to build a table with little audio players that I can use both within Emacs and in the exported HTML. This collects just the last attempts for each sentence during a number of my sessions (both with the tutor and on my own). The score is from the Microsoft Azure pronunciation assessment service. I'm not entirely sure about its validity yet, but I thought I'd add it for fun. * indicates where I've added some notes from my tutor, which should be available as a title attribute on hover. (Someday I'll figure out a mobile-friendly way to do that.)

Calling it with my sentences and files
(my-lang-summarize-segments
 sentences
 '(("/home/sacha/proj/french/analysis/virelangues/2026-02-20-raphael-script.vtt" . "Feb 20")
 ;("~/sync/recordings/processed/2026-02-20-raphael-tongue-twisters.vtt" . "Feb 20")
        ("~/sync/recordings/processed/2026-02-22-virelangues-single.vtt" . "Feb 22")
        ("~/proj/french/recordings/2026-02-26-virelangues-script.vtt" . "Feb 26")
        ("~/proj/french/recordings/2026-02-27-virelangues-script.vtt" . "Feb 27")
        ("~/proj/french/recordings/2026-03-03-virelangues.vtt" . "Mar 3")
        ("/home/sacha/sync/recordings/processed/2026-03-03-raphael-reference-script.vtt" . "Mar 3")
        ("~/proj/french/analysis/virelangues/2026-03-06-raphael-script.vtt" . "Mar 6")
        ("~/proj/french/analysis/virelangues/2026-03-12-virelangues-script.vtt" . "Mar 12"))
 "clip"
 #'my-lang-subed-record-get-last-attempt
 #'my-lang-subed-record-cell-info
 t
 )
Feb 20 Feb 22 Feb 26 Feb 27 Mar 3 Mar 3 Mar 6 Mar 12 Text
▶️ 63* ▶️ 96 ▶️ 95 ▶️ 94 ▶️ 83 ▶️ 83* ▶️ 81* ▶️ 88 Maman peint un grand lapin blanc.
▶️ 88* ▶️ 95 ▶️ 99 ▶️ 99 ▶️ 96 ▶️ 89* ▶️ 92* ▶️ 83 Un enfant intelligent mange lentement.
▶️ 84* ▶️ 97 ▶️ 97 ▶️ 96 ▶️ 94 ▶️ 95* ▶️ 98* ▶️ 99 Le roi croit voir trois noix.
▶️ 80* ▶️ 85 ▶️ 77 ▶️ 94 ▶️ 97   ▶️ 92* ▶️ 88 Le témoin voit le chemin loin.
▶️ 72* ▶️ 97 ▶️ 95 ▶️ 77 ▶️ 92   ▶️ 89* ▶️ 86 Moins de foin au loin ce matin.
▶️ 79* ▶️ 95 ▶️ 76 ▶️ 95 ▶️ 76 ▶️ 90* ▶️ 90* ▶️ 79 La laine beige sèche près du collège.
▶️ 67* ▶️ 99 ▶️ 85 ▶️ 81 ▶️ 85 ▶️ 99* ▶️ 97* ▶️ 97 La croquette sèche dans l'assiette.
▶️ 88* ▶️ 99 ▶️ 100 ▶️ 100 ▶️ 98 ▶️ 100* ▶️ 99* ▶️ 100 Elle mène son frère à l'hôtel.
▶️ 77* ▶️ 87 ▶️ 99 ▶️ 93 ▶️ 87   ▶️ 87* ▶️ 99 Le verre vert est très clair.
▶️ 100* ▶️ 94 ▶️ 100 ▶️ 99 ▶️ 99 ▶️ 99* ▶️ 100* ▶️ 100 Elle aimait manger et rêver.
▶️ 78* ▶️ 98 ▶️ 99 ▶️ 98 ▶️ 98 ▶️ 92*   ▶️ 88 Le jeu bleu me plaît peu.
▶️ 78* ▶️ 97 ▶️ 85 ▶️ 95 ▶️ 85     ▶️ 85 Ce neveu veut un jeu.
▶️ 73* ▶️ 95 ▶️ 95 ▶️ 96 ▶️ 97     ▶️ 100 Le feu bleu est dangereux.
▶️ 87* ▶️ 76 ▶️ 65 ▶️ 97 ▶️ 85 ▶️ 74* ▶️ 85* ▶️ 96 Le beurre fond dans le cœur chaud.
▶️ 84* ▶️ 43 ▶️ 85 ▶️ 79 ▶️ 75     ▶️ 98 Les fleurs de ma sœur sentent bon.
▶️ 70* ▶️ 86 ▶️ 79 ▶️ 76 ▶️ 87 ▶️ 84   ▶️ 98 Le hibou sait où il va.
▶️ 92* ▶️ 95 ▶️ 86 ▶️ 92 ▶️ 98 ▶️ 99*   ▶️ 94 L'homme fort mord la pomme.
▶️ 83* ▶️ 73 ▶️ 69 ▶️ 81 ▶️ 60 ▶️ 96*   ▶️ 81 Le sombre col tombe.
▶️ 39* ▶️ 49 ▶️ 69 ▶️ 56 ▶️ 69 ▶️ 96*   ▶️ 94 L'auto saute au trottoir chaud.
▶️ 82 ▶️ 84 ▶️ 85 ▶️ 98 ▶️ 94 ▶️ 96*   ▶️ 99 Le château d'en haut est beau.
▶️ 89 ▶️ 85 ▶️ 75 ▶️ 91 ▶️ 52 ▶️ 75* ▶️ 70* ▶️ 98 Le cœur seul pleure doucement.
▶️ 98*   ▶️ 99 ▶️ 99 ▶️ 95 ▶️ 93* ▶️ 97* ▶️ 99 Tu es sûr du futur ?
    ▶️ 97 ▶️ 93 ▶️ 92 ▶️ 85*   ▶️ 90 Trois très grands trains traversent trois trop grandes rues.
    ▶️ 94 ▶️ 85 ▶️ 97 ▶️ 82*   ▶️ 92 Je veux deux feux bleus, mais la reine préfère la laine beige.
    ▶️ 91 ▶️ 79 ▶️ 87 ▶️ 82*   ▶️ 94 Vincent prend un bain en chantant lentement.
    ▶️ 89 ▶️ 91 ▶️ 91 ▶️ 84*   ▶️ 92 La mule sûre court plus vite que le loup fou.
    ▶️ 91 ▶️ 93 ▶️ 93 ▶️ 92*   ▶️ 96 Luc a bu du jus sous le pont où coule la boue.
    ▶️ 88 ▶️ 71 ▶️ 94 ▶️ 86*   ▶️ 92 Le frère de Robert prépare un rare rôti rouge.
    ▶️ 81 ▶️ 84 ▶️ 88 ▶️ 67*   ▶️ 94 La mule court autour du mur où hurle le loup.

Pronunciation still feels a bit hit or miss. Sometimes I say a sentence and my tutor says "Oui," and then I say it again and he says "Non, non…" The /ʁ/ and /y/ sounds are hard.

I like seeing these compact links in an Org Mode table and being able to play them, thanks to my custom audio link type. It should be pretty easy to write a function that lets me use a keyboard shortcut to play the audio (maybe using the keys 1-9?) so that I can bounce between them for comparison.

If I screen-share from Google Chrome, I can share the tab with audio, so my tutor can listen to things at the same time. Could be fun to compare attempts so that I can try to hear the differences better. Hmm, actually, let's try adding keyboard shortcuts that let me use 1-8, n/p, and f/b to navigate and play audio. Mwahahaha! It works!

2026-03-14: Second set: Mon oncle peint un grand pont blanc, etc.

Update 2026-03-14: My tutor gave me a new set of tongue-twisters. When I'm working on my own, I find it helpful to loop over an audio reference with a bit of silence after it so that I can repeat what I've heard. I have several choices for reference audio:

  • I can generate an audio file using text-to-speech, like a local instance of Kokoro TTS, or a hosted service like Google Translate (via gtts-cli), ElevenLabs, or Microsoft Azure.
  • I can extract a recording of my tutor from one of my sessions.
  • I can extract a recording of myself from one of my tutoring sessions where my tutor said that the pronunciation is alright.

Here I stumble through the tongue-twisters. I've included reference audio from Kokoro, gtts, and ElevenLabs for comparison.

(my-subed-record-analyze-file-with-azure
 (subed-record-keep-last
  (subed-record-filter-skips
   (subed-parse-file
    "/home/sacha/proj/french/analysis/virelangues/2026-03-13-raphael-script.vtt")))
 "~/proj/french/analysis/virelangues-2026-03-13/2026-03-13-clip")
Kk Gt Az Me ID Comments All Acc Flu Comp Conf  
👂🏼 👂🏼 👂🏼 ▶️ 1 X: pont 93 99 90 100 86 Mon oncle peint un grand pont blanc. {pont}
👂🏼 👂🏼 👂🏼 ▶️ 2 C'est mieux 68 75 80 62 87 Un singe malin prend un bon raisin rond.
👂🏼 👂🏼 👂🏼 ▶️ 3 Ouais, c'est ça 83 94 78 91 89 Dans le vent du matin, mon chien sent un bon parfum.
👂🏼 👂🏼 👂🏼 ▶️ 4 ok 75 86 63 100 89 Le soin du roi consiste à joindre chaque coin du royaume.
👂🏼 👂🏼 👂🏼 ▶️ 5 Ouais, c'est ça, parfait 83 94 74 100 88 Dans un coin du bois, le roi voit trois points noirs.
👂🏼 👂🏼 👂🏼 ▶️ 6 Ouais, parfait 90 92 87 100 86 Le feu de ce vieux four chauffe peu.
👂🏼 👂🏼 👂🏼 ▶️ 7 Ouais 77 85 88 71 86 Deux peureux veulent un peu de feu.
👂🏼 👂🏼 👂🏼 ▶️ 8   77 78 75 83 85 Deux vieux bœufs veulent du beurre.
👂🏼 👂🏼 👂🏼 ▶️ 9 Ouais, parfait 92 94 89 100 89 Elle aimait marcher près de la rivière.
👂🏼 👂🏼 👂🏼 ▶️ 10 Ok, c'est bien 93 98 89 100 90 Je vais essayer de réparer la fenêtre.
👂🏼 👂🏼 👂🏼 ▶️ 11 Okay 83 87 76 100 89 Le bébé préfère le lait frais.
👂🏼 👂🏼 👂🏼 ▶️ 12   77 92 70 86 90 Charlotte cherche ses chaussures dans la chambre.
👂🏼 👂🏼 👂🏼 ▶️ 13 Okay 91 90 94 91 88 Un chasseur sachant chasser sans son chien est-il un bon chasseur ?
👂🏼 👂🏼 👂🏼 ▶️ 14 Ouais 91 88 92 100 91 Le journaliste voyage en janvier au Japon.
👂🏼 👂🏼 👂🏼 ▶️ 15 C'est bien (X: dans un) 91 88 94 100 88 Georges joue du jazz dans un grand bar. {dans un}
👂🏼 👂🏼 👂🏼 ▶️ 16 C'est bien 88 87 94 88 85 Un jeune joueur joue dans le grand gymnase.
👂🏼 👂🏼 👂🏼 ▶️ 17   95 94 96 100 91 Le compagnon du montagnard soigne un agneau.
👂🏼 👂🏼 👂🏼 ▶️ 18   85 88 84 86 89 La cigogne soigne l’agneau dans la campagne.
👂🏼 👂🏼 👂🏼 ▶️ 19 grenouille 71 80 68 75 86 La grenouille fouille les feuilles dans la broussaille.

The code

Code for summarizing the segments
(defun my-lang-subed-record-cell-info (item file-index file sub)
  (let* ((sound-file (expand-file-name (format "%s-%s-%d.opus"
                                               prefix
                                               (my-transform-html-slugify item)
                                               (1+ file-index))))
         (score (car (split-string
                      (or
                       (subed-record-get-directive "#+SCORE" (elt sub 4)) "")
                      ";")))
         (note (replace-regexp-in-string
                (concat "^" (regexp-quote (cdr file))
                        "\\(: \\)?")
                ""
                (or (subed-record-get-directive "#+NOTE" (elt sub 4)) ""))))
    (when (or always-create (not (file-exists-p sound-file)))
      (subed-record-extract-audio-for-current-subtitle-to-file sound-file sub))
    (org-link-make-string
     (concat "audio:" sound-file "?icon=t"
             (format "&source=%s&source-start=%s" (car file) (elt sub 1))
             (format "&title=%s"
                     (url-hexify-string
                      (if (string= note "")
                          (cdr file)
                        (concat (cdr file) ": " note)))))
     (concat
      "▶️"
      (if score (format " %s" score) "")
      (if (string= note "") "" "*")))))

(defun my-lang-subed-record-get-last-attempt (item file)
  "Return the last subtitle matching ITEM in FILE."
  (car
   (last
    (seq-remove
     (lambda (o) (string-match "#\\+SKIP" (or (elt o 4) "")))
     (learn-lang-subed-record-collect-matching-subtitles
      item
      (list file)
      nil
      nil
      'my-subed-simplify)))))

(defun my-lang-summarize-segments (items files prefix attempt-fn cell-fn &optional always-create)
  (cons
   (append
    (seq-map 'cdr files)
    (list "Text"))
   (seq-map
    (lambda (item)
      (append
       (seq-map-indexed
        (lambda (file file-index)
          (let* ((sub (funcall attempt-fn item file)))
            (if sub
                (funcall cell-fn item file-index file sub)
              "")))
        files)
       (list item)))
    items)))

(defun my-subed-record-analyze-file-with-azure (subtitles prefix &optional always-create)
  (cons
   '("Kk" "Gt" "Az" "Me" "ID" "Comments" "All" "Acc" "Flu" "Comp" "Conf")
   (seq-map-indexed
    (lambda (sub i)
      (let ((sound-file (expand-file-name (format "%s-%02d.opus"
                                                  prefix
                                                  (1+ i))))
            (tts-services
             '(("kokoro" . learn-lang-tts-kokoro-fastapi-say)
               ("gtts" . learn-lang-tts-gtts-say)
               ("azure" . learn-lang-tts-azure-say)))
            tts-files
            (note (subed-record-get-directive "#+NOTE" (elt sub 4))))
        (when (or always-create (not (file-exists-p sound-file)))
          (subed-record-extract-audio-for-current-subtitle-to-file sound-file sub))
        (setq
         tts-files
         (mapcar
          (lambda (row)
            (let ((reference (format "%s-%s-%02d.opus" prefix (car row) (1+ i) )))
              (when (or always-create (not (file-exists-p reference)))
                (funcall (cdr row)
                         (subed-record-simplify (elt sub 3))
                         'sync
                         reference))
              (org-link-make-string
               (concat "audio:" reference "?icon=t&note=" (url-hexify-string (car row)))
               "👂🏼")))
          tts-services))
        (append
         tts-files
         (list
          (org-link-make-string
           (concat "audio:" sound-file "?icon=t"
                   (format "&source-start=%s" (elt sub 1))
                   (if (and note (not (string= note "")))
                       (format "&title=%s"
                               (url-hexify-string note))
                     ""))
           "▶️")
          (format "%d" (1+ i))
          (or note ""))
         (learn-lang-azure-subed-record-parse (elt sub 4))
         (list
          (elt sub 3)))))
    subtitles)))

Some code for doing this stuff is in sachac/learn-lang on Codeberg.

View Org source for this post

Improving subed-vtt parsing; using dedicated windows in Emacs; training my intuition

| emacs, subed

While putting together some notes on how to use subed.el with auto-generated YouTube captions, I decided to get subed-word-data working with the Youtube VTT format. The sample file I downloaded from one of my YouTube videos had a cue whose text started with a blank line, so I ended up redoing the way subed-vtt.el parsed cues. Twice, actually. The first time, I changed it to handle multiple blocks in cue text (separated by blank lines). Then I came across this test suite for WebVTT parsing and found out that the timing line for a cue doesn't have to be preceded by a blank line, so then I needed to change the VTT parsing again.

Fortunately, I inherited a large Buttercup test suite from subed.el's original author. I've been adding to it over the years. As I learned more about how I wanted subed.el to behave, I added more cases. Then I worked on shifting the code to behave the way I wanted it to. At first, I didn't quite understand what I wanted the code to do, but as I pinned down more test cases, I was able to figure it out.

This was the first time I used toggle-window-dedicated (C-x w d) extensively. This function keeps the same buffer displayed in the window instead of letting Emacs Lisp functions replace it. First, I set up a large window for my subed-vtt.el. I split the other side into a smaller window for my test-subed-vtt.el, another window for a temporary VTT file, and a window for output from whatever command I'm running. I set most of the windows to dedicated except for my temporary output window. I also enabled winner-mode just in case I messed things up so that I could restore with winner-undo. Dedicating the windows meant I didn't have to keep fussing around with my windows and buffers to get things back to where I wanted them to be. I did use C-x o (other-window) a lot, so maybe it'll be worth getting the hang of ace-window.

2025-01-28_12-40-04.png
Figure 1: Dedicated windows for the code, the test file, and a VTT buffer for interactive testing; a spare window for other output

Sometimes I wanted to focus on one of those small windows. prot/window-single-toggle was helpful for maximizing a window and then returning to the previous configuration.

I mostly evaluated or edebugged my code and then used my-buttercup-run-dwim to run a suite in my test-subed-vtt.el. sp-backward-up-sexp and sp-narrow-to-sexp were also helpful for navigating the suites and focusing on whatever I was working on.

I'm looking forward to exploring the other test cases from that repository. It feels good to get better as a coder.

I just finished reading Mathematica: a Secret World of Imagination and Curiosity, by David Bessis. The idea of training your intuition echoed in my mind as I wrote test cases and changed the code. I started to look forward to the gaps between my understanding of the spec and my understanding of the test cases, and the gaps between the test cases and my implementation. It felt almost like a conversation. Sometimes it was hard to translate an idea into code. I felt myself getting muddled and turned around. Whenever I noticed that, I just had to back up and start from something I understood, figure out what I was uncertain about, and then go from there.

I like the way that subed.el gives me tools for thinking about text, times, and metadata together. That abstraction has been helpful for editing audio and making videos. The more solid I can make it, the easier it will be to imagine other things that use those ideas.

View org source for this post

Automatically correcting phrasing and misrecognized words in speech-to-text captions by using a script

| speechtotext, subed, emacs

I usually write my scripts with phrases that could be turned into the subtitles. I figured I might as well combine that information with the WhisperX transcripts which I use to cut out my false starts and oopses. To do that, I use the string-distance function, which calculates how similar strings are, based on the Levenshtein [distance] algorithm. If I take each line of the script and compare it with the list of words in the transcription, I can add one transcribed word at a time, until I find the number with the minimum distance from my current script phrase. This lets me approximately match strings despite misrecognized words. I use oopses to signal mistakes. When I detect those, I look for the previous script line that is closest to the words I restart with. I can then skip the previous lines automatically. When the script and the transcript are close, I can automatically correct the words. If not, I can use comments to easily compare them at that point. Even though I haven't optimized anything, it runs well enough for my short videos. With these subtitles as a base, I can get timestamps with subed-align and then there's just the matter of tweaking the times and adding the visuals.

Text from sketch

Matching a script with a transcript 2025-01-09-01

  • script
  • record on my phone
  • WhisperX transcript (with false starts and recognition errors)

My current implementation is totally unoptimized (n²) but it's fine for short videos.

Process:

  • While there are transcript words to process
    • Find the script line that has the minimum distance to the words left in the transcript. restart after oopses
  • Script
  • Transcript: min. distance between script phrase & transcript
  • Restarting after oops: find script phrase with minimum distance
  • Ex. script phrase: The Emacs text editor
  • Transcript: The Emax text editor is a…
  • Bar graph of distance decreasing, and then increasing again
  • Minimum distance
  • Oops?
    • N: Use transcript words, or diff > threshold?
      • Y: Add script words as comment
      • N: Correct minor errors
    • Y: Mark caption for skipping and look for the previous script line with minimum distance.

Result:

  • Untimed captions with comments
  • Aeneas
  • Timed captions for editing

This means I can edit a nicely-split, mostly-corrected file.

I've included the links to various files below so you can get a sense of how it works. Let's focus on an excerpt from the middle of my script file.

it runs well enough for my short videos.
With these subtitles as a base,
I can get timestamps with subed-align

When I call WhisperX with large-v2 as the model and --max_line_width 50 --segment_resolution chunk --max_line_count 1 as the options, it produces these captions corresponding to that part of the script.

01:25.087 --> 01:29.069
runs well enough for my short videos. With these subtitles

01:29.649 --> 01:32.431
as a base, I can get... Oops. With these subtitles as a base, I

01:33.939 --> 01:41.205
can get timestamps with subedeline, and then there's just

Running subed-word-data-use-script-file results in a VTT file containing this excerpt:

00:00:00.000 --> 00:00:00.000
it runs well enough for my short videos.

NOTE #+SKIP

00:00:00.000 --> 00:00:00.000
With these subtitles as a base,

NOTE #+SKIP

00:00:00.000 --> 00:00:00.000
I can get... Oops.

00:00:00.000 --> 00:00:00.000
With these subtitles as a base,

NOTE
#+TRANSCRIPT: I can get timestamps with subedeline,
#+DISTANCE: 0.14

00:00:00.000 --> 00:00:00.000
I can get timestamps with subed-align

There are no timestamps yet, but subed-align can add them. Because subed-align uses the Aeneas forced alignment tool to figure out timestamps by lining up waveforms for speech-synthesized text with the recorded audio, it's important to keep the false starts in the subtitle file. Once subed-align has filled in the timestamps and I've tweaked the timestamps by using the waveforms, I can use subed-record to create an audio file that omits the subtitles that have #+SKIP comments.

The code is available as subed-word-data-use-script-file in subed-word-data.el. I haven't released a new version of subed.el yet, but you can get it from the repository.

In addition to making my editing workflow a little more convenient, I think it might also come in handy for applying the segmentation from tools like sub-seg or lachesis to captions that might already have been edited by volunteers. (I got sub-seg working on my system, but I haven't figured out lachesis.) If I call subed-word-data-use-script-file with the universal prefix arg C-u, it should set keep-transcript-words to true and keep any corrections we've already made to the caption text while still approximately matching and using the other file's segments. Neatly-segmented captions might be more pleasant to read and may require less cognitive load.

There's probably some kind of fancy Python project that already does this kind of false start identification and script reconciliation. I just did it in Emacs Lisp because that was handy and because that way, I can make it part of subed. If you know of a more robust or full-featured approach, please let me know!

View org source for this post

#YayEmacs 9: Trimming/adding silences to get to a target; subed-record-sum-time

| audio, subed, yay-emacs, emacs, video

New in this video: subed-record-sum-time, #+PAD_LEFT and #+PAD_RIGHT

I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be. First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times. Then I can skip my oopses. Sometimes WhisperX doesn't catch them, so I also look at waveforms and characters per second. I already talk quickly, so I'm not going to speed that up but I can trim the pauses in between phrases which is easy to do with waveforms. Sometimes, after reviewing a draft, I realize I need a little more time. If the original audio has some silence, I can just copy and paste it. If not, I can pad left or pad right to add some silence. I can try the flow of some sections and compile the video when I'm ready. Emacs can do almost anything. Yay Emacs!

You can watch this on YouTube, download the video, or download the audio.

Play by play:

  • I like the constraints of a one-minute video, so I added a subed-record-sum-time command. That way, when I edit the video using Emacs, I can check how long the result will be.
    • subed-record uses subtitles and directives in comments in a VTT subtitle file to edit audio and video. subed-record-sum-time calculates the resulting duration and displays it in the minibuffer.
  • First, I split the subtitles, align it with the audio to fix the timestamps, and double check the times.
    • I'm experimenting with an algorithmic way to combine the breaks from my script with the text from the transcript. subed-align calls the aeneas forced alignment tool to match up the text with the timestamps. I use subed-waveform-show-all to show all the waveforms.
  • Then I can skip my oopses.
    • Adding a NOTE #+SKIP comment before a subtitle makes subed-record-compile-video and subed-record-compile-flow skip that part of the audio.
  • Sometimes WhisperX doesn't catch them,
    • WhisperX sometimes doesn't transcribe my false starts if I repeat things quickly.
  • so I also look at waveforms
    • subed-waveform-show-all adds waveforms for all the subtitles. If I notice there's a pause or a repeated shape in the waveform, or if I listen and notice the repetition, I can confirm by middle-clicking on the waveform to sample part of it.
  • and characters per second.
    • Low characters per second is sometimes a sign that the timestamps are incorrect or there's a repetition that wasn't transcribed.
  • I already talk quickly, so I'm not going to speed that up
    • Also, I already sound like a chipmunk; mechanically speeding up my recording to fit in a certain time will make that worse =)
  • but I can trim the pauses in between phrases which is easy to do with waveforms.
    • left-click to set the start, right-click to set the stop. If I want to adjust the previous/next one at the same time, I would use shift-left-click or shift-right-click, but here I want to skip the gaps between phrases, so I adjust the current subtitle without making the previous/next one longer.
  • Sometimes, after reviewing a draft, I realize I need a little more time.
    • I can specify visuals like a video, animated GIF, or an image by adding a [[file:...]] link in the comment for a subtitle. That visual will be used until the next visual is specified in a comment on a different subtitle. subed-record-compile-video can automatically speed up video clips to fit in the time for the current audio segment, which is the set of subtitles before the next visual is defined. After I compile and review the video, sometimes I notice that something goes by too quickly.
  • If the original audio has some silence, I can just copy and paste it.
    • This can sometimes feel more natural than adding in complete silence.
  • If not, I can pad left or pad right to add some silence.
    • I added a new feature so that I could specify something like #+PAD_RIGHT: 1.5 in a comment to add 1.5 seconds of silence after the audio specified by that subtitle.
  • I can try the flow of some sections
    • I can select a region and then use M-x subed-record-compile-try-flow to play the audio or C-u M-x subed-record-compile-try-flow to play the audio+video for that region.
  • and compile the video when I'm ready.
    • subed-record-compile-video compiles the video to the file specified in #+OUTPUT: filename. ffmpeg is very arcane, so I'm glad I can simplify my use of it with Emacs Lisp.
  • Emacs can do almost anything. Yay Emacs!
    • Non-linear audio and video editing is actually pretty fun in a text editor, especially when I can just use M-x vundo to navigate my undo history.

Links:

Related:

View org source for this post

Editing videos with Emacs and subed-record.el

| emacs, subed, video

I want to document more of my Minecraft adventures with A+. Video is a natural way to do this. It builds on her familiarity with the tutorials and streams she enjoys watching. I set up OBS on her laptop and plugged in my Blue Yeti microphone. We did our first interview yesterday. I edited and subtitled it (because why not!), uploaded it as an unlisted YouTube video, and shared it with her dad, sister, and cousins.

I did the video editing in Emacs with subed-record. First, I used WhisperX to transcribe the video, and I used subed-align to fix the timestamps with aeneas. I normalized the audio with Audacity and I exported the .opus file for use in subed-record.el. Then I added NOTE #+SKIP before times I wanted to remove, like when she asked for a retake. Here's what that subtitle markup looks like:

WEBVTT

NOTE #+SKIP

00:00:00.000 --> 00:00:16.679
And then I'll record in my side also
and we'll just put it in somehow.
Somehow. Okay. We can edit that, right?
Yeah, we'll learn how to edit things.
It'll be great.

NOTE
Introduction
#+AUDIO: cuberventures-001.opus
[[file:intro.webm]]
#+OUTPUT: cuberventures-001-fxnt-create-2-windmill-home-cafe-trains-hotel-half-underwater.webm

00:00:16.680 --> 00:00:19.399
Okay, so now we're here with <username>.

00:00:19.400 --> 00:00:23.039
I want to find out what you like about Minecraft and

00:00:23.040 --> 00:00:26.079
all the cool things that you have been building lately.

This was a little different from my usual video creation workflow, where I record the audio and the video separately. When I wrote subed-record.el, I assumed I'd edit the audio first, choose images/GIFs/videos that were already ready to go, and then combine those visuals with sections of audio, speeding things up or slowing things down as needed. Now I wanted to apply the same edits to the video as I did to the audio. A+ did a great job of looking at stuff in Minecraft while talking about them, so I wanted to keep her narration in sync. I added some code to allow me to specify a same-edits keyword for the visuals. That meant that I would use the same selection list that I used for cutting the audio. Here's what that subtitle markup looks like:

NOTE
[[file:2024-12-31 10-35-14.mkv]]
#+OPTIONS: same-edits

00:00:43.860 --> 00:00:45.941
Shall we take a tour of my world?

00:00:45.942 --> 00:00:50.079
Sure, let's tell people which mod pack this is.

00:00:50.080 --> 00:00:55.639
This is FXNT Create 2, also known as FoxyNoTail Create 2.

NOTE Windmill

00:00:55.640 --> 00:00:58.239
I've got this little bit of path leading to the interview

00:00:58.240 --> 00:01:01.839
room. This is my unfinished windmill. I've been meaning to

This workflow lets me cut out segments in the middle of the video, like this:

00:17:30.200 --> 00:17:33.119
great start for a tour. I'm looking forward to seeing what

00:17:33.120 --> 00:17:34.112
you will build next.

NOTE #+SKIP

00:17:34.113 --> 00:18:02.379
Do you have any last words before
we try to figure out this video editing thing?
Yeah. We'll cut that last part out.
Let's just do a retake on that last part.
Someday. Out here. Okay. There you go.
This is a beautiful view.

00:18:02.380 --> 00:18:08.119
The last things I want to say about this world is there'll be

I also wanted to start the video with a segment from my recording, so we could see her avatar on screen during the introduction. She kept her computer on first-person POV instead of changing the camera. I used mpv to figure out the timestamps for the start and end of the part that I wanted to use, then I used ffmpeg to cut that clip. I added a comment with a link to that video in order to use it before the main video. That's the [[file:intro.webm]] in the first section's comments.

After testing a small section of the transcript by selecting a region and using subed-record-compile-video, I deselected the region and used subed-record-compile-video to produce the whole video.

I also modified subed-record-compile-subtitles to include the other non-directive comments, so I can include the section headings in the raw VTT file and have them turn up in the exported version. Then I can use the new subed-section-comments-as-chapters command to copy those as chapters for the YouTube description.

We're not going to share that particular video yet, but I'm looking forward to trying that technique with videos about stuff I'm figuring out in Minecraft or Emacs. It's also tempting me to think about ways to specify transitions like crossfades and other fancy effects like overlays.

I like using the transcript as the starting point for video editing. It just makes sense to me to work with it as text. I also like this experiment with documenting more of our Minecraft experimentation. It seems to get her talking and encourages her to build more. I'm looking forward to learning more about Minecraft and making videos too.

We did another video today using the new shortcuts I've just set up for toggling OBS recording. This time we didn't even need to do any editing. I used Org Export to make her a little HTML file that had the two videos on it, so she can review it any time. Onward!

View org source for this post

subed.el: Tweaking subtitle times

| emacs, subed

When subtitle times are too far off from the video or audio, people start worrying if their video has frozen or jumped ahead. It's good to keep subtitles roughly in time with the audio.

For EmacsConf, we can get timing information from two places. WhisperX produces a JSON file with word data in the process of doing the speech recognition, and the aeneas forced alignment tool can use synthesized text-to-speech to figure out the timestamps for each line of text compared to a media file.

Aeneas timestamps are more helpful once we start editing, but it can be confused by long silences, extraneous noises, multiple speakers, and inaccurate transcripts (words added or removed).

When I combine the WhisperX word data with subtitles, I can see where the times might need a closer look because matching words weren't found.

2024-12-18_12-27-37.png
Figure 1: Screenshot with word data loaded

Loading word data requires a pretty close match at the moment, but since we change only about 4% of the subtitle text when editing, those cues are still helpful. (I measured this by the Levenshtein distance between the combined cue texts of edited subtitles versus the original WhisperX transcripts, using string-distance to approximate the editing percentage.)

Calculating how much we edited
(let ((sum-original 0)
      (sum-dist 0))
  (append
   (seq-keep
    (lambda (talk)
      (when (and (emacsconf-talk-file talk "--main.vtt")
                 (emacsconf-talk-file talk "--reencoded.json"))
        (let* ((json-object-type 'alist)
               (json-array-type 'list)
               (edited-text
                (mapconcat (lambda (sub) (elt sub 3))
                           (subed-parse-file (emacsconf-talk-file talk "--main.vtt"))
                           " "))
               (original-text
                (mapconcat
                 (lambda (word)
                   (assoc-default 'word word))
                 (assoc-default
                  'word_segments
                  (json-read-file (emacsconf-talk-file talk "--reencoded.json")))
                 " "))
               (dist (string-distance original-text edited-text)))
          (setq sum-original (+ sum-original (length original-text)))
          (setq sum-dist (+ sum-dist dist))
          (list
           (length original-text)
           (length edited-text)
           dist))))
    (emacsconf-get-talk-info))
   '(hline)
   (list
    (list
     sum-original
     (format "%d%%" (/ (* 100.0 sum-dist) sum-original))
     sum-dist))))

To make it easier to correct subtitle timing, I added a few ways to tweak subtitle timing for a region of subtitles.

WhisperX: subed-word-data-fix-subtitle-timing in subed-word-data.el tries to match the word data from WhisperX against the text of the current subtitle, using string-distance for approximate matches. I start at about two words shorter than what's in the subtitle, and then increase the number of words taken from the data while the string distance decreases. I skip the data for words before the beginning of the first subtitle in the region.

Screencast of subed-word-data-fix-subtitle-timing

Aeneas: subed-align-region uses Aeneas to realign the subtitles from the region using the section of the media file between the start of the first subtitle and the end of the last subtitle in the region. When I notice that the times are off, I skim the subtitles (or just skim them visually) to find the last well-timed subtitle. Then I pick a subtitle that's in the incorrectly-timed section. I use subed-mpv-jump-to-current-subtitle (M-j) to jump to that position, and I play back that subtitle. It usually belongs to some text further down, so I reset to that position with M-j, set my mark before the previous correctly-timed subtitle with C-SPC, go to the subtitle that matches that time, and use subed-copy-player-pos-to-start-time (C-c [) to set the proper timestamp. Then I can go to the previous incorrectly-timed subtitle and use M-x subed-align-region. This runs the Aeneas forced alignment tool using just the subtitle text in the region, the starting timestamp of the first subtitle, and the ending timestamp of the last subtitle, making it easy to adjust that section. subed-align-region is in subed-align.el

Retiming by pressing SPC after each subtitle: As an experiment, I've also added a subed-retime-subtitles command that plays through the subtitles so that I can press SPC when the next subtitle starts. It begins with the current subtitle and stops when you press a key that's not in its keymap.

Screencast with audio: subed-retime-subtitles

Manual adjustments: For fine-tuning timestamps, I usually turn on subed-waveform-show-all and shift-left-click (subed-waveform-set-start-and-copy-to-previous) or shift-right-click (subed-waveform-set-stop-and-copy-to-next) on the waveforms because it's easy to see where the words and pauses are. When I'm not sure, I can use middle-click (subed-waveform-play-sample) to play part of the file without changing the subtitle start/stop or the MPV playback position.

Screencast with audio of using the waveforms

I'm experimenting with adding repeating keybindings. There was a subed-mpv-frame-step-map that was bound to C-c C-f, so I've renamed it to subed-mpv-control, added a whole bunch of keybindings to the subed-mpv-control-map based on MPV and Aegisub shortcuts, and made it a repeating transient map.

Screencast with audio, experimenting with the mpv control map

Ideas for next steps:

Gotta get the hang of all these new capabilities through practice! =)

To make my subed-align-region workflow even more convenient, I could use completing-read to let me select a future subtitle with completion, and then Emacs could automatically fix the subtitle start time, go to the previous subtitle, and realign the region.

Also, I think switching the waveforms from overlays to text properties could be a good idea. When I cut text, the overlays get left behind, but I want the waveforms to go away too.

While writing this post and fiddling with subed, I ended up adding a bunch of keybindings and a menu. I figured this was as good a time as any to stop tweaking it and finally publish. (But it's fun! Just one more idea…)

View org source for this post

Checking caption timing by skimming with Emacs Lisp or JS

| js, emacs, subed

Sometimes automatic subtitle timing tools like Aeneas can get confused by silences, extraneous sounds, filler words, mis-starts, and text that I've edited out of the raw captions for easier readability. It's good to quickly check each caption. I used to listen to captions at 1.5x speed, watching carefully as each caption displayed. This took a fair bit of time and focus, so… it usually didn't happen. Sampling the first second of each caption is faster and requires a little less attention.

Skimming with subed.el

Here's a function that I wrote to play the first second of each subtitle.

(defvar my-subed-skim-msecs 1000 "Number of milliseconds to play when skimming.")
(defun my-subed-skim-starts ()
  (interactive)
  (subed-mpv-unpause)
  (subed-disable-loop-over-current-subtitle)
  (catch 'done
    (while (not (eobp))
      (subed-mpv-jump-to-current-subtitle)
      (let ((ch
             (read-char "(q)uit? " nil (/ my-subed-skim-msecs 1000.0))))
        (when ch
          (throw 'done t)))
      (subed-forward-subtitle-time-start)
      (when (and subed-waveform-minor-mode
                 (not subed-waveform-show-all))
        (subed-waveform-refresh))
      (recenter)))
  (subed-mpv-pause))

Now I can read the lines as the subtitles play, and I can press any key to stop so that I can fix timestamps.

Skimming with Javascript

I also want to check the times on the Web in case there have been caching issues. Here's some Javascript to skim the first second of each cue in the first text track for a video, with some code to make it easy to process the first video in the visible area.

function getVisibleVideo() {
  const videos = document.querySelectorAll('video');
  for (const video of videos) {
    const rect = video.getBoundingClientRect();
    if (
      rect.top >= 0 &&
      rect.left >= 0 &&
      rect.bottom <= (window.innerHeight || document.documentElement.clientHeight) &&
      rect.right <= (window.innerWidth || document.documentElement.clientWidth)
    ) {
      return video;
    }
  }
  return null;
}

async function skimVideo(video=getVisibleVideo(), msecs=1000) {
  // Get the first text track (assumed to be captions/subtitles)
  const textTrack = video.textTracks[0];
  if (!textTrack) return;
  const remaining = [...textTrack.cues].filter((cue) => cue.endTime >= video.currentTime);
  video.play();
  // Play the first 1 second of each visible subtitle
  for (let i = 0; i < remaining.length && !video.paused; i++) {
    video.currentTime = remaining[i].startTime;
    await new Promise((resolve) => setTimeout(resolve, msecs));
  }
}

Then I can call it with skimVideo();. Actually, in our backstage area, it might be useful to add a Skim button so that I can skim things from my phone.

function handleSkimButton(event) {
   const vid = event.target.closest('.vid').querySelector('video');
   skimVideo(vid);
 }

document.querySelectorAll('video').forEach((vid) => {
   const div = document.createElement('div');
   const skim = document.createElement('button');
   skim.textContent = 'Skim';
   div.appendChild(skim);
   vid.parentNode.insertBefore(div, vid.nextSibling);
   skim.addEventListener('click', handleSkimButton);
});

Results

How much faster is it this way?

Some code to help figure out the speedup
(-let* ((files (directory-files "~/proj/emacsconf/2024/cache" t "--main\\.vtt"))
        ((count-subs sum-seconds)
         (-unzip (mapcar
                  (lambda (file)
                    (list
                     (length (subed-parse-file file))
                     (/ (compile-media-get-file-duration-ms
                         (concat (file-name-sans-extension file) ".webm")) 1000.0)))
                  files)))
        (total-seconds (-reduce #'+ sum-seconds))
        (total-subs (-reduce #'+ count-subs)))
  (format "%d files, %.1f hours, %d total captions, speed up of %.1f"
          (length files)
          (/ total-seconds 3600.0)
          total-subs
          (/ total-seconds total-subs)))

It looks like for EmacsConf talks where we typically format captions to be one long line each (< 60 characters), this can be a speed-up of about 4x compared to listening to the video at normal speed. More usefully, it's different enough to get my brain to do it instead of putting it off.

Most of the automatically-generated timestamps are fine. It's just a few that might need tweaking. It's nice to be able to skim them with fewer keystrokes.

View org source for this post