Sacha Chua - category - speech

Using speech recognition for on-the-fly translations in Emacs and faking in-buffer completion for the results

2026-02-27T20:11:58Z

When I'm writing a journal entry in French, I sometimes want to translate a phrase that I can't look up word by word using a dictionary. Instead of switching to a browser, I can use an Emacs function to prompt me for text and either insert or display the translation. The plz library makes HTTP requests slightly neater.

(defun my-french-en-to-fr (text &optional display-only)
  (interactive (list (read-string "Text: ") current-prefix-arg))
  (let* ((url "https://translation.googleapis.com/language/translate/v2")
         (params `(("key" . ,(getenv "GOOGLE_API_KEY"))
                   ("q" . ,text)
                   ("source" . "en")
                   ("target" . "fr")
                   ("format" . "text")))
         (query-string (mapconcat
                        (lambda (pair)
                          (format "%s=%s"
                                  (url-hexify-string (car pair))
                                  (url-hexify-string (cdr pair))))
                        params
                        "&"))
         (full-url (concat url "?" query-string)))
    (let* ((response (plz 'get full-url :as #'json-read))
           (data (alist-get 'data response))
           (translations (alist-get 'translations data))
           (first-translation (car translations))
           (translated-text (alist-get 'translatedText first-translation)))
      (when (called-interactively-p 'any)
        (if display-only
            (message "%s" translated-text)
          (insert translated-text)))
      translated-text)))

I think it would be even nicer if I could use speech synthesis, so I can keep it a little more separate from my typing thoughts. I want to be able to say "Okay, translate …" or "Okay, … in French" to get a translation. I've been using my fork of natrys/whisper.el for speech recognition in English, and I like it a lot. By adding a function to whisper-after-transcription-hook, I can modify the intermediate results before they're inserted into the buffer.

(defun my-whisper-translate ()
  (goto-char (point-min))
  (let ((case-fold-search t))
    (when (re-search-forward "okay[,\\.]? translate[,\\.]? \\(.+\\)\\|okay[,\\.]? \\(.+?\\) in French" nil t)
      (let* ((s (or (match-string 1) (match-string 2)))
             (translation (save-match-data (my-french-en-to-fr s))))
        (replace-match
         (propertize translation
                     'type-hint translation
                     'help-echo s))))))

(with-eval-after-load 'whisper
  (add-hook 'whisper-after-transcription-hook 'my-whisper-translate 70))

But that's too easy. I want to actually type things myself so that I get more practice. Something like an autocomplete suggestion would be handy as a way of showing me a hint at the cursor. The usual completion-at-point functions are too eager to insert things if there's only one candidate, so we'll just fake it with an overlay. This code works only with my whisper.el fork because it supports using a list of functions for whisper-insert-text-at-point.

(defun my-whisper-maybe-type-with-hints (text)
  "Add this function to `whisper-insert-text-at-point'."
  (let ((hint (and text (org-find-text-property-in-string 'type-hint text))))
    (if hint
        (progn
          (my-type-with-hint hint)
          nil)
      text)))

(defvar-local my-practice-overlay nil)
(defvar-local my-practice-target nil)
(defvar-local my-practice-start nil)

(defun my-practice-cleanup ()
  "Remove the overlay and stop monitoring."
  (when (overlayp my-practice-overlay)
    (delete-overlay my-practice-overlay))
  (setq my-practice-overlay nil
        my-practice-target nil
        my-practice-start nil)
  (remove-hook 'post-command-hook #'my-practice-monitor t))

(defun my-practice-monitor ()
  "Updates hint or cancels."
  (let* ((pos (point))
         (input (buffer-substring-no-properties my-practice-start pos))
         (input-len (length input))
         (target-len (length my-practice-target)))
    (cond
     ((or (< pos my-practice-start)
          (> pos (+ my-practice-start target-len))
          (string-match "[\n\t]" input)
          (string= input my-practice-target))
      (my-practice-cleanup))
     ((string-prefix-p (downcase input) (downcase my-practice-target))
      (let ((remaining (substring my-practice-target input-len)))
        (move-overlay my-practice-overlay pos pos)
        (overlay-put my-practice-overlay 'after-string
                     (propertize remaining 'face 'shadow))))
     (t                                 ; typo
      (move-overlay my-practice-overlay pos pos)
      (overlay-put my-practice-overlay 'after-string
                   (propertize (substring my-practice-target input-len) 'face 'error))))))

(defun my-type-with-hint (string)
  "Show hints for STRING."
  (interactive "sString to practice: ")
  (my-practice-cleanup)
  (setq-local my-practice-target string)
  (setq-local my-practice-start (point))
  (setq-local my-practice-overlay (make-overlay (point) (point) nil t t))
  (overlay-put my-practice-overlay 'after-string (propertize string 'face 'shadow))
  (add-hook 'post-command-hook #'my-practice-monitor nil t))

Here's a demonstration of me saying "Okay, this is a test, in French.":

Download the video

Screencast of using speech recognition to translate into French and provide a hint when typing

Since we're faking in-buffer completion here, maybe we can still get away with considering this as an entry for Emacs Carnival February 2026: Completion ? =)

This is part of my Emacs configuration.

View Org source for this post

You can e-mail me at [email protected].

Using whisper.el to convert speech to text and save it to the currently clocked task in Org Mode or elsewhere

2026-01-13T18:38:54Z

[2026-01-30 Fri]: Major change: I switched to my fork of natrys/whisper.el so that I can specify functions that change the window configuration etc.
[2026-01-13 Tue]: Change main function to my-whisper-run, use seq-reduce to go through the functions.
[2026-01-09 Fri]: Added code for automatically capturing screenshots, saving text, working with a list of functions.
[2026-01-08 Thu]: Added demo, fixed some bugs.
[2026-01-04 Sun]: Added note about difference from MELPA package, fixed :vc

I want to get my thoughts into the computer quickly, and talking might be a good way to do some of that. OpenAI Whisper is reasonably good at recognizing my speech now and whisper.el gives me a convenient way to call whisper.cpp from Emacs with a single keybinding. (Note: This is not the same whisper package as the one on MELPA.) Here is how I have it set up for reasonable performance on my Lenovo P52 with just the CPU, no GPU.

I've bound to the command whisper-run. I press to start recording, talk, and then press to stop recording. By default, it inserts the text into the buffer at the current point. I've set whisper-return-cursor-to-start to nil so that I can keep going.

(use-package whisper
  :vc (:url "https://github.com/natrys/whisper.el")
  :load-path "~/vendor/whisper.el"
  :config
  (setq whisper--mode-line-recording-indicator "⏺")
  (setq whisper-quantize "q4_0")
  (setq whisper-install-directory "~/vendor")
  (setq whisper--install-path (concat
     (expand-file-name (file-name-as-directory whisper-install-directory))
     "whisper.cpp/"))
  ;; Get it running with whisper-server-mode set to nil first before you switch to 'local.
  ;; If you change models,
  ;; (whisper-install-whispercpp (whisper--check-install-and-run nil "whisper-start"))
  (setq whisper-server-mode 'local)
  (setq whisper-model "base")
  (setq whisper-return-cursor-to-start nil)
  ;(setq whisper--ffmpeg-input-device "alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo")
  (setq whisper--ffmpeg-input-device "VirtualMicSink.monitor")
  (setq whisper-language "en")
  (setq whisper-recording-timeout 3000)
  (setq whisper-before-transcription-hook nil)
  (setq whisper-use-threads (1- (num-processors)))
  (setq whisper-transcription-buffer-name-function 'whisper--simple-transcription-buffer-name)
  (add-hook 'whisper-after-transcription-hook 'my-subed-fix-common-errors-from-start -100)
  :bind
  (("" . whisper-run)
   ("C-" . my-whisper-run)
   ("S-" . my-whisper-replay)
   ("M-" . my-whisper-toggle-language)))

Let's see if we can process "Computer remind me to…":

(defvar my-whisper-org-reminder-template "t")

(defun my-whisper-org-process-reminder ()
  (let ((text (buffer-string))
        reminder)
    (when (string-match "computer[,\.]? reminds? me to \\(.+\\)" text)
      (setq reminder (match-string 1 text))
      (save-window-excursion
        (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
          (when (markerp whisper--marker) (goto-char whisper--marker))
          (org-capture nil my-whisper-org-reminder-template)
          (insert reminder)
          (org-capture-finalize)))
      (erase-buffer))))

(with-eval-after-load 'whisper
  (add-hook 'whisper-after-transcription-hook 'my-whisper-org-process-reminder 50))

Disk space is inexpensive and backups are great, so let's save each file using the timestamp.

(defvar my-whisper-dir "~/recordings/whisper/")
(defun my-whisper-set-temp-filename ()
  (setq whisper--temp-file (expand-file-name
                            (format-time-string "%Y-%m-%d-%H-%M-%S.wav")
                            my-whisper-dir)))

(with-eval-after-load 'whisper
  (add-hook 'whisper-before-transcription-hook #'my-whisper-set-temp-filename))

The technology isn't quite there yet to do real-time audio transcription so that I can see what it understands while I'm saying things, but that might be distracting anyway. If I do it in short segments, it might still be okay. I can replay the most recently recorded snippet in case it's missed something and I've forgotten what I just said.

(defun my-whisper-replay (&optional file)
  "Replay the last temporary recording."
  (interactive (list
                (when current-prefix-arg
                  (read-file-name "File: " my-whisper-dir))))
  (setq whisper--temp-file (or file whisper--temp-file))
  (mpv-play whisper--temp-file))

(defun my-whisper-insert-retry (&optional file)
  (interactive (list
                (when current-prefix-arg
                  (read-file-name "File: " my-whisper-dir))))
  (whisper--cleanup-transcription)
  (setq whisper--marker (point-marker)
        whisper--temp-file (or file whisper--temp-file))
  (whisper--transcribe-audio))

Il peut aussi comprendre le français.

(defun my-whisper-toggle-language ()
  "Set the language explicitly, since sometimes auto doesn't figure out the right one."
  (interactive)
  (setq whisper-language (if (string= whisper-language "en") "fr" "en"))
  ;; If using a server, we need to restart for the language
  (when (process-live-p whisper--server-process) (kill-process whisper--server-process))
  (message "%s" whisper-language))

I could use this with org-capture, but that's a lot of keystrokes. My shortcut for org-capture is C-c r. I need to press at least one key to set the template, to start recording, to stop recording, and C-c C-c to save it. I want to be able to capture notes to my currently clocked in task without having an Org capture buffer interrupt my display.

To clock in, I can use C-c C-x i or my ! speed command. Bonus: the modeline displays the current task to keep me on track, and I can use org-clock-goto (which I've bound to C-c j) to jump to it.

Then, when I'm looking at something else and I want to record a note, I can press to start the recording, and then C- to save it to my currently clocked task along with a link to whatever I'm looking at. (Update: Ooh, now I can save a screenshot too.)

(defun my-whisper-reset (text)
  (setq my-whisper-skip-annotation nil)
  (remove-hook 'whisper-insert-text-at-point #'my-whisper-org-save-to-clocked-task)
  text)

;; Only works with my tweaks to whisper.el
;; https://github.com/sachac/whisper.el/tree/whisper-insert-text-at-point-function
(with-eval-after-load 'whisper
  (setq whisper-insert-text-at-point
        '(my-whisper-handle-commands
          my-whisper-save-text
          my-whisper-save-to-file
          my-whisper-maybe-expand-snippet
          my-whisper-maybe-type
          my-whisper-maybe-type-with-hints
          my-whisper-insert
          my-whisper-reset)))

(defvar my-whisper-last-annotation nil "Last annotation so we can skip duplicates.")
(defvar my-whisper-skip-annotation nil)
(defvar my-whisper-target-markers nil "List of markers to send text to.")

(defun my-whisper-insert (text)
  (let ((markers
         (cond
          ((null my-whisper-target-markers)
           (list whisper--marker)) ; current point where whisper was started
          ((listp my-whisper-target-markers)
           my-whisper-target-markers)
          ((markerp my-whisper-target-markers)
           (list my-whisper-target-markers))))
        (orig-point (point))
        (orig-buffer (current-buffer)))
    (when text
      (mapcar (lambda (marker)
                (with-current-buffer (marker-buffer marker)
                  (save-restriction
                    (widen)
                    (when (markerp marker) (goto-char marker))
                    (when (and (derived-mode-p 'org-mode) (org-at-drawer-p))
                      (insert "\n"))
                    (whisper--insert-text
                     (concat
                      (if (looking-back "[ \t\n]\\|^")
                          ""
                        " ")
                      (string-trim text)))
                    ;; Move the marker forward here
                    (move-marker marker (point)))))
              markers)
      (when my-whisper-target-markers
        (goto-char orig-point))
      nil)))

(defun my-whisper-maybe-type (text)
  (when text
    (if (frame-focus-state)
        text
      (make-process :name "xdotool" :command
                    (list "xdotool" "type"
                          text))
      nil)))

(defun my-whisper-clear-markers ()
  (interactive)
  (setq my-whisper-target-markers nil))

(defun my-whisper-use-current-point (&optional add)
  (interactive (list current-prefix-arg))
  (if add
      (push (point-marker) my-whisper-target-markers)
    (setq my-whisper-target-markers (list (point-marker)))))

(defun my-whisper-run-at-point (&optional add)
  (interactive (list current-prefix-arg))
  (my-whisper-clear-markers)
  (whisper-run))

(keymap-global-set "" #'my-whisper-run-at-point)
(keymap-global-set "" #'whisper-run)

(defun my-whisper-jump-to-marker ()
  (interactive)
  (with-current-buffer (marker-buffer (car my-whisper-target-markers))
    (goto-char (car my-whisper-target-markers))))

(defun my-whisper-use-currently-clocked-task (&optional add)
  (interactive (list current-prefix-arg))
  (save-window-excursion
    (save-restriction
      (save-excursion
        (org-clock-goto)
        (org-end-of-meta-data)
        (org-end-of-subtree)
        (if add
            (push (point-marker) my-whisper-target-markers)
          (setq my-whisper-target-markers (list (point-marker))))))))

(defun my-whisper-run (&optional skip-annotation)
  (interactive (list current-prefix-arg))
  (require 'whisper)
  (add-hook 'whisper-insert-text-at-point #'my-whisper-org-save-to-clocked-task -10)
  (whisper-run)
  (when skip-annotation
    (setq my-whisper-skip-annotation t)))

(defun my-whisper-save-text (text)
  "Save TEXT beside `whisper--temp-file'."
  (when text
    (let ((link (org-store-link nil)))
      (with-temp-file (concat (file-name-sans-extension whisper--temp-file) ".txt")
        (when link
          (insert link "\n"))
        (insert text)))
    text))

(defun my-whisper-org-save-to-clocked-task (text)
  (when text
    (save-window-excursion
      (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer))
        (when (markerp whisper--marker) (goto-char whisper--marker))
        ;; Take a screenshot maybe
        (let* ((link (and (not my-whisper-skip-annotation)
                          (org-store-link nil)))
               (region (and (region-active-p) (buffer-substring (region-beginning) (region-end))))
               (screenshot-filename
                (when (or
                       (null link)
                       (not (string= my-whisper-last-annotation link))
                       (not (frame-focus-state))) ; not in focus, take a screenshot
                  (my-screenshot-current-screen (concat (file-name-sans-extension whisper--temp-file) ".png")))))
          (if (org-clocking-p)
              (save-window-excursion
                (save-restriction
                  (save-excursion
                    (org-clock-goto)
                    (org-end-of-subtree)
                    (unless (bolp)
                      (insert "\n"))
                    (insert "\n")
                    (if (and link (not (string= my-whisper-last-annotation link)))
                        (insert
                         (if screenshot-filename
                             (concat "(" (org-link-make-string
                                          (concat "file:" screenshot-filename)
                                          "screenshot") ") ")
                           "")
                         link
                         "\n")
                      (when screenshot-filename
                        (insert (org-link-make-string
                                 (concat "file:" screenshot-filename)
                                 "screenshot")
                                "\n")))
                    (when region
                      (insert "#+begin_example\n" region "\n#+end_example\n"))
                    (insert text "\n")
                    (setq my-whisper-last-annotation link)))
                (run-at-time 0.5 nil (lambda (text) (message "Added clock note: %s" text)) text))
            ;; No clocked task, prompt for a place to capture it
            (kill-new text)
            (setq org-capture-initial text)
            (call-interactively 'org-capture)
            ;; Delay the window configuration
            (let ((config (current-window-configuration)))
              (run-at-time 0.5 nil
                           (lambda (text config)
                             (set-window-configuration config)
                             (message "Copied: %s" text))
                           text config))))))))

(with-eval-after-load 'org
  (add-hook 'org-clock-in-hook #'my-whisper-org-clear-saved-annotation))

(defun my-whisper-org-clear-saved-annotation ()
  (setq my-whisper-org-last-annotation nil))

Here's an idea for a function that saves the recognized text with a timestamp.

(defvar my-whisper-notes "~/sync/stream/narration.org")
(defun my-whisper-save-to-file (text)
  (when text
    (let ((link (org-store-link nil)))
      (with-current-buffer (find-file-noselect my-whisper-notes)
        (goto-char (point-max))
        (insert "\n\n" (format-time-string "%H:%M ") text "\n" (if link (concat link "\n") ""))
        (save-buffer)
        (run-at-time 0.5 nil (lambda (text) (message "Saved to file: %s" text)) text)))
    text))

And now I can redo things if needed:

(defun my-whisper-redo ()
  (interactive)
  (setq whisper--marker (point-marker))
  (whisper--transcribe-audio))

I think I've just figured out my Pipewire setup so that I can record audio in OBS while also being able to do speech to text, without the audio stuttering. qpwgraph was super helpful for visualizing the Pipewire connections and fixing them.

systemctl --user restart pipewire
sleep 2
pactl load-module module-null-sink \
  sink_name="VirtualMicSink" sink_properties=device.description=VirtualMicSink
pactl load-module module-null-sink \
  sink_name="CombinedSink" sink_properties=device.description=CombinedSink
if pactl list short sources | grep -i pci-0000; then
  pactl load-module module-loopback \
    source="alsa_input.pci-0000_00_1f.3.analog-stereo" \
    sink="VirtualMicSink" \
    latency_msec=100 \
    adjust_time=1 \
    source_output_properties="node.description='SysToVMic' node.name='SysToVMic' media.name='SysVToMic'" \
    sink_input_properties="node.description='SysToVMic' node.name='SysToVMic' media.role='filter'"
    sink_input_properties=media.role=filter
  pactl load-module module-loopback \    source="alsa_output.pci-0000_00_1f.3.analog-stereo.monitor" \
    sink="CombinedSink" \
    node_name="SystemOutToCombined" \
    source_output_properties="node.description='SysOutToCombined' node.name='SysOutToCombined'" \
    sink_input_properties="node.description='SysOutToCombined' node.name='SysOutToCombined' media.role='filter'" \
    latency_msec=100 adjust_time=1
fi
if pactl list short sources | grep -i yeti; then
  pactl load-module module-loopback \
    source="alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo" \
    sink="VirtualMicSink" \
    latency_msec=100 \
    adjust_time=1 \
    source_output_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.name='YetiToVMic'" \
    sink_input_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.role='filter'"
  pactl load-module module-loopback \    source="alsa_output.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo.monitor" \
    sink="CombinedSink" \
    source_output_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.name='YetiOutToCombined' " \
    sink_input_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.role='filter'" \
    latency_msec=100 adjust_time=1
fi
pactl load-module module-loopback \
  source="VirtualMicSink.monitor" \
  sink="CombinedSink" \
  source_output_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.name='VMicToCombined'" \
  sink_input_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.role='filter'" \
  latency_msec=100 adjust_time=1

pactl load-module module-null-sink \
  sink_name="ExtraSink1" sink_properties=device.description=ExtraSink1

pactl load-module module-loopback \
  source="ExtraSink1.monitor" \
  sink="CombinedSink" \
  source_output_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.name='ExtraSink1ToCombined'" \
  sink_input_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.role='filter'" \
  latency_msec=100 adjust_time=1

Here's a demo:

Download the video

Screencast of using whisper.el to do speech-to-text into the current buffer, clocked-in task, or other function

Transcript

Controlling my Android phone by voice

2025-01-30T21:27:02Z

[2025-01-30 Thu]: Fix timestamp format in toggle recording task.

I want to be able to use voice control to do things on my phone while I'm busy washing dishes, putting things away, knitting, or just keeping my hands warm. It'll also be handy to have a way to get things out of my head when the kiddo is koala-ing me. I've been using my Google Pixel 8's voice interface to set timers, send text messages, and do quick web searches. Building on my recent thoughts on wearable computing, I decided to spend some more time investigating the Google Assistant and Voice Access features in Android and setting up other voice shortcuts.

Tasker routines

I switched back to Google Assistant from Gemini so that I could run Tasker routines. I also found out that I needed to switch the language from English/Canada to English/US in order for my Tasker scripts to run instead of Google Assistant treating them as web searches. Once that was sorted out, I could run Tasker tasks with "Hey Google, run {task-name} in Tasker" and parameterize them with "Hey Google, run {task-name} with {parameter} in Tasker."

Voice Access

Learning how to use Voice Access to navigate, click, and type on my phone was straightforward. "Scroll down" works for webpages, while "scroll right" works for the e-books I have in Libby. Tapping items by text usually works. When it doesn't, I can use "show labels", "show numbers", or "show grid." The speech-to-text of "type …" isn't as good as Whisper, so I probably won't use it for a lot of dictation, but it's fine for quick notes. I can keep recording in the background so that I have the raw audio in case I want to review it or grab the WhisperX transcripts instead.

For some reason, saying "Hey Google, voice access" to start up voice access has been leaving the Assistant dialog on the screen, which makes it difficult to interact with the screen I'm looking at. I added a Tasker routine to start voice access, wait a second, and tap on the screen to dismiss the Assistant dialog.

Start Voice.tsk.xml - Import via Taskernet

Start Voice.tsk.xml

Updating my audio braindump workflow to take advantage of WhisperX

2024-11-19T13:33:59Z

I get word timestamps for free when I transcribe with WhisperX, so I can skip the Aeneas alignment step. That means I can update my previous code for handling audio braindumps . Breaking the transcript up into sections Also, I recently updated subed-word-data to colour words based on their transcription score, which draws my attention to things that might be uncertain.

Here's what it looks like when I have the post, the transcript, and the annotated PDF.

Figure 1: Screenshot of draft, transcript, and PDF

Here's what I needed to implement my-audio-braindump-from-whisperx-json (plus some code from my previous audio braindump workflow):

(defun my-whisperx-word-list (file)
  (let* ((json-object-type 'alist)
         (json-array-type 'list))
    (seq-mapcat (lambda (seg)
                  (alist-get 'words seg))
                (alist-get 'segments (json-read-file file)))))

;; (seq-take (my-whisperx-word-list (my-latest-file "~/sync/recordings" "\\.json")) 10)
(defun my-whisperx-insert-word-list (words)
  "Inserts WORDS with text properties."
  (require 'subed-word-data)
  (mapc (lambda (word)
            (let ((start (point)))
              (insert
               (alist-get 'word word))
              (subed-word-data--add-word-properties start (point) word)
              (insert " ")))
        words))

(defun my-audio-braindump-turn-sections-into-headings ()
  (interactive)
  (goto-char (point-min))
  (while (re-search-forward "START SECTION \\(.+?\\) STOP SECTION" nil t)
    (replace-match
     (save-match-data
       (format
        "\n*** %s\n"
        (save-match-data (string-trim (replace-regexp-in-string "^[,\\.]\\|[,\\.]$" "" (match-string 1))))))
     nil t)
    (let ((prop-match (save-excursion (text-property-search-forward 'subed-word-data-start))))
      (when prop-match
        (org-entry-put (point) "START" (format-seconds "%02h:%02m:%02s" (prop-match-value prop-match)))))))

(defun my-audio-braindump-split-sentences ()
  (interactive)
  (goto-char (point-min))
  (while (re-search-forward "[a-z]\\. " nil t)
    (replace-match (concat (string-trim (match-string 0)) "\n") )))

(defun my-audio-braindump-restructure ()
  (interactive)
  (goto-char (point-min))
  (my-subed-fix-common-errors)
  (org-mode)
  (my-audio-braindump-prepare-alignment-breaks)
  (my-audio-braindump-turn-sections-into-headings)
  (my-audio-braindump-split-sentences)
  (goto-char (point-min))
  (my-remove-filler-words-at-start))

(defun my-audio-braindump-from-whisperx-json (file)
  (interactive (list (read-file-name "JSON: " "~/sync/recordings/" nil nil nil (lambda (f) (string-match "\\.json\\'" f)))))
  ;; put them all into a buffer
  (with-current-buffer (get-buffer-create "*Words*")
    (erase-buffer)
    (fundamental-mode)
    (my-whisperx-insert-word-list (my-whisperx-word-list file))
    (my-audio-braindump-restructure)
    (goto-char (point-min))
    (switch-to-buffer (current-buffer))))

(defun my-audio-braindump-process-text (file)
  (interactive (list (read-file-name "Text: " "~/sync/recordings/" nil nil nil (lambda (f) (string-match "\\.txt\\'" f)))))
  (with-current-buffer (find-file-noselect file)
    (my-audio-braindump-restructure)
    (save-buffer)))
;; (my-audio-braindump-from-whisperx-json (my-latest-file "~/sync/recordings" "\\.json"))

Ideas for next steps:

I can change my processing script to split up the Whisper TXT into sections and automatically make the PDF with nice sections.
I can add reminders and other callouts. I can style them, and I can copy reminders into a different section for easier processing.
I can look into extracting PDF annotations so that I can jump to the next highlight or copy highlighted text.

This is part of my Emacs configuration.

View org source for this post

You can e-mail me at [email protected].

Yay, I can get live speech recognition results from Emacs to Etherpad

2023-12-26T15:02:47Z

I want to see if we can caption EmacsConf live presentations and Q&A sessions, even if the automated captions need help with misrecognized words. Now that I can get live speech into Emacs using the Deepgram streaming API, I can process that information and send it to other places. Here's a quick demonstration of appending live speech captions to Etherpad:

Download the video

I added an emacsconf-pad-append-text function to emacsconf-pad.el that uses the appendText function.

You can e-mail me at [email protected].

Audio braindump workflow tweaks: Adding Org Mode hyperlinks to recordings based on keywords

2023-12-22T21:47:57Z

[2023-12-24 Sun] Added a quick video!

Download the video

Summary sketch

Getting live speech into Emacs with Deepgram's streaming API

2023-12-19T15:11:56Z

[2023-12-26 Tue]: Reorganized code to call a list of functions and pass the recognition results. Added Etherpad. Took out the mode; will just use the functions. Related: getting live speech from Emacs into Etherpad

This is a quick demonstration of using Deepgram's streaming API to do speech recognition live. It isn't as accurate as OpenAI Whisper but since Whisper doesn't have a streaming API, it'll do for now. I can correct misrecognized words manually. I tend to talk really quickly, so it displays the words per minute in my modeline. I put the words into an Org Mode buffer so I can toggle headings with avy and cycle visibility. When I'm done, it saves the text, JSON, and WAV for further processing. I think it'll be handy to have a quick way to take live notes during interviews or when I'm thinking out loud. Could be fun!

Download the video

I'm still getting some weirdness when the mode turns on when I don't expect it, so that's something to look into. Maybe I won't use it as a mode for now. I'll just use my-live-speech-start and my-live-speech-stop.

General code

(defvar my-live-speech-buffer "*Speech*")
(defvar my-live-speech-process nil)
(defvar my-live-speech-output-buffer "*Speech JSON*")

(defvar my-live-speech-functions
  '(my-live-speech-display-in-speech-buffer
    my-live-speech-display-wpm
    my-live-speech-append-to-etherpad)
  "Functions to call with one argument, the recognition results.")

(defun my-live-speech-start ()
  "Turn on live captions."
  (interactive)
  (with-current-buffer (get-buffer-create my-live-speech-buffer)
    (unless (process-live-p my-live-speech-process)
      (let ((default-directory "~/proj/deepgram-live"))
        (message "%s" default-directory)
        (with-current-buffer (get-buffer-create my-live-speech-output-buffer)
          (erase-buffer))
        (setq my-live-speech-recent-words nil
              my-live-speech-wpm-string "READY ")
        (setq my-deepgram-process
              (make-process
               :command '("bash" "run.sh")
               :name "speech"
               :filter 'my-live-speech-json-filter
               :sentinel #'my-live-speech-process-sentinel
               :buffer my-live-speech-output-buffer)))
      (org-mode))
    (display-buffer (current-buffer))))

(defun my-live-speech-stop ()
  (interactive)
  (if (process-live-p my-live-speech-process)
      (kill-process my-live-speech-process))
  (setq my-live-speech-wpm-string nil))

;; (define-minor-mode my-live-speech-mode
;;  "Show live speech and display WPM.
;; Need to check how to reliably turn this on and off."
;;  :global t :group 'sachac
;;  (if my-live-speech-mode
;;      (my-live-speech-start)
;;    (my-live-speech-stop)
;;    (setq my-live-speech-wpm-string nil)))

;; based on subed-mpv::client-filter
(defun my-live-speech-handle-json (line-object)
  "Process the JSON object in LINE."
  (run-hook-with-args 'my-live-speech-functions (json-parse-string line :object-type 'alist)))

(defun my-live-speech-process-sentinel (proc event)
  (when (string-match "finished" event)
    (my-live-speech-stop)
    ;(my-live-speech-mode -1)
    ))

(defun my-live-speech-json-filter (proc string)
  (when (buffer-live-p (process-buffer proc))
    (with-current-buffer (process-buffer proc)
      (let* ((proc-mark (process-mark proc))
             (moving (= (point) proc-mark)))
        ;;  insert the output
        (save-excursion
          (goto-char proc-mark)
          (insert string)
          (set-marker proc-mark (point)))
        (if moving (goto-char proc-mark))
        ;; process and remove all complete lines of JSON (lines are complete if ending with \n)
        (let ((pos (point-min)))
          (while (progn (goto-char pos)
                        (end-of-line)
                        (equal (following-char) ?\n))
            (let* ((end (point))
                   (line (buffer-substring pos end)))
              (delete-region pos (+ end 1))
              (with-current-buffer (get-buffer my-live-speech-buffer)
                (my-live-speech-handle-json line)))))))))

Python code based on the Deepgram streaming test suite:

Very rough app.py