#+TITLE: Sacha Chua's Emacs configuration #+ELEVENTY_COLLECTIONS: _posts #+ELEVENTY_BASE_DIR: ~/proj/static-blog/ #+ELEVENTY_CATEGORIES+: emacs #+ELEVENTY_LAYOUT: layouts/post #+ELEVENTY_BASE_URL: https://sachachua.com * Using whisper.el to convert speech to text and save it to the currently clocked task in Org Mode or elsewhere :speech: :PROPERTIES: :CUSTOM_ID: multimedia-whisper :EXPORT_DATE: 2026-01-03T20:23:57-0500 :EXPORT_ELEVENTY_PERMALINK: /blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/ :EXPORT_ELEVENTY_FILE_NAME: blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/ :EXPORT_MASTODON: https://social.sachachua.com/@sacha/statuses/01KE3D9XENS7FKCEJHF7SKR3TX :EXPORT_MODIFIED: 2026-01-13T13:38:54-0500 :END: #+begin_update - [2026-01-30 Fri]: Major change: I switched to [[https://github.com/sachac/whisper.el/tree/whisper-insert-text-at-point-function][my fork]] of natrys/whisper.el so that I can specify functions that change the window configuration etc. - [2026-01-13]: Change main function to ~my-whisper-run~, use seq-reduce to go through the functions. - [2026-01-09]: Added code for automatically capturing screenshots, saving text, working with a list of functions. - [2026-01-08]: Added demo, fixed some bugs. - [2026-01-04]: Added note about difference from MELPA package, fixed :vc #+end_update I want to get my thoughts into the computer quickly, and talking might be a good way to do some of that. [[https://github.com/openai/whisper][OpenAI Whisper]] is reasonably good at recognizing my speech now and [[https://github.com/natrys/whisper.el][whisper.el]] gives me a convenient way to call [[https://github.com/ggml-org/whisper.cpp][whisper.cpp]] from Emacs with a single keybinding. (Note: This is not the same [[https://melpa.org/#/whisper][whisper]] package as the one on MELPA.) Here is how I have it set up for reasonable performance on my Lenovo P52 with just the CPU, no GPU. I've bound ~~ to the command ~whisper-run~. I press ~~ to start recording, talk, and then press ~~ to stop recording. By default, it inserts the text into the buffer at the current point. I've set ~whisper-return-cursor-to-start~ to ~nil~ so that I can keep going. #+begin_src emacs-lisp (use-package whisper :vc (:url "https://github.com/natrys/whisper.el") :load-path "~/vendor/whisper.el" :config (setq whisper--mode-line-recording-indicator "⏺") (setq whisper-quantize "q4_0") (setq whisper-install-directory "~/vendor") (setq whisper--install-path (concat (expand-file-name (file-name-as-directory whisper-install-directory)) "whisper.cpp/")) ;; Get it running with whisper-server-mode set to nil first before you switch to 'local. ;; If you change models, ;; (whisper-install-whispercpp (whisper--check-install-and-run nil "whisper-start")) (setq whisper-server-mode 'local) (setq whisper-model "base") (setq whisper-return-cursor-to-start nil) ;(setq whisper--ffmpeg-input-device "alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo") (setq whisper--ffmpeg-input-device "VirtualMicSink.monitor") (setq whisper-language "en") (setq whisper-recording-timeout 3000) (setq whisper-before-transcription-hook nil) (setq whisper-use-threads (1- (num-processors))) (setq whisper-transcription-buffer-name-function 'whisper--simple-transcription-buffer-name) (add-hook 'whisper-after-transcription-hook 'my-subed-fix-common-errors-from-start -100) :bind (("" . whisper-run) ("C-" . my-whisper-run) ("S-" . my-whisper-replay) ("M-" . my-whisper-toggle-language))) #+end_src Let's see if we can process "Computer remind me to...": #+begin_src emacs-lisp (defvar my-whisper-org-reminder-template "t") (defun my-whisper-org-process-reminder () (let ((text (buffer-string)) reminder) (when (string-match "computer[,\.]? reminds? me to \\(.+\\)" text) (setq reminder (match-string 1 text)) (save-window-excursion (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer)) (when (markerp whisper--marker) (goto-char whisper--marker)) (org-capture nil my-whisper-org-reminder-template) (insert reminder) (org-capture-finalize))) (erase-buffer)))) (with-eval-after-load 'whisper (add-hook 'whisper-after-transcription-hook 'my-whisper-org-process-reminder 50)) #+end_src Disk space is inexpensive and backups are great, so let's save each file using the timestamp. #+begin_src emacs-lisp (defvar my-whisper-dir "~/recordings/whisper/") (defun my-whisper-set-temp-filename () (setq whisper--temp-file (expand-file-name (format-time-string "%Y-%m-%d-%H-%M-%S.wav") my-whisper-dir))) (with-eval-after-load 'whisper (add-hook 'whisper-before-transcription-hook #'my-whisper-set-temp-filename)) #+end_src The technology isn't quite there yet to do real-time audio transcription so that I can see what it understands while I'm saying things, but that might be distracting anyway. If I do it in short segments, it might still be okay. I can replay the most recently recorded snippet in case it's missed something and I've forgotten what I just said. #+begin_src emacs-lisp (defun my-whisper-replay (&optional file) "Replay the last temporary recording." (interactive (list (when current-prefix-arg (read-file-name "File: " my-whisper-dir)))) (setq whisper--temp-file (or file whisper--temp-file)) (mpv-play whisper--temp-file)) (defun my-whisper-insert-retry (&optional file) (interactive (list (when current-prefix-arg (read-file-name "File: " my-whisper-dir)))) (whisper--cleanup-transcription) (setq whisper--marker (point-marker) whisper--temp-file (or file whisper--temp-file)) (whisper--transcribe-audio)) #+end_src Il peut aussi comprendre le français. #+begin_src emacs-lisp (defun my-whisper-toggle-language () "Set the language explicitly, since sometimes auto doesn't figure out the right one." (interactive) (setq whisper-language (if (string= whisper-language "en") "fr" "en")) ;; If using a server, we need to restart for the language (when (process-live-p whisper--server-process) (kill-process whisper--server-process)) (message "%s" whisper-language)) #+end_src I could use this with ~org-capture~, but that's a lot of keystrokes. My shortcut for org-capture is ~C-c r~. I need to press at least one key to set the template, ~~ to start recording, ~~ to stop recording, and ~C-c C-c~ to save it. I want to be able to capture notes to my currently clocked in task without having an Org capture buffer interrupt my display. To clock in, I can use ~C-c C-x i~ or my ~!~ [[dotemacs:org-mode-keyboard-shortcuts-other-speed-commands][speed command]]. Bonus: the modeline displays the current task to keep me on track, and I can use ~org-clock-goto~ (which I've bound to ~C-c j~) to jump to it. Then, when I'm looking at something else and I want to record a note, I can press ~~ to start the recording, and then ~C-~ to save it to my currently clocked task along with a link to whatever I'm looking at. (Update: Ooh, now I can save a screenshot too.) #+begin_src emacs-lisp (defun my-whisper-reset (text) (setq my-whisper-skip-annotation nil) (remove-hook 'whisper-insert-text-at-point #'my-whisper-org-save-to-clocked-task) text) #+end_src #+NAME: whisper-insert-text-at-point-functions #+begin_src emacs-lisp ;; Only works with my tweaks to whisper.el ;; https://github.com/sachac/whisper.el/tree/whisper-insert-text-at-point-function (with-eval-after-load 'whisper (setq whisper-insert-text-at-point '(my-whisper-handle-commands my-whisper-save-text my-whisper-save-to-file my-whisper-maybe-expand-snippet my-whisper-maybe-type my-whisper-maybe-type-with-hints my-whisper-insert my-whisper-reset))) #+end_src #+begin_src emacs-lisp (defvar my-whisper-last-annotation nil "Last annotation so we can skip duplicates.") (defvar my-whisper-skip-annotation nil) (defvar my-whisper-target-markers nil "List of markers to send text to.") (defun my-whisper-insert (text) (let ((markers (cond ((null my-whisper-target-markers) (list whisper--marker)) ; current point where whisper was started ((listp my-whisper-target-markers) my-whisper-target-markers) ((markerp my-whisper-target-markers) (list my-whisper-target-markers)))) (orig-point (point)) (orig-buffer (current-buffer))) (when text (mapcar (lambda (marker) (with-current-buffer (marker-buffer marker) (save-restriction (widen) (when (markerp marker) (goto-char marker)) (when (and (derived-mode-p 'org-mode) (org-at-drawer-p)) (insert "\n")) (whisper--insert-text (concat (if (looking-back "[ \t\n]\\|^") "" " ") (string-trim text))) ;; Move the marker forward here (move-marker marker (point))))) markers) (when my-whisper-target-markers (goto-char orig-point)) nil))) (defun my-whisper-maybe-type (text) (when text (if (frame-focus-state) text (make-process :name "xdotool" :command (list "xdotool" "type" text)) nil))) (defun my-whisper-clear-markers () (interactive) (setq my-whisper-target-markers nil)) (defun my-whisper-use-current-point (&optional add) (interactive (list current-prefix-arg)) (if add (push (point-marker) my-whisper-target-markers) (setq my-whisper-target-markers (list (point-marker))))) (defun my-whisper-run-at-point (&optional add) (interactive (list current-prefix-arg)) (my-whisper-clear-markers) (whisper-run)) (keymap-global-set "" #'my-whisper-run-at-point) (keymap-global-set "" #'whisper-run) (defun my-whisper-jump-to-marker () (interactive) (with-current-buffer (marker-buffer (car my-whisper-target-markers)) (goto-char (car my-whisper-target-markers)))) (defun my-whisper-use-currently-clocked-task (&optional add) (interactive (list current-prefix-arg)) (save-window-excursion (save-restriction (save-excursion (org-clock-goto) (org-end-of-meta-data) (org-end-of-subtree) (if add (push (point-marker) my-whisper-target-markers) (setq my-whisper-target-markers (list (point-marker)))))))) (defun my-whisper-run (&optional skip-annotation) (interactive (list current-prefix-arg)) (require 'whisper) (add-hook 'whisper-insert-text-at-point #'my-whisper-org-save-to-clocked-task -10) (whisper-run) (when skip-annotation (setq my-whisper-skip-annotation t))) (defun my-whisper-save-text (text) "Save TEXT beside `whisper--temp-file'." (when text (let ((link (org-store-link nil))) (with-temp-file (concat (file-name-sans-extension whisper--temp-file) ".txt") (when link (insert link "\n")) (insert text))) text)) (defun my-whisper-org-save-to-clocked-task (text) (when text (save-window-excursion (with-current-buffer (if (markerp whisper--marker) (marker-buffer whisper--marker) (current-buffer)) (when (markerp whisper--marker) (goto-char whisper--marker)) ;; Take a screenshot maybe (let* ((link (and (not my-whisper-skip-annotation) (org-store-link nil))) (region (and (region-active-p) (buffer-substring (region-beginning) (region-end)))) (screenshot-filename (when (or (null link) (not (string= my-whisper-last-annotation link)) (not (frame-focus-state))) ; not in focus, take a screenshot (my-screenshot-current-screen (concat (file-name-sans-extension whisper--temp-file) ".png"))))) (if (org-clocking-p) (save-window-excursion (save-restriction (save-excursion (org-clock-goto) (org-end-of-subtree) (unless (bolp) (insert "\n")) (insert "\n") (if (and link (not (string= my-whisper-last-annotation link))) (insert (if screenshot-filename (concat "(" (org-link-make-string (concat "file:" screenshot-filename) "screenshot") ") ") "") link "\n") (when screenshot-filename (insert (org-link-make-string (concat "file:" screenshot-filename) "screenshot") "\n"))) (when region (insert "#+begin_example\n" region "\n#+end_example\n")) (insert text "\n") (setq my-whisper-last-annotation link))) (run-at-time 0.5 nil (lambda (text) (message "Added clock note: %s" text)) text)) ;; No clocked task, prompt for a place to capture it (kill-new text) (setq org-capture-initial text) (call-interactively 'org-capture) ;; Delay the window configuration (let ((config (current-window-configuration))) (run-at-time 0.5 nil (lambda (text config) (set-window-configuration config) (message "Copied: %s" text)) text config)))))))) (with-eval-after-load 'org (add-hook 'org-clock-in-hook #'my-whisper-org-clear-saved-annotation)) (defun my-whisper-org-clear-saved-annotation () (setq my-whisper-org-last-annotation nil)) #+end_src Here's an idea for a function that saves the recognized text with a timestamp. #+begin_src emacs-lisp (defvar my-whisper-notes "~/sync/stream/narration.org") (defun my-whisper-save-to-file (text) (when text (let ((link (org-store-link nil))) (with-current-buffer (find-file-noselect my-whisper-notes) (goto-char (point-max)) (insert "\n\n" (format-time-string "%H:%M ") text "\n" (if link (concat link "\n") "")) (save-buffer) (run-at-time 0.5 nil (lambda (text) (message "Saved to file: %s" text)) text))) text)) #+end_src And now I can redo things if needed: #+begin_src emacs-lisp (defun my-whisper-redo () (interactive) (setq whisper--marker (point-marker)) (whisper--transcribe-audio)) #+end_src I think I've just figured out my Pipewire setup so that I can record audio in OBS while also being able to do speech to text, without the audio stuttering. [[https://github.com/rncbc/qpwgraph][qpwgraph]] was super helpful for visualizing the Pipewire connections and fixing them. #+begin_src sh :eval no :tangle "~/bin/setup-mic" :shebang "#!/bin/bash" systemctl --user restart pipewire sleep 2 pactl load-module module-null-sink \ sink_name="VirtualMicSink" sink_properties=device.description=VirtualMicSink pactl load-module module-null-sink \ sink_name="CombinedSink" sink_properties=device.description=CombinedSink if pactl list short sources | grep -i pci-0000; then pactl load-module module-loopback \ source="alsa_input.pci-0000_00_1f.3.analog-stereo" \ sink="VirtualMicSink" \ latency_msec=100 \ adjust_time=1 \ source_output_properties="node.description='SysToVMic' node.name='SysToVMic' media.name='SysVToMic'" \ sink_input_properties="node.description='SysToVMic' node.name='SysToVMic' media.role='filter'" sink_input_properties=media.role=filter pactl load-module module-loopback \ source="alsa_output.pci-0000_00_1f.3.analog-stereo.monitor" \ sink="CombinedSink" \ node_name="SystemOutToCombined" \ source_output_properties="node.description='SysOutToCombined' node.name='SysOutToCombined'" \ sink_input_properties="node.description='SysOutToCombined' node.name='SysOutToCombined' media.role='filter'" \ latency_msec=100 adjust_time=1 fi if pactl list short sources | grep -i yeti; then pactl load-module module-loopback \ source="alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo" \ sink="VirtualMicSink" \ latency_msec=100 \ adjust_time=1 \ source_output_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.name='YetiToVMic'" \ sink_input_properties="node.description='YetiToVMic' node.name='YetiToVMic' media.role='filter'" pactl load-module module-loopback \ source="alsa_output.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo.monitor" \ sink="CombinedSink" \ source_output_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.name='YetiOutToCombined' " \ sink_input_properties="node.description='YetiOutToCombined' node.name='YetiOutToCombined' media.role='filter'" \ latency_msec=100 adjust_time=1 fi pactl load-module module-loopback \ source="VirtualMicSink.monitor" \ sink="CombinedSink" \ source_output_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.name='VMicToCombined'" \ sink_input_properties="node.description='VMicToCombined' node.name='VMicToCombined' media.role='filter'" \ latency_msec=100 adjust_time=1 pactl load-module module-null-sink \ sink_name="ExtraSink1" sink_properties=device.description=ExtraSink1 pactl load-module module-loopback \ source="ExtraSink1.monitor" \ sink="CombinedSink" \ source_output_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.name='ExtraSink1ToCombined'" \ sink_input_properties="node.description='ExtraSink1ToCombined' node.name='ExtraSink1ToCombined' media.role='filter'" \ latency_msec=100 adjust_time=1 #+end_src Here's a demo: #+begin_media-post [[video:https://sachachua.com/blog/2026/01/using-whisper-el-to-capture-text-to-speech-in-emacs/2026-01-08_11.17.22.webm?caption=Screencast of using whisper.el to do speech-to-text into the current buffer, clocked-in task, or other function&captions=t]] #+begin_my_details Transcript :open t captions:~/recordings/2026-01-08_11.17.22.vtt #+end_my_details #+end_media-post And then I define a global shortcut in KDE that runs: #+begin_src sh :eval no :tangle no /home/sacha/bin/xdotool-emacs key --clearmodifiers F9 #+end_src So now I can dictate into other applications or save into Emacs. Which suggests of course that I should get it working with C-f9 as well, if I can avoid the keyboard shortcut loop...