Create a system for auto-applying spelling fixes

Why didn't I do something like this years ago?
This commit is contained in:
TEC 2024-03-25 17:03:30 +08:00
parent df76ad127d
commit 1907b9bd27
Signed by: tec
SSH Key Fingerprint: SHA256:eobz41Mnm0/iYWBvWThftS0ElEs1ftBr6jamutnXc/A
1 changed files with 262 additions and 0 deletions

View File

@ -4101,6 +4101,268 @@ tweaks.
(advice-add 'jinx-next :after (lambda (_) (left-word))))
#+end_src
**** Autocorrect
#+call: confpkg("autocorrect", prefix="", after="jinx")
If you want to write without looking like you skipped a chunk of
primary/secondary school (as I do), then autocorrect is a handy thing to have.
Beyond just misspellings, it can also help with typos, and lazy capitalisation
(can you really be bothered to consistently type "LuaLaTeX" instead of
"lualatex" and "SciFi" over "scifi"?). However, primarily thanks to smartphones,
I more often hear people cursing autocorrect than praising it. With that in
mind, I think it's worth giving some thought to how smartphone autocorrect gets
it's bad reputation (despite largely doing a decent job):
1. Typing is harder on smartphones, and so autocorrect makes bigger (more speculative) guesses
2. People type (and mistype) differently, but autocorrect tries to have a "one
size fits all" profile that is refined over time
3. As soon as you accept a particular correction, autocorrect can start applying
that even when the original typo is ambiguous and has multiple "corrected" forms
4. It's hard to tell the phone to stop doing a particular autocorrect (see
"Emacs" recapitalised as "eMacs" on Apple devices)
I think we can largely alleviate these problems by
1. Being mainly used on devices with actual keyboards
2. Starting with an empty autocorrect "profile", built up by the user over time
3. Having a customisable threshold before a repeated correction is made into an
autocorrection, and blacklisting misspellings with multiple distinct corrections.
4. Making it easy to blacklist certain words from becoming autocorrections
Another complaint about autocorrect is that it lets you develop bad habits, and
if anything a tool that got you to retype the correct spelling several times
would be more valuable in the long run. I think this is a pretty reasonable
complaint, and have two different trains of thought that both justify tracking
corrections made:
+ I almost never leave Emacs for writing more than a text message, so what if I
type worse outside of it?
+ By tracking corrections made, you can also make a personal "most common
misspellings" training list to run through at your leasure. Just set the
"minimum replacement count" to a stupidly high number.
For starters, let's write a record of all corrections made.
#+begin_src emacs-lisp
(defvar autocorrect-history-file
(file-name-concat (or (getenv "XDG_STATE_HOME") "~/.local/state")
"emacs" "spelling-corrections.txt")
"File where a spell check record will be saved.")
#+end_src
For simplicity of operation, I think we can just append each correction the file
as =<misspelled> <corrected>= lines. This has a number of advantages, such as
avoiding recalculations while typing, avoiding race conditions with multiple
Emacs sessions, and making merging data on different machines trivial.
In the Emacs session though, I think we'll want to have a hash table of the
counts of each correction. We can have the misspelled words as the keys, and
then have each value be an alist of src_elisp{(correction . count)} pairs. This
table can be lazily built and processed after startup.
#+begin_src emacs-lisp
(defvar autocorrect-record-table (make-hash-table :test #'equal))
#+end_src
We probably want to also specify a threshold number of misspellings that trigger
entry to the abbrev table, both on load and when made during the current Emacs
session. For now, I'll try a value of three for on-load and two for misspellings
made in the current Emacs session. I think I want to avoid a value of one since
that makes it easy for a misspelling with multiple valid corrections to become
associated with a single correction too soon. This is a rare concern, but it
would be annoying enough to run into that I think it's worth requiring a second
misspelling.
#+begin_src emacs-lisp
(defvar autocorrect-count-threshold-history 3
"The number of recorded identical misspellings to create an abbrev.
This applies to misspellings read from the history file")
(defvar autocorrect-count-threshold-session 2
"The number of identical misspellings to create an abbrev.
This applies to misspellings made in the current Emacs session.")
#+end_src
At this point we need to actually implement this functionality, starting with
updating the table when a correction is either read from the history file or
occurs live.
#+begin_src emacs-lisp
(defun autocorrect-update-table (misspelling corrected)
"Update the MISPELLING to CORRECTED entry in the table.
Returns the number of times this correction has occurred."
(if-let ((correction-counts
(gethash misspelling autocorrect-record-table)))
(if-let ((record-cons (assoc corrected correction-counts)))
(setcdr record-cons (1+ (cdr record-cons)))
(puthash misspelling
(push (cons corrected 1) correction-counts)
autocorrect-record-table)
1)
(puthash misspelling
(list (cons corrected 1))
autocorrect-record-table)
1))
#+end_src
We could call ~define-abbrev~ directly, but since we'll be doing so in multiple
places, I think it's nice to have a single place where the abbrev table so any
changes to the abbrev table (or similar) only need to be made in one place.
We could use the global abbrev table, but I'd rather have one dedicated to
spelling corrections. Let's manage this entirely separately to the global abbrev
file too.
#+begin_src emacs-lisp
(defvar autocorrect-abbrev-file
(file-name-concat (or (getenv "XDG_STATE_HOME") "~/.local/state")
"emacs" "spelling-abbrevs.el")
"File to save spell check records in.")
(defvar autocorrect-abbrev-table nil
"The spelling abbrev table.")
(defvar autocorrect-abbrev-table--saved-version 0
"The version of `autocorrect-abbrev-table' saved to disk.")
(defun autocorrect--setup-abbrevs ()
"Setup `autocorrect-abbrev-table'.
Also set it as a parent of `global-abbrev-table'."
(unless autocorrect-abbrev-table
(setq autocorrect-abbrev-table (make-abbrev-table))
(abbrev-table-put
global-abbrev-table :parents
(cons autocorrect-abbrev-table
(abbrev-table-get global-abbrev-table :parents)))
(add-hook 'kill-emacs-hook #'autocorrect-save-abbrevs))
(when (file-exists-p autocorrect-abbrev-file)
(read-abbrev-file autocorrect-abbrev-file t)
(setq autocorrect-abbrev-table--saved-version
(abbrev-table-get autocorrect-abbrev-table
:abbrev-table-modiff))))
(defun autocorrect-save-abbrevs ()
"Write `autocorrect-abbrev-table'."
(when (> (abbrev-table-get autocorrect-abbrev-table
:abbrev-table-modiff)
autocorrect-abbrev-table--saved-version)
(unless (file-exists-p autocorrect-abbrev-file)
(make-directory (file-name-directory autocorrect-abbrev-file) t))
(let ((coding-system-for-write 'utf-8))
(with-temp-buffer
(insert-abbrev-table-description 'autocorrect-abbrev-table nil)
(when (unencodable-char-position (point-min) (point-max) 'utf-8)
(setq coding-system-for-write 'utf-8-emacs))
(goto-char (point-min))
(insert (format ";;-*-coding: %s;-*-\n\n" coding-system-for-write))
(write-region nil nil autocorrect-abbrev-file 0)))
(setq autocorrect-abbrev-table--saved-version
(abbrev-table-get autocorrect-abbrev-table
:abbrev-table-modiff))))
#+end_src
Now we can write the update function that's run on a live spelling correction.
#+begin_src emacs-lisp
(defun autocorrect-record-correction (misspelling corrected)
"Record the correction of MISPELLING to CORRECTED."
(let ((write-region-inhibit-fsync t) ; Quicker writes
(coding-system-for-write 'utf-8)
(inhibit-message t))
(write-region
(concat misspelling " " corrected "\n") nil
autocorrect-history-file t))
(when (and (>= (autocorrect-update-table misspelling corrected)
autocorrect-count-threshold-session)
(= (length (gethash misspelling autocorrect-record-table))
1))
(define-abbrev autocorrect-abbrev-table misspelling corrected)
(message "Created new autocorrection: %s ⟶ %s"
(propertize misspelling 'face 'warning)
(propertize corrected 'face 'success))))
#+end_src
The only thing left to be done now is load the history file. I think I'd like to
split the actual reading and the abbrev generation into two parts though.
#+begin_src emacs-lisp
(defun autocorrect--read-history ()
"Read the history file into the correction table."
(if (file-exists-p autocorrect-history-file)
(with-temp-buffer
(insert-file-contents autocorrect-history-file)
(goto-char (point-min))
(while (< (point) (point-max))
(let ((pt (point))
misspelling corrected)
(setq misspelling
(and (forward-word)
(buffer-substring pt (point)))
pt (1+ (point)))
(setq corrected
(and (forward-word)
(buffer-substring pt (point)))
pt (point))
(when (and misspelling corrected)
(autocorrect-update-table misspelling corrected))
(forward-line 1))))
(make-directory (file-name-directory autocorrect-history-file))
(write-region "" nil autocorrect-history-file)))
(defun autocorrect--remove-invalid-abbrevs ()
"Ensure that all entries of the abbrev table are valid."
(obarray-map
(lambda (misspelling)
(when (stringp misspelling) ; Abbrev's obarrays start with a symbol
(let ((corrections (gethash misspelling autocorrect-record-table)))
(unless (and (= (length corrections) 1)
(>= (cdar corrections)
autocorrect-count-threshold-history))
(define-abbrev autocorrect-abbrev-table misspelling nil)))))
autocorrect-abbrev-table))
(defun autocorrect--create-history-abbrevs ()
"Apply the history threshold to the current correction table."
(maphash
(lambda (misspelling corrections)
(when (and (= (length corrections) 1)
(>= (cdar corrections)
autocorrect-count-threshold-history))
(unless (obarray-get autocorrect-abbrev-table misspelling)
(define-abbrev autocorrect-abbrev-table
misspelling (caar corrections)))))
autocorrect-record-table))
(defun autocorrect-setup ()
"Read and process the history file into abbrevs."
(autocorrect--read-history)
(autocorrect--setup-abbrevs)
(autocorrect--remove-invalid-abbrevs)
(autocorrect--create-history-abbrevs))
#+end_src
We don't want to load the history eagerly, but we do want it available soon
after startup. I think an idle timer would be a good way to do this.
#+begin_src emacs-lisp
(run-with-idle-timer 0.5 nil #'autocorrect-setup)
#+end_src
-----
There we go, that's a complete self-managing abbrev-run frequent-misspelling
correction system. We can hook this up to Jinx by taking note of a helpful [[https://github.com/minad/jinx/wiki#save-misspelling-and-correction-as-abbreviation][code
snippet]] in the Jinx wiki for immediately saving all corrected misspellings into
the global abbrev list.
#+begin_src emacs-lisp
(defun autocorrect-record-jinx-correction (overlay corrected)
(let ((text
(buffer-substring-no-properties
(overlay-start overlay)
(overlay-end overlay))))
(autocorrect-record-correction text corrected)))
(advice-add 'jinx--correct-replace :before #'autocorrect-record-jinx-correction)
#+end_src
**** Downloading dictionaries
Let's get a nice big dictionary from [[http://app.aspell.net/create][SCOWL Custom List/Dictionary Creator]] with