Create a system for auto-applying spelling fixes
Why didn't I do something like this years ago?
This commit is contained in:
parent
df76ad127d
commit
1907b9bd27
262
config.org
262
config.org
|
@ -4101,6 +4101,268 @@ tweaks.
|
|||
(advice-add 'jinx-next :after (lambda (_) (left-word))))
|
||||
#+end_src
|
||||
|
||||
**** Autocorrect
|
||||
|
||||
#+call: confpkg("autocorrect", prefix="", after="jinx")
|
||||
|
||||
If you want to write without looking like you skipped a chunk of
|
||||
primary/secondary school (as I do), then autocorrect is a handy thing to have.
|
||||
Beyond just misspellings, it can also help with typos, and lazy capitalisation
|
||||
(can you really be bothered to consistently type "LuaLaTeX" instead of
|
||||
"lualatex" and "SciFi" over "scifi"?). However, primarily thanks to smartphones,
|
||||
I more often hear people cursing autocorrect than praising it. With that in
|
||||
mind, I think it's worth giving some thought to how smartphone autocorrect gets
|
||||
it's bad reputation (despite largely doing a decent job):
|
||||
1. Typing is harder on smartphones, and so autocorrect makes bigger (more speculative) guesses
|
||||
2. People type (and mistype) differently, but autocorrect tries to have a "one
|
||||
size fits all" profile that is refined over time
|
||||
3. As soon as you accept a particular correction, autocorrect can start applying
|
||||
that even when the original typo is ambiguous and has multiple "corrected" forms
|
||||
4. It's hard to tell the phone to stop doing a particular autocorrect (see
|
||||
"Emacs" recapitalised as "eMacs" on Apple devices)
|
||||
|
||||
I think we can largely alleviate these problems by
|
||||
1. Being mainly used on devices with actual keyboards
|
||||
2. Starting with an empty autocorrect "profile", built up by the user over time
|
||||
3. Having a customisable threshold before a repeated correction is made into an
|
||||
autocorrection, and blacklisting misspellings with multiple distinct corrections.
|
||||
4. Making it easy to blacklist certain words from becoming autocorrections
|
||||
|
||||
Another complaint about autocorrect is that it lets you develop bad habits, and
|
||||
if anything a tool that got you to retype the correct spelling several times
|
||||
would be more valuable in the long run. I think this is a pretty reasonable
|
||||
complaint, and have two different trains of thought that both justify tracking
|
||||
corrections made:
|
||||
+ I almost never leave Emacs for writing more than a text message, so what if I
|
||||
type worse outside of it?
|
||||
+ By tracking corrections made, you can also make a personal "most common
|
||||
misspellings" training list to run through at your leasure. Just set the
|
||||
"minimum replacement count" to a stupidly high number.
|
||||
|
||||
For starters, let's write a record of all corrections made.
|
||||
|
||||
#+begin_src emacs-lisp
|
||||
(defvar autocorrect-history-file
|
||||
(file-name-concat (or (getenv "XDG_STATE_HOME") "~/.local/state")
|
||||
"emacs" "spelling-corrections.txt")
|
||||
"File where a spell check record will be saved.")
|
||||
#+end_src
|
||||
|
||||
For simplicity of operation, I think we can just append each correction the file
|
||||
as =<misspelled> <corrected>= lines. This has a number of advantages, such as
|
||||
avoiding recalculations while typing, avoiding race conditions with multiple
|
||||
Emacs sessions, and making merging data on different machines trivial.
|
||||
|
||||
In the Emacs session though, I think we'll want to have a hash table of the
|
||||
counts of each correction. We can have the misspelled words as the keys, and
|
||||
then have each value be an alist of src_elisp{(correction . count)} pairs. This
|
||||
table can be lazily built and processed after startup.
|
||||
|
||||
#+begin_src emacs-lisp
|
||||
(defvar autocorrect-record-table (make-hash-table :test #'equal))
|
||||
#+end_src
|
||||
|
||||
We probably want to also specify a threshold number of misspellings that trigger
|
||||
entry to the abbrev table, both on load and when made during the current Emacs
|
||||
session. For now, I'll try a value of three for on-load and two for misspellings
|
||||
made in the current Emacs session. I think I want to avoid a value of one since
|
||||
that makes it easy for a misspelling with multiple valid corrections to become
|
||||
associated with a single correction too soon. This is a rare concern, but it
|
||||
would be annoying enough to run into that I think it's worth requiring a second
|
||||
misspelling.
|
||||
|
||||
#+begin_src emacs-lisp
|
||||
(defvar autocorrect-count-threshold-history 3
|
||||
"The number of recorded identical misspellings to create an abbrev.
|
||||
This applies to misspellings read from the history file")
|
||||
(defvar autocorrect-count-threshold-session 2
|
||||
"The number of identical misspellings to create an abbrev.
|
||||
This applies to misspellings made in the current Emacs session.")
|
||||
#+end_src
|
||||
|
||||
At this point we need to actually implement this functionality, starting with
|
||||
updating the table when a correction is either read from the history file or
|
||||
occurs live.
|
||||
|
||||
#+begin_src emacs-lisp
|
||||
(defun autocorrect-update-table (misspelling corrected)
|
||||
"Update the MISPELLING to CORRECTED entry in the table.
|
||||
Returns the number of times this correction has occurred."
|
||||
(if-let ((correction-counts
|
||||
(gethash misspelling autocorrect-record-table)))
|
||||
(if-let ((record-cons (assoc corrected correction-counts)))
|
||||
(setcdr record-cons (1+ (cdr record-cons)))
|
||||
(puthash misspelling
|
||||
(push (cons corrected 1) correction-counts)
|
||||
autocorrect-record-table)
|
||||
1)
|
||||
(puthash misspelling
|
||||
(list (cons corrected 1))
|
||||
autocorrect-record-table)
|
||||
1))
|
||||
#+end_src
|
||||
|
||||
We could call ~define-abbrev~ directly, but since we'll be doing so in multiple
|
||||
places, I think it's nice to have a single place where the abbrev table so any
|
||||
changes to the abbrev table (or similar) only need to be made in one place.
|
||||
|
||||
We could use the global abbrev table, but I'd rather have one dedicated to
|
||||
spelling corrections. Let's manage this entirely separately to the global abbrev
|
||||
file too.
|
||||
|
||||
#+begin_src emacs-lisp
|
||||
(defvar autocorrect-abbrev-file
|
||||
(file-name-concat (or (getenv "XDG_STATE_HOME") "~/.local/state")
|
||||
"emacs" "spelling-abbrevs.el")
|
||||
"File to save spell check records in.")
|
||||
|
||||
(defvar autocorrect-abbrev-table nil
|
||||
"The spelling abbrev table.")
|
||||
|
||||
(defvar autocorrect-abbrev-table--saved-version 0
|
||||
"The version of `autocorrect-abbrev-table' saved to disk.")
|
||||
|
||||
(defun autocorrect--setup-abbrevs ()
|
||||
"Setup `autocorrect-abbrev-table'.
|
||||
Also set it as a parent of `global-abbrev-table'."
|
||||
(unless autocorrect-abbrev-table
|
||||
(setq autocorrect-abbrev-table (make-abbrev-table))
|
||||
(abbrev-table-put
|
||||
global-abbrev-table :parents
|
||||
(cons autocorrect-abbrev-table
|
||||
(abbrev-table-get global-abbrev-table :parents)))
|
||||
(add-hook 'kill-emacs-hook #'autocorrect-save-abbrevs))
|
||||
(when (file-exists-p autocorrect-abbrev-file)
|
||||
(read-abbrev-file autocorrect-abbrev-file t)
|
||||
(setq autocorrect-abbrev-table--saved-version
|
||||
(abbrev-table-get autocorrect-abbrev-table
|
||||
:abbrev-table-modiff))))
|
||||
|
||||
(defun autocorrect-save-abbrevs ()
|
||||
"Write `autocorrect-abbrev-table'."
|
||||
(when (> (abbrev-table-get autocorrect-abbrev-table
|
||||
:abbrev-table-modiff)
|
||||
autocorrect-abbrev-table--saved-version)
|
||||
(unless (file-exists-p autocorrect-abbrev-file)
|
||||
(make-directory (file-name-directory autocorrect-abbrev-file) t))
|
||||
(let ((coding-system-for-write 'utf-8))
|
||||
(with-temp-buffer
|
||||
(insert-abbrev-table-description 'autocorrect-abbrev-table nil)
|
||||
(when (unencodable-char-position (point-min) (point-max) 'utf-8)
|
||||
(setq coding-system-for-write 'utf-8-emacs))
|
||||
(goto-char (point-min))
|
||||
(insert (format ";;-*-coding: %s;-*-\n\n" coding-system-for-write))
|
||||
(write-region nil nil autocorrect-abbrev-file 0)))
|
||||
(setq autocorrect-abbrev-table--saved-version
|
||||
(abbrev-table-get autocorrect-abbrev-table
|
||||
:abbrev-table-modiff))))
|
||||
#+end_src
|
||||
|
||||
Now we can write the update function that's run on a live spelling correction.
|
||||
|
||||
#+begin_src emacs-lisp
|
||||
(defun autocorrect-record-correction (misspelling corrected)
|
||||
"Record the correction of MISPELLING to CORRECTED."
|
||||
(let ((write-region-inhibit-fsync t) ; Quicker writes
|
||||
(coding-system-for-write 'utf-8)
|
||||
(inhibit-message t))
|
||||
(write-region
|
||||
(concat misspelling " " corrected "\n") nil
|
||||
autocorrect-history-file t))
|
||||
(when (and (>= (autocorrect-update-table misspelling corrected)
|
||||
autocorrect-count-threshold-session)
|
||||
(= (length (gethash misspelling autocorrect-record-table))
|
||||
1))
|
||||
(define-abbrev autocorrect-abbrev-table misspelling corrected)
|
||||
(message "Created new autocorrection: %s ⟶ %s"
|
||||
(propertize misspelling 'face 'warning)
|
||||
(propertize corrected 'face 'success))))
|
||||
#+end_src
|
||||
|
||||
The only thing left to be done now is load the history file. I think I'd like to
|
||||
split the actual reading and the abbrev generation into two parts though.
|
||||
|
||||
#+begin_src emacs-lisp
|
||||
(defun autocorrect--read-history ()
|
||||
"Read the history file into the correction table."
|
||||
(if (file-exists-p autocorrect-history-file)
|
||||
(with-temp-buffer
|
||||
(insert-file-contents autocorrect-history-file)
|
||||
(goto-char (point-min))
|
||||
(while (< (point) (point-max))
|
||||
(let ((pt (point))
|
||||
misspelling corrected)
|
||||
(setq misspelling
|
||||
(and (forward-word)
|
||||
(buffer-substring pt (point)))
|
||||
pt (1+ (point)))
|
||||
(setq corrected
|
||||
(and (forward-word)
|
||||
(buffer-substring pt (point)))
|
||||
pt (point))
|
||||
(when (and misspelling corrected)
|
||||
(autocorrect-update-table misspelling corrected))
|
||||
(forward-line 1))))
|
||||
(make-directory (file-name-directory autocorrect-history-file))
|
||||
(write-region "" nil autocorrect-history-file)))
|
||||
|
||||
(defun autocorrect--remove-invalid-abbrevs ()
|
||||
"Ensure that all entries of the abbrev table are valid."
|
||||
(obarray-map
|
||||
(lambda (misspelling)
|
||||
(when (stringp misspelling) ; Abbrev's obarrays start with a symbol
|
||||
(let ((corrections (gethash misspelling autocorrect-record-table)))
|
||||
(unless (and (= (length corrections) 1)
|
||||
(>= (cdar corrections)
|
||||
autocorrect-count-threshold-history))
|
||||
(define-abbrev autocorrect-abbrev-table misspelling nil)))))
|
||||
autocorrect-abbrev-table))
|
||||
|
||||
(defun autocorrect--create-history-abbrevs ()
|
||||
"Apply the history threshold to the current correction table."
|
||||
(maphash
|
||||
(lambda (misspelling corrections)
|
||||
(when (and (= (length corrections) 1)
|
||||
(>= (cdar corrections)
|
||||
autocorrect-count-threshold-history))
|
||||
(unless (obarray-get autocorrect-abbrev-table misspelling)
|
||||
(define-abbrev autocorrect-abbrev-table
|
||||
misspelling (caar corrections)))))
|
||||
autocorrect-record-table))
|
||||
|
||||
(defun autocorrect-setup ()
|
||||
"Read and process the history file into abbrevs."
|
||||
(autocorrect--read-history)
|
||||
(autocorrect--setup-abbrevs)
|
||||
(autocorrect--remove-invalid-abbrevs)
|
||||
(autocorrect--create-history-abbrevs))
|
||||
#+end_src
|
||||
|
||||
We don't want to load the history eagerly, but we do want it available soon
|
||||
after startup. I think an idle timer would be a good way to do this.
|
||||
|
||||
#+begin_src emacs-lisp
|
||||
(run-with-idle-timer 0.5 nil #'autocorrect-setup)
|
||||
#+end_src
|
||||
|
||||
-----
|
||||
|
||||
There we go, that's a complete self-managing abbrev-run frequent-misspelling
|
||||
correction system. We can hook this up to Jinx by taking note of a helpful [[https://github.com/minad/jinx/wiki#save-misspelling-and-correction-as-abbreviation][code
|
||||
snippet]] in the Jinx wiki for immediately saving all corrected misspellings into
|
||||
the global abbrev list.
|
||||
|
||||
#+begin_src emacs-lisp
|
||||
(defun autocorrect-record-jinx-correction (overlay corrected)
|
||||
(let ((text
|
||||
(buffer-substring-no-properties
|
||||
(overlay-start overlay)
|
||||
(overlay-end overlay))))
|
||||
(autocorrect-record-correction text corrected)))
|
||||
|
||||
(advice-add 'jinx--correct-replace :before #'autocorrect-record-jinx-correction)
|
||||
#+end_src
|
||||
|
||||
**** Downloading dictionaries
|
||||
|
||||
Let's get a nice big dictionary from [[http://app.aspell.net/create][SCOWL Custom List/Dictionary Creator]] with
|
||||
|
|
Loading…
Reference in New Issue