Run this code: shell:pushd ~/Code/koreader-to-org && nix-shell shell.nix --run "fennel command.fnl" &
Locally-synced, transparent files can encourage diverse user-agents
This document serves as Literate Programming for a script to extract my notes from my KOReader sync directory and output them as org-mode1. These notes are then accessible and linkable from my org-roam Knowledge Base and can be directly integrated in to my thinking, or indirectly through the creation of Topic Files for each book with SRS cards in them.
I've been using the highlights json export plugin in KOReader to get KOReader Notes extracted adjacent to my Archive. it's fine enough but a bit rickety since on Android the notes exporter fucking crashes without a backtrace, but because I have my books directory managed by Syncthing instead of a Calibre or OPDS distribution, i can kludge stuff like making KOReader on My NixOS configuration via container and running the export plugin there, or abusing bind-mounts on my Linux system to trick the Android app. This is Fine, but it's oh so rickety and kludgey and basically doesn't work well enough.
But Syncthing, oh friendly Syncthing! having files all in sync introduces the possibility of multiple user-agents. I don't even need to use koreader to parse these!
I can start explaining by executing something like:
find ~/mobile-library/books -wholename "*sdr/metadata.epub.lua" -type f -print0 | head -z -n 1 | xargs -0 head -n1
– we can read Lua syntax here!
"we can read Lua syntax here!" 😎
"And so can I" 😈
let's write some Fennel. Fennel can read Lua syntax ;-) Fennel and Lua are of course in nixpkgs, I'll make a Nix Shell to set up an environment.
Nix Shell
the nix shell imports hashings, which uses nums, and creates a little
environment with penlight
, lua
, hashings
(defined below), and find
.
{ pkgs ? import <nixpkgs> {}, ...}:
let
lua = pkgs.lua5_3;
myLuaPkgs = import ./pkgs.nix { inherit pkgs; inherit lua; };
myHashings = myLuaPkgs.hashings;
myLua = lua.withPackages (luapkgs: with luapkgs; [penlight http myHashings rapidjson]);
in
{
pkgs.mkShell packages = [
myLua.pkgs.fennel
myLua
pkgs.findutils# inotify-tools
];
}
Firing this up
It's easy enough to get a repl
running
with that nix-shell
from this document, or
from there with shell:nix-shell
~/Code/koreader-to-org/shell.nix. This relies on a patch to
fennel-mode for now…
setq fennel-program "nix-shell /home/rrix/Code/koreader-to-org/shell.nix --run \"fennel --repl\"")
(nil) (fennel-repl
But with a REPL we can get a little bit weird. Without a repl I can
execute fennel command.fnl
to execute all
the code in this document.
Libraries and Locals
This makes use of Penlight, a toolbox of Lua functions and patterns. It's included in the Lua distribution created by the Nix Shell above. It also uses lua-http to query Wallabag.
(local file (require :pl.file))
(local lapp (require :pl.lapp))
(local path (require :pl.path))
(local pretty (require :pl.pretty))
(local stringx (require :pl.stringx))
(local tablex (require :pl.tablex))
(local text (require :pl.text))
(local wallabag (require :wallabag))
(local api-prefix wallabag.api-prefix)
(local sha256 (require :hashings.sha256))
Including
unpackaged dependency for sha256
sum in
Nix Shell
I want to include two packages which aren't in nixpkgs… I should fix that, but for now here they are:
{ pkgs, lua }:
rec {
nums = lua.pkgs.buildLuarocksPackage {
pname = "nums";
version = "20130228-2";
knownRockspec = (pkgs.fetchurl {
url = "https://luarocks.org/manifests/user-none/lua-nums-scm-1.rockspec";
sha256 = "sha256-fxfcfiAgGGRhyCQZYYdUPs/WplMWVZH4QEPRlSW53uE=";
}).outPath;
src = pkgs.fetchFromGitHub {
repo = "lua-nums";
owner = "user-none";
rev = "fef161a940aaafdbb8d9c75fe073b8bb43152474";
sha256 = "sha256-coI8JHMx+6sikSndfbUIuo1jutHUnM3licI2s7I7fmQ=";
};
disabled = with lua; (lua.pkgs.luaOlder "5.3") || (lua.pkgs.luaAtLeast "5.5");
meta = {
homepage = "https://github.com/user-none/lua-nums";
description = "Pure Lua number library providing BigNum and fixed width unsigned integer types";
license.fullName = "MIT";
};
};
hashings =lua.pkgs.buildLuarocksPackage {
pname = "hashings";
version = "20130228-2";
knownRockspec = (pkgs.fetchurl {
url = "https://luarocks.org/manifests/user-none/lua-hashings-scm-1.rockspec";
sha256 = "sha256-SGx6kYhigTCmJQr/lFW6TARpM3na18M8lzgIDcOiCg0=";
}).outPath;
src = pkgs.fetchFromGitHub {
repo = "lua-hashings";
owner = "user-none";
rev = "89879fe79b6f3dc495c607494126ec9c3912b8e9";
sha256 = "sha256-/YagiUKAQKtHicsNE4amkHOJZvBEpDMs0qVjszkYnw4=";
};
disabled = with lua; (lua.pkgs.luaOlder "5.3") || (lua.pkgs.luaAtLeast "5.5");
propagatedBuildInputs = [ lua nums ];
meta = {
homepage = "https://github.com/user-none/lua-hashings";
description = "Pure Lua cryptographic hash library";
license.fullName = "MIT";
};
};
}
Collecting the Files
I didn't find an easy way to glob-match files like find
can do (though pl.dir
should get us close enough eventually,
i'm lazy and hacking for now) so we'll cheat and use find
😜 with that command from earlier. This
function collect-highlights
gets passed a
function converter-fn
which is called with
each metadata.epub.lua
or whatever match-name
is set to, both loaded and the file
name. the converter-fn
used in the EPUBs'
highlights may be different than for PDFs maybe…
posix-egrep
is used so that I can match
things like (mobi|epub)
…
(fn collect-highlights [base-path match-name converter-fn]
(let [proc (io.popen (.. "bash -c 'find " base-path " -regextype posix-egrep -regex \"" match-name "\" -type f'"))]
(local output [])
(each [line (proc:lines)]
(table.insert
output
(converter-fn ((loadfile line)) line)))
output))
The stub data is returned in a structure with some simple metadata at (datastructures)… There is probably some other interesting doc properties I could add, should add, will add. Filename, too?
(fn init-datastructure [md-path from-md]
(let [props (?. from-md :doc_props)
stats (?. from-md :stats)]
{ "path" md-path
"authors" (.. (?. props :authors))
"title" (?. props :title)
"series" (?. props :series)
"md5" (or (?. stats :md5) (?. from-md :partial_md5_checksum))
"highlights" []}))
The entries are sorted by their date time (for EPUBs) or by page
number (for PDFs) within the chapter, but the chapters
are lexically-sorted within the parent datastructure, so it'll all have
to be sorted again when rendering… hmm. I'd like to sort EPUBs based on
the XPath in the locs
metadata, but
lexically sorting those without parsing them is sort of infeasible since
[10] sorts after [2].
(fn sort-by-datetime [first second]
(< (. first "datetime")
(. second "datetime")))
(fn sort-by-page-no [first second]
(< (. first "page")
(. second "page")))
NEXT swap collect-highlights
to use pl.dir
Parsing Koreader metadata in to a kludged together tree structure
The metadata are basically the same, with some minor structure differences in the highlighter.
These could probably be a lot "prettier" if they were
destructuring-based data-structure construction, but that can come in a
future refactor. For now the goal is to move the data, basically in to a
[book -> chapter -> list of highlights]
topology mirroring the hierarchy of the org-mode document i want to
dump.
process-one-book
receives the metadata
as directly loaded from the metadata.epub.lua
file, and the file name it was
loaded from. What we want to get out of this is an object restructured
from the metadata
's bookmarks
key structured by the chapter they are
indexed by, ordered by the earliest highlight in that chapter. This is
not so difficult, just finicky. don't forget to return the output
structure we are populating!
<<find-chapter-by-name>>
(fn process-one-book [metadata file-name]
(let [bookmarks (. metadata "bookmarks")]
(local output (init-datastructure file-name metadata)) ;; (ref:datastructures)
(print "processing..." (?. (?. metadata "doc_props") "title"))
(var sum (accumulate [total 0 _i1
inner-tbl (tablex.sortv bookmarks sort-by-datetime)] ;; (ref:iterate-bookmarks)
(do
<<process-bookmark-table>>
(+ total 1))))
(print "parsed" sum "highlights")
output))
find-chapter-by-name
takes the
sequential table of highlights and returns one with a matching :name
property using tablex
. It's embedded in the process-one-book
function instead of being
hoisted up mostly because I am lazy and don't want to hoist all the
requirements to global scope.
(fn find-chapter-by-name [highlights name]
(tablex.find_if highlights (lambda [v]
(= (. v :name) name))))
The meat of the process is a loop over the sorted bookmarks. In file:::table-sort-fns above there is a
function which will blindly string sort one of the bookmark
entities based on the datetime
property. Replaceing this with a real
datetime sort or even by the XPath of the note is doable but these
datetimes already lexically sort well enough. And so in (iterate-bookmarks), they
are sorted lexically by datetime and then an "inner table" is processed.
A chapter object is instantiated in file:::get-chapter-idx and each highlight
is inserted in to that chapter in file:::mk-highlight
(let [chapter (or (?. inner-tbl "chapter") "")
chapter-idx-maybe
<<get-chapter-idx>>
]
<<mk-highlight>>
(table.insert
(. output.highlights chapter-idx-maybe)
(mk-highlight chapter inner-tbl)))
The chapter is either found or inserted anew. Note that instantiating
this with a :name
key makes this
"technically" not a sequential table, care will have to be taken when
iterating over it. Probably makes sense to stuff the name in to a
metatable eventually.
(or (find-chapter-by-name output.highlights chapter)
(do
(table.insert output.highlights {:name chapter})
(length output.highlights)))
And each highlight is constructed from the metadata table like so; the locations for PDF and EPUB are subtly different since one has HTML XPaths and one has … well, rectangles. That will be an issue when we render them, but for now these can be "the same"-ish… Maybe just include a page-reference in the final render? These highlights get fed, ultimately, to file:::render-one-highlight
(fn mk-highlight [chapter highlight-tbl]
{ "datetime" (?. highlight-tbl "datetime")
"chapter" chapter
"locs" [(?. highlight-tbl "pos0")
(?. highlight-tbl "pos1")]
"text" (or (?. highlight-tbl "text")
(?. highlight-tbl "notes"))})
A sample chapter will look like this according to fennel's prettyprinter. PDFs will look similar but the location elements will have more information:
{1 {:chapter "Chapter 11"
:datetime "2021-05-20 11:36:06"
:locs ["/body/DocFragment[154]/body/p[154]/text()[1].0"
"/body/DocFragment[154]/body/p[154]/text()[4].107"]
:text "The spy could hardly tightbeam the Uriel with the dangerous news that the crew of Father Captain de Soya’s Raphael had been going to confession too frequently, but that was precisely one of the causes of Liebler’s concern"}
2 {:chapter "Chapter 11"
:datetime "2021-05-20 13:04:57"
:locs ["/body/DocFragment[154]/body/p[158]/text()[1].234"
"/body/DocFragment[154]/body/p[158]/text()[1].389"]
:text "The crew did not like Hoag Liebler—he was used to being disliked by classmates and shipmates, it was the curse of his natural-born aristocracy, he knew—but"}
:name "Chapter 11"}
Rendering my Kludged Datastructure to Org
Each book comes out of process-one-ebook
looking kind of like this:
{:authors "Nikole Hannah-Jones
The New York Times Magazine
Nikole Hannah-Jones
The New York Times Magazine"
:highlights {"Preface: Origins by Nikole Hannah-Jones" [{:chapter "Preface: Origins by Nikole Hannah-Jones"
:datetime "2022-01-17 00:02:03"
:locs ["/body/DocFragment[8]/body/div/p[12]/text()[1].284"
"/body/DocFragment[8]/body/div/p[12]/text()[3].6"]
:text "I was starting to figure out that the histories we learn in school or, more casually, through popular culture, monuments, and political speeches rarely teach us the facts but only certain facts"}
{:chapter "Preface: Origins by Nikole Hannah-Jones"
:datetime "2022-01-17 00:03:57"
:locs ["/body/DocFragment[8]/body/div/p[15]/text()[1].0"
"/body/DocFragment[8]/body/div/p[15]/a/text().1"]
:text "School curricula generally treat slavery as an aberration in a free society, and textbooks largely ignore the way that many prominent men, women, industries, and institutions profited from and protected slavery.6"}
{:chapter "Preface: Origins by Nikole Hannah-Jones"
:datetime "2022-01-17 00:04:46"
:locs ["/body/DocFragment[8]/body/div/p[16]/text()[3].1"
"/body/DocFragment[8]/body/div/p[16]/a[2]/text().1"]
:text "Even educators struggle with basic facts of history, the SPLC report found: only about half of U.S. teachers understand that enslavers dominated the presidency in the decades after the founding and would dominate the U.S. Supreme Court and the U.S. Senate until the Civil War.8"}]}
:title "The 1619 Project: A New Origin Story"}
It's the job of the rest of the doc to take the "tree" datastructure
present in :highlights
along with the
other metadata stored at the book level and write those to disk as org-mode.
Aside: using Dead Simple Wallabag Fennel client to derive the URLs of read-it-later files
So with that tiny little
dogshit API client, i can uhh, capture a wallabag access token, and make
a single API request with it to extract links to documents sent from my
browser or phone to koreader via its built in plugin. Koreader downloads
the stories with a known file name [w-id_XXXX]
where the XXXX
is the ID of the story, so we fetch that
and cram it in to the datastructure here.
maybe-update-book-md-with-wallabag
is
called below in render-one-book
.
(local wallabag-token (->> ".wallabag"
(wallabag.load-client-credentials)
(wallabag.get-token (.. api-prefix "/oauth/v2/token"))))
(fn get-single-entry-from-wallabag [id]
(let [(headers body) (wallabag.api-req wallabag-token "GET" (.. api-prefix "/api/entries/" id ".json"))]
body))
(fn get-wallabag-url [id]
(. (get-single-entry-from-wallabag id) :url))
Templating the org documents
I use penlight's pl.text.Template
to render the tables that come out of process-one-book
to strings, and smash them all
together with the stringx
module. These
should be more-or-less self-evident based on the structure spat out by
the collectors. Level 0 is the book, Level 1 is the chapter, Level 2 is
the highlight within the chapter.
(local template (. text :Template))
<<render-one-highlight>>
<<render-one-chapter>>
<<render-one-book>>
We use noweb syntax to make sure the functions are defined in the correct order.
render-one-book
Each book is its own org-roam document.
Books get their canonical ID from the md5
sum stored in the file.
(local book-tmpl (template
":PROPERTIES:
:ID: koreader-${md5}
:ROAM_REFS: \"${path}\"
:END:
#+TITLE: Notes from ${title}
#+AUTHORS: ${authors}
[[${path}][${path}]]
"))
(fn munge-book-path [book-md]
(let [path (. book-md :path)
(_ _ bag-id) (string.find path "%[w%-id_(%d+)%]")]
(if bag-id
(do
(print "bag id" bag-id)
(set book-md.path (get-wallabag-url bag-id)))
;; sickos.jpg
(set book-md.path (.. "file:"
(string.gsub path "sdr/metadata.([^.]+).lua" "%1"))))))
(fn render-one-book [book]
(let [authors (?. book "authors")
title (?. book "title")]
(munge-book-path book)
(.. (: book-tmpl :substitute book)
(stringx.join "\n"
(icollect [_i1 chapter-hls (pairs (?. book "highlights"))]
(render-one-chapter chapter-hls))))))
render-one-chapter
A book contains multiple chapters, this is a level-1 heading
basically just used for organization; we generate the Chapter template
and cram a sequential table in to it of each highlight's text. There is
a surprise here, we copy the chapter table so that the :name
added in (get-chapter-idx) in the collectors is
removed before collecting each highlight.
(local chapter-tmpl (template
"* ${chapter}
"))
(fn render-one-chapter [chapter]
(let [name (?. chapter :name)
copy (tablex.deepcopy chapter)]
(tset copy :name nil)
(stringx.join "\n"
(tablex.insertvalues [(: chapter-tmpl :substitute {:chapter name})]
(icollect [_i2 hl (ipairs copy)]
(values (render-one-highlight hl)))))))
render-one-highlight
This uses the hashings
library which I
import at the top to generate a unique ID for each highlight based on
the datetime
which I took the note, and
the note itself. Should surely be high enough entropy + stable. I sure
hope so!
The text goes through some gsub
functions so that special characters are escaped. Other transformations
of the notes could happen here.
(local highlight-tmpl (template
"** ${text}
:PROPERTIES:
:ID: ${id}
:LOC0: ${loc0}
:LOC1: ${loc1}
:PAGE: ${page}
:END:
[${datetime}]
"))
(fn render-one-highlight [hl]
(let [locs (. hl :locs)
digest (: sha256 :new (.. (. hl :text) (. hl :datetime)))
hexdigest (: digest :hexdigest)
fields (tablex.update
{ :datetime (. hl :datetime)
:text (-> (. hl :text)
(: :gsub "%$" "$ ")
(: :gsub "\n" "¶ ")
)
:id hexdigest }
(if (= (type (. locs 1)) "table")
{ :page (or (?. hl :page) (?. (. locs 1) :page) "")
:loc0 ""
:loc1 ""}
{ :page ""
:loc0 (or (. locs 1) "")
:loc1 (or (. locs 2) "")}))]
(: highlight-tmpl :substitute
fields)))
Rendering the templates to a file
Get one book by invoking process-one-book
and then write it to file. The
entire "flow" is built around this function and the metadata collectors
below.
- Find all of the epubs or so by wrapping
collect-highlights
with the metadata collectors below - Each metadata file is:
- parsed in to tables, and fed in to
process-one-book
to get a normalized tree structure - Call
render-one-book
to create a string from the tree structure - check whether the metadata file is edited more recently than the
notes and if render it in
maybe-write-to-file
below.
- parsed in to tables, and fed in to
<<maybe-write-to-file>>
(fn write-one-book [book out-dir]
(let [book-path (?. book :path)
title (?. book :title)
notes-path (path.join out-dir (.. (: title :gsub "[:/\\ ]" "_") ".org"))]
(print "Maybe rendering" book-path)
(print "to" notes-path)
(maybe-write-to-file book-path
notes-path
(lambda []
(print "rendering" (accumulate [sum 0 i chap (ipairs (. book :highlights))]
(+ sum (length chap))) "highlights to string")
(render-one-book book)))))
(fn write-one-book-from-md [md filename out-dir]
(let [book (process-one-book md filename)]
(write-one-book book out-dir)
book))
Helper to Write books to Files
This is a helper function which accepts a source file for
modification detection at (mod-test) – if there either are no
notes file, or it's older than the book metadata, it will call render-fn
to return a string of the book notes,
and then write that to dest-file
. Passing
in a lambda allows this to be a "lazy" helper, we only want to render
the note template if the file has been modified.
(fn maybe-write-to-file [src-file dest-file render-fn]
(let [dest-file2 (path.expanduser dest-file)
book-mtime (file.modified_time src-file)
notes-mtime (file.modified_time dest-file2)]
(print "Targeting..." dest-file2)
(if (or (not notes-mtime) ;; (ref:mod-test)
(< notes-mtime book-mtime))
(match (io.output dest-file2)
(nil msg) (print "Could not write file... " msg)
f (let [text (render-fn)]
(io.write text)
(io.close f)
(print "Rendered..." (length text))))
(print "Skipping ... " dest-file2))))
DONE re-structure this document so that the books are parsed as they're rendered
rather than having all of them parsed in Invoking the collectors, then rendered here, delay the parsing.
NEXT bubble modtime check further up in to (write-one-book-from-md)
this would be so that the parsing of the notes does not even happen if the koreader source has not been modified.
Koreader EPUB
Metadata Collector
process-one-book
is wrapped with collect-highlights in to a simple
interface which can be used to collect all the epubs' highlights in to
one big ol' table.
(fn collect-epub-highlights [books-path out-path]
(collect-highlights books-path ".*sdr/metadata.(epub|mobi).lua"
(lambda [book book-path]
(write-one-book-from-md book book-path out-path))))
Koreader PDF
Metadata Collector
The PDF notes are a bit differently shaped than the epubs', but basically close enough… They're sorted by page number. Just gotta grab the correct files!
(fn collect-pdf-highlights [books-path out-path]
(collect-highlights books-path ".*sdr/metadata.pdf.lua"
(lambda [book book-path]
(write-one-book-from-md book book-path out-path))))
Invoking the collectors
So each of those collectors will go all the way down to the file-system. This is basically the "entrypoint" of the app:
(let [default-book-dir "~/mobile-library/"
default-note-dir "~/org/highlights/"
args (lapp (stringx.join
"\n"
["Parse koreader metadata files in to org-mode notes"
""
"-f,--file (optional string) only parse one metadata.*.lua file"
(.. "-e,--epubs (optional string) parse epubs from here, otherwise " default-book-dir)
(.. "-p,--pdfs (optional string) parse pdfs from here, otherwise " default-book-dir)
(.. "-n,--notes (optional string) write outputs to directory: " default-note-dir)
""]))
file-name (?. args :file)
file? (not (not file-name))
epub-dir (or (. args :epubs) default-book-dir)
pdf-dir (or (. args :pdfs) default-book-dir)
notes-dir (or (. args :notes) default-note-dir)]
(if file?
(write-one-book-from-md ((loadfile file-name)) file-name notes-dir)
(do
(collect-epub-highlights epub-dir notes-dir)
(collect-pdf-highlights pdf-dir notes-dir))))
NEXT many entries don't have correct metadata…
NEXT generate a TOC
Note Index
(org-roam-db-sync)
(->>
(org-roam-db-query [:select [id title]
:from nodes"%/org/highlights/%")
:where (like file = level 0)])
:and (
(-map (pcase-lambda (`(,id ,title))format "- [[id:%s][%s]]" id title)))
("\n")) (s-join
Footnotes
They're not included in my Archive though, that is they're unlinked to the broader "sphere of thinking" and from my auto-complete database – they don't have IDs in the org-roam database, they aren't visible to Arroyo. I choose to work them in manually from the Note Index page. This is one of My Living Systems which I use to Remember Anything I Read.↩︎