The Arcology Garden

Local parsing of KOReader Notes to Org Roam

LifeTechEmacsTopicsArcology

Run this code: shell:pushd ~/Code/koreader-to-org && nix-shell shell.nix --run "fennel command.fnl" &

Locally-synced, transparent files can encourage diverse user-agents

This document serves as Literate Programming for a script to extract my notes from my KOReader sync directory and output them as org-mode1. These notes are then accessible and linkable from my org-roam Knowledge Base and can be directly integrated in to my thinking, or indirectly through the creation of Topic Files for each book with SRS cards in them.

I've been using the highlights json export plugin in KOReader to get KOReader Notes extracted adjacent to my Archive. it's fine enough but a bit rickety since on Android the notes exporter fucking crashes without a backtrace, but because I have my books directory managed by Syncthing instead of a Calibre or OPDS distribution, i can kludge stuff like making KOReader on My NixOS configuration via container and running the export plugin there, or abusing bind-mounts on my Linux system to trick the Android app. This is Fine, but it's oh so rickety and kludgey and basically doesn't work well enough.

But Syncthing, oh friendly Syncthing! having files all in sync introduces the possibility of multiple user-agents. I don't even need to use koreader to parse these!

I can start explaining by executing something like:

find ~/mobile-library/books -wholename "*sdr/metadata.epub.lua" -type f -print0 | head -z -n 1 | xargs -0 head -n1 

– we can read Lua syntax here!

"we can read Lua syntax here!" 😎

"And so can I" 😈

let's write some Fennel. Fennel can read Lua syntax ;-) Fennel and Lua are of course in nixpkgs, I'll make a Nix Shell to set up an environment.

Nix Shell

the nix shell imports hashings, which uses nums, and creates a little environment with penlight, lua, hashings (defined below), and find.

{ pkgs ? import <nixpkgs> {}, ...}:

let
  lua = pkgs.lua5_3;
  myLuaPkgs = import ./pkgs.nix { inherit pkgs; inherit lua; };
  myHashings = myLuaPkgs.hashings;
  myLua = lua.withPackages (luapkgs: with luapkgs; [penlight http myHashings rapidjson]);
in
pkgs.mkShell {
  packages = [
    myLua.pkgs.fennel
    myLua
    pkgs.findutils
    # inotify-tools
  ];
}

Firing this up

It's easy enough to get a repl running with that nix-shell from this document, or from there with shell:nix-shell ~/Code/koreader-to-org/shell.nix. This relies on a patch to fennel-mode for now…

(setq fennel-program "nix-shell /home/rrix/Code/koreader-to-org/shell.nix --run \"fennel --repl\"")
(fennel-repl nil)

But with a REPL we can get a little bit weird. Without a repl I can execute fennel command.fnl to execute all the code in this document.

Libraries and Locals

This makes use of Penlight, a toolbox of Lua functions and patterns. It's included in the Lua distribution created by the Nix Shell above. It also uses lua-http to query Wallabag.

(local file (require :pl.file))
(local lapp (require :pl.lapp))
(local path (require :pl.path))
(local pretty (require :pl.pretty))
(local stringx (require :pl.stringx))
(local tablex (require :pl.tablex))
(local text (require :pl.text))
(local wallabag (require :wallabag))
(local api-prefix wallabag.api-prefix)

(local sha256 (require :hashings.sha256))

Including unpackaged dependency for sha256 sum in Nix Shell

I want to include two packages which aren't in nixpkgs… I should fix that, but for now here they are:

{ pkgs, lua }:

rec {
  nums = lua.pkgs.buildLuarocksPackage {
    pname = "nums";
    version = "20130228-2";

    knownRockspec = (pkgs.fetchurl {
      url    = "https://luarocks.org/manifests/user-none/lua-nums-scm-1.rockspec";
      sha256 = "sha256-fxfcfiAgGGRhyCQZYYdUPs/WplMWVZH4QEPRlSW53uE=";
    }).outPath;

    src = pkgs.fetchFromGitHub {
      repo = "lua-nums";
      owner = "user-none";
      rev = "fef161a940aaafdbb8d9c75fe073b8bb43152474";
      sha256 = "sha256-coI8JHMx+6sikSndfbUIuo1jutHUnM3licI2s7I7fmQ=";
    };

    disabled = with lua; (lua.pkgs.luaOlder "5.3") || (lua.pkgs.luaAtLeast "5.5");

    meta = {
      homepage = "https://github.com/user-none/lua-nums";
      description = "Pure Lua number library providing BigNum and fixed width unsigned integer types";
      license.fullName = "MIT";
    };
  };

  hashings =lua.pkgs.buildLuarocksPackage {
    pname = "hashings";
    version = "20130228-2";

    knownRockspec = (pkgs.fetchurl {
      url    = "https://luarocks.org/manifests/user-none/lua-hashings-scm-1.rockspec";
      sha256 = "sha256-SGx6kYhigTCmJQr/lFW6TARpM3na18M8lzgIDcOiCg0=";
    }).outPath;

    src = pkgs.fetchFromGitHub {
      repo = "lua-hashings";
      owner = "user-none";
      rev = "89879fe79b6f3dc495c607494126ec9c3912b8e9";
      sha256 = "sha256-/YagiUKAQKtHicsNE4amkHOJZvBEpDMs0qVjszkYnw4=";
    };

    disabled = with lua; (lua.pkgs.luaOlder "5.3") || (lua.pkgs.luaAtLeast "5.5");
    propagatedBuildInputs = [ lua nums ];

    meta = {
      homepage = "https://github.com/user-none/lua-hashings";
      description = "Pure Lua cryptographic hash library";
      license.fullName = "MIT";
    };
  };
}

Collecting the Files

I didn't find an easy way to glob-match files like find can do (though pl.dir should get us close enough eventually, i'm lazy and hacking for now) so we'll cheat and use find 😜 with that command from earlier. This function collect-highlights gets passed a function converter-fn which is called with each metadata.epub.lua or whatever match-name is set to, both loaded and the file name. the converter-fn used in the EPUBs' highlights may be different than for PDFs maybe…

posix-egrep is used so that I can match things like (mobi|epub)

(fn collect-highlights [base-path match-name converter-fn]
  (let [proc (io.popen (.. "bash -c 'find " base-path " -regextype posix-egrep -regex \"" match-name "\" -type f'"))]
    (local output [])
    (each [line (proc:lines)]
      (table.insert
       output
       (converter-fn ((loadfile line)) line)))
    output))

The stub data is returned in a structure with some simple metadata at (datastructures)… There is probably some other interesting doc properties I could add, should add, will add. Filename, too?

(fn init-datastructure [md-path from-md]
  (let [props (?. from-md :doc_props)
        stats (?. from-md :stats)]
    { "path" md-path
      "authors" (.. (?. props :authors))
      "title" (?. props :title)

      "series" (?. props :series)
      "md5" (or (?. stats :md5) (?. from-md :partial_md5_checksum))
      "highlights" []}))

The entries are sorted by their date time (for EPUBs) or by page number (for PDFs) within the chapter, but the chapters are lexically-sorted within the parent datastructure, so it'll all have to be sorted again when rendering… hmm. I'd like to sort EPUBs based on the XPath in the locs metadata, but lexically sorting those without parsing them is sort of infeasible since [10] sorts after [2].

(fn sort-by-datetime [first second]
  (< (. first "datetime")
     (. second "datetime")))

(fn sort-by-page-no [first second]
  (< (. first "page")
     (. second "page")))

NEXT swap collect-highlights to use pl.dir

Parsing Koreader metadata in to a kludged together tree structure

The metadata are basically the same, with some minor structure differences in the highlighter.

These could probably be a lot "prettier" if they were destructuring-based data-structure construction, but that can come in a future refactor. For now the goal is to move the data, basically in to a [book -> chapter -> list of highlights] topology mirroring the hierarchy of the org-mode document i want to dump.

process-one-book receives the metadata as directly loaded from the metadata.epub.lua file, and the file name it was loaded from. What we want to get out of this is an object restructured from the metadata's bookmarks key structured by the chapter they are indexed by, ordered by the earliest highlight in that chapter. This is not so difficult, just finicky. don't forget to return the output structure we are populating!

<<find-chapter-by-name>>

(fn process-one-book [metadata file-name]
  (let [bookmarks (. metadata "bookmarks")]
    (local output (init-datastructure file-name metadata)) ;; (ref:datastructures)

    (print "processing..." (?. (?. metadata "doc_props") "title"))
    (var sum (accumulate [total 0 _i1
                                 inner-tbl (tablex.sortv bookmarks sort-by-datetime)]       ;; (ref:iterate-bookmarks)
                      (do
                        <<process-bookmark-table>>
                        (+ total 1))))
    (print "parsed" sum "highlights")
    output))

find-chapter-by-name takes the sequential table of highlights and returns one with a matching :name property using tablex. It's embedded in the process-one-book function instead of being hoisted up mostly because I am lazy and don't want to hoist all the requirements to global scope.

(fn find-chapter-by-name [highlights name]
  (tablex.find_if highlights (lambda [v]
                               (= (. v :name) name))))

The meat of the process is a loop over the sorted bookmarks. In file:::table-sort-fns above there is a function which will blindly string sort one of the bookmark entities based on the datetime property. Replaceing this with a real datetime sort or even by the XPath of the note is doable but these datetimes already lexically sort well enough. And so in (iterate-bookmarks), they are sorted lexically by datetime and then an "inner table" is processed. A chapter object is instantiated in file:::get-chapter-idx and each highlight is inserted in to that chapter in file:::mk-highlight

(let [chapter (or (?. inner-tbl "chapter") "")
      chapter-idx-maybe
      <<get-chapter-idx>>
      ]

  <<mk-highlight>>

  (table.insert
   (. output.highlights chapter-idx-maybe)
   (mk-highlight chapter inner-tbl)))

The chapter is either found or inserted anew. Note that instantiating this with a :name key makes this "technically" not a sequential table, care will have to be taken when iterating over it. Probably makes sense to stuff the name in to a metatable eventually.

(or (find-chapter-by-name output.highlights chapter)
                           (do
                             (table.insert output.highlights {:name chapter})
                             (length output.highlights)))

And each highlight is constructed from the metadata table like so; the locations for PDF and EPUB are subtly different since one has HTML XPaths and one has … well, rectangles. That will be an issue when we render them, but for now these can be "the same"-ish… Maybe just include a page-reference in the final render? These highlights get fed, ultimately, to file:::render-one-highlight

(fn mk-highlight [chapter highlight-tbl]
  { "datetime" (?. highlight-tbl "datetime")
    "chapter" chapter
    "locs" [(?. highlight-tbl "pos0")
            (?. highlight-tbl "pos1")] 
    "text" (or (?. highlight-tbl "text")
              (?. highlight-tbl "notes"))})

A sample chapter will look like this according to fennel's prettyprinter. PDFs will look similar but the location elements will have more information:

{1 {:chapter "Chapter 11"
    :datetime "2021-05-20 11:36:06"
    :locs ["/body/DocFragment[154]/body/p[154]/text()[1].0"
           "/body/DocFragment[154]/body/p[154]/text()[4].107"]
    :text "The spy could hardly tightbeam the Uriel with the dangerous news that the crew of Father Captain de Soya’s Raphael had been going to confession too frequently, but that was precisely one of the causes of Liebler’s concern"}
 2 {:chapter "Chapter 11"
    :datetime "2021-05-20 13:04:57"
    :locs ["/body/DocFragment[154]/body/p[158]/text()[1].234"
           "/body/DocFragment[154]/body/p[158]/text()[1].389"]
    :text "The crew did not like Hoag Liebler—he was used to being disliked by classmates and shipmates, it was the curse of his natural-born aristocracy, he knew—but"}
 :name "Chapter 11"}

Rendering my Kludged Datastructure to Org

Each book comes out of process-one-ebook looking kind of like this:

{:authors "Nikole Hannah-Jones
The New York Times Magazine
Nikole Hannah-Jones
The New York Times Magazine"
 :highlights {"Preface: Origins by Nikole Hannah-Jones" [{:chapter "Preface: Origins by Nikole Hannah-Jones"
                                                          :datetime "2022-01-17 00:02:03"
                                                          :locs ["/body/DocFragment[8]/body/div/p[12]/text()[1].284"
                                                                 "/body/DocFragment[8]/body/div/p[12]/text()[3].6"]
                                                          :text "I was starting to figure out that the histories we learn in school or, more casually, through popular culture, monuments, and political speeches rarely teach us the facts but only certain facts"}
                                                         {:chapter "Preface: Origins by Nikole Hannah-Jones"
                                                          :datetime "2022-01-17 00:03:57"
                                                          :locs ["/body/DocFragment[8]/body/div/p[15]/text()[1].0"
                                                                 "/body/DocFragment[8]/body/div/p[15]/a/text().1"]
                                                          :text "School curricula generally treat slavery as an aberration in a free society, and textbooks largely ignore the way that many prominent men, women, industries, and institutions profited from and protected slavery.6"}
                                                         {:chapter "Preface: Origins by Nikole Hannah-Jones"
                                                          :datetime "2022-01-17 00:04:46"
                                                          :locs ["/body/DocFragment[8]/body/div/p[16]/text()[3].1"
                                                                 "/body/DocFragment[8]/body/div/p[16]/a[2]/text().1"]
                                                          :text "Even educators struggle with basic facts of history, the SPLC report found: only about half of U.S. teachers understand that enslavers dominated the presidency in the decades after the founding and would dominate the U.S. Supreme Court and the U.S. Senate until the Civil War.8"}]}
 :title "The 1619 Project: A New Origin Story"}

It's the job of the rest of the doc to take the "tree" datastructure present in :highlights along with the other metadata stored at the book level and write those to disk as org-mode.

Aside: using Dead Simple Wallabag Fennel client to derive the URLs of read-it-later files

So with that tiny little dogshit API client, i can uhh, capture a wallabag access token, and make a single API request with it to extract links to documents sent from my browser or phone to koreader via its built in plugin. Koreader downloads the stories with a known file name [w-id_XXXX] where the XXXX is the ID of the story, so we fetch that and cram it in to the datastructure here.

maybe-update-book-md-with-wallabag is called below in render-one-book.

(local wallabag-token (->> ".wallabag"
                           (wallabag.load-client-credentials)
                           (wallabag.get-token (.. api-prefix "/oauth/v2/token"))))

(fn get-single-entry-from-wallabag [id]
  (let [(headers body) (wallabag.api-req wallabag-token "GET" (.. api-prefix "/api/entries/" id ".json"))]
    body))

(fn get-wallabag-url [id]
  (. (get-single-entry-from-wallabag id) :url))

Templating the org documents

I use penlight's pl.text.Template to render the tables that come out of process-one-book to strings, and smash them all together with the stringx module. These should be more-or-less self-evident based on the structure spat out by the collectors. Level 0 is the book, Level 1 is the chapter, Level 2 is the highlight within the chapter.

(local template (. text :Template))

<<render-one-highlight>>
<<render-one-chapter>>
<<render-one-book>>

We use noweb syntax to make sure the functions are defined in the correct order.

render-one-book

Each book is its own org-roam document.

Books get their canonical ID from the md5 sum stored in the file.

(local book-tmpl (template
                  ":PROPERTIES:
:ID: koreader-${md5}
:ROAM_REFS: \"${path}\"
:END:
#+TITLE: Notes from ${title}
#+AUTHORS: ${authors}
[[${path}][${path}]]
"))

(fn munge-book-path [book-md]
  (let [path (. book-md :path)
        (_ _ bag-id) (string.find path "%[w%-id_(%d+)%]")]
    (if bag-id
        (do
          (print "bag id" bag-id)
          (set book-md.path (get-wallabag-url bag-id)))
        ;; sickos.jpg
        (set book-md.path (.. "file:"
                                (string.gsub path "sdr/metadata.([^.]+).lua" "%1"))))))

(fn render-one-book [book]
  (let [authors (?. book "authors")
        title   (?. book "title")]
    (munge-book-path book)
    (.. (: book-tmpl :substitute book)
        (stringx.join "\n"
                      (icollect [_i1 chapter-hls (pairs (?. book "highlights"))]
                        (render-one-chapter chapter-hls))))))

render-one-chapter

A book contains multiple chapters, this is a level-1 heading basically just used for organization; we generate the Chapter template and cram a sequential table in to it of each highlight's text. There is a surprise here, we copy the chapter table so that the :name added in (get-chapter-idx) in the collectors is removed before collecting each highlight.

(local chapter-tmpl (template
                     "* ${chapter}
"))

(fn render-one-chapter [chapter]
  (let [name (?. chapter :name)
        copy (tablex.deepcopy chapter)]
    (tset copy :name nil)
    (stringx.join "\n"
                  (tablex.insertvalues [(: chapter-tmpl :substitute {:chapter name})]
                                       (icollect [_i2 hl (ipairs copy)]
                                         (values (render-one-highlight hl)))))))

render-one-highlight

This uses the hashings library which I import at the top to generate a unique ID for each highlight based on the datetime which I took the note, and the note itself. Should surely be high enough entropy + stable. I sure hope so!

The text goes through some gsub functions so that special characters are escaped. Other transformations of the notes could happen here.

(local highlight-tmpl (template
                       "** ${text}
:PROPERTIES:
:ID: ${id}
:LOC0: ${loc0}
:LOC1: ${loc1}
:PAGE: ${page}
:END:
[${datetime}] 
"))

(fn render-one-highlight [hl]
  (let [locs (. hl :locs)
        digest (: sha256 :new (.. (. hl :text) (. hl :datetime)))
        hexdigest (: digest :hexdigest)
        fields (tablex.update
                { :datetime (. hl :datetime)
                  :text (-> (. hl :text)
                            (: :gsub "%$" "$ ")
                            (: :gsub "\n" "¶ ")
                            )
                  :id hexdigest }
                (if (= (type (. locs 1)) "table")
                    { :page (or (?. hl :page) (?. (. locs 1) :page) "")
                      :loc0 ""
                      :loc1 ""}
                    { :page ""
                      :loc0 (or (. locs 1) "")
                      :loc1 (or (. locs 2) "")}))]
    (: highlight-tmpl :substitute
       fields)))

Rendering the templates to a file

Get one book by invoking process-one-book and then write it to file. The entire "flow" is built around this function and the metadata collectors below.

<<maybe-write-to-file>>

(fn write-one-book [book out-dir]
  (let [book-path (?. book :path)
        title (?. book :title)
        notes-path (path.join out-dir (.. (: title :gsub "[:/\\ ]" "_") ".org"))]
    (print "Maybe rendering" book-path)
    (print "to" notes-path)
    (maybe-write-to-file book-path
                         notes-path
                         (lambda []
                           (print "rendering" (accumulate [sum 0 i chap (ipairs (. book :highlights))]
                                                (+ sum (length chap))) "highlights to string")
                           (render-one-book book)))))

(fn write-one-book-from-md [md filename out-dir]
  (let [book (process-one-book md filename)]
    (write-one-book book out-dir)
    book))

Helper to Write books to Files

This is a helper function which accepts a source file for modification detection at (mod-test) – if there either are no notes file, or it's older than the book metadata, it will call render-fn to return a string of the book notes, and then write that to dest-file. Passing in a lambda allows this to be a "lazy" helper, we only want to render the note template if the file has been modified.

(fn maybe-write-to-file [src-file dest-file render-fn]
  (let [dest-file2 (path.expanduser dest-file)
        book-mtime (file.modified_time src-file)
        notes-mtime (file.modified_time dest-file2)]
    (print "Targeting..." dest-file2)
    (if (or (not notes-mtime)              ;; (ref:mod-test)
            (< notes-mtime book-mtime))
        (match (io.output dest-file2)
          (nil msg) (print "Could not write file... " msg)
          f (let [text (render-fn)]
              (io.write text)
              (io.close f)
              (print "Rendered..." (length text))))
        (print "Skipping ... " dest-file2))))
  1. DONE re-structure this document so that the books are parsed as they're rendered

    rather than having all of them parsed in Invoking the collectors, then rendered here, delay the parsing.

  2. NEXT bubble modtime check further up in to (write-one-book-from-md)

    this would be so that the parsing of the notes does not even happen if the koreader source has not been modified.

Koreader EPUB Metadata Collector

process-one-book is wrapped with collect-highlights in to a simple interface which can be used to collect all the epubs' highlights in to one big ol' table.

(fn collect-epub-highlights [books-path out-path]
  (collect-highlights books-path ".*sdr/metadata.(epub|mobi).lua" 
                      (lambda [book book-path]
                        (write-one-book-from-md book book-path out-path))))

Koreader PDF Metadata Collector

The PDF notes are a bit differently shaped than the epubs', but basically close enough… They're sorted by page number. Just gotta grab the correct files!

(fn collect-pdf-highlights [books-path out-path]
  (collect-highlights books-path ".*sdr/metadata.pdf.lua" 
                      (lambda [book book-path]
                        (write-one-book-from-md book book-path out-path))))

Invoking the collectors

So each of those collectors will go all the way down to the file-system. This is basically the "entrypoint" of the app:

(let [default-book-dir "~/mobile-library/"
      default-note-dir "~/org/highlights/"
      args (lapp (stringx.join
                  "\n"
                  ["Parse koreader metadata files in to org-mode notes"
                   ""
                   "-f,--file (optional string) only parse one metadata.*.lua file"
                   (.. "-e,--epubs (optional string) parse epubs from here, otherwise " default-book-dir)
                   (.. "-p,--pdfs (optional string) parse pdfs from here, otherwise " default-book-dir)
                   (.. "-n,--notes (optional string) write outputs to directory: " default-note-dir)
                   ""]))
      file-name (?. args :file)
      file? (not (not file-name))
      epub-dir (or (. args :epubs) default-book-dir)
      pdf-dir (or (. args :pdfs) default-book-dir)
      notes-dir (or (. args :notes) default-note-dir)]
  (if file?
      (write-one-book-from-md ((loadfile file-name)) file-name notes-dir)
      (do
        (collect-epub-highlights epub-dir notes-dir)
        (collect-pdf-highlights pdf-dir notes-dir))))

NEXT many entries don't have correct metadata…

NEXT generate a TOC

Note Index

(org-roam-db-sync)
(->>
 (org-roam-db-query [:select [id title]
                     :from nodes
                     :where (like file "%/org/highlights/%")
                     :and (= level 0)])
 (-map (pcase-lambda (`(,id ,title))
         (format "- [[id:%s][%s]]" id title)))
 (s-join "\n"))

Footnotes


  1. They're not included in my Archive though, that is they're unlinked to the broader "sphere of thinking" and from my auto-complete database – they don't have IDs in the org-roam database, they aren't visible to Arroyo. I choose to work them in manually from the Note Index page. This is one of My Living Systems which I use to Remember Anything I Read.↩︎