Monday, August 2, 2021

Why do we need a database-based filesystem.

Note 1: there is a small TLMC-based demo at the end of this post, I recommend to check it out.
Note 2: these ideas are neither original nor new. I'm fairly sure there exist similar thoughts in written form from 20 years ago and at least three abandoned prototypes of this kind.

To illustrate the reasoning consider this toy example: suppose you are a food recipe collector. You started collecting various recipes and writing them down two years ago and was doing it since. You've managed to extract some original recipes from your aunt, mom and sister and so far you already have three broad categories of food: soup, tea and cake. Of course, you are keeping the recipes in digital form as files. As your library grows you need a good way to locate recipes instead of just dumping them all in one directory and combing through the entire list. What are your options?

1. Your disk filesystem suggests a natural way to do it. You create several directories, each corresponding to a certain mutually exclusive property of your files that interests you, for example a year you recorded the recipe. Once inside you choose another property, for example an author, then a food type and so on, creating a tree, until the end where you're left with only a small portion of files in every leaf. Uh, you immediately see not even one, but two problems, right?
First, you will need to create duplicate author directories for each year and then duplicate type directories for each author. Just 3 layers with 3 different properties per layer and you're already at 9x duplication ("\prod_{i=2}^{N} {n_i}" in the general case).
Second, this locks you in the order by which you can choose features which interest you. This scheme works as long as you go by year-author-type, but want to view all recipes by a particular author or all recipes of a certain food type? There is no easy way.

2. Abandon the idea of directories and store all files in one directory as a giant pile. However, give them names which encode all information about the metadata, such as "year=value1, author=value2, type=value3 recipename=value4.txt". Then use a search tool to choose whatever interests you. Unfortunately for Windows users the default search was semi-broken in XP and completely broken in 7 and onwards, but let's assume you use a working third-party one or some other OS where you have decent tools out of the box.
This is better, but still problems remain. If you want to filter on two or more features at once you need to remember their exact order in the name schema, else your naive search will return nothing (and an expression which would work with either order will get monstrous very quick). You also need to create new files with properties in correct order, but here you can just clone an empty template file with all fields set to nulls. If there are several authors for a single recipe then you need to put those authors in order in your query too. Can use alphabetic sort, sure, but these small inconveniences do add up.

3. Yes, finally, a proper solution. Just use an SQL database, dammit. Have a table for recipes, authors, dates and what not. Have as many of one-to-many or many-to-many relationships as reasonable to describe whatever you need. Then with a single query you get anything you wanted.

All our problems won't go away that easily.

First, there is a cost to set this up. The solution is obvious if you're familiar with computers, but a novice will have to learn a whole new query language with all relevant concepts.
Even more importantly, there is no integration with anything that expects to work with files/directories. If you have an application, a text editor, it wants to open files through a file open dialog, which reverts us back to the filesystem. Yes, you can drag and drop a file, but you can't meaningfully play previous/next track in the audio/video player. Saving a file is equally problematic, as you'll need to associate it with proper db structures later.

3+. FUSE to the rescue.
Have an application that performs a mapping between SQL tables/rows/statements and a filesystem: a path becomes a query and query results become files.
You don't need experience in developing kernel drivers to write a program that implements what you need. Userspace will cost you in performance, but it also means things are easier to pick apart and a mistake won't crash your system.

A demo.

I simply took the entire TLMC tree (snapshot of 11th of June... probably...) and looking just at its two upper layers pushed all available information into a database via a trivial regex. All fully automatic, of course, so entries with multiple circles are left as is (as compound pseudocircles), circle name hints, rename combos and decorating brackets are left as is, albums spanning multiple discs/codes are left as is and so on. You can download the resulting db and a FUSE module here: source , database , windows binary .
There is source code for all platforms and a prebuilt windows binary. Nothing is particularly interesting about the source (it is in fact quite messy), I'm just sharing it because a sole binary is always suspicious/requires extra trust and making an executable that works on all linuxes is a pain. Don't forget to grab the database.
To run this demo YOU WILL NEED : WinFSP on Windows, fuse3 packages on Linux, SQLite library for both platforms. Additionally, to build this you will need fairly recent gcc and friends (included in the latest MSYS2 on Windows), fuse3 library + headers (included in WinFSP on Windows).
How to run: mount your vfs somewhere as "mkdir /tmp/mnt && a.out /tmp/mnt" on linux or "a z:" on windows.
How does it work: when started the program examines the databases directory. For each database [with a fixed name] it looks at its table and column names and presents them as directories. Right now it assumes certain db structure (starlike acyclic schema), in general case it could be supplied some hints by the db author. When, by entering a directory, you choose a table.column pair the program searches the db for all distinct values of this combination and lists them as the next layer of directories. You can choose a value then, this creates a constraint on results. Next you can choose another table.column, see what options you have there and so on. Once there is enough information to uniquely identify an item the full implementation is supposed to look up its corresponding file(s) and provide them as a passthrough fs (mine just returns a dummy).
This is all good, however you could notice two annoyances:
- Picking properties one by one is slow, you might want to navigate the directory tree according to a preset combination of properties in the order that you remember without explicitly naming every single one along the way.
- Picking properties one by one is slow, you might want to see what's available in concise format where multiple properties are combined into a single choice.
Both of these are easy to solve. TLMC database has two extra invisible tables. A custom field is what TLMC always used for its album directory names : the "yyyy.mm.dd [code] albumname [event]" pattern. And a preset path is a path which uses circle name as its first component and a custom field for the second. Of course you can also define your own to create any view you'd want.
By the way, there are no files, this is just a demonstration of directory structure a database-based filesystem could have. Also, the whole thing is a quick hack job, so it's relatively fragile and could break if you try to hand-feed it deliberately crafted confusing inputs. It is also criminally slow to list some virtual directories in comparison to regular filesystems, but as long as it's faster than human reaction time you can afford to pay that price.

Q: So, after all this we can do the same thing we always could, just in a more complicated way?
A: Not exactly. We can do the same thing and much more. For example you can trivially add transliteration for circle names (or album names or event names, but I skipped that). Well, more like translation. Well, even more like whatever google thinks you think you wanted, but you get the idea, this was also automatic. Then instead of seeing a list of squiggly blobs [in case you can't read japanese] you'll see familiar english letters. Or you can filter on any property and browse in any order you'd like:
- Pick "event.name" first, "circle.name" second and see which circles participated in the event you chose.
- Pick "album.year" first, "event.name" second and see which events happened in the year you chose. Strictly speaking this abuses the fact that album release date almost always coincides with event date, but date fields could be added to the event table to disambiguate.
- Pick "circle.name" first, "album.year" second, "custom field" third and see all albums by a circle released in a certain year.
Okay, there's not a whole lot of combinations now, but it does add some convenient options. Pick "prepared path 1" to view albums in historical-TLMC-like way.

6 comments:

  1. Please add the following files on TLMC V.20...

    [SWING HOLIC]
    ・SOUND HOLIC VLOL.13~16 [SWHC-0013~0016]
    ・COOL JAZZ TOHO I・II [SWHC-1002・1003]

    ReplyDelete
  2. sql is cringe kys jk

    ReplyDelete
    Replies
    1. Was this supposed to be funny? Sadly, it was just utterly inane.

      Delete
  3. It's a shame I didn't look at this until now. I've actually been working on-and-off on kind of a similar idea on my own for the past several years. My goal was to organize a large collection of image files; for example, a collection of comic scans, or loose one-off drawings from various artists. Ideally you could type some kind of search query in a text box and the database would return all files/subcollections that match it. Instead of using a FUSE module I wrote a client-server system and a custom file browser.
    I've figured out that unless you have some reliable source of metadata, it quickly becomes unmanageable for an end-user. Right now my collection has half a million items. Imagine adding meaningful tags to each of those. Instead, what worked for me was to realize that well, unlike a normal file system the user is not ever going to be changing the contents of these files, so if the user "copies" this file from here to here, there's no reason to actually do such a thing. The files that are displayed in the browser can just be links to the data, and copying a link just creates a new link to the same datum. Then you can keep the old hierarchical system of files and directories, and have duplicate links to the same file in different directory structures that organize the same files by different criteria. E.g. one by author/genre/title, and another by genre/author/title. The only disadvantage being that it's not dynamic. Meaning, new entries have to be organized explicitly.
    I've been using my system for 18 months and it works quite well. In fact I'm working on version 2.0 already, where I'll apply some of what I learned on version 1.0.

    ReplyDelete
  4. Bruh does anyone her have the seed for this album? 東方想幽森雛 (Touhou Such A Mystery)

    ReplyDelete