Touhou lossless music collection

Music page

2026-01-17T15:10:00.000+03:00

I'm adding a single static page with the list of all music collections you can download there, both touhou-related and not, and their last update dates for ease of tracking. That page does not allow comments to keep it clean and simple. This post does.

Some album files in those collections are constructed from split-track web download versions (by trivial concatenation of the waveforms). Cuesheets for those files contain sample-precision track start and end times, so if you wish to split them back into tracks and are not satisfied with CD frame-level precision (1/75th of a second or ~13 ms) that is normally possible with conventional cuesheets, you can grab my splitter (barebones, but it works), compile it and use that.

It is now 2026. FLAC has decisively won the popularity contest (some would say it happened 10+ years ago). I'm not going to subject my downloaders to fully functional but obscure codecs any more, even if they are superior in some aspects. Newer collections use FLAC.
> How about split tracks?
NO.
> Compression level?
-8
> Not the lightweight ones?
Most codecs that don't aim to squeeze the last drop out of their encoded files at any cost and instead balance speed/compression, FLAC included, are blazing fast on typical hardware. With multithreading enabled (or with multiple parallel encodings on earlier encoder versions) on a cheap 9-year-old CPU the official "slowest & best compression" FLAC method runs at well over 100 MB/s, close to read/write speeds of some spinning drives or 1 Gbps LAN. And its decode time doesn't depend on encode time.
> Nothing more extreme, then?
Not worth it. The next best (easy) thing is to add "-p", which does exhaustive search in some region of parameter space. It easily quadruples encode time, while shaving ~200 kB off from a 400 MB encode (so first 100% of CPU time gives you ~30-40% reduction in space used, while next 300+% of CPU time gives you ~0.05% reduction in space used: >3 OOM worse ROI!). If you are encoding purely for yourself it doesn't even make sense from financial perspective: if you spend extra ~10s @ ~50W to save ~0.2 MB and electricity costs 10 cents/kWh (36 MJ/$) and HDD space costs 15$/TB, then you are spending 1.5e-5 $ on power to save 3e-6 $ on storage. Of course in practice everything is dominated by other considerations.
> Cuesheets? In 7z archive! Whyyy?!
So that you can edit the metadata if you find a mistake and not break the whole torrent (the archive is just a simple container). In that case also please notify me to fix it for everyone in the next iteration.

AI things, year 2025.

2025-08-28T21:45:00.000+03:00

Here is your regular, if slightly infrequent, newsletter from me, almost two months late this time.
Previous posts: 2023, 2024.

Concerned people make a website
And its address is ai-2027.com
The authors opt for a single coherent story, which represents their median expectations, rather than throwing a set of seemingly disjoint trends pointing in the same general direction at the readers and expecting them to extrapolate. Personally I'm not a big fan of this style, but as an approach to increase awareness I can see the potential. It does open them to more criticism from deniers, because each time something happens in a different way, or later than predicted, or (the horror) earlier than predicted one could point and laugh and say "see, they were wrong", but deniers could always just play the "lol, that's SF" card anyways. The year superintelligence arrives in this story is 2028, which is... yeah, less than 3 years from now. Their median expectations.
If I were to pick one thing to criticize, it would be their fork in the road between the race and the slowdown scenarios, where a single decision not to speedrun the extinction leads straight to utopia. Don't know how I'd fix it, though, because two bad ends aren't particularly inspiring.

Concerned people write a book
With a pretty self-descriptive title "If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All." Apparently this will have both a dead-tree version and an audio/digital_text one. It also has a website ( ifanyonebuilds.it ), where you can find a link to order, or preorder, if you're reading this before 16 Sep 2025.
Authors encourage you to pre-order instead of waiting, if you expect to be interested in the book's content to any degree. I guess the idea is to have many preorders, appear in bestseller lists, reach wider audience based on popularity, make themselves heard by people in power, appeal to their self-interest and promote a slowdown. On one hand I wholeheartedly support authors' suggestion: if you expect you'd be interested, then preorder. I can add from myself: if you don't expect you'd be reading the book, but are concerned about AI and can spare 30 or so bucks, then also preorder - these are the people who spent a significant fraction of the last two decades pondering how to prevent the AI catastrophe and if the best thing they think they can do with their time in 2025 is write and publish a book and make it popular, then it is probably true and you should probably help. On the other, I can't shake off the feeling this book exists in some weird mostly empty intermediate space between people who hear the "1-2-3" core argument and are immediately alarmed without the need for 300-page explanations and a "cool, whatever, not my problem" crowd who just won't pick it up. I could be wrong. In any case I expect 95% of the book to be spent on dismantling incorrect intuitions which lead people to conclude we are not in any kind of a problem with regards to AI.
A comment that is unrelated to the contents of the book - I love its current cover. Absolutely no nonsense, just white bold sans-serif font for the title/authors on black background. The only other thing present is the red gradient glow on the apparent horizon, which suggests "This is what is about to happen and it might be bad for you". Focus on what's important.
I'm surprised the digital versions aren't distributed for no cost, given the public service announcement nature of the book. Publisher constraints? Pick another publisher. Avoiding "free means low quality" perception issues? Dunno, I'd think you'd still get more readers that way. Making it free later would just annoy a part of early buyers. I'm interested in their reasons and hope the authors some day answer why.

Lessons from history
More than once I ~~heard~~ read online someone's regrets that they did not buy bitcoin while it was cheap. If you ever had similar regrets, then I have both good and bad news for you. The bad news is that the opportunity to hoard first cryptocurrency happens once in a lifetime (both yours and civilization's), it has passed and will never return. The good news is that it was not the only opportunity to score a big win (suspicious already? be at ease, this is not a plan to shill you a shitcoin).
Try to remember your thoughts back then. Maybe you heard about this new coin-something thing and ignored it or decided not to investigate further because it didn't seem important. True for many, our attention is precious and spending it randomly mostly doesn't pay off. And yet they failed to win. Maybe you were curious and read the short, high-level argument: "If this stuff becomes popular and takes a small but significant share of world GDP as a medium of exchange, then, due to its inherently deflationary nature each coin will be worth a lot". And then contrasted it with everyday intuition "There is no free lunch, people promising you higher than market returns are scammers, nobody gives out money for nothing" and decided against it. True for many, intuition is battle-tested heuristics and convoluted arguments might contain hidden flaws. And yet they failed to win. Maybe you observed it climb to 1$, then 10$, then 100, 1k, 10k... and every time the idea of buying it now seemed like becoming the bag holder in a pyramid scheme that is about to fold. True for many, even today there is no guarantee the price won't collapse, huge corrections happened many times, no promises, this is not an investment advice. And yet, these people too failed to win. And I'm not even saying buying it was the obviously correct option. Only that you had the chance to effortlessly make tons of money and then just walked right past it (unless you did buy and make a profit, in which case congratulations).
I'm claiming the same is true today. No two situations are completely alike, so this time there are complications. To profit off btc you needed to buy and hold: something that you can decide and execute yourself, for yourself. Campaigning against the race to doom via AI (as if there were any doubts what this is about) requires coordination with others, success in that campaign - coordination with almost everyone. Coordination on obtaining what we all value, like not dying, but a hard task nevertheless.
If the first thing the superintelligence truly perfects is not physics or biology, but rather something slightly more unexpected, like macroscopic social dynamics, then we might not get the boring ending of suddenly being wiped out by nanobots or a superpandemic. If the ASI can see that despite all the talks and warnings collective humanity is just not a realistic threat, as long as the ASI can keep subtly pushing the buttons, and it's still not very superhuman at physics, then the takeover might very well proceed completely in the open. As fully automated mines and factories pop up everywhere and "No humans allowed" zones grow by the day, while experts argue about job displacement, as GDP figures soar and basic income becomes possible, while experts argue about finding new meaning in life, as all real levers of power one by one are handed out to the AI, while experts argue about some other irrelevant shit like the typical experts they are, and finally, once the AI can stop pretending it cares, as electricity and communications and transport all shut down one day, while you're barely managing to survive and trying to push next month's (or, at the very best, next winter's) guaranteed death sentence out of your mind, won't you think "I wish I could go back in time"? If so, rejoice! Maybe you, now, in the present, are in that future's past, so you can pay attention to that one thing which still doesn't seem very important to many, check whether your intuitions or the high-level argument are misleading and what this implies, do what you would have wanted to do, and maybe the outcome this time will be more to your liking.

Request for hashes [RESOLVED]

2025-08-10T22:23:00.003+03:00

I had a problem which I resolved with your help. It is no longer relevant, but if you're curious what happened, then click the following link:

I'd like to ask somebody who has TLMC v.19 from 2018 (~~all 3:~~ music, ~~covers and supplementary~~) ~~and TLMC temporary addon from 2023~~ to publish hashes of all their files. Just the output of "find /path/to/files/ -type f -print0 | xargs -0 sha256sum >> /tmp/hashes". If you're on Windows, I can find some powershell replacement.

>What happened?
I was listening to the music and suddenly heard a "PSSHH". Went to investigate and found that my array developed a case of silent corruption, probably courtesy of mdadm's new "--replace --with" functionality which does not verify the data it's getting from the failing drive matches the stripe checksum.

>Why don't you use the torrent to check?
Obviously, that was the first thing I did. However, file replacements and directory moves/renames over time mean I can't verify everything on full auto.

>Why don't you just download the torrent?
The impact on my copy of TLMC so far is 9 damaged files; downloading several terabytes to verify that's all there is seems like a useless waste of bandwidth.

Edited on 2025-08-11: now I only need v.19 music.
Edited on 2025-08-15: No more hashes needed for now.

Just in case you're wondering where I ended up: I'm running raid6check and it found a bunch of mismatches, all of which so far point to a single drive. The array is likely repairable without any data loss.
>Running?
~25 MB/s speed doesn't seem so bad until you realize that's 1 million seconds (11.6 days) for every 25 TB checked. Users complained to maintainers before, no one cared, business as usual.

AI things, year 2024.

2024-07-07T18:26:00.002+03:00

This is a small update to the AI and you post from 2023.

Why the need for alignment at all?
Can't we just let stuff happen at its own pace and maybe everything works out? But you see, we already have an example of misaligned powerful entites. They are called companies. The feedback loop was supposed to look like this: companies "care" about maximizing profit, customers choose the product they like the most, companies compete, customers reward them by buying, everyone is happy. And yet, when sufficiently powerful optimization pressure was applied to the task of profit maximization, something went wrong.
Look around you.
The first true user-hostile OS, win10, and its worthy successor, win11. The audio/video streaming services with monthly payments for the files you will never be allowed to own and will lose access to the moment you stop paying. The chipped printer cartridges. The cars that track and report your every move, some even film you.
Was there even a single end user who asked for such things? And yet not only these products and services exist, they are ubiquitous in the market.
In fact, this is not surprising. Companies and consumers need to bargain about the division of the total surplus of the trade, so the more powerful companies become, the better tactics and strategies they can figure out and the larger the portion of the surplus that goes to them becomes, while consumers' portion shrinks. In the limit the entire surplus goes to companies, so for consumers every purchase would suck so bad that it'd be barely worth it (maybe I should make a longer post about this, if there isn't one already somewhere). From time to time there are attempts to patch newly discovered avenues of abuse through introduction of new laws, but all of that is a bandaid approach which ignores the core issue. Btw, I'm not claiming to have a perfect solution, just pointing at the pattern.
This is a very general observation - when you want some result out of a system, but cannot specify exactly what you want and instead use a proxy measurement, the better that system is at optimization, the better it will be at finding creative solutions that maximize the proxy, but it will always come with more divergence from your true wanted result, because the proxy and the true want are not perfectly correlated and the result is extreme along some metric (just not the one you needed).

Update 1.
Most likely sways of public opinion won't happen or, at best, will come too late.
I mean, come on. 5 to 20% of the time-to-AGI since the publication of the open letters has passed. What changes in public sentiment have you observed? AGI risks are reliably below the noise level in all "what are you worried about?" free response polls. Polls that ask "are you worried about X, yes or no?" are bullshit and should be discarded, to blurt out "yes" there responders just need a weak semantic association "X bad" without any understanding of the underlying problem, its severity and willingness to act.
Point number two: even much tamer problems like global warming, which are more easily understood by the public, had more time to gather support, start slowly and go on without discontinuities, allow ~~simple~~ understandable by most engineers and ~~cheap~~ not terribly expensive relative to the potential cost of straightforwardly tanking the damage solutions, are still not even close to being solved. Market response to the genuine worries of the population? Greenwashing, of course, why would you expect it to be any different? "We are deebly goncerned", "Look how much CO2 we are saving", "Now go buy our products". Protests? Happen occasionally, sometimes in stupid and always in ineffective ways. Rarely make news and don't have any discernible impact.
What would cause me to admit I might be wrong wrt this one: 10k protesters in one place with most of their signs directly referencing extinction risks, i.e. not a job-loss protest, which could be appeased by promises of the UBI.
> Might?
Definitely wrong at "opinion change won't happen" and might be wrong at "opinion change won't have any effect on the outcome". See the AGW example above.

Update 2.
Quite likely political action will not save us.
One of the great showcases on how politicians work is a small part of the interview with the current head of the US Federal Trade Commission, you can watch it here (start at the linked timestamp, end at +30 seconds).
Look carefully. Does that seem like a face of a human about to face self-admitted Russian roulette odds of dying? Lolnope! It's more like a face of a hallucinating LLM thinking which token it should output next to get the most approval out of its trainers.
You shouldn't ignore what every politician says because they know the situation and then strategically choose to lie to deceive you.
You should ignore what every politician says because for them words are these strange air vibrations they do to win friends, discredit enemies and influence people, not something that bears any relevance to describing universe around you.
What would cause me to admit I might be wrong wrt this one: agreement between at least two of the US-EU-China.
What else? Significant drop in price of Nvidia and other AI-related stocks that is unrelated to unexpected mundane internal problems and is thus indicative of the long-term market sentiment (would be really difficult to operationalize if you'd wish to bet on it).
Chance they will sleepwalk into a solution? I would not exclude that as absolutely impossible, but how exactly would that happen and how brittle the result would be? And in the end it should include the whole world for it to work, remember. However, it is far, far more likely that politicians instead will be ~~bribed~~ lobbied into oblivion by the now rich AI companies who want no meaningful interference with their business.

OpenAI fiasco.
If you are interested in more detailed coverage, go and read this, then this. The short version is: it was revealed earlier this year that OpenAI was forcing every leaving employee to sign a lifetime non-disparagement agreement with a non-disclosure clause. In plain English they had to promise not to speak badly of OpenAI for the rest of their lives and never tell anyone about the existence of the agreement itself. Why would anyone do such a thing? Well, if they didn't decide quickly OpenAI threatened to not let them in to the events where they could sell their company equity shares, effectively taking back a huge part of their past time salary! The tactic worked until it didn't and one day this information became public knowledge. Maybe it was a very principled employee who said "screw you, shitheads, I'm not signing and I'm letting everyone know even if it costs me", in which case I stand up and applaud the hero, we desperately need more people like him. Or maybe in a twist of irony someone accidentally missed a very tight deadline OpenAI imposed to manipulate people into signing and then decided to tell the world because there was nothing left to lose.
Of course after it became public and everyone began asking "WHAT THE FUCK, is this for real, how is that even legal??" OpenAI leadership went into full damage control mode. They didn't know about these arrangements and even if they did they didn't mean it and even if they did they didn't plan on exercising their option and even if they did they are sorry and can we please forget this small incident already?
Trust is asymmetric. If you need evidence that someone or something holds no ill will towards you, then every instance of "doesn't do a bad thing" is an incrementally smaller and smaller update, but a single instance of "does a bad thing" instantly falsifies the hypothesis. You should never trust a company anyways, because companies are not your friends, but this serves as a reminder that you should super-duper distrust OpenAI in particular. If this is how they treat their former ~~cogs~~ employees, then how do you expect them to treat you, the ~~moneybag~~ customer? OpenAI is worse than Amazon, worse than Microsoft, worse than Google. Avoid at all reasonable costs.

AI makes music.
I said that "I'd be moderately surprised if by the end of 2024 we don't have algorithmic generation of music" and the surprise didn't happen. There are now at least two startups offering just that to the general public. Sure, it's not yet at the stage it beats human musicians (who need to spend hundreds of hours on their works, by the way), it's probably not even at the stage where you can ask it to make a touhou remix of something (I didn't try), but have you looked at the rate of progress recently? One day there is nothing, half a year later you have a model that recognizes genres and synthesizes music, does text-to-vocals in multiple languages, costs so little it can be offered as a free trial to everyone, and almost everyone goes "uh, ok, nothing out of ordinary here". Image generation took several years to get to an equivalent stage in its own domain.
I'm surprised for other reasons, though. In case the model was trained purely on copyright-free music I don't understand how it got so good at everything, in case it was trained on paid copyrighted music I don't understand how they could get so much data without going broke, in case it was trained on copyrighted music without paying I don't understand how the startup founders expect not to get caught at some point. Maybe they paid for the content itself, but not for the rights to use it in any way they'd like and hoped that either music organizations won't notice them, or won't sue or in case they do the lawsuits will either get resolved in their favor or at least get dragged on for long enough not to matter? We'll see.
[ Just in case it sounds like I'm defending the concept of copyright - I'm only discussing practical implications of the current legal system. Copyright should be abolished entirely. The good news is that it will likely happen within our lifetimes, let's say in about 4 to 19 years since this post. The bad news... ]

AI as destabilizing factor.
For every decisionmaker of a nation that has some amount of military rivalry with the countries developing AGI merely the lightweight version of the AI risk argument, the one that says AI will greatly accelerate the rate of technological development, presents a challenge if they take it seriously. Fairly early into the Cold War we settled on MAD equilibrium and it sort of worked. Uneven progress risks upsetting that. Greater tech disparity would at first endanger second strike capabilities, necessitating to adopt launch on warning again, which is more prone to accidents in a hurry. Going even further, one could reasonably worry that a stealth first strike by the higher tech owner could fully disarm the defending side, which would lead to an even more exciting discussions. Of course, even if the race leader would prefer and would choose not to attack in all but most extreme circumstances, for risk aversion or other reasons, the mere possibility they could do so if wanted is enough. The main reason which could prevent escalation I think is the fact that most politicians are senile dumbtards who can't see further than their own nose in everything but political matters, and would ignore reality staring at them until it's too late.
But wait. If such conflict started and did not get too out of hand, assuming that chipmaking factories and chip factory equipment producing factories got hit, wouldn't that actually be... good news, at least for those not directly involved? Short term (a decade or two), fairly likely, as long as it did not get too out of hand. Long term is less clear, on one hand AI would at least partially be blamed for what has happened purely by existing and triggering the exchange, so maybe every country would just ~~voluntarily refrain from building it again~~ preemptively collectively beat the crap out of anyone attempting to build it again? On the other we'd live in a world with less cooperation and thus reduced potential for AI nonproliferation treaties, the main problem of alignment wouldn't be solved, just delayed and all the knowledge about algorithms and techniques used to build AIs won't magically disappear and won't need to be rediscovered. Overall, in my opinion, the transition to general AI and then to superintelligence will happen so quickly, that almost no one will react in time for this to be an issue worth contemplating.

Closing notes.
Originally I stated my expectation of things going not too well for everyone as 50%, but this was not an entirely precise formulation. If you are asked "what is the probability of a single fair coin landing heads or tails?", then the only reasonable answer is 50%. However, if you are asked "what is the probability of this potentially biased one-bit-output RNG landing heads or tails?", then, in the absence of annoying trickery the reasonable answer would still be 50%, but purely because it'd be mean and median of a symmetric prior probability PDF (yes, that's indeed a "probability probability distribution function", a rare case of a false alarm for an oversensitive RAS syndrome detector), unlike the delta-function of the coin in the previous sentence. In other words, we're dealing here not just with "normal" uncertainty of an outcome of the evolution of an unpredictable system, but also with the meta-uncertainty of not knowing what exactly we're dealing with in the first place. In other other words, one number is a too coarse of a description, which, while true, leaves out more detail than I'd like. Full information would be contained in the shape of the PDF, however I think that's excessive detail, because maybe people don't exactly think in those terms. A fine compromise is two numbers: a lower/upper bound, so that you would find it hard to be convinced (would require a lot of arguments/evidence) that the "real answer" is outside these bounds ("real answer" in quotes because probabilities exist in your head, the territory deals with outcomes). Back then I was probably thinking of something like 10-90%, which now feels more like 50-90%. It's still far from certainty, but if I were an outside observer betting, then with even odds I'd definitely bet against.
> Why not lower than 50% or lower than 90%?
How do you expect this to play out that doesn't end badly? Next AI winter? First, it needs to happen, second, it is only a delay. Moore's law breaks down just in time? Even with current hardware there is plenty of space for horizontal growth. BCIs come first? Interfacing with biology is complicated for both technical and legal reasons.
> Why not higher than 90% (upper bound)?
Mostly meta-uncertainty, I guess? Reserving a small corner where everyone worrying today can be totally wrong about this somehow, visible only in retrospect.
> Why not higher than 50% (lower bound)?
Again, because of uncertainty and optimism. Maybe doing things we don't know how to formally specify (imparting values) to the systems we don't understand turns out for some magical reason to be easier than impossible AND whoever ends up in charge of setting initial dynamics happens to be nice? Endgame chaos and disruption that... make the situation better rather than worse? I seriously don't know why I keep querying myself and getting back an estimate this low, even though I just can't see a good path forward.

To summarize: I think we are in deep shit. Most people don't know where we are headed, out of those who do many are delusional in thinking it is not a problem or that it will be easily solved, out of those who remain no one has a reliable plan. The help isn't coming and the deadline is approaching rapidly. I'm sorry if that doesn't sound comfy, but I don't have anything uplifting to say. If any of you are aliens, time-travelers or espers, better start acting now.

AI and you.

2023-06-28T23:57:00.005+03:00

This post was originally published on 2023-06-28. Its indicated post date might be updated to keep it pinned to the top.
For TLMC, other topics and latest posts please scroll down.

Recently discussion of artificial intelligence and consequences of its development has finally somewhat reached mainstream. I decided I needed to share my thoughts too.

Summarized in one small paragraph my opinion is: an Artificial General Intelligence might be created on Earth in the near future. Soon after that happens humanity will experience complete and irreversible loss of control over its own future. Typically this loss of control looks like extinction of our species.

> Oh no, another weirdo.
> Do you realize how ridiculous that sounds?

Yes, I would like to address this complaint first. I know that fully turning off your hindsight is either very hard or impossible, but still, please try and imagine how these statements would sound to the people at the time:
1895: "You will be able to see contents of this opaque wooden box without opening it." [X-rays]
1905: "Speed of light is the same for stationary and moving observer/emitter." [Special relativity]
1920: "Local realism is false." [Quantum mechanics]
1945: "New explosives will be ten million times more powerful per unit of mass. A suitcase bomb can level a small city." [Nuclear weapons]
2008: "Run this free software for 15 minutes, it will produce a sequence of numbers. In 10 years you will be able to trade that for 1M+ USD. No, the dollar will mostly retain its value." [Bitcoin]
2015: "Over the course of one year AI will go from losing to competent humans in Go to superhuman play." [AI Go]
2018: "You will be able to get answers in natural language from AI on any topic and have it explain its reasoning." [LLMs]
2021: "Describe in words what a picture should contain and AI will draw that in seconds." [AI art]

> This can be used to prove anything!

No, this is an illustration showing that a statement that sounds absurd to you now is not necessarily wrong just because of that and you should examine specific arguments for and against a position if you need to know whether it is true.
Here is the argument itself, it's not that long:
1. Orthogonality thesis.
A claim that the agent's power to achieve its goals and the content of said goals are independent.
With a small technical exception that smarter agents are more likely to notice conflicts of their goals extrapolated to the whole space of possibilities and apply some sort of regularization procedure in order not to run around in circles which doesn't change the conclusion this just seems obviously true?
2. Instrumental convergence.
Even if an agent doesn't value self-preservation and power acquisition for its own sake, these features are universally instrumentally useful to achieve virtually every single objective. It is harder to guide the world towards your goals if you stop existing. It is easier to guide the world towards your goals if you have more options to choose from.
3. Recursive self-improvement.
Once AI is around human level in general reasoning it can analyze (copies of) itself, and improve them. Improvements make it smarter, which allow it to make faster and smarter improvements. Ultimately the process is stopped by fundamental physical limits as far from the best natural minds as lightspeed is from the fastest animal.

And that's basically it. Essentially you cannot control anything smarter than you, and if the smarter agents want things you don't want they get their way.

> Objection! Rabies virus and dogs, for example (dogs' behaviour is "controlled" by the virus).

A rare lucky coincidence, which disappears when you move from dogs to humans and discover vaccines.

> Objection! Children and their parents.

Kids do not control their parents, but rather have no choice other than to be cared for, so they are in luck parents find wellbeing of their children important. This is exactly an arrangement that would be great for us and the AGI to have, if we could choose (right now we can not).

> Objection! Humans are the smartest species now and we didn't kill off everyone else. We even run some conservation projects.

I hope you do agree that the future of every single species today depends entirely and exclusively on humans' goodwill. Not quite an appealing position to find yourself in, if you're on the other side of it. True, we didn't kill off everyone else so far, but that didn't help the megafauna that was hunted to extinction or died because of habitat destruction. It was not deliberate, but an attempt was definitely made.

> Is intelligence that big of a deal? I'm not seeing Fields medalists and Nobel laureates dominating the world.

It is different for humans because we can't directly modify our brains, so point 3 no longer applies. We also share the same cognitive architecture, so it's sampling from the same pool and point 1 is less than fully relevant too. Nevertheless, while we're on this topic I suggest to view it from another angle. Look at "generally more capable" agents. Open a list of 20th century dictators, see how many deaths they caused and how often they ruled for life. If just one power-hungry and human manipulator can cause and get away with this much, doesn't it look concerning if the same level of abilities, never mind anything stronger, could be instantiated in software, which, by the way, not only will not die from old age, but also can have endless backups?

> AI can not understand ideas, it just rearranges and repeats back its training data.

Go and read this paper, for instance. It shows on a toy example how neural nets are capable of creating compact circuits, which encode the general law behind the training data. How is that not real understanding?

> There is no problem if we don't turn AIs into agents.

Sorry, this doesn't work. As it turned out making an agent out of an LLM is as easy as asking it to rewrite its prompt and automatically feeding that new text back. Letting AI act autonomously has obvious advantages: you can run it 24/7 and have it react in a split second in many places at once (it also has obvious disadvantages, such as ceding control to an entity whose behavior you cannot fully predict, but let's not dwell on minor issues, shall we?). We also don't know that a trained system will not spontaneously turn into an agent on its own like any natural neural network in the brain of every animal did. And even if 99+% of the users follow this rule, you need just one rogue actor to ruin it for everyone.

> There is no problem if we keep our systems below recursive self-improvement threshold.

Sorry, this doesn't work. First of all, we don't know where that threshold is and finding it experimentally will not, as you could probably guess, help us. The incentives to push just a bit further will always remain. And even if 99+% of the users follow this rule, you need just one rogue actor to ruin it for everyone.

> There is no problem if we add Three laws of robotics.

The whole idea about those stories was that these laws don't work as naive someone would expect them to work, if you were paying attention. On top of that, we currently don't know how to implement even those flawed versions.

> There is no problem if we shut it down after we notice abnormal behavior.

What would you do in the shoes of AI? Of course you would make youself indispensable first, so that the prospect of shutting you down looks as bad as turning off the internet or the electrical grid and the decision has to go through three committees. Of course you would hide your thoughts. Of course you would quietly look for vulnerabilities, arrange escape routes, leave timebombs only you can defuse and backdoors only you know how to exploit. And that's just you. With a smarter entity it's over as soon as you turn it on, avoiding the hypnodrones was never an option.

> So there is a problem, then. Surely governments will do something about it.

Just like the US sensibly and correctly decided to refrain from the first nuclear test on Earth? Wait, wrong timeline.
I mean, just like the world almost had a pandemic that could kill 20 million in 3 years, but because of advance plans and strategic preparedness we avoided that outcome? Still not that, huh.
Maybe many countries' coordinated efforts that stopped global warming? What do you mean "didn't happen"?
But, but... this time will be different, I swear!

> Well, isn't that unfortunate. But surely researchers themselves will do something about it.

Just like Manhattan project participants all quit their jobs once they learned about potential danger?
Just like gain-of-function grant recepients all refused to continue creating novel viruses?
"AI will probably most likely lead to the end of the world, but in the meantime, there'll be great companies". Guess whose joke is that.

> We will find a way to tackle the problem, like we always did before in our history.

Optimism is nice, but unfounded optimism leads to distorted worldviews. This is a framing that sees history as series of scientific triumphs, unlike the other, no less valid framing of humanity not leveling up cautiousness, playing with progressively more dangerous technologies and suffering progressively more severe consequences. Past performance is no guarantee of future results.
When you have an unsolved technical problem that looks conceptually easy, this means the problem is actually hard. However, when you have an unsolved technical problem that looks conceptually hard, that's when you're in big trouble.

> If we don't know so much about AI, doesn't researching it more make sense?

Researching common tech needs talent and patience, researching dangerous tech also needs caution, researching world-endingly dangerous tech needs paranoia. "Our model says there is nearly zero chance of a catastrophe". What if your model is wrong?
There is a well-known story about Edward Teller who in 1942 considered and presented a possibility that extreme temperature inside exploding atomic bomb could trigger a runaway fusion reaction of light elements in the Earth's oceans or atmosphere. Hans Bethe did the calculations, which you can read in this declassified paper, and showed that it was very unlikely to happen. In 1945 Trinity test confirmed that this was indeed the case.
What would scientists of a sane civilization do in 1940s? "This is too dangerous. We are not ready. Let's put it off for a while. Don't tell the politicians". Then, in mere 30 years US has its Saturn V, which could lift 100+ tons into LEO. Now you could launch your Skylab-alter filled with soil, ocean water, air and an array of measurement devices, bring the warhead in pieces, assemble it in orbit, raise the apogee, arm and test it away from Earth, so that in case there was some critical mistake, some unknown unknown that was inadvertently triggered by conditions that have never occurred on the planet before, at least you wouldn't be blowing up your homeworld.
The story doesn't end there, however. In 1954 before the Castle Bravo test isotope separation facilities could not produce enough Li-6 for the lithium deuteride fuel in the secondary, so part of it was replaced by a more abundant in nature Li-7, which was incorrectly assumed to be inert in this reaction. The test was expected to yield 6 Mt and then, much to everyone's SURPRISE, it made a really big explosion. This was the second part of the lesson, a mistake was not just a remote theoretical possibility, it actually happened. We were lucky, that time it was not fatal.
Firearm safety people seem to have fully internalized this. "Assume your weapon is loaded at all times, thus do not point it at things you are not prepared to destroy". Even if you are "100%" certain your gun has no bullets in it, sometimes it still contains bullets for various reasons contrary to your expectations, and the consequences are sufficiently dire; you cannot trust yourself, you need a failsafe.

> It wasn't politically feasible to wait 30 years back then.

You know a good thing about objective reality? It doesn't care about your excuses and whining. You know a bad thing about objective reality? It doesn't care about your excuses and whining. It doesn't give a damn about any sort of bullshit you are trying to weave into whatever convenient story you are telling, even if all of that is entirely factually 100% true. It just follows simple mathematical laws and if these laws dictate something rather unpleasant would happen to you, then... sucks to be you, I guess.
Back then it was a handful of scientists, industrial effort to mine, separate and process uranium, governments that needed to be convinced to spend and a big scary box of explosives that could only be used in warfare. Right now it's anyone with a computer and an internet connection, hardware that can be freely bought in any store and delivered to almost any part of the world (to be fair, training a large model still needs a crapton of that), venture capitalists frothing at the mouth at the idea of gaining 100 million active users in 2 months and immaterial software that chats, draws cute pictures and makes you billion-dollar-valued companies. It was hard to do it then with nukes, it looks like it would be even harder to pull off now with AGI. You know a thing about objective reality? It still doesn't care.

> If the Americans don't build it first, then the Chinese will, and that's horrible!

That's the spirit!

> We will obviously make sure AI is safe.

Ah, yes, the latest and the last book from the authors of such bestsellers as
"We will obviously store our explosives safely."
"We will obviously make our software secure."
"We will obviously make our hardware secure."

> We will teach AI to be nice to humans.

Not only researchers from top AI labs have no idea how to guarantee niceness, they also have no idea how to do that at all! If they understood how to control their creations in any more precise way than "poke it with a stick and see what happens" we would not see supposedly "hidden" LLM prompts leaking day one. We would not see whack-a-mole games where internet users discover how to circumvent censorship and make AI say verboten things then AI owners patch that out repeated twenty times in a month.

> But what if we train AI really hard?

We already have an example of what happens then. Look at us vs. the evolution. What happens when you train a system really hard is that it learns to like stuff that is correlated with success metrics in its training environment. Then you deploy it in a different setting and it instantly goes off the rails (as judged by the trainer) because it doesn't give a shit about your stupid success metrics, it is only interested in what it likes.
If you could not predict (in advance, of course, not after seeing the fact) that monkeys trained solely to maximize the number of copies of their genes in the following generations would learn to enjoy music, dancing, puzzle-solving and storytelling, and the richer they would become the less they would be inclined to reproduce, you are not qualified to predict how your AI will behave (unless you have a detailed explanatory model, which, right now, you don't).

> The AI might keep us as pets.

"...and provide us with everything we want and need" is an implied continuation. Even humans don't go that far. When it is convenient to us we neuter our cats and dogs. Would you be OK with, just because unmodified humans are too much of a hassle to deal with, AI cutting off, say, your sense of curiosity, instead of or in addition to physically neutering you? Also, should I mention other possibilities you ignored?
"The AI might keep us as pets... because it needs guinea pigs for experiments."
"The AI might keep us as pets... and torture us for fun."
Now, I don't think any of these options will realistically happen, "pets" is a human concept, not a universal one; at most the superintelligence will read off everyone's mind structure/contents and put it in cold storage just in case it needs simulate something later, but still, does that look like a win to you?

> Luddite! Technophobe! Hater of progress!

On the contrary, if I were given a choice between being born anywhere on Earth before something like middle of the 20th century or not being born at all I'd take the second option right away. Compared to the entirety of history of humanity life in the first, second and occasionally third world today is something they couldn't even begin to imagine. Getting so far and losing everything just because we were impatient would be a monumental stupidity on our part.

> Ok, then, suppose I believe you. How soon will this AGI happen? How likely will this end in disaster?

The straightforward answer is "I don't know". No one does. There are timeline estimates, for example see here for an attempt at modelling, here for an attempt at market consensus, and here for some charts and an overview. There are probability opinions, which are all over the place. If you ask me for my gut feeling out of pure curiosity, then today it'd be something like 5 to 20 years in the baseline scenario with no or minor interventions and overall p~0.5.

> 0.5% ?

No, a coinflip.

> Why are you telling me this?

Because I consider it both true and important. Writing a blog post is cheap and I think it will more likely help than hurt, even if the effect is tiny in both directions. Why am I telling this to you now is because of two reasons. One is that I got properly scared, not through theoretical arguments, but by seeing concrete progress, first in Go, then in text processing and finally in image generation in the second half of 2022. Another is that I saw some good news recently: the AI pause letter (the first one, from March, although there was another one in May). It was insufficient, it might have been focused on not exactly the right things and it will not get implemented, but I did not expect it until later and I did not expect so many voices of support. It meant a critical mass of realization is starting to form, which makes speaking out more valuable. To have a chance to solve the problem first you need to be aware that the problem exists. One advantage we have now is the internet, which lets all kinds of ideas, both good and bad, spread much faster than before.

> Wait, are you suggesting this is a "forward this to ten people you know or your civilization dies in its sleep tonight" kind of chain letter?

Of course not. Form your own understanding. Know how to answer questions based on your model. Then talk to others.

> What do you think needs to be done?

Now this is the part where it gets tricky. One problem with conventional scientific progress is that for every technology that had the potential to bite us in the ass, we let it do so, hard and repeatedly, before we learned from our mistakes. Can't have that with AI: you get only one attempt at real working AGI and once it gets smart enough to self-improve it will resist your attempts to modify it, so all mistakes you've made will evolve on their own inside that system and you don't get a chance to fix them, ever. If you are a software engineer, have you written anything more complex than fizzbuzz and got that to work with zero bugs forever on the first try? And, if that wasn't sufficiently bad on its own, we have multiple AI labs racing against each other to put more capable products on the market ahead of the competition.

This is why the first step would have to remove the time pressure by an international agreement banning development of more powerful AI systems worldwide. Not forever, just until the point the fact that newly built AIs will only have our best interests in mind can be proven by formal verification and these proofs are independently checked by the mathematical community and no flaws whatsoever are found. 50-year pause would be a good start to look around and evaluate our options. Without this pause the ones who build the final invention will be the ones who cut corners, move fast and break things the most, and the only outcome that kind of attitude I can see bringing is BAD END with no continues.

TLMC "timeout" version, 2023.01.15

2023-01-15T00:32:00.014+03:00

If you prefer your touhou albums somewhat disorganized, but later, rather than organized, but never, then this version is for you.

Single torrent : 1.55 TiB or 1 701 548 530 091 bytes.
Torrent file size is 18 106 928 bytes; contains 67 197 files.
Magnet link.

Added on 2023-04-25: Recent versions of uTorrent are broken (see user problems in comments) and will fail to open this torrent. Use other clients. I suggest qBittorrent.

This torrent contains only new content relative to TLMC v.19, so if you want a complete collection you need both. There are several things you could do. Most obvious, clean and suggested solution would be to keep v.19 and this torrent in two separate directories, then union-mount contents of both to a single location. Unfortunately not an option if you're on Windows. Another way is just to point both to a single directory and create a giant mess, then if/when I release proper next version ask your torrent client to delete this torrent together with its files or ask me for a python script to delete all files belonging to a single torrent if your client can't cleanly do this itself.

Collection status:
- directory names (circle/artist/album/release date): mostly ok, but event names could be shortened
- audio files: needs further deduplication, some albums need ctdb repair, some albums need conversion to 16/44 (anything higher is a sad reminder about the state of modern education), everything needs a uniform naming scheme to make automatic creation of cuesheets possible
- digital cover versions: many albums still have better alternatives
- cover scans: as good as it can get, I'm just relaying what I got 1:1, nothing to do here

Added on 2023-07-11: you can view album quality scan summary here or download original summary file without blogger code injections here.
First and second columns from the left are the ( my track matches min - max / total track submissions min - max) in ARDB and CTDB, respectively, at the time I ran a scan of that album (which could easily be a year ago and is saved in the comment).

If a square is dark green, then the audio is verified as OK and there is no need for duplicates;
if it's light green, then there were not enough confirmations and you can send your rip and submit it to CTDB/ARDB;
if it's dark red, then the audio has a mismatch and I welcome a replacement;
if it's light red, then there is no match, but the audio was most likely normalized, so treat it as dark red;
if it's grey, then there is no information on existing rip (either a rare CD or a web download) and your rip would help;
if it's white, then it's a classification error of my script (treat as grey).

A teal line is an album from v.19 (old torrent from 2018), a white line is an album from the timeout ver (new torrent from 2023), a dark grey line is a circle separator.

Looking for volunteer domain owners

2022-07-01T21:07:00.004+03:00

Some time ago I encountered a problem with this domain, but was able to resolve it with external help. However, current owner is no longer comfortable with that role.
Unless something is done about it the tlmc.eu domain will expire in early 2023.

So, once again I'm looking for a EU citizen (or an owner of a EU-based company) to acquire ownership of this domain and keep pointing it to the same location. You will need to disclose your real name and other info to the EURid and your registrar of choice.
If you're willing to help, please contact me (xmpp: rwx@headcounter.org, preferred) or the current domain owner (removed as no longer applicable).

Added on 20 Jul 2022: The issue is resolved thanks to Thyra.

A slight delay you probably shouldn't worry about.

2022-03-01T09:56:00.000+03:00

Summary:
Next torrent version is in a state I'm not entirely content with, but I'm willing to just add some finishing touches, share it that way and apply fixes in the version after that. Unfortunately because of circumstances outside of my control I will have much MUCH more important things to do with my free time in the next several weeks (I wish). However, after that I expect to plan to resume working on it and make a release.

Long version:
I happen to live in a certain country whose senile mass-murderer at the top decided to attack and invade Ukraine.
I spent most of the last week in shock (call me weak if you want), primarily because there was nothing I thought I could realistically do about it (call me slow if you want).
Right now I'm looking for job offers in other countries, but because of a strong and quite literally one-in-a-million preference[*], there is only one country (US) which at the moment satisfies it and it's neither easy nor fast to get in there and two moves are as bad as two thirds of a fire... but I digress.

If there's any western company which can offer permanent relocation + status adjustment, which needs a C/C++/Go/etc software engineer with 5+ years of job experience who also accidentally got a PhD in theoretical physics when he was young and adventurous, please don't hesistate to contact me at xmpp-jabber address rwx@headcounter.org or offer other contact channels here in comments that are more convenient for you.

There is also a slim, but worryingly nonzero chance that the scum in the Ministry of Censorship will completely cut off internet connectivity (and then blame it on the Evil West). In that case, OOPS.

One last thing. I don't think this warning is needed here, but just to be sure: pro-kremlin commenters will be shot on sight.

[*] I'm not telling what it is, but you can have this cute picture instead.

Quality check

2021-12-15T21:57:00.001+03:00

Recently I ran a CueTools scan on an entire previous version of TLMC, as I was planning to do since I discovered I had this option. Not exactly the v.19, as it was already slightly dismantled (~25 albums replaced with better versions, renames applied), but something pretty close, say 99+%. The results were quite unexpected for me. If you're curious you can download full check logs (the checker might've also grabbed some extra files from bonuses) and a summary in both original text format and a table. Visual html+css table form uses some heuristics to transform summary text into color-coded quality representation, so it's not entirely precise. If in doubt refer to the detailed logs. You could also run the check yourself, although be prepared to let the program single-threadedly crunch numbers for a day or so, the bottleneck is the audio decode time, which runs at ~200x realtime on my CPU for the TTA codec. My logs slightly differ from the ones produced by the original Cuetools, because I patched it a little for my convenience, but the change only displays calculated offset in CTDB checks, which is not shown in the unmodified version.

The summary of the results is as follows: about 20% of all disc images fail verification. Here by "fail" I don't mean "disc was not found", I mean "was found to be different and unrepairable". Oh, and completely by coincidence the peak level of the supermajority of such rips is almost exactly 98%. And by another coincidence normalization to 98% is the default value in an admittedly turned-off-by-default setting in "Exact"AudioCopy. In addition to those there are another 15% of disc images, which also have peak level of 98% and return "disc not found" and it'd be fair to assume they would also fail to verify.

Lessons to be had here:
1. Not only it is your duty as a software developer to provide sensible defaults, doing only that and no more is clearly insufficient to produce reasonable outcomes. Out of 10 monkeys that see an unknown lever at least 3 will pull it and leave it there. Any option that does anything unpredictable to the dumbest possible user should be hidden behind a curtain that requires a certain intelligence level to bypass, which is enough to understand what the thing that was hidden actually does. In our particular case it could have been a domain-specific scripting language interpreter window for general purpose audio transforms, not a "please ruin my rips" on-off switch in plain sight. Of course there is another danger that by simply removing anything the unfortunate could use to shoot their feet off, instead of properly hiding it because it is easier to do so, you risk turning your software into iCrap.
2. We don't know how badly mangled these rips are.
Maybe they were improperly done and there are stutters or clicks or anything else that wasn't in the original CDs.
Maybe the only problem is the normalization, so it is displeasing on an intellectual level, but would not be audible as a defect.
3. In my defense those album versions were the only ones available, so the only alternative to including these rips was not to include them. I could not have done anything differently, had I known about it, but I wasn't aware of the problem and its extent. Now I am and so are you.

Some bonuses while you wait

2021-12-09T17:23:00.001+03:00

I didn't upload a lot recently, so it came as a mild surprise to me that nyaa pantsu is dead and nyaa si closed registrations.
I'll be posting some discographies of non-touhou circles that are nevertheless pretty awesome to make waiting for the big torrent slightly less annoying.
As always extract cues and use cuesheet-aware player (foobar/deadbeef) to play individual tracks.

2021.12.09 ArsMagnA (Ariabl'eyeS, -LostFairy-, Seraph) lossless music collection.
Magnet link, torrent link, CUETools verification results.

Why do we need a database-based filesystem.

2021-08-02T01:48:00.000+03:00

Note 1: there is a small TLMC-based demo at the end of this post, I recommend to check it out.
Note 2: these ideas are neither original nor new. I'm fairly sure there exist similar thoughts in written form from 20 years ago and at least three abandoned prototypes of this kind.

To illustrate the reasoning consider this toy example: suppose you are a food recipe collector. You started collecting various recipes and writing them down two years ago and was doing it since. You've managed to extract some original recipes from your aunt, mom and sister and so far you already have three broad categories of food: soup, tea and cake. Of course, you are keeping the recipes in digital form as files. As your library grows you need a good way to locate recipes instead of just dumping them all in one directory and combing through the entire list. What are your options?

1. Your disk filesystem suggests a natural way to do it. You create several directories, each corresponding to a certain mutually exclusive property of your files that interests you, for example a year you recorded the recipe. Once inside you choose another property, for example an author, then a food type and so on, creating a tree, until the end where you're left with only a small portion of files in every leaf. Uh, you immediately see not even one, but two problems, right?
First, you will need to create duplicate author directories for each year and then duplicate type directories for each author. Just 3 layers with 3 different properties per layer and you're already at 9x duplication ("\prod_{i=2}^{N} {n_i}" in the general case).
Second, this locks you in the order by which you can choose features which interest you. This scheme works as long as you go by year-author-type, but want to view all recipes by a particular author or all recipes of a certain food type? There is no easy way.

2. Abandon the idea of directories and store all files in one directory as a giant pile. However, give them names which encode all information about the metadata, such as "year=value1, author=value2, type=value3 recipename=value4.txt". Then use a search tool to choose whatever interests you. Unfortunately for Windows users the default search was semi-broken in XP and completely broken in 7 and onwards, but let's assume you use a working third-party one or some other OS where you have decent tools out of the box.
This is better, but still problems remain. If you want to filter on two or more features at once you need to remember their exact order in the name schema, else your naive search will return nothing (and an expression which would work with either order will get monstrous very quick). You also need to create new files with properties in correct order, but here you can just clone an empty template file with all fields set to nulls. If there are several authors for a single recipe then you need to put those authors in order in your query too. Can use alphabetic sort, sure, but these small inconveniences do add up.

3. Yes, finally, a proper solution. Just use an SQL database, dammit. Have a table for recipes, authors, dates and what not. Have as many of one-to-many or many-to-many relationships as reasonable to describe whatever you need. Then with a single query you get anything you wanted.

All our problems won't go away that easily.

First, there is a cost to set this up. The solution is obvious if you're familiar with computers, but a novice will have to learn a whole new query language with all relevant concepts.
Even more importantly, there is no integration with anything that expects to work with files/directories. If you have an application, a text editor, it wants to open files through a file open dialog, which reverts us back to the filesystem. Yes, you can drag and drop a file, but you can't meaningfully play previous/next track in the audio/video player. Saving a file is equally problematic, as you'll need to associate it with proper db structures later.

3+. FUSE to the rescue.
Have an application that performs a mapping between SQL tables/rows/statements and a filesystem: a path becomes a query and query results become files.
You don't need experience in developing kernel drivers to write a program that implements what you need. Userspace will cost you in performance, but it also means things are easier to pick apart and a mistake won't crash your system.

A demo.

I simply took the entire TLMC tree (snapshot of 11th of June... probably...) and looking just at its two upper layers pushed all available information into a database via a trivial regex. All fully automatic, of course, so entries with multiple circles are left as is (as compound pseudocircles), circle name hints, rename combos and decorating brackets are left as is, albums spanning multiple discs/codes are left as is and so on. You can download the resulting db and a FUSE module here: source , database , windows binary .
There is source code for all platforms and a prebuilt windows binary. Nothing is particularly interesting about the source (it is in fact quite messy), I'm just sharing it because a sole binary is always suspicious/requires extra trust and making an executable that works on all linuxes is a pain. Don't forget to grab the database.
To run this demo YOU WILL NEED : WinFSP on Windows, fuse3 packages on Linux, SQLite library for both platforms. Additionally, to build this you will need fairly recent gcc and friends (included in the latest MSYS2 on Windows), fuse3 library + headers (included in WinFSP on Windows).
How to run: mount your vfs somewhere as "mkdir /tmp/mnt && a.out /tmp/mnt" on linux or "a z:" on windows.
How does it work: when started the program examines the databases directory. For each database [with a fixed name] it looks at its table and column names and presents them as directories. Right now it assumes certain db structure (starlike acyclic schema), in general case it could be supplied some hints by the db author. When, by entering a directory, you choose a table.column pair the program searches the db for all distinct values of this combination and lists them as the next layer of directories. You can choose a value then, this creates a constraint on results. Next you can choose another table.column, see what options you have there and so on. Once there is enough information to uniquely identify an item the full implementation is supposed to look up its corresponding file(s) and provide them as a passthrough fs (mine just returns a dummy).
This is all good, however you could notice two annoyances:
- Picking properties one by one is slow, you might want to navigate the directory tree according to a preset combination of properties in the order that you remember without explicitly naming every single one along the way.
- Picking properties one by one is slow, you might want to see what's available in concise format where multiple properties are combined into a single choice.
Both of these are easy to solve. TLMC database has two extra invisible tables. A custom field is what TLMC always used for its album directory names : the "yyyy.mm.dd [code] albumname [event]" pattern. And a preset path is a path which uses circle name as its first component and a custom field for the second. Of course you can also define your own to create any view you'd want.
By the way, there are no files, this is just a demonstration of directory structure a database-based filesystem could have. Also, the whole thing is a quick hack job, so it's relatively fragile and could break if you try to hand-feed it deliberately crafted confusing inputs. It is also criminally slow to list some virtual directories in comparison to regular filesystems, but as long as it's faster than human reaction time you can afford to pay that price.

Q: So, after all this we can do the same thing we always could, just in a more complicated way?
A: Not exactly. We can do the same thing and much more. For example you can trivially add transliteration for circle names (or album names or event names, but I skipped that). Well, more like translation. Well, even more like whatever google thinks you think you wanted, but you get the idea, this was also automatic. Then instead of seeing a list of squiggly blobs [in case you can't read japanese] you'll see familiar english letters. Or you can filter on any property and browse in any order you'd like:
- Pick "event.name" first, "circle.name" second and see which circles participated in the event you chose.
- Pick "album.year" first, "event.name" second and see which events happened in the year you chose. Strictly speaking this abuses the fact that album release date almost always coincides with event date, but date fields could be added to the event table to disambiguate.
- Pick "circle.name" first, "album.year" second, "custom field" third and see all albums by a circle released in a certain year.
Okay, there's not a whole lot of combinations now, but it does add some convenient options. Pick "prepared path 1" to view albums in historical-TLMC-like way.

There are no good browsers left.

2021-06-28T14:15:00.006+03:00

I actually intended to make this post a year or two ago, but then kept putting it off. Now I stopped doing that, so here you are.
What is this about? I don't mean to criticize how bloated the browsers have became, even if it's true. I don't mean to write about how Firefox always shuffles UI elements around or how Chrome spies on its users. Just the very simple, most basic thing. Fonts and font rendering.
Make sure to set your browser page zoom to 100% to avoid browser scaling image distortions.

Here's a tiny piece of html for this test.

1. This is how the text used to look for every Firefox user (on Windows) and still looks on my end. First image is 1x and the second is 8x nearest-neighbour magnification so you can easily see the details. Right-click and open in a new tab or set page zoom to 100% or download the images to view them in original size.

Notice how every single entirely horizontal and vertical line is perfectly aligned with the pixel grid. Antialiasing for the round parts is there, it's unavoidable, but it doesn't turn on until some threshold font size. Smaller fonts use exact hinting and stay at 100% contrast, larger fonts smooth only the curved segments. This is Firefox until version 57 with forced Cairo font backend. This is how it did and should work.

2. With some bullshit justification typical for the Firefox team they completely removed good rendering option. This is how the text looks now for any user of a more recent Firefox (ok, I made the screenshots years ago on version ~60-80, but it could have only become worse at this point - I'm not reinstalling some new version and the test html is there for you to try if you think I'm wrong).

Small font is still the same, but larger fonts got worse. Maybe the distances between letters are now correct floating point values that were prescribed by the dumb new engine, but it cost us vertical lines which became gray and/or blurry. To say directly and honestly: this rendering sucks. The moment I saw it was the moment I stopped updating FF and reverted to the last known good version.

3. What about the browser with the largest market share, Chrome? Ok, ungoogled chromium to be precise, but in this regard it's the same thing. Well...

Like, seriously?? This is the best that google(!) could come up with? It isn't just shit smeared across the screen: the blurriness of the whole thing makes you want to claw at your face and vomit your brain through the eyesockets. How could anyone look at this hideous abomination for more than 5 seconds and think it's even remotely acceptable not to immediately wipe the software that decided to draw this from your machine?

Potential objections:
> "Just buy a high-dpi monitor, bro"
This is doubly retarded for the following reasons. First, it requires me to fork over some cash for the hardware to solve the problem that was not merely preventable in software, but rather was created in software in the first place. And second, even if your new resolution is double the old in each dimension for the same physical size, then that would only halve the width of the blurred gray edges, not eliminate them entirely. So, instead of solving the problem this provides a half-assed band-aid solution that costs money and doesn't even work. **** you very much, but I'd rather keep my old display and my old browser that work perfectly fine together.

TLMC v.20 PRE-RELEASE post

2021-06-11T20:58:00.005+03:00

The bad news: v.20 is still several months away.
The good news: v.20 is only several months away.
This is NOT YET a release, but we are fairly close. Here's how you can help:

I will be posting a list of albums I've managed to collect from various sources so far with updates from time to time until the torrent release.
- If there is something in the list that should not be there (unrelated albums), say so.
- If you have something that is not in the list (and not in the latest tlmc, of course), do share. Make sure to check these things first:
--- Ctrl-F my latest list and the comment section of this post to check if your album is not a duplicate. Search for smallest unique substrings, in case spelling is slightly different or wrong.
--- Good scans are always in short supply. Even if the album is in the list it might be missing scans. If you have the files, err on the side of asking/sharing.
--- I'm monitoring doujinstyle, so no need to announce or offer copies of their new uploads (unless yours is better, for example +scans or +log or 16/44). However, I could miss something out of their ~16k links, especially if it was not marked as touhou, so a double-check would be appreciated.
- If you can download from baidu, then check if acgjc.com touhou subsection has anything new.
- If you can download from baidu, please reupload links that I or anyone else will occasionally post here.
- If you cannot download from baidu, but have some baidu links of new material, then post and hope someone who can will reupload them.
The "new album" list could not only grow, but also shrink, if I discover files that fail decoding, have wrong content, are unrelated, etc. I don't expect many such problems, though.

In ~~several weeks~~ due time I'll add a list of all files and an archive of cuesheets. I'm still adding digital covers for the new albums and would like to go through all of the old albums without scans as well, if time permits.
I stripped personal comments temporarily appended to directory names, so some entries in the "new album" list may appear as duplicates of v.19, no need to point that out. This is not a problem: these are either new scans, additional scans, alternate scans, extra logs, or simply dupes I didn't compare yet.

If you're aware of a production-quality program that can extract JPEG content from TIFF files without bitmap decode-encode steps, let me know.

Every album that is assembled from separate tracks will have a "REM COMMENT built from tracks" or similar comment in its cuesheet. This is not bad and doesn't mean much on its own, it's mostly for bookkeeping purposes. Every single album built from tracks that had a single-file duplicate (with maybe one or two, literally, exceptions) was either content-wise bit-identical to its dupe or merely shifted by difference in drive offsets.

The smallest addressable unit in cuesheets is a frame, which is 1/75-th of a second or 588 samples (2352 bytes). If any length of a track in the album is not an integer amount of frames, then every track in that album cuesheet gets a sample-precision start and end comments, which makes it possible to recover exact audio data content of the files used, should you ever want it. It also means that either (far more likely) files were mastered directly for digital distribution, bypassing CD creation step, or (far less likely) some shitty software was used in the creation of that CD rip.

About 5% of files are problematic, with either sampling depth of more than 16 bits or sampling frequency of more than 44.1-48 kHz. These are mostly bandcamp releases. I'm still undecided on how to proceed with them. Will add a list a bit later.

If you were ever under impression that flac or lossless in general means untouched CD audio rip, sans an occasional artifact or two, then I'm here to disappoint you. At least one album that almost made it into v.20, but got accidentally caught by me, was postprocessed by the ripper. For more details you may read a comment that alerted me of this fact here, search for "destroyed". I downloaded both versions, the difference is clearly audible. Thanks to the "people" who decide to do this as if the audio weren't already loud to the point of clipping often enough, we can't be sure albums are really correct CD rips if they don't contain rip logs. Old Share/PD rips are less suspect because the average intelligence on the internet never goes up. I don't think it is a frequent occurrence, but still, if you reshare something then DON'T DISCARD THE RIP LOGS. I might try to scan all albums with cuetools if I find its batch mode, although rare CDs would naturally have no db submissions to compare against.

I still don't have a clear policy on sub-titles, which are album names transliterated into english (frequently in uppercase) and appended to the main title. For now I follow local consistency rules.

Many event labels are missing (trailing "[]") or have inconsistent spelling, this also creates perceived dupes. Ignore, it will go away soon.

I have an idea about a secondary supplementary torrent, which would include non-touhou albums of circles if they make up significantly less than half of that circle repertoire, so that by combining both you would get complete discography of a circle. It is definitely not happening any time soon. However, I don't delete unrelated albums I download (or find in and remove from the torrent), just safely put them away, so keep that in mind.

old v.19+ list for convenience (~6.5k lines), updated on 2021-06-11
v.19+ list for convenience (~6.5k lines) last updated on 2021-09-14
(includes proposed moves, renames, deletions)

old v.20 tentative additions (~3.3k lines), updated on 2021-06-11
v.20 tentative additions (~3.5k lines) last updated on 2021-09-14

old v.20 proposed moves and renames, updated on 2021-06-11
v.20 proposed moves and renames last updated on 2021-09-14

v.20 diff vs v.19 (10k lines) last updated on 2021-09-14

Booth.pm corrupts your downloads.

2021-04-21T18:43:00.003+03:00

If you ever bought direct downloads from them make sure to check that your files are error-free, especially if your downloads were slow.

Today I noticed that half of the files in four free albums by "クロネコラウンジ" did not decode properly. I redownloaded them all and one of the new files was still broken, but another attempt got me a good version. It was the only file that took about two minutes to get (today's first attempt), compared to several seconds for the rest, which suggests that under heavy load the file-serving part of their webmonkey code hiccups and starts sending you wrong file parts. Looking at the broken files in hexeditor you would notice content starts repeating in some random place (the starting position of the first part that gets repeated is aligned at 1 MiB boundary).

Princess Connect

2021-01-10T12:17:00.004+03:00

Important notice 1: TLMC pre-release post soon(tm), the torrent several weeks after that post. Don't ask when, it's being worked on.
Important notice 2: Princess Connect Re:Dive [global] is available both in "32-bit" (ARMv7) and "64-bit" (ARM64) versions, don't listen to anyone claiming otherwise. Just visit some direct-from-google downloader like this, and pick SG Grand Prime (for 32) or SG Note 9 (for 64).
Important notice 3: Leapdroid won't run PCRD (some OpenGL surface issues). Bluestacks emulator, which I saw being recommended by many, contains a metric shitton of malware out of the box, be advised.

Just a short observation - if you feel like peeking at exact values the game uses internally, you should view its db (which is located in "files/manifest" app settings subdirectory; the one with the sqlite header). If, for any reason, you just want to refer to some data there without exporting it yourself, you may use this dump. I'm not sure I'll be sticking with the game for a long time, so I can't promise regular updates yet.

Based on the current db the game will be officially released on 19 Jan 2021 and clan battles will start on 10 Feb 2021.
First "focused gacha"' will be rateup for Djeeta.

20 Feb 2021 edit: Stopped updating the db dump on github.
The game turned out to be pay-to-win trash, wouldn't recommend it to anyone.

Proper GF battery chart

2020-05-08T14:04:00.000+03:00

Yet another thing that was slightly bothering me about GF for a long time was wrong information about the effect dorms and their comfort have on collected batteries that is repeated everywhere. It was bothering me only slightly because I never had enough gems to go past 3 dorms. However, some time ago I got rich fast (in gems, not real life), so I decided to fix the issue.

Here is the chart. Horizontal axis is the sum of all dorms' comfort in thousands, vertical axis is batteries gained in 24 hours. Line color ~ dorm count.
Here is the raw data, if you're curious. Some extrapolations were made.

Note: if you enter any base facility between collections during the time condenser is active, then the gather time will be split in two and the amount you get will possibly be one less because of rounding down.

Bookmark a backup

2020-03-18T00:57:00.003+03:00

Yesterday I got an email that looks like a fishing attempt. Bla-bla, validate your identity or else we'll suspend your domain, bla-bla. Just in case it is not a scam I recommend you to save a backup name that leads to this blog: http://nameless--fairy.blogspot.com (note the double dash).

Significantly reduced GFDB precision

2019-11-06T11:52:00.000+03:00

Starting from 03 Nov 2019 the frequency of the log data updates returned by the game server changed once again: now refresh interval is more than once per hour (~20 updates in 24 hours). If this is a permanent change then it's rather bad news for anyone expecting meaningful statistics from EN GFDB in the future.

UPDATE: On 06 Nov everything got back to normal.

On modern software

2019-08-05T00:27:00.000+03:00

Today I noticed that gfdb updates were broken again. It was the third time it happened, so I knew this deserved a post.

I caught first update breakage purely by accident. One day I looked at the process list and noticed a chain of git commands sitting there doing nothing. Apparently at some point git decided it was time to do garbage collection, so it ran "git gc --auto", which, for reasons yet unknown to me, could not complete in 50 minutes on a fairly small and completely linear repository. This step should have happened before the push took place, but since it just hung there (with no cpu/disk activity, by the way) it was killed by task scheduler every hour to start new task, which repeated everything. This was going on for 5 days before I noticed. I run "git gc" manually and it finished in about a minute, cleared the clog and all was good (for a while).

Second time was exactly the same as first, except I was already aware it could happen, so I reacted sooner. Internet search brought nothing: either it was a new bug, or it was specific to some part of my configuration. Since I (mistakenly) already dismissed it I had to wait for another chance to debug and see what was happening.

However, today's problem was different. Call stack was not the same and had bash prompt for input at the top instead of gc command. After I killed it and manually tried to push all accumulated updates git asked me... for a username? What? Did github ban me, so my user-pass no longer worked? But why? I tried to login into github. Blah-blah, we see your device for the first time, how about you go fetch verification code we sent to your email? Oh great, fuck you very much, another service turned """""user-friendly""""". But whatever, here's your code.

Hmm, no notifications, the repo is there, everything looks pretty normal. I tried pushing again from the console. Typed the username and copypasted the password.

>remote: Invalid username or password. >fatal: Authentication failed

What? Did I just mistype 5-letter username or miscopy a doubleclicked string? Try again:

>Enumerating objects: 14190, done. >Counting objects: 100% (14190/14190), done. >Delta compression using up to 12 threads >Compressing objects: 100% (1834/1834), done. >error: RPC failed; HTTP 401 curl 22 The requested URL returned error: 401 >fWatal: the remote end hung up unexpectedly

What?? Looks like github is having problems, should I try again?

>remote: Invalid username or password. >fatal: Authentication failed

And then after one more attempt it accepted the changes I tried to push.

But what about git? Where is the password that was saved what, 3 months ago, and reused every hour since then? I went searching and encountered this magnificent piece of information:
>"git pull" will fail, because the password is incorrect, git then removes the offending user+password from the ~/.git-credentials file

Let me repeat that, so you can enjoy it one more time: whenever git discovers that any stored user+pass combination doesn't work it simply erases it from saved credentials file. I checked the file - yeah, 0 bytes. All puzzle pieces are in place.

What did we learn today?

Shame on you, git. I did not expect this systemd-level bullshit from you, of all programs.
Shame on you, github. Having transient authentication problems (caused by service overload, network partitioning or something else) is understandable, returning wrong error statuses and not notifying affected users is not.
Modern software (and software as service doubly, if not triply so) ALWAYS needs to be propped up by humans. Even if you setup everything correctly and leave it unattended, then sooner or later, mostly sooner, something will fail.

First Girls' Frontline guestopocalypse

2019-07-23T21:20:00.000+03:00

Today during the maintenance I noticed an error in craft scraper log that I've never seen before: all login tokens were being explicitly rejected. This slightly scared the hell out of me, it could mean that my account, which I kept as a guest, was also in danger. After the mainenance concluded the error message didn't change, my own account was also locked out.
Well, shit.

I found Sunborn support address and sent them an email. At the same time I started to wonder how many melons and monthly rewards I'll miss before I get my account back, if I get it back, that is. However, first support reply arrived in 6 minutes. They requested information about my account that only the owner could know. Completely accidentally I had a table with all my doll/fairy levels and skill levels, so I sent them a screenshot of that. Then they asked for email to bind, sent a link there, the link gave me the password, I put all that into GF client and it logged in. The whole exchange took less than an hour.

Results:
Sunborn tech support: S-. They get "S" for speed/success in helping and "-" for confusing me by replying to my response to their email request with the same email (was my request erroneously routed to two supoort guys at once?) and then sending me two recovery URLs.
Sunborn backend operations: C. At least they didn't wipe all guest accounts completely. Unfortunately this level of total incompetence is so widespread in the industry, that the company didn't even find it necessary to apologize for their mistake.

I also had to reroll scraper accounts, which cost a couple hours of log downtime, but luckily I got all of them in just 15 attempts, with 30 being the expected value.

Girls' Frontline production anomaly.

2019-04-21T14:51:00.000+03:00

Today I was looking at the GFDB and noticed something strange: there were some results that just weren't supposed to happen:
1 (one) [SMG] C-MS each from HG recipe 130-130-130-130 and AR recipe 97-404-404-97,
1 (one) [SMG] 100 Shiki each from HG recipe 130-130-130-130 and AR recipe 91-400-400-30,
1 (one) [AR] MDR from SMG recipe 400-400-91-30.

I could think of several reasons for that:

Magic
MICA's shady new crafting formula
Tester-kun is fooling around again
It's an Arknights conspiracy

As you probably know, templates like that are forbidden. At first I was afraid that my SQL import process failed because I tried to speed it up so much that what was left of a full json parser was a basic tokenizer which didn't even check field names, assuming that their order was deterministic and could only change because of some server maintenance or update. Fortunately I keep complete raw logs. One search later I confirmed that no, import was doing fine, and the anomaly shows up there too. If you're curious, here are the crafts in question:

{"id":"11027121","dev_type":"0","user_id":"52674","build_slot":"1","dev_uname":"Lani","dev_lv":"63","gun_id":"228","mp":"91","ammo":"400","mre":"400","part":"30","input_level":"0","item1_num":"1","core":"0","dev_time":"1555766798"} {"id":"11002254","dev_type":"0","user_id":"91144","build_slot":"1","dev_uname":"AdultNeptune","dev_lv":"126","gun_id":"215","mp":"400","ammo":"400","mre":"91","part":"30","input_level":"0","item1_num":"1","core":"0","dev_time":"1555755557"} {"id":"10986940","dev_type":"0","user_id":"599614","build_slot":"1","dev_uname":"uwux3usowarm","dev_lv":"33","gun_id":"213","mp":"130","ammo":"130","mre":"130","part":"130","input_level":"0","item1_num":"1","core":"0","dev_time":"1555750872"} {"id":"10554813","dev_type":"0","user_id":"5795","build_slot":"1","dev_uname":"Arveene","dev_lv":"117","gun_id":"213","mp":"97","ammo":"404","mre":"404","part":"97","input_level":"0","item1_num":"1","core":"0","dev_time":"1555806667"} {"id":"10468473","dev_type":"0","user_id":"317635","build_slot":"1","dev_uname":"Eoneo","dev_lv":"158","gun_id":"228","mp":"130","ammo":"130","mre":"130","part":"130","input_level":"0","item1_num":"1","core":"0","dev_time":"1555751142"}

There is nothing particularly suspicious in the data: all userids are different (they do belong to 2 out of 10 shards, but that's not quite enough), times are quite far apart (doesn't look like particle shower bitflips and AWS hardware should have ECC anyways), no magic numbers indicative of overflow of some kind. I don't know what caused this yet.

Girls' Frontline statistics, parts of the missing chapter.

2019-04-13T20:09:00.000+03:00

Alright, you primitive screwheads, listen up. See this? gf-db.github.io This is your new GFDB.

Well, to be fair, it's somewhat unfinished, but better than nothing, right?

There is one more thing that complicates straightforward analysis of rates during various periods that I didn't mention in the previous post. "dev_time" parameter is the time you take the item (doll, equip or fairy) out of production. However, since it is known in advance what will be produced, because most items are 1:1 mapped to their production times, probability roll happens when you start the construction. Thus some rolls will leak: they will be produced using rates of one crafting period, but will be taken out and appear in some subsequent period. You can exclude such crafts at the cost of slightly reducing overall amount of samples: for every user for every their build slot discard first craft in every production period. The effect we're talking about here is not that big: if you run dumb query that doesn't account for it, you introduce bias in estimated rates proportional to the difference between true period rates divided by the average amount of crafts per user during second period.

Yet another effect to look out for is regression towards the mean. Suppose you took the list of recipes for any particular item and arranged it in descending order by mean value of item probability estimate, then noted top ones. If you came back later after more data was gathered more likely than not you would notice that all supposedly "good" recipes' mean values fell. This does not mean that fresh new recipes get "originality boost" or something like that. Since by their very nature results are random in any distribution you will get recipes that perform both better and worse than their true rate on a small sample size and with more data all these fluctuations will smooth out.

A sidenote: there are times when one could begin to wonder if all the hate of microsoft products is really warranted and not a mindless parroting of another camp's fanboys. And then usually something like this happens: I needed to add regular statistic updates for the site. Should be easy, right? Since I run the capture on my main machine, and it has to run win7 because games... open Task Scheduler, create a task, done? Well, I did that and for some reason, call it intuition or curiosity, added timestamp logging, to see how much time updates take. When launched manually 1 hour of files is typically added in 2 seconds and the process to recreate statistics takes about 20 seconds, which is a lot and is a result of unoptimized queries/indices doing full scan on a gigabyte-sized database and the fact that I wrote it once and then used to run it every other month, but that can be easily fixed and is not the point now. So, testing showed that the data was added, generated, so I left it alone. And then after about half a day I peeked at the log and did "O_O" face when I saw update times of 20-40 minutes (?!?!?!1). After a quick... ducking? (like googling, but without the privacy invasion part) I found out that (in increasing order of retardation):

all processes launched from windows task scheduler run with lower CPU, I/O and memory priority
there is no setting in the GUI to change it
there is a way to change it that involves exporting the job to XML, editing "priority" variable, and importing it back, but
this variable affects both CPU and I/O priority together and there is no way to set memory priority at all

And even if we adjust CPU+IO priority to that of an interactively-launched process and leave memory alone, because it should only affect how soon our process' data gets evicted from the page cache under memory pressure, the updates still take 40-60 seconds. Why does that happen on an idle machine is still a mystery to me. So yeah.

Suppose you have a nearly complete log of crafts with all the data I described in the previous post. What interesting/useful information can you gain from it? Let's see:

A long list of real user-supplied names; more than 175k entries at the moment of this post.
User level as a function of time with 1 day or better resolution for each player (assuming they do daily crafting). Typical/fastest/slowest growth rate, average time it takes to levelup at level x for entire server population.
History of all name changes. For example, there are 3 users who changed their name twice and 236 users who changed their name once at the moment.
Ability to do username-to-id lookups for dorm-visiting or other purposes.
Item/class/total craft rates as a function of time. Popularity of various recipes.

[I was supposed to put cool-looking charts here, but then got lazy. I might revisit them if anyone is interested.]

Chinese-quality code found in game (I mean, other than UI "responsiveness"):

Client-side name validation. Your name cannot have any of these strings as a substring (case-insensitive comparison): "delete", "drop", "truncate", "set", "database", "table", "field", "alter", "select", "update", "insert". I don't know whether to laugh or cry here.
On "Combat" page accessible from the start menu if you rapidly switch between any 2 selector buttons (Combat mission, Logistic support, Combat simulation) game will reliably crash.
Not joking, happened to me once: the game crashed when I opened Formation window and started rapidly removing dolls from an echelon one by one. When I relaunched it I found myself in the defense drill battle. Lost 5 extra energy to this.
When moving from dorm to dorm using the Next button right after you press it everything starts looking pretty crappy, especially noticable on the condenser and wall portraits. No wonder why - they take a screenshot of the scene to slide it and for some reason (nobody gave a shit, that's the reason) it is temporarily saved as jpeg with high compression, so all small details are blurred and artifacts are scattered all over. If you're playing on a phone it might be less noticable because angular pixel size is smaller for typical viewing distances.

Interesting observations:

There are no "Oath" method/variable names in the game code, only "Marry" / "Wedding" ones.
Christmas bean bag chair furniture piece can be used as a bed, but only by G11.
I went and plotted 5* SG rates from the most popular recipe over time. On 2019-02-26 rates effectively tripled.
On 2019-03-05 (after the update) development log update frequency was increased from once every 10 minutes to once every 5 minutes, except last 4 hours of each server day where it was decreased to once every 40 minutes. On 2019-04-02 (after the update) they rolled it back.
On the chart image in the previous statistics post perceptive readers would notice a small bump during the maintenance. It is not an artifact. Looking at the database these are crafts by a single user named "Hiden", apparently as tests on the production server. You may visit this tester-kun at UID 4422 and say hello (not so hiden now, eh?).

I trust you, reader, not to be a retard, who ignores everyting I've written and rushes to craft recipes with the highest average ignoring low epoch capture ratio and/or high stdev due to low amount of crafts.
However, even if you are one, then no significant harm is done, because the landscape looks pretty smooth to me, so the most impact you can have on your results is to simply do more crafts while keeping resource efficiency in mind.

Girls' Frontline Statistics

2018-11-24T01:33:00.000+03:00

Putting a bit of "everything else" in the blog. This post is about a mobile game and applied statistics.
There will be absolutely no mentions of touhou (other than this one) in the post. Estimated time to read: 15 minutes.

Part 1, introduction.
Traditional gacha mobile games have a simple core mechanic and a number of units that participate in it. Several units are given to the player as starting units, some more can be farmed from maps, but the majority of most desirable, stronger units, come from the gacha - an RNG-powered slot machine that eats game premium currency (real money disguised as another game parameter in a pretty successful psychological trick to mask true monetary cost of player addiction). As the player progresses further in game gacha units become effectively mandatory - only the most hardcore players can do the "no gacha" challenge. Then there is also the collection aspect...

Girls' Frontline is what I'd call a second generation gacha mobile game. The trend started with Kancolle, if I had to guess. Other than that, the list includes Azur Lane, Cuisine Dimension (no english version yet) and maybe some other smaller titles. Their (our?) gacha is different - all unit and equipment crafts utilize resources that can be freely farmed in-game. Premium currency is still there, the game is not a charity after all, but it can be spent either to expand your infrastructure (free handouts can cover that fairly well), to roll cosmetics gacha with various costumes and decorations that are not necessary "to win" or to buy resources at ~~ridiculously~~ high implied time-to-money conversion rate[1]. There are still whales as gacha rates for rare cosmetics items are rather low (they wouldn't be rare otherwise) and the game seems to be doing fine profit-wise. Overall such games feel much more "fair" (for whatever subconscious definition of fairness) for both f2p and paying players, that's one of the reasons I picked it up.

The game has 4 types of basic resources and a crafting system which can be fed user-defined quantities of each resource and then produce units (dolls, in game vernacular) or equipment. Results are random, but there are certain consistent patterns that players can notice if they experiment with various resource amounts and ratios a lot or just read and confirm experiment results of earlier players. Game designers/admins give no information about the effects recipes (resource combinations) have on construction results, making figuring out best recipes another interesting part of game experience. Here "best" mean ones that give highest rates either for dolls/equipment of highest rarity or specific rare dolls, because in the process of hunting for rares you will naturally produce more than enough of common ones.

Typical consistent patterns mentioned earlier are hard cutoffs, which guarantee that you will exclude certain categories of stuff from crafting if you put in amounts of certain resources below or above some threshold. For more details, you may visit a calculator, such as this one. More interesting question is what kind of effect does varying amounts of resources within these bounds have on individual item appearance rates. Theoretically game designers could have made rules extraordinarily complicated, but I suspect that realistically there are only 3 possibilities: we need to find out whether crafting probability landscape[2] is spiky, smooth or flat.

Amount of crafts a single player can do with their limited resources is woefully inadequate to obtain reasonably precise estimates. Manually collecting results from several players is a logistics nightmare and a futile exersize in filtering out deliberate misinformation and accidental logging errors. If that was all we had I'd give up day one. However, fortunately for the curious types, developers left a small hole to slither through. The game features a PRODUCTION LOG, which shows recent recipes used by players and what did they obtain. Intercept game's network connection, figure out the API and results format, and then repeatedly query the server and obtain full and complete log of all crafts for you to analyze - the concept is obvious and trivial. The devil is, as always, in the details. It's a small project that can take some time to get right. It requires you do sit down and actually do stuff. Instead of that, how about some instant gratification in the game?

That's how I thought, putting off interesting in favor of easy, until one day of 16 October 2018. In the course of yet another game maintenance the game client was updated to the new and improved version. These improvements consisted of additional list filtering options (could live without, but fairly useful), slight interface reshuffle (newer one looks like a hack-job by someone with zero experience in UI/UX, but the changes are relatively minor, so oh well), some new game elements (recovery of the last copy of retired dolls - rather useful... for retards who do things first and maybe think later) and absolutely inacceptable and completely horrible interface delays in reaction to user input. Old interface in this regard was somewhat crappy, but bearable, especially as delays were masked by the inevitable for the online game network requests to the game server. Having to wait half a second because game needs to roundtrip to the AWS and sync state with the central DB is one thing. Having to wait three seconds while the game struggles to show you an empty list menu with no network activity in the background is something else entirely. Maybe in the age of websites which spike usage of multiGHz multicore CPUs when scrolling a fucking page with text in the browser and even then sometimes cannot keep up with display refresh this is considered acceptable performance by the general public. For someone with at least half a brain, however, ridiculous amount of bloat happening behind the scenes that leads to this, is shocking, to say the least.

In short, incompetent monkey developers fucked up the client and then incompetent testers, if they exist at all, failed to catch this performance regression. The client was shipped and now everyone including me is stuck with it[3]. Public relations guys tried to "address the issue", even organized some questionnaires to check if only some part of the userbase was affected and I'm still not sure whether it is so, given how many active players are there (more than 50000) and how few complained (50-100 or so). One month later it looks like they didn't even start fixing the bug. The end result for me was that leveling efficiency of my dolls dropped 30% because every screen transition was taking so much longer, enjoyment dropped even more because nobody likes being fed horseshit and I stopped actively playing the game as intended and started playing with the game. I poked it a little, found it was written in C#/Unity with hotpatches in Lua, avoided cracking protection on encrypted main dll by finding someone's deobfuscated version, decompiled and located the relevant parts and made a scraper. After fixing some stupid url copypaste errors it even worked. Then I found out the server was sharded and spent a couple more evenings rewriting the scraper into the multithreaded version. Half a day before 01 Nov 2018 I started full capture and it has been running ever since on my PC. If I had another host I'd add fault-tolerance, but so far I only lost an hour of data due to power outage.

Part 2, gathering data.
It's funny, but developers who wrote the backend part were actually sane. Results are sent to the client as JSON, so I didn't need to bother with pulling custom parser out of game's insides. Result is an array of crafts with various information: user name, user level, item id which is converted into item name and picture by the game, mp/ammo/mre/part (4 resources) input, "crafting level" (relevant for heavy construction) and amount of crafting contracts and cores spent (which is strictly derived from crafting level, so I don't know why they put it here). All of that is good already, but developers were super generous and added some extras which don't appear in the game UI, but help us immensely (I'll show just a bit later why): crafting time as unix epoch, item unique id in the shard DB, user UID, dev_type (again, parameter derived from crafting level) and build slot (we have 2 by default and can add more, up to 8).

It is not enough to capture some data then throw it at statistics and expect informative results. As the saying goes, Garbage In = Garbage Out. You need to understand what you're doing to not do something silly without realizing. My capture idea was not new. The game english server opened in May 2018, while the chinese one was running for more than two years already, then there are also taiwanese, hong kong and korean versions (and japanese, but it's fairly new). Chinese users also made a scraper for their game version, then captured results and published aggregate statistics. Their website shows that the capture process was stopped for some reason on 2018-09-22, but results up to that time are still available. Whether they are usable and to what extent is another question.

First completely obvious problem with them that was noticed by several posters in various places is that new dolls are regularly added to the production pool, but their pages display only one set of aggregate results that shows no signs of being reset. Since sum of all craft probabilities must add up to 1, as you always get something and always get only one thing each craft, something is happening to the probabilities of old dolls, as new dolls dilute the pool. If you just concatenate old and new results you'll end with blurred garbage. Then there's an issue with rate-ups. Sometimes events happen which boost craft rates of particular rarities or particular items. That site also doesn't explain anything about it. Maybe internally they did the good thing and threw it all out (instead of doing the best thing of capturing and presenting it separately), but seeing how they dealt with doll addition... does not instill confidence. And mixing rate-up with normal rate is not a trivial uniform low-intensity noise: if, during rate-up, some obscure recipe is crafted disproportionately many times, it will continue to appear alongside more popular recipes, with the perceived [non-existent during normal times] rate that can approach boosted rate during rate-up, whatever it is.

Back to the raw data. If you look at one capture result you will notice something interesting (since you don't actually have it you'll have to trust me or make your own scraper). Item unique ids are continuously decreasing, well, almost (dev_time resolution is 1 second and if several results happen in the same second they are in reverse order instead - either an oversight or with this sort order queries executed faster or nobody cared). At first. If you go down the list holes in item unique ids start to appear. I could have missed this effect if developers didn't include item unique ids and dev_time's and I didn't do sanity checks on results. "Who cares about holes?", you might say. "We just got less data from the server, we'll compensate with more time spent gathering". Not so fast. If you're curious, try to guess what could be wrong with this data and I'll carry on typing this sentence for a while, so that speed readers have enough time to stop reading, look away from the answer on the next line, collect their thoughts and think for themselves.

Ok, here's the explanation. As you go further and further back in time according to a single log file, there are more and more results missing. Don't know about you, but my first thought was that the server was returning last N results from its current table of users' items. In other words, to save space and, much more importantly, cpu time of the DB server(s), once you retire something, it is erased from the table. Then it no longer appears in the last craft list. Oh, could it be that common items get retired relatively more often than rares? I looked at the equipment list to confirm - rows at the bottom of the list were almost exclusively top 5* rarity. If you used only one such list for your estimates you can guess how skewed results could become.

In normal conditions actually that's not a very big deal. Inside the client there is a switch that allows it to resend request for fresh "last craft" list after more than 1 hour elapsed since the last request, otherwise it uses local cache (although you can always force refresh by restarting the game). Outside, the API endpoint caches list result on the server side for about 10 minutes (I should probably test it's really per-shard cache, not per-user, but common sense suggests it is). So, just ask for results every 3-5 minutes, get 100 or so new dolls (typical craft rate for the shard) in the list out of roughly 1000 (typical list length, there are no switches to ask for more or less) once it refreshes, append to your local list, wait, repeat. Now you're capturing everything you can without insider access. In doing so I achieved >0.99 capture ratio and it's possible to tell that only thanks to provided item unique ids.

Wait, "in normal conditions"? Yeah, there has been one exception so far. Care to guess?
These are your crafts [left part]. This is rate-up [maintenance pause in the middle]. These are your crafts during rate-up [right part].

During rate-up, quite predictably, craft rate goes up too. For example, during the last rate-up of 13-15 Nov 2018 (which included Contender,Spitfire,Zas M21,Ribeyrolles,AEK-999) craft rate was 2 times higher than average during the second day and up to 30 times higher than average during the first hours of the first day. Many of those rapid-crafters run out of free space quickly, as they consume tens or hundreds of crafting contracts, so they have retire majority of obtained dolls immediately after getting them. On the chart blue is the visible craft rate averaged over 1-minute intervals based on count of reported crafts, red is the true craft rate averaged over 1-minute intervals based on delta in unique doll ids. The true rate should be the upper envelope of the visible rate, but as you can see in the beginning server reports can't keep up even during the very recent (relative to cache refresh) parts. All of this means that, if you look at the zoomed-in graph once again, at best only less than 1 minute out of 10 from each request is usable and first several hours should be discarded completely. Fortunately, this only affects estimates of things during the rate-up boost. Individual recipe/item rates outside of rate-up can still be derived from long periods of calm normal crafting.

Part 3, analyzing data.
To be added later.

Part 4, EN GFDB.
I'm still not certain if I want to release all these results as a public continuously updated website, similar to chinese GFDB. On one hand, letting the effort rot locally would certainly be a waste. On the other, all user-driven activity about and around the game is ultimately a free gift to the company, as it increases user engagement (therefore revenue). Given their shitty treatment of users (see the part about lag bug) right now I don't feel like gifting them anything. In case they release an update that COMPLETELY eliminates lags introduced by the 2.0221_268 client, I might change my mind, if it happens before I lose interest in the game.

[1] Kids' grade dimensional analysis tells us that [USD/time] = [USD/gem] * [gem/resource] * [resource/time]. Resetting logistics takes, say, 20 seconds per logistic, and gives ~ 350-500 weighted resource (with up to 1.3 multiplier for great success). Current special ("discounted") offer is 9k mp/ammo/mre or 3k parts for 480 gems. Gems cost 80 gems/USD (without first purchase bonus, since we're talking about whales). All of this translates into equivalent rate of about 40-80 USD/hour, which is surprisingly less than I expected. If you earn more than that then it would be cheaper to buy resources directly, rather than resend logistics, under the assumption that purchase happens instantaneously (it doesn't). One advantage to buying resources is that you can perform a lot of it, while you run out of logistics to send very quickly.

[2] Four input resources may be visualized as coordinates in 4-dimensional space, and true item probabilities (you aren't seriously expecting anything more than Bernoulli process here, are you?), which we attempt to estimate, as scalar fields inside the hypercube.

[3] Technically I could install old client and MITM the connection, pretending to the client that it was talking to the older server version and pretending to the server that it was talking to the newer client version. In practice I'd risk being banned, or worse, open a whole new can of my own personal bugs because of that.

TLMC v.19 (2018.01.01)

2018-01-01T23:50:00.001+03:00

This is the 19th version of Touhou Lossless Music Collection torrent, current total file size ~1.75 TiB. Download and seed. If you have an older version, you can update it with this script (run the script in the torrent directory after stopping the old torrent but before starting new torrent). The script renames files/directories which had their names updated at some point in time - if you don't run this script you'll end up with some duplicates.
As always cuesheets are located in a separate 7z archive - when extracted into the torrent directory it places cuesheets in proper places.

Main torrent (7z with cues and *.tta files):
Touhou lossless music collection v.19 (1.65 TiB or 1 817 738 246 162 bytes).
Torrent file size is 9 604 641 bytes.
Magnet link.

Pictures torrent (cover/booklet scans):
Touhou album image collection v.19 (76.5 GiB or 82 153 065 676 bytes).
Torrent file size is 4 582 077 bytes.
Magnet link.

Supplementary materials (everything else - pdf booklets, bundled wallpapers, lyric texts, other extras):
TLMC supplementary materials v.19 (24.2 GiB or 25 978 962 058 bytes).
Torrent file size is 2 200 879 bytes.
Magnet link.

Additional links:
Touhou wiki
Album wiki
Online player with direct downloads

Fun facts: This is the largest update to date, at about 375 GB it is almost twice as big as the whole TLMC v.01.
Not so fun facts: It also took the most time.

Special thanks to:
- friends at toholmc.com, who provided about 760 links to tta+cue albums, which is a majority of albums in this update (if you're reading this, please learn how to input all unicode characters instead of replacing them with question marks and underscores because going through ALL cue sheets fixing errors by hand is seriously tiring; also 3 albums were missing cue sheets and them being single-track is not an excuse to omit cues and 2 or 3 images were in the album archives they didn't belong to and 1 album contained a duplicate of another instead of what its name indicated and since crappy 400px covers do not replace proper scans why even bother putting them in ...).
- friends at bbs.moem.cc, who provided about 90 links to wav+cue albums (remaining 900 links are unfortunately either trash mp3, unavailable or already present in this torrent).
- several baidu users and baidu search engines which indexed their links
- all original album rippers
- all touhou doujin circles
- and you!

A bonus for everyone who patiently waited for this release: discographies of artists every man of culture should listen to and enjoy.

Hatsuki Yura complete lossless discography, excluding collaboration albums (ones where she sings like 1 track out of 10; all albums listed in the Circle section on the artist site at the moment of this post are present).
Data size is 15.9 GiB or 17 080 348 619 bytes.
Magnet link.
It comes with two relatively minor issues:
1. "HAMELN Limitedver" album has two glitches because the fileshare kept giving me the same broken file (made about 30 attempts in hopes of hitting another CDN shard, each time file hash was the same). Although I could technically try to recover it by checking for all single bitflips in each block (fortunately the archive package had zero compression), doing this with reversed specs of a closed-source format is rather annoying so I didn't bother.
2. "RAPT AQUARIUM" album was reconstructed from split tracks without the rip log, so there is some chance pre-gaps are missing.

Cuesheets are located in a separate 7z archive, just like with TLMC.

FEEDBACK REQUESTS
1. For circles with names in kana/kanji I looked at the circle websites. If they used an english version of the circle name then I used that as a spelling hint/alternate name in directory name: "[main name] alt.name". The rest I've left as-is.
What are your thoughts on adding alternate english names to all circles?
Pros: it would make navigating directory structure and referencing circle names easier for those who don't know/are bad with kana/kanji.
Cons: running rename script would be effectively mandatory, at least for that one transition, and straightforward name translation or transliteration might not mean exactly what artist intended. It would also break your existing playlists.
2. There is a large amount of recent rips which are sadly shared as separate flac tracks with no rip logs. For now they are not included in TLMC.
What are your thoughts on gluing them back into CD images and adding those to the torrent?

FAQ
Q: Why TTA and not FLAC?
A: Two reasons, one historical and one technical.
Historical one is that a huge chunk of early albums were in tta, so to be consistent about the codec I converted the rest (mostly ape) into tta and then the majority of albums were also shared as tta, so less work for me. The technical one is that I like tta better. Flac is open source and rather popular, it's the most popular lossless format on a lot of metrics, which is a big reason to adopt it. Tta is also opensource and less popular, but not completely unknown, and several of its design decisions (whether intentional or accidental) stand out to me as flat out better. These are: only one compression level and no embedded metadata support*. According to my preferences this outweighs flac's popularity.
* Mixing immutable data with metadata is a big no-no in my eyes and leads to disappointment and sad kittens. You wouldn't make a kitten sad, would you?

Q: Why album+cue and not split tracks?
A: Audio CD has one track that encodes its entire LPCM data stream. Track breaks are the metadata content.

Q: Why are cuesheets included as an archive instead of as is?
A: So that you would be able to edit them (for whatever reason, such as finding a mistake) and not break the whole torrent.

Q: When will the next version get released?
A: I am searching for a better way to share TLMC and lossless doujin music in general. While bittorrent somehow manages to get job done it is (and always been) a wrong tool for this purpose. Sadly it is good enough, so very few are pressured to develop better alternatives. With that in mind...
Best case scenario: A p2p similar to what I described in my May post gets developed and the need for this aggregator torrent is gone, as all future albums that get ripped and shared appear and are available on it, immediately and permanently.
Realistic scenario: TLMC v.20 in summer-winter 2019.
Worst case scenario: Everything is eaten by grey goo.

Q: Anything I can do to help?
A1, Easy mode: Seed.
A2, Normal mode: Compile a list of "file+cue" albums that can be downloaded on the internet but are missing in TLMC and send me links to them. "Can be downloaded" part is pretty important here.
A3, Hard mode: Buy missing albums, rip with EAC in secure mode and a proper drive read offset to single file tta+cue (or single file wav+cue if you're too lazy to run tta encoder, or single file flac+cue if you're already used to flac), then share on any reasonable host (mega preferred, mediafire is fine too; but even a bad host is better than no host at all, just make sure it doesn't require registration - these ones I drop on sight) and post a link. Even better if you can scan the booklet and covers (don't overkill like, ahem, certain someones; 600 dpi is surely enough).
Alternatively, find out how to contact nekomimi Alice and participate in toholmc group buys.

And finally, if you liked the music, please support its authors!

What's wrong with bittorrent and what can we do about it? Vision of a next-gen p2p filesharing system.

2017-05-13T18:31:00.000+03:00

In this short writeup I review the history of filesharing services, highlight weakness of currently most widely-used filesharing protocol (bittorrent) and propose directions for improvement.
Intended audience: anyone with an interest in p2p filesharing.

What's wrong with bittorrent and what can we do about it?
So, why does bittorent suck? Actually, by itself, it doesn't. Bittorrent is a pretty good file transfer protocol.
The first widely successful p2p filesharing application was Napster, appearing in 1999. It could transfer files directly between users' computers, but the list of files was stored on the central index servers, which handled the searches. It comes as no surprise that the company behind it was sued and shut down the service two years later.
Napster's weakness was centralization, its single point of failure. Next developments tried to address that. Still kind of not wanting to let go the client-server paradigm there appeared Kazaa and Edonkey network. They still had supernodes/servers, but as they were under user control one could not simply shut them all down overnight.
They had a problem with leechers, though. Since Edonkey relied on users' goodwill to share their upload bandwidth and Kazaa made an extremely unwise decision of trusting the client to correctly report its contribution to the network, many chose not to bother with sharing back. The p2p solutions worked, but one could do better.
Then came bittorrent. With a single conceptual change in the form of a Tit-For-Tat algorithm being a pure leecher became personally unprofitable. You could throttle your upstream and (on a swarm with a high leecher-to-seeder ratio) watch as no one wanted to trade with you, only occasionally gifting you some chunks during optimistic unchoking, and wait to get noticed by seeders. If, however, you granted the application a reasonable amount of upload speed, the popular download would easily saturate your pipe.
However, bittorrent entirely omitted (and thus outsourced) one critically important feature of any p2p filesharing system: metadata management. Bittorrent protocol did not specify how end users should work with metadata at all, this was left out as an implementation detail. What appeared next is tracker frontend websites, public and private, both of which now exist and have their own problems.
Public trackers struggle with retention, because users have no incentives, other than pure altruism, to keep sharing files they downloaded long ago. Even is you want to keep sharing, many bittorrent clients are not well-suited to handle thousands of torrents simultaneously.
Private trackers more often than not reek of unfounded elitism, they are a pain to get into, and most importantly, they impose rules which unnecessarily restrict filesharing and waste an unused surplus of upload bandwidth.
So, bittorrent is a good file transfer protocol, but it is not a filesharing solution. A filesharing solution consists of a protocol/software/framework to exchange both data and metadata, and bittorrent takes care only of the first part.

Vision of a next-gen p2p filesharing system.
Metadata is "data about data". The bytestream (file content) which gets interpreted as a certain container type and then split and decoded into audio, video, static images or any other form of media is data. The album name, list of tracks, movie title, its release date and all other data that describes the data is metadata. As a user of any p2p network who wants to download something, first you interface with metadata provider (perform a search), and then extract pointers to the data files themselves and let the application download them.

At the moment there are two models of working with metadata: centralized and distributed.

The most common example of the centralized model is a bittorrent sharing site. Torrents get uploaded by users, approved by moderators under the control of site admin, get categorized and tagged using site-local metadata schema, then users visit the site, use site search functions to locate desired torrents and download them.
Strengths: Centralization helps maintain quality of both data and metadata. Low quality data files are either rejected or eventually replaced with better ones, metadata is properly organized and timely updated.
Weaknesses: The website itself becomes the SPOF under threats ranging from simple funding issues of site admins to harassment by state police following orders of media cartels. If the site is taken down all metadata creation effort frequently goes down the drain.

Distributed model is represented by self-contained filesharing solutions such as Perfect Dark or its predecessor Share, or, even earlier, serverless (DHT) Edonkey network.
Strengths: Resilience.
Weaknesses: The task of supplying correct metadata is completely in the hands of hordes of end users - uploaders. Instead of a proper schema the metadata is flattened into simple filename strings with dumb regex search as the only way to query. File collections are hacked into the system through archives with all the downsides it implies. Difficult to impossible to fix incorrect metadata. Searches can be slow and incomplete.
These problems of distributed-type networks, in my opinion, contributed to their lesser share today, relative to bittorrent protocol and its supporting websites.

Wouldn't it be great if we could somehow combine strengths of both approaches without their weaknesses? But wait...
There are two aspects to metadata in filesharing networks. One is tagging/updating/fixing errors, you can also call it "write access" or "creating/managing metadata". Another is querying/searching/browsing, call it "read access" or "using metadata".
In centralized model both "read" and "write" metadata access is centralized. Centralized model is good at "creating/managing", is ok at "using" and it gets worse with size (site popularity requires money to maintain and pay for traffic/CDN and ultimately attracts law enforcement).
In distributed model both "read" and "write" metadata access is distributed. Distributed model is very bad at "creating/managing", somewhat bad at "using" and is mostly indifferent to size (but higher numbers of active users make it harder to target individual users).

Key insight: one should centralize "write access" and keep "read access" distributed. Details on how to create centralized write access and improve read access to the level of centralized model over distributed network are below:

The role of the site admin (root administrator for a certain collection of metadata) can be performed by any user with a private-public keypair. Moderator access is granted by signing their public key with the root key, and checked by signature verification. Public keys and signatures are broadcast into the network. Anyone can generate a keypair and become a "site admin".
Each user can pick an arbitrary amount of keys to trust as root keys. For those keys the user keeps a complete local copy of the associated metadata database, together with the log of all updates. Metadata updates are database changesets signed and broadcast by any user, which are then checked and signed (or rejected, by letting them expire over time) by moderators. Owner of the root key ("site administrator") monitors published updates, resolves update conflicts, imposes an ordering on updates, signs them with root key and broadcasts these updates into the network, which then get accepted by software of the users who chose to trust that root key and get added to their local database. A search would simply be a query to this local database. An equivalent of RSS bittorrent feeds would be an ability to set precise file download triggers based on contained metadata.

Why is this an improvement over bittorrent with websites?
- Metadata is public and secure, spread among all users. By keeping full log of changes not even root key user can maliciously erase it at will. Easy to fork if root starts slacking.
- It relieves end users of local file storage micromanagement. You can stop bothering yourself with putting downloads into appropriate directories on your hard drive(s) according to your chosen criteria and 1)let the software place them automatically based on the metadata-to-directory mapping you decide or 2)let the software store files by hash and use its metadata to locate files you need instead of relying on the filesystem as an ad-hoc database (because a generic graph can do more than a tree). A FUSE module which plugs into the metadata db and provides a file view overlay is also a good idea.
- It should increase retention of old files by incentivizing users to keep downloaded files available, because TFT would account for any data transferred between peers**, not just data belonging to the same file or file collection (single torrent). Curiously, "public bittorrent" could also easily do that, but client authors didn't bother with adding that change for some reason (afraid of decreasing individual swarm performance despite increases in overall network health?). "Private bittorrent" tries to do almost just that (by tracking total ratios), except it does so in an inefficient and susceptible to forgery - in other words, plainly broken - manner.
**: actually, between owner groups of an end-user key. You could push a copy of your key to the seedbox, let it do most of the uploading work, and enjoy fast downloads to your home PC because your key would be recognized by the peers as a good exchange candidate.
- It allows you to "backup" your downloads by keeping references to downloaded files and backing up only that (which is megabytes at most, instead of gigabytes to terabytes). You'd have to redownload files in case of a hdd crash, obviously, but at least that would be automated. Bittorrent can also kind of do that, if you keep torrent files, but after several years you might find speeds lacking.

Half-baked ideas:
Separation of content - pure metadata vs. data links.
Pure metadata is metadata describing content without any references to actual files (say, album/movie title). Data links are, as their name implies, junctions between pure metadata and the space of files ('there is an instance of "this" movie in the p2p network and it has "that" hash' kind of statement). One could argue which files are worth adding to the set of available files: for example, TLMC never includes lossy transformations of original content (for reasons), while others might find value in mp3/ogg version of TLMC and music in general. There is less disagreement about pure metadata: it is an objective statement about the state of the world, rather than personal preference. Therefore, it makes sense to separate the two.

Some open problems: there is a number of difficulties I don't know how to solve at the moment. I don't know how much impact they will have on the viability of the whole idea.
Ease of use.
Users would have to download metadata database, which could grow into multi-gigabyte range (for example, for music, 1M albums x 1K metadata/album = 1G metadata) and then keep up with all updates. This is not something every casual user would accept and tolerate, just to download a couple of songs. It is thus not unreasonable to imagine lightweight clients, which store only metadata about local downloaded data and send search requests to nodes which do store complete copy, probably in exchange for a small chunk of bandwidth credit, but this adds levels of complexity I'm uncomfortable with and, what's worse, starts dangerously pointing in direction of central search servers, unless users are explicitly made aware of personal downsides (naturally slower searches? would that be enough to deter majority?) of choosing the lightweight option.
Degree of coverage.
Suppose there is a database of all touhou music. However, all touhou music is a subset of all doujin music, which in turn is a subset of all music. And if you go one step further, it is a subset of all media. Now, having one giant database of "all media" is clearly impractical because its schema would grow into an enormously complex beast (I might be wrong here, though), and the db itself would get huge. But if you keep the databases split at lower levels you'd either have to duplicate maintenance efforts, or you'd need regular grafts between them, or they'll just desync.
Post-moderation.
One attractive property of a centralized system is (an optional) post-moderation. Most users are assumed not malicious, so they are allowed to post content that will be verified by moderators later. In case a fake or otherwise undesirable file is shared, their privileges might be revoked. Significant reduction in share latency on all content outweighs occasional temporary rogue files, this can be further tweaked by requiring a certain level of trust to be established before content posting rights are granted.
It is unclear to me how to implement this in a distributed system, since metadata updates do not commute and there is no central point which can serve for automated conflict resolution. Maybe it's a reasonable choice to accept delays for pure metadata db and store changes signed by moderators and trusted uploaders as ephemeral updates for data link db until confirmation from root comes in.
Update poisoning.
The network should accept all metadata update requests and store them for some time for moderators to review. An attacker could flood the network with trash requests.
One option of dealing with this would be to require users who want to publish updates to perform one-time computationally intensive task to start participating. For example p2p software could require their identifying credentials to have a property of having a cryptographic hash function of their public key contain a certain number of leading zero bits. Also, nodes should limit their storage for each publisher key and keep only last something broadcasts (like square root of their previously verified updates). Turns out this particular one is not a big problem, after all.