POW NFT Generative Music
After many months of hard work and development, POW NFT has entered the exclusive pool of generative music NFTs. This massive undertaking, the joint effort of a musician and a coder, was designed to be provably generative in a lossless format that has the best chance of surviving decades of technological advancement and deprecation.
Tech stack for the future
A number of considerations need to be made when choosing the tech stack for this project. I am strong of the belief that if you’re going to make a generative art NFT based on-chain data, you need to show your work. It would be very easy to create or curate artwork behind a curtain, then just claim it was generative. So whatever process we landed on, it was important that the process be transparent (and therefore repeatable). In a doomsday scenario, even if every copy of every Atom is destroyed and metadata servers melt, the artwork should be re-creatable with just the ruleset and the token’s hash.
As NFTs, the permanence of the tokens is another major consideration. For better or worse, as non-burnable NFTs, POW NFT Atoms will be around forever (or as long as the Ethereum blockchain), and due to the unique nature of POW NFT’s mining process, new ones will theoretically be mintable 100 years from now. So a tech stack that had the best chance of being supported in some form long into the future was crucial.
These criteria immediately rule out a lot of options. Even if we’d written a highly complex macro to interact with some existing audio software, that whole process would have a massive point of failure around requiring that software remain available, supported, and usable for decades to come. I have no doubt that if NFTs were a thing 15 years ago, a majority would have been made using Adobe Flash. For a time, literally, every page you visited had embedded Flash elements on it, but these days Flash has been completely end-of-life by Adobe and most browsers block any Flash files by default.
So when thinking long term, the task isn’t about picking the current shiny leader, it’s about picking a format that is most resistant to the natural cycles of decay and innovation that are inherent in digital technology. This means using the thinnest possible stack, ideally something with broad compatibility, and open-source (or at the very least, not a proprietary format that is subject to the whims of managerial budget cuts).
For these reasons, it seemed obvious that something using current web technology was the solution. Ever since Tim Berners-Lee’s brilliant invention of the World Wide Web back in the late 80s, we’ve essentially had a platform that’s actively maintained to be universally compatible across all devices in the form of HTML. As software devs, some of us have a tendency to dismiss HTML because of it’s relative simplicity and front-facing nature, but if you were to tell someone in 1985 that someday you’d be able to write a line of code and have it work on literally any system anywhere in the world, made by any company, in any (human) language, then their head would almost certainly have exploded.
Because the web is open and works so well for everybody, everybody uses it. And because of this, every browser/OS is motivated to continue supporting it. If you’re going to launch a new business, you need a website, and if you’re creating a new device, that device better has a working web browser or nobody will buy it. Basically, our modern economy has the web as a dependency, so there’s an incentive for everyone to support and maintain it. I would say using a platform that piggybacks on the collective efforts of the entire global economy for compatibility is a good place to start when choosing our stack. There is no guarantee any tech will be around 100 years from now, but for the time being the web isn’t going anywhere.
So that’s fine for today, but what about the future? Web standards change, just like anything. However, this is once again a case where the ubiquity of the web works in our favor. Projects that aim to preserve our digital culture already exist, and there are active efforts to archive the web underway. So I think it’s safe to assume that 50 years from now if anything digital has been preserved at all, it will be something from the web. And if our late-21st-century descendants are putting in the work to make 2021 web tech compatible, then we might as well jump on board that trains and builds something that takes advantage of that fact.
Web tech for the future
Okay, so we want to build using only web tech so our NFTs can live forever, but that raises another important question… how do you make music with just a browser?
Again, we need to immediately throw out a lot of options. Any process that loads and mixes pre-recorded or pre-generated samples is out. This would add a whole new file type to our dependency list, but more importantly, it would require the preservation of all these samples in the precise manner that was usable by the synth. So whatever we make it needs to run in browser using only code.
I also immediately ruled out using any framework or library. When keeping a thin stack, you don’t want to just throw someone else’s potentially messy, potentially precarious code in there without knowing exactly what its doing. I could write tomes about how half the time frameworks cause more problems than they solve, but for the sake of this article just consider that since we need our NFTs to last forever, we need to understand every line of code we commit, because time has a way of amplifying mistakes. I can guarantee that the dev who wrote the package you’re importing wasn’t thinking “is this going to work a decade from now?”, so by using their framework you might save a few hours on dev time, and then lose a few decades of NFT lifespan.
In fact, this is why POW NFT Atoms don’t use any sort of graphics framework or library, just vanilla js interacting with HTML5 canvas. The Live Models can run indefinitely and the only requirement for users is that they have an up-to-date browser. I don’t consider this constraint to be a drawback either, anyone who enjoys creating in any field would agree that making something within limitations can often yield better results than an unbounded blank page.
AudioContext
This is all great in theory, but how do you make music play in a browser without loading sound files? The answer lies with the Web’s relatively unknown AudioContext API. I’ll admit I had no idea this existed until I needed to make an alarm for the POW NFT miner, and from looking around the web, very few people actually use it. On the rare occasion, someone wants to create music using just a browser, there are a couple of decent-looking frameworks people have made to wrap around AudioContext, but I think I’ve made my thoughts on frameworks pretty clear. As it turns out, AudioContext has all the core components of a modern digital synthesizer, you just need to know how to use them. So we were going to build this from scratch.
POW NFT’s generative music is actually split into two parts: a Synth module (a fully configurable digital synthesizer built upon the AudioContext API), and a Composer module (which creates the compositions, configures the Synth’s instruments for any given track, and then gives the score to the Synth to play). The two modules are quite different in both function and design, so I’ll cover them separately, starting with the Synth.
The Synthesizer
Going into this, I knew very little about how synths work. A background in mechanical engineering and a decade and a half of writing code for a vibration analysis company gave me a pretty decent understanding of working with sine waves, but the knowledge required to make those waves pleasing to the ear was far outside my wheelhouse. Luckily, my collaborator, Skwid had this kind of knowledge in spades, having worked with both digital and analog synths for as long as I’ve been throwing 1’s and 0’s together. This rig that he built illustrates his skill level far better than I ever could:
The first step was to find a common language so I could build what Skwid needed in the synth. Thankfully we both understand circuit diagrams, and so after we ran through what features AudioContext had to offer, Skwid sketched out this part-flow-chart-part-circuit-diagram explaining what components were needed and where they’d interact with each other:
What you’re seeing is the layout for a single instrument within the synthesizer. The instruments actually have several LFO-oscillator-gain node sets — you need this if you want an instrument to be able to play multiple notes at once. It turns out that everything you see in this diagram is all you need to be able to make any digital instrument, from a juicy bass guitar to a sparkly chime, to all sorts of percussion, it’s all just about how you configure the different components.
Oscillator nodes
Our synth is actually a dual-oscillator synth, so the Osc node in the diagram actually represents 2 oscillators that can be configured differently but play the same note. AudioContext’s OscillatorNode
does a huge amount of work here and is relatively cheap (from a processor standpoint). They basically output a repeating wave which you can configure for all sorts of things. Some of our synths borrows terminology from the AudioContext API, so whenever I’m referring to a component from the latter, I'll write it like this
to prevent confusion.
Each oscillator in the Osc node can be configured for a different wave type (sine, triangle, square, sawtooth), and can have a detune value (basically a tone offset). To accommodate the creation of certain percussive sounds, Osc nodes can also be configured to output white noise instead of a repeating wave, or to always play the same note in if they’re designed for a specific sound. They can also take a “max note” parameter, which will downshift any note higher than this to be in the octave below it. The latter functions could technically have been handled by a Composer module that was conscious about the ideal conditions for the instruments it was using, but this way it keeps the burden on the Synth to make sure it always sounds okay, and lets the Composer just worry about writing music.
LFOs (Low-frequency Oscillators)
LFOs are basically just oscillator nodes, but they’re used in places where it makes sense for them to have low-frequency settings. In our synth, each instrument has its own LFO which can modulate the OSC frequency, and there’s also one in each filter, and one in each panner.
ADSR
A huge part of making a synth that sounds like anything other than 8bit chiptunes is ADSR envelopes. If you haven’t heard of these (as I hadn’t at the start of this project), it stands for Attack-Decay-Sustain-Release and is the cornerstone of all synthesizer sounds. We actually have 3 ADSRs in the synth, but for the newbies, I’ll first explain what it does in regards to Osc amplitude.
Basically, rather than just starting the wave when a note starts and then stopping it after, the gain (volume) starts at zero, then goes up to a peak, comes down again and holds there for a while, and then goes back to zero. This mimics the way all sounds work, not just digital synths. The height and lengths of the different sections may vary (and the tones themselves will vary), but the shape of the ADSR envelope will determine how hard/soft/sharp/smooth the sound is. If you imagine a piano, the ADSR begins when you hit a key, the attack and decay are determined by the piano’s material properties. The sustaining part lasts until you take your finger off the key, and then the release is how quickly the sound falls away following that.
Each instrument in our synth has its own ADSR envelope configuration, and when it receives a message saying “Play C#2”, it first sets an oscillator node to the correct frequency and then runs through the ADSR envelope. AudioContext doesn’t have built-in ADSR functionality, but it does allow you to set different properties to change to specific values over time, so with a bit of precision coding, you can make your own ADSR.
Our synth actually uses ADSR in two other places: oscillator pitch and filter frequency. I’ll discuss filters in a bit, but the pitch ADSR means it actually bends the pitch of the note while it’s playing. The pitch-ADSR envelope can have a shape entirely different to that of the gain-ADSR envelope.
Oscillator reassignment
Because our synth is doing a lot of work with limited resources, we had to be smart about how many oscillator pairs each instrument can have. If an instrument has a long ADSR envelope and is playing a lot of different notes in rapid succession, then you can end up with hundreds of oscillators. I made the decision to allow each instrument a max of five oscillator pairs. At the start of the track, it looks at the score and works out which notes will be assigned to which OSC. If it’s got 5 notes ringing and it wants to play another, it will re-assign the least-recently-played OSC, cut off the ADSR envelope and play the new one.
Actually, because of some quirks of the way AudioContext does/doesn’t handle interruptions to those gradually changing variables I used for ADSR envelopes, the synth actually has to map out the entire ADSR event chain for each oscillator at the start of the track, calculate the cutoff values, and encode them accordingly. But all this means that the composer can throw a rapid tune with a 180bpm tempo using instruments with long-tailed ADSRs and it will know the Synth is never going to jam up. Again, it the Synth’s job to make sure it always sounds okay.
Filters
Filters are another ingredient that 100x’ed the power of our Synth. For the non-synth-savvy, they amplify or remove different frequencies from the signal coming in, which is important for creating certain sounds. AudioContext’s BiquadFilterNode
is doing all the heavy lifting here, and our synth just wraps around it and then adds in some ADSR functionality or an LFO on the cutoff frequency. A key feature of the filter ADSR envelope is that it can be configured as having negative values — meaning it distorts the cutoff frequency down from a baseline, rather than up from a 0Hz floor.
Other components
Every instrument also has the option of delay, panner, and reverb module. The former is just AudioContext’s delayNode
on a configurable feedback loop and the panner wraps the stereoPannerNode
with an LFO. Our reverb module is a little complex, doing some magic with a BiquadFilterNode
and a convolverNode
with a custom impulse response function.
Generalising synth controls
When making the synth, it was important to generalise its capabilities so the composer had everything it needed to work with. For this reason, all properties with a time-based unit (ie, seconds or Hz), have the option of being configured in terms of beats. For example, an LFO can be set to oscillate once per beat, rather than a set number of times per second (Hz). On top of this, most instrument properties can also be reconfigured by the composer mid-track if needed — the synth effectively exposes an API to the composer, giving it the ability to adjust the sound of any instrument at any point in the track.
All of this is to say, we created a fully functioning digital synthesiser. At no point in development did I have to say “no, we can’t do that” to any of Skwid’s design parameters, which really put us in an ideal position for the second stage of this build. The only thing close to a compromise on tech we had was to add a sampleRate
toggle. Essentially, this lowers the sound quality, which was a way of accommodating for current systems that might not be able to handle real time generation of CD-quality music in a browser. Ironically, assuming devices continue their general trend of improvement, this feature will most likely become a vestigial one. If you’re reading this in 10 years time then just ignore that button.
The Composer
Now that we had a fully loaded Synth, it was time to write the code that makes it sing. This was a mammoth task and requires finding a delicate balance between being overly constrained or sounding too crazy and non-musical. Ironically, given that it was code writing this music, we found it needed to meet a higher bar for matching what the regular listener would consider “okay music” than if it were created by a human. If the latter does something unusual it can be received as experimental, but if a machine that’s supposed to make music makes something that doesn’t sound like your idea of music, you’ll probably just think it sucks.
Randomness from hash
Before going into the details of the composer, it’s important to cover how its decisions are made based on the token’s hash. In POW NFT’s visual layer, individual bytes of the hash is assigned to determining different characteristics of the Atom (a few for atomic number, a few for colour, etc). When it came to designing the composer, however, it became clear that there would be more decisions to be made than exist bytes in the hash. And even then, we would want more than 256 degrees of variety in some of them.
I recalled a weird thing happening long, long ago back in high school maths class. Two friends with identical make of calculator where pressing the “random” key and getting identical results. We realized that the “randomness” was built into the calculator somehow, and their calculators happened to both be at the same point in whatever process it had. It didn’t matter, because all that was needed from a calculator’s randomness function was that the user didn’t know what the next value would be, it didn’t require some sort of quantum randomness.
So the simple solution to this was just to write a pseudo-random function that rehashes the token’s hash and returns a random number based on the result. As long as you are always starting from the same hash, and as long as the order of instructions in the composer doesn’t change, you will always get identical results.
For this part, I used a cheap implementation of the Fowler-Noll-Vo hash function, seen in its entirety:
function fnv32a() {
let str = JSON.stringify(arguments);
let hval = 0x811c9dc5;
for ( let i = 0; i < str.length; ++i )
{
hval ^= str.charCodeAt(i);
hval += (hval << 1) + (hval << 4) + (hval << 7) + (hval << 8) + (hval << 24);
}
return hval >>> 0;
}
This isn’t something you’d want to use for cryptography, but for the purposes of rehashing our token hash for generative music, it’s perfect.
Our actual randomness function looks like this:
function random(){
const limit = 1000000000;
last_rehash = fnv32a(tokenId,hash,last_rehash);
return (last_rehash%limit)/limit;
}
This effectively gives us a random number with 1,000,000,000 (one billion) different possible values — more than enough for a project which can’t really ever have more than 16,000 tokens. The reason for re-including tokenId and hash in every subsequent rehash is that one billion is significantly smaller than the max value of our token’s hash. So in the unlikely (but not impossible) situation that two tokens hash to the same value, collisions will only occur once, rather than propagating to all subsequent hashes. Effectively it retains the 32Bit hash’s resolution for the randomness.
I wrote a few little helper functions that wrap around this random function for different variable types, and different probabilistic conditions, but this is effectively the backbone of how the composer makes all its decisions.
Track structure
An early decision we made was that our non-fungible tunes needed to adhere to some kind of internally-logical structure. That’s not to say that they should all be structured in the same way, but that a track should have different sections with different moods/tones, and that the order and possible repetition of those sections should not be jarring to someone who’s paying attention.
For this reason, we defined several sections, (eg. chorus, bridge, tension, release, drop, etc…) and gave the composer rules for deciding how many sections to include, and what order to put them in. This ensures that a track has a natural-sounding flow, with a balance between repeating themes and variety.
Track properties
Once the Composer has mapped out the track’s overall structure, it must decide on a number of properties including tempo, root key, key changes and scale. It also has an algorithm for determining the length of the different sections, which provides additional variety between tracks.
The tempo is actually the one property that isn’t purely determined by the randomness function. The majority of a track’s tempo is correlated to the Atom’s atomic number (number of electrons), with only a small bit of variation coming from randomness.
Rather than arbitrarily picking notes, the Composer selects a musical scale from a curated list. This was an important and highly subjective process, where we had to rule out scales that could potentially yield too many crazy results, but didn’t want to over-constrain it to just the super friendly major scales.
Instruments
For each track, the Composer effectively assembles a band of instrument definitions and then creates them on the Synth. The instrument definitions were predefined by Skwid, which was a creative decision designed to ensure that all the Synth’s sounds were audibly pleasing. It’s important to remember all generative art has a level of curation. Whether it’s the color palette, the shape algorithm, or in this case the instrument library, the generative art is a result of rules laid out by an artist who understood the many possibilities but was unaware of what the final outcome would be.
Skwid defined several instruments and assigned them to a number of categories (lead, bass, pads, FX, etc), then for each track the Composer selected a set number from each category. This was a way of ensuring the Synth wasn’t playing 5 leads at once, or trying to use a bass guitar as a snare drum.
The instrument preset process is where the generative music gets a lot of its feel, and it’s why we spent so much time perfecting the Synth, and why Skwid spent so much time refining each instrument so they all go well together, regardless of selection.
Melody and rhythm elements
The crunchiest part of the Composer module, and the thing that gives the tracks the most flavour besides the instrument definitions, are the melodies and rhythms. Skwid defined several matrices which determined the possible re-use of these elements in different sections of a track, as well as how they should be generated.
There may be a certain likelihood that for a given track, lead #2 plays the same melody in the intro and the chorus, or that two instruments play a call-response, or that percussive instruments take a break and play fills, or use have a certain beat that plays in all the track-defining sections. Once again, this approach means that any given track will have added internal consistency and recognisable patterns, rather than just being a mish-mash of random melodies.
The rhythms and melodies themselves are generated using several different algorithms. These include arpeggiators, euclidian sequencers, a chord generator, an algorithm for walking bass-lines, as well as a few others. It was also important that these algorithms output melodies in a key-agnostic form — that is to say, rather than outputting melodies with specific notes, it outputs them with shifts relative to some arbitrary base-note, meaning they can be repurposed for key-changes or for octave transposing (handy if a bass and lead are jamming).
The Composer is also aware of which elements will be used in which sections of the track, so writes them with the length in mind. Ie, if your track has a 16 bar chorus, it won’t just repeat a 1-bar melody for 16 bars, but it also won’t just have a 16 bar solo with no repetition. If a melody is being reused in two sections with different lengths, it will also account for this.
Bringing it all together
Once the Composer has created all the pieces, it must actually write the score and send it to the Synth. It steps through each section of the track and composes the relevant bars for each instrument based on whatever melody/rhythm it should be playing, mindful of key-changes, mixer settings, and making a few extra decisions about which instruments will play at what time.
The end result isn’t really human-readable, but it’s our Synth’s equivalent of sheet music. It contains every note and instruction the Synth needs, and the Synth will blindly do what it’s told, when it’s supposed to.
Music for eternity
All these rules, built in this future-facing, open format, meant we were able to add a unique track to all 5200+ existing POW NFT Atoms. It also maximises the chance that if someone mints an Atom a decade or two from now, they’ll be able to hear their own unique track, played live by their browser, based on rules written decades ago, that’s never been heard before.
Writing 5200 tracks in 2 months is probably a record for Skwid, but having the knowledge and skill to intuit a system that could theoretically compose 1,000,000 more is probably a greater achievement.
Time will make up its own mind about this project’s legacy and it is impossible to know what the future brings, but we’ve put it on a path that leads far beyond the horizon.