You may already be familiar with as an immutable public — a way to capture and information as a shared source of truth. But have you ever asked exactly what type of information you can pull from the ? What kinds of questions can -backed data answer?
In this post, we want to explore those questions, highlighting what kinds of questions are naturally and easily answered by -backed storage and what kinds of questions are more easily answered by utilizing other storage formats.
At , our — crypto projects who want to deeply understand their users — need to answer a whole series of questions that are pretty much impossible to answer by pulling data directly from the . Instead, we transform, , and aggregate data into more traditional relational formats in order to calculate metrics like network activity or visualize trends in and retention.
We’ll get into those metrics below, but first we should explore the data that we can pull directly from the chain.
What can data can you actually pull from the chain?
While any particular might have its own implementation, at a high level, you can generalize the shape of data as being a series of linked blocks, each composed of a group of transactions, where the transactions are interactions between “addresses” — cryptographically controlled identifiers on the chain.
Common libraries allow you to query for things like “the latest block in the chain” or “balance for a specific address”. The query interfaces all make sense when you consider how the chain gets built up. Users, who control addresses, submit transactions they would like to record to a pool of potential transactions. These transactions are then “mined”, where the miners attempt to group them into a valid block that will eventually be recorded on the chain.
Using the Python as an example, we can explore the over JSON-RPC. Once you have , an easy entry point is pulling the latest block:
from web3 import Web3
from web3.providers.rpc import HTTPProvider
web3 = Web3(HTTPProvider(YOUR_ETH_NODE_ENDPOINT))
web3.eth.getBlock(‘latest’)
Which returns:
At the block level, there is:
- context metadata about how and when the block was mined
- who mined it
- how it can be validated
- and for our parsing needs: an array of transaction hashes under the `transactions` key.
We can in turn ask for the transaction information using the transaction hash:
tx_hash = block.get(‘transactions’)[0]
tx = web3.eth.getTransaction(tx_hash)
Which returns:
The transaction includes:
- block-level reference data
- fees paid to execute and record the transaction
- sender and recipient data
- input data which becomes important in Ethereum for more complex transactions (in our example, it’s actually an empty value)
You can see a cleanly decoded version of this transaction .
You could continue pulling block after block and decoding transactions, which is what we do in our ETL flow to parse each chain, but let’s step back and ask what we’re doing here from a data access perspective.
In order to record transactions on the miners need to be able to perform a couple simple lookups:
- “give me the latest block” — since they will be adding to the chain
- “does this user have enough currency to spend on this transaction” — to validate each transaction that can form the next block of transactions
And when they mine the block, they will record inputs and outputs for each transaction. On a smart contract platform like , these outputs can get very complicated because they contain the results of executing any arbitrary contract call — think of like the output of any function call in any given programming language…and think about what it would take to encapsulate all those possible outcomes to be decoded later on. Feel bad for our data team yet??
So “the truth” of each of these transactions is forever recorded on the …but what have we not done here?
Miners do not need to compute:
- how the usage of this token has declined or increased in the last thirty days
- whether the user-base on this platform growing
- the rate at which apps are launching on the platform
In other words, the data stored on the chain and the interfaces built to access it serve to mostly answer point-lookup-style queries, not aggregate analytics.
What can you do with on-chain data?
Since our want to answer questions about and usage trends, and not just do point lookups, we have to beyond just crawling the chain. We have transaction data in a way that lets us flexibly segment and aggregate it. The good is, that for many of the questions we end up asking at Flipside Crypto, a straightforward relational data model is all we need.
For each block, we decode all the transactions and the them as on or more individual records. An transaction looks pretty similar to the rpc response above, with the following fields that get pulled from the block, the transaction, some additional metadata and decoded outputs from any potential contract calls:
Metric Example 1: Economic Throughput
With these values, we can slice and dice the network data by project to show larger trends, like economic throughput ().
Our individual transactions easily roll up by project, and so we can calculate average transaction size and number of transactions over time. If we plot average transaction size against number of transactions on a project-by-project basis, we can start to see how different projects are used:
Projects farther to the left are more similar to payment systems, like Paypal, that is, a high number of smaller value transactions. In contrast, projects with larger transfer sizes act more like stores of value. A project’s placement in this plot could confirm whether users are doing what the creators intended.
While “high-level averages across all projects” is interesting to help us get oriented to the ecosystem as a whole, our want to go deeper, to better understand their users. Since we start at the individual transaction level, we can refine the aggregate stats and break down transfer metrics in other ways.
Metric Example 2: Concentration of User Activity
For a single project, we can start to understand how active their user base is over time. In the plot below, we are looking at average transfer size again, but this time, we can compare it to the number of transfers sent per user per month to reveal the concentration of user activity:
Often, crypto projects we work with have a profile of their ideal user behavior. By looking at these behavioral groupings over time, they can better understand if usage of their project is trending towards or away from their ideal.
Metric Example 3: User Retention
Finally, our analytics become even more powerful when we let projects map their off-chain efforts to on-chain activity.
In this user retention chart, we’re looking at the percentage of users from each month that are active six months later:
A project can look at this chart and better answer questions like: Did a marketing campaign one month pay off by recruiting the right new users? How did a product redesign impact long-term usage?
By capturing retention from on-chain data, we can help them track and measure their off-chain engagement efforts in a much more concrete way.
Chaining it all together
As a data engineer, it’s been fascinating to see how a transformation in data storage — from to relational — has enabled so much insight for our . For all the merits of as a distributed datastore, capturing high level analytics isn’t one of them!
Just last night we were discussing some of these metrics with a data scientist who looked at us and said “What are you talking about? There are no on the !” We hope you’re a little surprised by these insights. We think it is possible to explain how people are using -based projects in the way that helps projects refine their missions, and ultimately leads to a healthier ecosystem.
We have many more data transformations ahead of us too. Each time we experiment with a new format, we find new questions we can answer.
If you’re interested in answering these questions and exploring how to provide more insight into what’s being built in the space, feel free to reach out.
Find me on twitter or email data@flipsidecrypto.com
Published at Fri, 26 Apr 2019 15:10:40 +0000