In Ethereum’s protocol development roadmap released by Vitalik, an interesting module mentions that clients will not need to store data older than one year. The data pruning module is planned for launch after the network transitions to Proof of Stake (PoS), implements sharding, and moves towards statelessness modules.
- eases the hardware requirements for nodes,
- allows nodes to remove code that deals exclusively with legacy transactions, and
- reduces bandwidth on the network – nodes need to sync less data.
EIP-4444 proposes these changes and sets the rationale and the framework for the data pruning module. For those wondering what an “EIP” is, please check out the first half of our blog on EIP-1559 here.
Current State of Ethereum
As a general rule, letting go of the past is hard. We hold onto our memories for so long. If someone turned up one day and told you that they would delete all your memories spare the past year, you would probably be angry. Why, then, would Ethereum take such a step? It’s because memory has become a huge, efficiency-decreasing burden.
EIP-4444 is like the treadmill that will be used to literally cut down all the extra slack that Ethereum has accumulated over the past few years. Before seeing how exactly it does that, let’s see how Ethereum deals with historical data currently, and where this burden is placed.
Nodes – they bear the burden
In Ethereum, a “Node” refers to a running piece of client software. In simple terms, any computer that has a copy of the Ethereum blockchain, and is part of the network, can be considered a node. So, a node has to collect blocks, validate them if it’s a validator node, and send the result back to the main chain. This requires resources, namely:
- Network Bandwidth
The aim of EIP-4444 is to reduce the burden on the nodes.
Historical Data – is the burden
Historical data refers to data such as the storage slot data in block N, the block header of block N, transaction receipts, etc. Ethereum full nodes and archive nodes are the two types of nodes that store data. Historical data is a major part of what they store. Full nodes and archive nodes store every single block, right from genesis. The hardware requirements for the nodes are as follows:
|Parameter||Full Node||Archive Node|
|CPU||4+ cores||4+ cores|
|SSD||500GB+ SSD||6TB+ SSD|
Ethereum full nodes stored 624.17 GB of data as of 31 March 2022. That is equivalent to storing more than 250 HD movies on a hard disk. A 1 TB hard drive would be appropriate for a local full node setup and is priced around $70. If the node is to be set up on AWS, an i3.xlarge instance can be used for syncing with an hourly rate of $0.312. Given that the setup will take around 18 hours, the total cost will be around $5. This is the fixed upfront cost. The running cost of the node, using EBS disk space ($105/month) and a cheaper EC2 instance such as t4g.medium ($25/month), will be around $130/month. Hence, on AWS, the upfront cost will be $5, and the running cost of the node will be around $130/month.
Historical data is a major chunk of this pricing estimation. The historical headers, bodies, and receipt list amount to about 300 GB, which is almost half of the entire data stored. If this burden could be simply removed from the nodes, the costs would be cut in half, and network participants could suddenly run two nodes without spending an extra penny.
Long story short, storing historical data requires memory and time, which makes the cost of operating a node expensive. Expensive nodes are bad because:
- only resource-rich users will be able to run nodes,
- which means there will be fewer nodes, increasing the risk of centralization and lesser security.
Why do something expensive when you can do it for less? This is where EIP-4444 comes into play.
EIP-4444 – reduces the burden
EIP-4444 proposes that nodes prune data that is over one year old. It specifies that nodes should not serve headers, block bodies, and receipts older than one year on the p2p network. Hence, nodes will not be responsible for serving historical data older than one year ago.
Impact of EIP-4444
The hardware requirement reduction improves the network’s decentralization in the short term. It wouldn’t be astronomically expensive to create and run a node anymore. So even consumer-grade equipment could run a full node. Similarly, faster nodes and a lightweight sync process will reduce strain on the network, making the protocol better at processing transactions at the tip of the chain.
This would allow the protocol to focus on consensus and agree on the global state faster. The separation of concerns between consensus and archival is a step in the right direction, leaving only the essentials within.
But how can we delete all that data? Isn’t it important? Yes. It is. In fact, if it weren’t for the technologies that exist right now, it would be a bad idea just to prune the data. To understand why this can be done, let’s go into where and why historical data is used and then see how each use case will still be addressed in the “post-pruning” world.
The Importance of Historical Data & How Pruning will affect it?
Historical data has three main categories of usage:
- Syncing is the process of downloading the blockchain from one or many connected peers that also have the historical data (full nodes) to one’s own node. The node needs to request historical data to develop its view of the chain.
- PoS changes the above. It is vulnerable to long-range attacks due to weak subjectivity. When a new node comes online to the network, it will be given the genesis block and presented with all of the currently published branches of the blockchain. But it won’t be able to tell which branch is the main chain. In PoW, the longest chain would be the right chain because creating a malicious longer chain would require tremendous computational power. In POS, the longest chain can’t be assumed to be the right chain because computational power is insignificant; the validators don’t have to solve any complex problem and just have to publish transactions as blocks.
- Given this, long-range attacks that involve creating malicious chains that dupe nodes become possible.
- To mitigate this, WS checkpoints are used, implying that only the latest X number of blocks of the chain can be reorganized. .
- The WS checkpoint allows nodes to skip the step of requesting historical data over the p2p network since nodes before the checkpoint are valid.
- But even here, historical data from the nodes that have crossed the checkpoint will be required to develop the view of the blockchain from there.
- User Requests over JSON-RPC
- Currently, the Ethereum JSON-RPC responds with an empty response when a requested piece of data does not exist. When a node requests a transaction receipt, historical data allows the users to know if the receipt exists or not.
- Once old data starts getting pruned, the node won’t be able to decide if the receipt was pruned or never existed at all.
- Decentralized Applications (dApp)
- Moreover, the main consumers of historical data are dApp devs. Many dApps populate their databases with historical information to serve their users via frontends. For these dApps, it is vital to be able to iterate through all transactions and logs or make multiple JSON-RPC calls to get it directly from a full node as a service.
- Many dApps require access to historical data from Ethereum to show past user behavior like user account balances, transactions, votes, and similar from the distant past.
All right, so we can see that historical data has important uses. One cannot simply do away with it and expect everything to run fine. We need access to this data somehow.
But the solution to this problem isn’t really far. The Web 3.0 space is in the process of rapid development, and there are projects that are actively working to solve this problem. Covalent, one of our portfolio companies, is making rapid strides in the relatively unexplored data layer of the blockchain.
Enter Covalent: The Data Indexer
To access historical data, users will need to do one of the following: query from storage services (like Arweave, Filecoin, Internet Archive) that will store the data, or use indexing solutions. While using storage services to fetch data is highly compute-intensive, fetching data from indexing solutions that store and index full node data is cheaper and faster. Indexing solutions are faster because they create indexes for data, and hence the lookups and aggregations are faster. Data availability and indexing are basic yet crucial factors for any blockchain and its users/developers.
One of the risks mentioned in the EIP-4444 is that centralized services can choose not to provide the data since only they can afford to serve it and/or have ownership over it so this can be done in a decentralized manner using the Covalent Network with community’s best interests in mind.
Covalent is a decentralized data indexing, and data querying protocol for blockchain data, which leverages its unified API (application programming interface) for servicing any data requests.
To know more about Covalent, check our Investment thesis document on Covalent here.
How does Covalent solve the historical data storage problem of EIP-4444?
Covalent is the only project to index the entire Ethereum Mainnet and the Kovan Testnet blockchain entirely – this means every single contract, every single wallet address, and every single transaction. This is billions of rows of data and terabytes of data made available through an API for any use case. Having the entire Ethereum Mainnet and Kovan Testnet indexed also means access to all the DeFi data from top protocols such as Uniswap, Aave, Balancer, and Compound, to name a few.
Hence, anyone can access all the historical data as per their needs.
How does Covalent solve The JSON RPC Problem?
We saw that the Ethereum JSON-RPC would no longer return reliable results on querying once historical data is pruned. Covalent completely changes the game by presenting a proof-based method that overcomes the cumbersome issues of querying the JSON-RPC layer. And the solution is Covalent’s Block Specimen. It is a cryptographically secure representation of a block and its constituent elements. Together, these specimens form a canonical representation of a blockchain’s full historical state. It provides a unified data object schema for blockchain data representation in a multi-chain world. The block specimens will feed the Covalent network with data that is required to answer user queries.
There are censorship risks if only one organization is assigned to handle the data. There are availability risks too if the organization is unreliable.
Hence, It is important that Ethereum’s historical data is preserved and seeded by independent organizations that check data availability frequently. This can be made possible by creating an incentivization layer for these organizations.
Covalent is mitigating those risks by making things less centralized but also implementing the incentives structures for community ownership. Covalent is also providing tons of economic incentives within the Covalent network to have independent operators validate and create block specimens that can be sharded by a single block or range of blocks depending on the time frames and granularity of data requested/required by clients).
The EIP-4444 update is likely to take some time to actually go live. It is still young, and there are plenty of discussions to be had. Prototyping, taking feedback, and modifications, all of these are parts of the process. Moreover, with the Merge and the transition to Proof of Stake, the EIP will also have to deal with weak subjectivity. Nevertheless, the reduction of the data load on nodes, which EIP-4444 enables, is yet another step in Ethereum’s effort to become more decentralized and provide transaction write access as opposed to transaction read access to the masses. Thus, a welcome and necessary change but one that doesn’t completely absolve the concerns or newer issues mentioned above.
Credits – Amit Hedge, Technology Team, Woodstock Fund
Woodstock Fund is an investor in Covalent. Every financial product, asset class, or investment has risk. A cryptocurrency (also known as digital tokens, digital coins, or crypto(s)) is no different. That is why it is important for users to be aware of the potential risks present in cryptocurrency and blockchain projects. You should not invest funds in the cryptocurrency market that you are not prepared to completely lose; i.e., only allocate risk capital to digital tokens. Furthermore, we will not accept liability for any loss or damage that may arise directly or indirectly from any such investments.