DOVU - Scaling Infrastructure: How we did it

Scaling Infrastructure: How we did it

30 Mar 2022,

BLOG POSTS

DOVU CTO Matt Smithies writes about our recent public performance test and shares insights for others looking to run similar experiments.

In Q1 2022 we ran a public performance test on the Hedera Hashgraph testnet. We wanted to prove that we are able to scale in line with a vision.

What does vision mean to us?

By vision, we mean our ambitions. Not just surrounding our platform and staking work, but also DOVU’s ability to generate carbon credits. In the future, we want to start utilising more third-party data and potentially Internet of Things (IoT) devices.

At a glance these are the stats:

Over 30 million transactions in 24 hours
Averaging 1.25 million transactions an hour, the maximum was around 1.35 million transactions an hour
This averaged out at 247 transactions per second

It goes without saying that this kind of scale isn’t possible on the majority of current web3 technologies. We observed the content of various other chains, like Cardano, which ran 20 million transactions in 4 years. This inspired us to run this test.

This explainer will help you understand what it takes to reach these standards.

As a note, we acknowledge that there are more entities out there that have pushed more throughput through the network. However, we feel that every single one of our transactions generated value rather than providing limited long-term value, by simply using the consensus service to send arbitrary messages to the network.

In this test, each transaction was real data which had value connected to it. Throughout, off-chain storage manipulated this value every step of the way.

The tl;dr

Well, we achieved our goals. In fact, we surpassed them by approximately 20% across the board. Below are some images demonstrating the TPS we hit, resulting in over 30 million transactions in 24 hours.

In terms of hourly transactions, we managed to max out just under the 1.35 million mark.

We averaged around 247 TPS, although we did have occasional spikes of 300 and higher.

However, it wasn’t an entirely smooth process. We learned a lot. And we urge you to read on so that you too can learn more about our experiments for the network.

DOVU’s TPS history on Hedera

Back in 2020 we first utilised the Trust Enterprises API (a tool I developed) to create a note using our proof of carbon mechanism. This was the first foray into building on Hedera and employing the consensus service to record every single transaction that happened within DOVU.

While this was an achievement and it was easy to create, our initial aim was to record every transaction through our legacy system, almost acting like a side chain.

Throughout this period we are able to push approximately 40-50 TPS on the testnet.

Understanding our current landscape

As the marketplace and staking platform evolved, we needed to increase our capability and bandwidth to drive future transaction volume.

The demands of ESG initiatives at a global scale are getting tougher. In terms of both audibility and demand from enterprises for verifiable carbon, the ability for us to scale quickly is a feature that we need to be capable of doing with lower downtime.

Furthermore, as one of the pioneering staking platforms on native Hedera, which is completely non-custodial and lower risk, we need to expect an influx of users of at least 20,000 or more in the medium term.

The plan of action

In preparation, we set a couple of success metrics that we wanted to capture throughout the 24-hour testing period:

Averaging 1 million transactions an hour
Reaching 20-25,000,000 transactions in 24 hours
Having bursts to hit 250-300 TPS

But sending transactions to the Hedera network wasn’t the most important aspect of this. We wanted to link every single transaction to mock data and process that in our primary system.

To facilitate this we decided to move forward in modelling our current infrastructure on production, but taken to a level that allows for increased scale.

Understanding our infrastructure

We use Google Cloud Services (GCS) for our hosted database, as well as smaller features of our ecosystem like serverless functions.

For our main backend application, written in Laravel and Hedera API servers, we utilise Digital Ocean.

Laravel Forge manages the infrastructure for our Laravel application.

Our Hedera REST API, Trust Enterprises, uses the App Platform. We could save money on this side of things but, we try to opt for simplicity and ease of use.

Driving Hedera transactions through an API, a note

Since working on Hedera, there has always been one item that has differentiated it from other platforms. This is the accessibility for any developer to build upon the network. Consensus or Token Service are easy to work with, even if you have limited knowledge of JavaScript or other languages.

I developed Trust Enterprises with the view that any developer should be able to access these services through a REST API with a client node that is simple to deploy, meaning that regardless of any specific language or infrastructure it would be easy to start working with the system.

However, due to the nature of using an API and HTTP request timings, there will always be a bottleneck of throughput.

To combat this, in our core application we take advantage of the job queueing system that is present in Laravel. Every time we send a transaction to Hedera we utilise jobs as a means for parallelising our throughput. Because of this, we can’t have sudden bursts of high TPS.

However, the problem with this is that you need to make sure that the API that handles connections to the Hedera network is capable of dealing with large amounts of traffic, especially given that particular nodes frequently throttle connections in testnet conditions.

What causes throttling on testnet?

This throttling on testnet has been a particular ecosystem-wide issue. In many cases, when a Hedera node is having issues processing transactions, it can trigger a cascade effect of the underlying SDK becoming “stuck”. At this point, there’s seemingly no obvious way to pivot to another node to send transactions to.

With this in mind, there are a couple of issues, you have to be aware of:

That, in testnet conditions, you might hit throttling issues if you rapidly tried to push more than 30-40 TPS consistently.
In some cases, one has to reset a given infrastructure in order for transactions to be processed once again.

We quickly learned that having a single large server to process all transactions from our core API to Hedera was prohibitive. It triggered these bottlenecks more frequently than we would have liked.

Thus, in order to drive transaction volume linked to meaningful data, the approach is to horizontally scale systems, rather than vertically.

Vertical scaling in this case is a way to increase throughput by increasing the power of your server directly.

Horizontally scaling, on the other hand, relies on the concepts of mitigating redundancy and having multiple lower-powered nodes.

Breaking down the costs

For both primary infrastructure elements, we chose to duplicate resources that were already present in our production system. For this round we named our infrastructure with the stardestroyer prefix, to unleash the beast.

Comprising the two elements, below is the breakdown of the costs, specifications, and additional details.

We took the view that if we needed to spend approximately $1,200 a month on infrastructure demands, reasonable in the medium term, we would provision infrastructure to match. However, during the course of the test, we wanted to run for approximately five days so the actual cost of the test would be around the $200 mark.

Core Laravel App

It was vital that we used dedicated CPUs with a high level of memory. This was due to the nature of utilising Laravel’s Job queue mechanism with a number of different workers.

We decided on the specification for approximately $640 a month, excluding tax.

In this case, we just had a single node that our main application ran on to make it simple. We could further extend this by having a load balancer with multiple backend nodes and separating out databases for distributing different read and write functions. This, I believe, would be overkill for this test. If we needed that infrastructure demand, it would be within the view of the long-term – after two or three years.

Trust Enterprises Hedera API

We uncovered a lot about Hedera through provisioning our API over the course of the week. Initially, we decided to use the single node, to keep things simple. We opted to use Digital Ocean’s app platform as it enabled us to connect a GitHub repository to a system that would detect a given language, and then deploy through Docker an automated process.

We opted for an initial container of the largest spec the platform had to offer. Why? Because on our staging infrastructure we utilise a single container to keep costs down. For this round, we decided to increase the node size… at least to begin with.

Obviously, we decided this when we planned to use a single container for a vertically scaled system. When we pivoted to a horizontally scaled approach, we switched to having lots of small containers to deal with a much lower volume of transactions per node.

From a cost perspective, nothing changed. But we ended up building our infrastructure up to 25 containers, to spread the transaction load between them all.

Understanding DOVU’s Staking platform

There are a number of things that happened throughout this test. At first, we decided to slowly build up the users in our staging environment. This ensured that we could handle the expected throughput we were aiming for. As part of our APIs and tooling, we have tools to generate users and to send tokens. For a test environment you have to complete a few steps:

Generate a Hedera account
Associate the account with a given token ID, which reflects a mock token for staking (tDOV)
Send a small amount of that token to the account

DOVU’s staking platform relies on a key principle of Hedera, that is:

Tokens must be associated to an account, if an account holds a token it must have been associated.

From here we can infer that, if a user has a particular balance, we can take that as gospel from the network and calculate the share of a given reward for a particular hour.

As part of the initial test, we generated over 20,000 accounts. Our goal was to send a tDOV to all users every 1-2 minutes, so we could reach our target TPS.

Ramping up

Once we drilled in our infrastructure and we were gaining reasonable consistency for sending the tokens every two minutes, we kept an eye on it.

In order to drive these transactions we used Laravel Horizon, the core system for managing our Redis Queues. It provided us with a helpful panel and gave us all the necessary details. This way we could keep on top of any failing jobs, and adjust our infrastructure accordingly.

At our peak, we were pushing through just under 5000 jobs a minute. This comprised a maximum of 500 processes. You could effectively consider these threads for concurrency.

DOVU Carbon Marketplace on Laravel Horizon

Between our queue dashboard, Dragonglass, and HederaTxns we had a good view of the analytics needed to diagnose any issues during the stress test.

Interestingly enough, the amount of jobs in a queue became a good indicator of the current throughput we could push through Hedera. When the network was failing to process around 200-250 TPS the actual jobs waiting to be processed would increase on the “Current Workload” panel.

Part of our core philosophy at DOVU is to have redundancy at the heart of our systems. This means, as part of our admit panel, if tasks fail we have fine-grained control to retry a given task. Task failures include if tokens fail to be transferred, or other jobs faily to complete. In the images below you can see an example of “Retry Token Transfer”.

Hedera testnet throttling

Within the ecosystem there have been complaints around how particular SDKs handle the selection of nodes when sending messages to them. Generally, when you try to push too many transactions to a particular node too quickly in a given SDK, it will use the Javascript one. It can be a challenge to recover from this, to bypass and to reattempt the transaction to another node.

It’s almost like a given container is unable to retroactively step back and choose a different node to send a transaction to.

I’ve been aware of this issue for some time, and it happened once during the test. It’s really hard to illustrate but here’s an example of the issue:

This is a screenshot of our Hedera API continually processing transactions, over a six hour period. At approximately 1145 we have a dip of transaction throughput. After that point, only a fraction of the containers are able to process current transactions to the system. It stands to reason that a number of containers did get stuck; we needed to reset the app to get them to start running again.

It is worth noting that we were public about this stress test. During the day, there was another mysterious actor that was pushing 5,000-8,000 transactions through the system periodically.

Enter: the enigmatic step function

It seems that the more throughput there is going through Hedera, the higher the chance of throttling testnet. We need to diagnose and remedy whether this is an issue with DOVU’s system or with the network itself. But even in periods of congestion, we still managed to maintain 100 TPS.

Every single one of our transactions represented some state within our system; primarily staking payments at scale.

Naturally, you can see by the reward that the value differs for every single wallet. This means we can calculate the exact amount of value (to 6 decimal places) to send to a given user.

Outcome of the test

It’s unlikely that we’ll hit 250+ TPS in the near future. But this performance test was absolutely vital for our learnings, including:

What we need to do to expand
If there are there easy optimisations at play
If additional infrastructure is required to be provisioned

There are still a few congestion issues surrounding consistently driving anything above 200 TPS without having to reset an app. At this point, it’s unknown whether this is a testnet issue, or if it would affect mainnet as well.

We were impressed with the amount of throughput we pushed to the network during the test – hitting over 20% more than our estimations. You can slowly scale up and meet your own internal system’s demands, provided you have enough infrastructure and budget.

One process challenge is that we currently use an API to send transactions to Hedera. If we use a native gRPC library, using the JS SDK on a nodeJS app, this would remove a bottleneck for maximising the use of the network. We use Laravel as it provides so much out of the box for us as the foundation to develop against. And we can bolt on additional microservices as required.

What can you do to drive this TPS yourself?

Our entire Hedera API is based on the Trust Enterprises API, which is an open source project. It can be forked and configured for any Hedera SDK method for your individual needs. It’s fully tested and comes with CI/CD tooling by default.

I implore you to follow this guide and to take a look at Digital Ocean. I particularly recommend looking at the App Platform and Droplets.

It’s possible to mirror what we have done starting from less than $80 a month. As a starting point, this could be your initial specification for your Hedera-powered infrastructure.

A droplet that costs $20-$40 a month for the core production app
Combined with the app engine with 2 containers for $12 each, to make $24 dollars

From here you can continually test the demands of your system. But you should be able to consistently push 40-50 TPS to the Hedera testnet from this starting point.

If you’ve found this useful – and especially if you’ve built something using some of these insights – get in touch. I’d love to hear how you get on.

Join our Discord and find me there!

Be the first in the know: https://discord.gg/wW7rg6dDN4
Browse: https://dovu.earth/en/
Connect: https://twitter.com/dovuofficial
Read: https://medium.com/dovu-earth

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.