A Byzantine Fault-Tolerant Consensus Library .pdf
Document Details
Uploaded by Deleted User
Full Transcript
A Byzantine Fault-Tolerant Consensus Library for Hyperledger Fabric Artem Barger Yacov Manevich Hagar Meir Yoav Tock...
A Byzantine Fault-Tolerant Consensus Library for Hyperledger Fabric Artem Barger Yacov Manevich Hagar Meir Yoav Tock IBM Research IBM Research IBM Research IBM Research Haifa, Israel Haifa, Israel Haifa, Israel Haifa, Israel [email protected] [email protected] [email protected] [email protected] Abstract— The latest attempt to provide Fabric with a BFT ordering arXiv:2107.06922v1 [cs.DC] 14 Jul 2021 Hyperledger Fabric is an enterprise grade permissioned dis- service, by Sousa et al. in 2018 , adapted the Kafka-based tributed ledger platform that offers modularity for a broad set ordering service of Fabric v1.1 and replaced Kafka with a of industry use cases. One modular component is a pluggable ordering service that establishes consensus on the order of cluster of BFT-SM A RT servers. That attempt was not adopted transactions and batches them into blocks. However, as of the by the community because of various reasons (elaborated time of this writing, there is no production grade Byzantine Fault- in Section VII). The main problem was that it was built on Tolerant (BFT) ordering service for Fabric, with the latest version the Kafka architecture and used an external “general-purpose” (v2.3) supporting only Crash Fault-Tolerance (CFT). monolith BFT cluster. The implications are that it did not In this work we describe the design and implementation of a BFT ordering service for Fabric, employing a new BFT consensus address some of the more difficult, configuration related work library. The new library, based on the BFT-SM A RT protocol flows of Fabric, and that the BFT cluster followers were not and written in Go, is tailored to the blockchain use-case, yet able to validate the transactions proposed by the leader during is general enough to cater to a wide variety of other uses. The the consensus phase. In addition, it missed out opportunities BFT library’s design and integration into Fabric address crucial to perform several much needed optimizations that exploit aspects that were left unsolved in all prior work, making them unfit for production use. We evaluate the new BFT ordering the unique blockchain use-case. Moreover, a two process service by comparing it with the currently supported Raft-based two language solution increases the complexity and cost of CFT ordering service in Hyperledger Fabric. deploying, operating and maintaining the service. Index Terms—Blockchain, Distributed Ledger In the time that passed since then, Fabric incorporated the Raft protocol as the core of the ordering service, signifi- I. I NTRODUCTION cantly changing the Ordering Service Node (OSN) in the Hyperledger Fabric (or just Fabric) is an open source project process. Our goal was to implement a BFT library in the dedicated to the development of an enterprise grade permis- Go programming language, that would be fit to use as an sioned blockchain platform. Fabric employs the execute- upgrade for Raft. Our prime candidates were BFT-SM A RT order-validate paradigm for distributed execution of smart and PBFT. We soon realized that simply re-writing contracts. In Fabric, transactions are first tentatively executed, the Java (or C) implementation in Go will not make the cut. or endorsed, by a subset of peers. Transactions with tentative Fabric itself presents some unique requirements that are not results are then grouped into blocks and ordered. Finally, a addressed by traditional protocol implementations. In addition, validation phase makes sure that transactions were properly the blockchain use case offers many opportunities for opti- endorsed and are not in conflict with other transactions. All mizations that are absent from a general-purpose transaction- transactions are stored in the ledger, but valid transactions ordering reference scenario. We therefore set out to design and are then committed to the state database, whereas invalid implement a BFT library that on the one hand addresses the transactions are omitted from the state. special needs and opportunities of a blockchain platform, but on the other hand is general and customizable enough to be In the heart of Fabric is the ordering service, which receives useful for other use cases. endorsed transactions from the clients, and emits a stream of One of our goals was to provide an end-to-end Fabric system blocks. At the time of writing, the latest Fabric release (v2.3) that addresses all the concerns that a BFT system must face. uses the Raft , protocol which is Crash Fault Tolerant This forced us to tackle issues that span the entire Fabric (CFT). Despite previous efforts to do so, Fabric still does not transaction flow: the client, the ordering service, and the peers. have a Byzantine Fault-Tolerant (BFT) ordering service. In this The result is the first fully functional BFT-enabled Fabric paper we describe our efforts to transform Fabric into a end- platform. Our key contributions are: to-end BFT system, and contrast our approach with previous A stand-alone Byzantine fault-tolerant consensus library, attempts. based on BFT-SM A RT. The code is open source and is written in the Go programming language. 978-0-7381-1420-0/21/$31.00 ©2021 IEEE An easy to integrate interface of a consensus library, suited for blockchains. This interface captures the special that maintain a record of transactions using an append-only needs of a blockchain application, but is fully customiz- ledger and are responsible for the execution of the chaincode able for other use cases as well. (smart contracts) and its life-cycle. These nodes also maintain A full integration of the library with Hyperledger Fab- a “state” in the form of a versioned key-value store. Not all ric , which addresses BFT concerns of all its compo- peers are responsible for execution of the chaincode, but only a nents. This BFT version of Fabric is publicly available subset of peers called endorsing peers ,. (3) Ordering and open source , with an accompanying SDK. nodes are platform nodes that form a cluster that exposes an An evaluation of our BFT version of Fabric versus abstraction of atomic broadcast in order to establish total order Fabric based on Raft. The evaluation demonstrates that between all transactions and to batch them into blocks. our implementation is comparable in performance to In order to address scalability and privacy, Fabric introduces the earlier BFT-SM A RT based implementation , but the concept of channels. A channel in Fabric allows a well slower than Raft, mainly due to the lack of pipelining. defined group of organizations that form a consortium to The library and BFT-Enabled Fabric system presented in privately transact with each other. Each channel is essentially this work form the basis of an industrial asset tokenization an independent private blockchain, with its own ledger, smart platform (see press releases , ), and is currently a contacts, and a well defined set of participants. candidate Fabric RFC , discussed for inclusion in a future Fabric’s transaction flow is: (1) A client uses an SDK to Fabric release. form and sign a transaction proposal, and sends the transaction The rest of the paper is organized as follows. Section II proposal to a set of endorsing peers. (2) Endorsing peers introduces some background on consensus protocols and Hy- simulate the transaction by invoking the chaincode, and send perledger Fabric. In Section III we provide a high level view signed endorsements back to the client. (3) The client collects of Fabric’s new BFT Ordering Service Node, describing the responses from all endorsing peers, validates their confor- main internal components and their relation to the library. mance, and packs the responses creating a transaction. (4) Section IV provides a more detailed exposition of the BFT The client then submits the transaction to the ordering service. consensus library we developed, whereas Section V describes (5) The ordering service collects incoming transactions, packs the additions and modifications we had to perform to Fabric’s them into blocks, and then orders the blocks to impose total orderer, peer, and client, in order to turn it into an end-to-end order of transactions within a channel context. (6) Blocks BFT system. In Section VI we evaluate the fully integrated are delivered to all the peers, for example by peers pulling BFT Fabric system, and compare it to the current Raft-based blocks directly from the ordering service. (7) Upon receiving implementation. Finally, Sections VII and VIII discuss related a new block, a peer iterates over the transactions in it and work and summarize our conclusions, respectively. validates: a) the endorsement policy, and b) performs multi- version concurrency control checks. (8) Once the transaction II. BACKGROUND validation has finished, the peer appends the block to the ledger A. Consensus and updates its state. Most enterprise blockchain platforms use quorum-based Fabric is a modular blockchain system supporting multiple consensus protocols to order transactions. Loosely speaking, types of ordering services. In Fabric’s first release (v1.0) crash fault tolerant (CFT) protocols like Raft need a simple the ordering service was based on Kafka , a replicated, majority Q = 12 N , whereas “mainstream” Byzantine fault (crash) fault tolerant messaging platform. The ordering service tolerant (BFT) protocols like PBFT require a Q = 32 N nodes (OSNs) sent transactions to a Kafka topic (one for each majority (N size of cluster, Q size of quorum). PBFT is channel), and consumed from it an ordered transaction stream. more expensive than Raft: it requires more replicas to protect Then the OSNs employed a deterministic function to partition against the same number of faults (F = N − Q), executes an the transactions into identical blocks across all nodes. In this additional communication round (3 vs. 2), and requires the use architecture the ordering nodes did not communicate between of cryptographic primitives. The benefit is protection against them directly; they only acted as producers and consumers of arbitrary faults, including malicious behavior and collusion. If a Kafka service. Moreover, every node was servicing all the the number of faults is no more than F , both protocols provide channels. safety (all correct replicas agree on the same sequence of valid On release v1.4.1 Fabric introduced an ordering service values) and liveness (if a correct node delivers a value, then based on a Go implementation of the Raft consensus eventually all other correct nodes will deliver it as well). algorithm from etcd. This significantly changed the archi- tecture of the OSN. Each channel now operates an independent B. Hyperledger Fabric cluster of Raft nodes. An OSN can serve multiple channels, The Fabric blockchain network is formed by nodes but is not required to serve all of them. This permits linear which could be classified into three categories based on scalability in the number of channels by spreading channels their roles: (1) Clients are network nodes running the appli- across OSNs. Raft is a leader-based protocol. The leader of cation code, which coordinate transaction execution. Client each cluster (channel) batches incoming transactions into a application code typically uses the Fabric SDK in order to block, and then proposes that block to the consensus protocol. communicate with the platform. (2) Peers are platform nodes The result is a totally ordered stream of blocks replicated across all OSNs that serve the channel. Clients are required component and sent to all the nodes. Every node that receives to submit transactions to a single node, preferably the leader. a commit uses the crypto component to verify the signature However, transactions may be submitted to non-leader nodes, on it. When enough valid commit messages accumulate, the which then forward them to the leader. At the time of this block along with all the Q = 2F + 1 commit signatures is writing Fabric’s latest version (v2.3) offers the Raft-base delivered to the committer component that saves it into the ordering service as default, and deprecated the Kafka-base block store. We call the block and commit signatures the option. “decision”. Each OSN has a block delivery service that allows peers, C. The BFT-SM A RT Protocol orderers and clients to request blocks from it. This is how peers Parts of the library presented in this paper were designed get their blocks, and this is how an OSN that was left behind based on the BFT-SM A RT consensus library ,. In the quorum catches up. If the consensus library suspects that BFT-SM A RT the message pattern in the normal case is it is behind the frontier of delivered decisions, it will call similar to the PBFT protocol , i.e. PRE - PREPARE, PRE - the sync component, which in turn uses the block delivery PARE , and COMMIT phases/messages. If the current leader is service of other OSNs in order to pull the missing blocks. faulty, a new leader is elected using a view change protocol, Every block that it receives goes through a block validation which we implement with the synchronization phase of BFT- component that makes sure it is properly signed by Q OSNs, as SM A RT in mind. The view change process uses three required by the BFT protocol. When it finishes, it will provide types of messages: VIEW- CHANGE, VIEW- DATA, and NEW- the library with the most recent decision. VIEW. We chose to implement the BFT-SM A RT protocol This architecture resembles the Raft-based OSN architecture because of its simplicity and elegance. This protocol is sig- in the sense that the consensus library is embedded within the nificantly simpler than PBFT, because it does not allow for OSN process. However, there are important differences: a transaction pipeline. In BFT-SM A RT there is only a single In a Raft-OSN the block is cut before ordering and proposed transaction by a given leader at any point in time, is then submitted to the raft library for ordering. Here which dramatically simplifies the view change sub-protocol. transactions are submitted to the BFT library, because the library monitors against transaction censorship. This III. A RCHITECTURE required the addition of the assembler to the OSN and We now describe the architecture of the new BFT OSN, respective APIs to the library. following the main components depicted in Figure 1. We In a Raft-OSN the followers trust the leader’s block present the novel library interface in Sec. IV, and the BFT proposal (as it can only crash fail), whereas here the related improvements to other Fabric components in Sec. V. followers must revalidate the transactions within the After a client collects endorsements and assembles a trans- block proposal. This required adding to the library APIs action, it submits it to all the OSNs (see Sec. V-E). An for transaction verification during the consensus protocol. OSN will first filter incoming transactions using several rules In a Raft-OSN the delivered block is signed after consen- encoded in the validator; for example, verifying the client’s sus by the node, right before it is saved to the block store. signature against the consortium’s definitions. Valid transac- Here the block is signed during the consensus protocol, tions are then submitted to the BFT consensus library for in the commit phase, by Q nodes. ordering. Submitted transactions are queued in a request pool within the library. IV. T HE CONSENSUS LIBRARY AND API The consensus library is leader based, and at any given mo- We implemented a BFT consensus algorithm in a stand- ment the node will operate either as a leader or a follower. A alone open-source library. The algorithm is mostly based leader OSN will batch incoming transactions, and then call the on the well known PBFT , and BFT-SM A RT , assembler to assemble these transactions into a block. The protocols. However, the library interface differs in several assembler is in charge of ensuring that block composition important ways from the traditional BFT library API. adheres to the rules and requirements of Fabric, and returns a block that may contain some or all of the transactions in A. Request life-cycle the batch, and may even reorder them. Transactions that were As in most consensus protocols, the request life-cycle, not included in the block will be included in a later batch. shown in Fig. 2, starts with “submit” and ends with “deliver”: The leader will then propose the block, starting the consensus SubmitRequest(req []byte) error - starts the request protocol. A follower node will receive this block proposal from life-cycle. Clients should submit requests to all nodes and the comm component and will revalidate every transaction not just to the leader. Whenever a request is not included in it using the same validator used by the filter. This is in a proposal after a first timeout, every non-faulty follower done because in a BFT environment the follower cannot trust forwards the request to the leader. This protects against the leader – it must therefore validate the proposal against malicious clients sending messages only to followers, trying to the application running on top of it – the Fabric blockchain. force a view change. If a request is not included in a proposal After a block proposal passes through the first two phases of after a second timeout, every non-faulty follower will initiate consensus the commit message is signed using the crypto a view change. This protects against faulty leaders. validator assembler crypto block pull blocks delivery called by: validate assemble block sign & peers, orderers, read tx from batch verify sig. & clients submit submit deliver tx tx consensus block write block client filter committer library store sync & send recv get last decision write Ordering validate block Service comm sync validation Node block comm to other orderers to other orderers delivery Fig. 1. The architecture of a Fabric BFT-based Ordering Service Node. Deliver(Proposal,[]Signature) Reconfig - is the end it is behind by several commits (e.g. recovering after a long of the request life-cycle. It is invoked by the library when a down-time). This will trigger a BFT-compliant replication node receives a quorum of COMMIT messages, and delivers protocol (see Sec. V) and pull the missing blocks. The library a committed proposal (a block) along with Q signatures to will be provided with a R ECONFIG object that tells it whether the application. The application is responsible to store the any of the transactions in the pulled history reconfigured the delivered decision, as the library may dispose of this data once library (see below). the deliver call returns. The application returns a R ECONFIG VerifyRequest(req []byte) error - The library assumes object that tells the library whether any of the transactions in the application submits only valid client requests. However, the P ROPOSAL reconfigured the library (see below). there are cases where a client’s request must be reverified: The requirements of the Fabric OSN guided us into a library (1) if a follower forwards a request to the leader, the leader API that extends this pattern, and requires the application to calls VerifyRequest; (2) if the configuration changed, as implement several additional functions. The invocation flow is elaborated next, then all queued client requests must be depicted in Figure 2. This extension makes the API perfectly reverified. suited to the blockchain domain. Assemble(metadata []byte, reqs [][]byte) Proposal - B. Reconfiguration is called by the leader with a batch of requests along with The library supports dynamic (and non-dynamic) reconfig- some metadata (which are SEQUENCE and VIEW- NUMBER). uration. Reconfiguration is done implicitly and not explicitly. This allows a blockchain application to construct a block Instead of the application sending an explicit reconfiguration (P ROPOSAL) in any form it wishes. For example, this enables transaction to the library, the library infers its configuration adding block and chain specific information, such as hash of status after each commit. To that end, the Deliver and Sync the previous block, merkle-tree, etc. Moreover, in Fabric, a calls return a R ECONFIG object that defines a new configura- configuration transaction must reside in a block by itself. tion, if there was any. If there was a reconfiguration, the library VerifyProposal(Proposal) error - is called by a follower restarts all of its components with the new configuration. that receives a proposal in a PRE - PREPARE message from the leader (since the leader may be malicious). In Fabric, C. Communication we check that the proposal is constructed as a valid and The library assumes the existence of a communication correctly numbered block, its header contains the hash of the infrastructure that implements point to point authenticated previous block, and all transactions in the block are valid channels (like Raft). The library is notified when a new mes- Fabric transactions. sage arrives; messages carry the identifier of the node that sent SignProposal(Proposal) Signature - is called during them. The communication infrastructure is used to send both the commit phase when nodes sign the current P ROPOSAL and Consensus protocol messages (i.e. PRE - PREPARE, PREPARE, send the signature as part of the COMMIT message. Nodes also COMMIT , view-change protocol messages, and heartbeats), as sign the VIEW- DATA message during the view change process well as Forwarded clients’ requests. (not shown). VerifySignature(Signature) error - is called to verify D. Write ahead log incoming COMMIT messages, as part of the commit phase, The WAL (write-ahead log) is a pluggable component, and where each node collects a quorum of COMMIT messages. we provide a simple implementation that maintains only the Sync() Reconfig - is called by the library when it suspects latest state. The WAL can be kept that simple because a Client SubmitRequest(req []byte) error Tx Sign Verify Leader Assemble(metadata []byte, Assemble Proposal Signature Deliver requests [][]byte) Proposal Application Tx VerifyProposal(Proposal) error Library Submit SignProposal( Request Proposal) Signature Follower Verify Sign Verify Proposal Proposal Signature Deliver VerifySignature( Application Signature) error Tx Library Deliver(Proposal, Pre-prepare Prepare Commit []Signature) Reconfig Fig. 2. Normal case library-application flow (left), and API (right). blockchain application will keep a history of all delivered that compose the ordering service consortium. For a BFT decisions. In contrast, a general purpose SMR may apply service, this is obviously not enough. decisions to the state and discard them, and therefore needs We implemented a new BFT validation policy that requires to keep all decisions to the latest state snapshot. that the block be signed by Q out-of N nodes. Moreover, we require that these signatures are from the consenter set (i.e. E. Summary the set of N nodes) defined in the last channel configuration The transaction flow in the monolith BFT-SM A RT cluster transaction. This means that the new validation policy we used in starts with SUBMIT and ends with DELIVER, with defined must be dynamically updated every time the consenter no means for efficiently implementing the requirements of a set is updated with a configuration transaction, which was blockchain system. The design of our library addresses this gap previously not supported. by defining API functions that need to be implemented by the application using the library. This allows us to address the pain C. The orderer node points that had failed previous attempts to implement a BFT ordering service for Fabric. In less demanding applications Changing the orderer node to work with our library amounts (e.g without blocks, no need for transaction validation) these to implementing the logic that initializes the library, submits “hooks” into the consensus protocol can be filled by simple transactions to the library, and handles invocations from the “empty” implementations. library. Synchronization happens when an orderer node falls behind the cluster. This is done by requesting blocks from V. B YZANTINE FAULT- TOLERANT FABRIC other cluster members. We changed the default synchroniza- In order to turn Fabric into an end-to-end BFT system tion protocol to tolerate Byzantine faults. The Block assembler we had to make changes that go beyond replacing the Raft is in charge of inspecting a batch of incoming transactions consensus library with our BFT library. offered to it by the leader, and returning a block of transactions that is built in accordance with the semantics of Fabric, for A. Block structure example, that a configuration transaction must be packed in Fabric’s block is composed of a header, a payload, and a block by itself. The Signer and Verifier implementations metadata. One of the metadata fields is consensus specific simply wrap the already existing capabilities of Fabric to sign information. We changed the type of information in it to the and verify data. ViewID and Sequence number from the consensus library. B. Signatures and validation D. The peer Every time a Fabric node (peer or orderer) receives a block In Fabric, a peer will randomly select an ordering node and outside of the consensus process, it must validate that the block request a stream of blocks from it. In our BFT implementa- is properly signed. This happens in three different scenarios: tion, the blocks that are received from the ordering service 1) when a peer receives a block from the ordering service, cannot be forged because they are signed by at least 2F + 1 2) when a peer receives a block from another peer during ordering nodes. However, receiving the stream of blocks from the gossip protocol, and 3) when an ordering node receives a a single orderer (or up to F orderers) might expose the peer block from another ordering node during the synchronization to a censorship attack. To protect against this scenario we protocol. The default policy in Fabric is to require a single implemented a block delivery service that is resilient to this signature from a node that belongs to any of the organizations sort of attack. A naive expensive solution is for the peer to ask for the a total fan-out of 720 Mb per second. In contrast, for Raft the block stream from F + 1 ordering nodes. Instead, we augment situation is different, and at a throughput of 12,000 tx/sec, the the delivery service and allow a peer to request a new kind leader sends 375Mb to each of the 10 followers which starts of delivery – a stream that contains only the header and hindering the throughput. metadata of each new block, enough to verify the signatures. Since the latencies in our LAN setup are negligible, we The peer requests a stream of full blocks from a randomly attribute the differences in throughput mainly to the high CPU selected orderer, and a Header-Metadata stream from all the processing overhead that is mandatory in BFT, but doesn’t rest. The peer then monitors the front of delivered blocks: if exist in Raft. We verify this conclusion by observing the F or more Header-Metadata streams are ahead of the full CPU consumption and measuring the time it took to perform block stream for more than a threshold period, then the peer local CPU cryptographic computations for block verification, suspects the orderer that delivers the stream of full blocks as well as monitoring I/O (omitted due to space limitations). applies censorship, and replaces it with another one. C. WAN performance E. The client Figure 4 exhibits the results of an identical experiment, In the Raft CFT setting a client is only required to submit except the servers are now deployed across 10 different data a transaction proposal to a single orderer. We modified the centers across the globe. Throughput is significantly reduced client in the Fabric Java SDK to submit every transaction both for BFT-OS and Raft-OS. With 1,000 tx/block, BFT-OS to all the orderers, in order to prevent censorship attacks by throughput decreases down to 40% of the LAN setup, whereas malicious orderers. This is the simplest solution to always hit RAFT-OS throughput decreases down to 20% of the LAN the leader, which minimizes forwarding between orderers. setup. In our BFT-OS, the leader starts consensus on a block VI. E VALUATION only after the previous block has been committed. Therefore, We evaluated the performance of our BFT ordering service throughput improves significantly with block size, as the WAN (BFT-OS) for Fabric that integrates our consensus library, and bandwidth is not saturated due to the lack of pipelining and compared it with the existing state-of-the-art Raft ordering ser- high latencies. In contrast, the Raft-OS can pipeline consent on vice (Raft-OS, from Fabric v2.3). Here we report a summary blocks, saturating the WAN bandwidth even for small blocks, of our findings. and therefore does not benefit from increasing block size. The cost of increasing the block size is added average latency due A. Setup and Workload to batching. In cases where the throughput is less sensitive to We consider both LAN and WAN setups, different cluster block size (BFT-LAN, Raft-WAN), or when latency is more sizes (4, 7, 10 for BFT, 5, 7, 11 for Raft) and various important than throughput, smaller blocks may be used. configurations of batch sizes (100, 250, 500, 1000 transac- D. Summary tions/block). We pre-allocate 1.4 million Fabric transactions each sized 4KB (typical of a transaction which contains 3 Using large blocks (1000 txs/block), and with resilience to certificates of endorsing peers) and send it to the leader from two faults (F = 2 , cluster size 7 & 5, for BFT-OS & Raft- 700 concurrent workers. This ensures the leader always fills OS, resp.): in a LAN, BFT-OS achieves 20% the performance blocks with as many transactions it can (resulting in block of a Raft-OS (2,500 vs. 13,000 tx/sec, resp.); in a WAN, sizes ∼400KB, ∼1MB, ∼2MB, ∼4MB resp.). We used a BFT-OS achieves 40% the performance of a Raft-OS (1,200 homogeneous virtual machine specification for all nodes with vs. 3,000 tx/sec, resp.). We attribute this gap mainly to the a 16 core Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz lack of pipelining in the BFT-SM A RT protocol, and to the with 32GB RAM and SSD hard drives. In our WAN setup cryptographic computations required in all BFT protocols. we spread the servers across the entire globe to maximize VII. R ELATED WORK the latency (Dallas, London, Washington, San Jose, Toronto, Sydney , Milan, Chennai, Hong Kong, and Tokyo). The leader A. Blockchain and BFT is always located in London. Latency to the followers is: In general, the increasing interest in blockchain lead to average = 133ms, min = 20ms (Milan), max = 250ms the development of algorithms such as Tendermint , Hot- (Tokyo). Each test is repeated 10 times. Stuff and SBFT , where the key focus is on improving the view change sub-protocol or replacement of a faulty leader. B. LAN performance Many decades of research lead to BFT protocols ex- Figure 3 depicts the throughput (in tx/sec) of both BFT-OS ploiting randomization to provide consensus solution in the (left) and Raft-OS (right) with various block sizes and cluster asynchronous model, with well known classical results such sizes. as –. However, these protocols are far from being For both BFT-OS and Raft-OS, throughput improves with practical due to their poor performance. Only recently Miller et block size (i.e. the number of txs per block). For BFT-OS, al. suggested a leaderless randomized algorithm with rea- a rate of 2,500 tx/sec means sending 80Mb of traffic per sonable and promising performance results. Recent advances second per node, and even in a cluster of 10 nodes, it means include also and. (a) BFT in a LAN (b) Raft in a LAN Fig. 3. Cluster throughput as a function of block size, in a LAN, for BFT (a) and Raft (b). Transaction size is ∼4KB. (a) BFT in a WAN (b) Raft in a WAN Fig. 4. Cluster throughput as a function of block size, in a WAN, for BFT (a) and Raft (b). Transaction size is ∼4KB. Despite the renaissance of research around BFT consen- B. BFT-Smart and Hyperledger Fabric sus algorithms, triggered primarily by increased interest in blockchain, there are only a few openly available imple- In its first release, the Fabric ordering service was based mentations suitable for production-grade exploitation – on the Kafka messaging service. In this implementation, the. Unfortunately, existing libraries lack a practical reusable OSNs receive transactions from clients, and submit them to interface which is flexible and generic enough to implement a a Kafka topic, one topic for each Fabric channel. All OSNs BFT-enabled ordering service for Hyperledger Fabric. Any im- would consume from the topics – again, one topic per channel plementation based on a programming language other than Go – and therefore receive a totally ordered stream of transactions would have exhibited the same drawbacks as the Kafka based per channel. Each OSN would then deterministically cut the ordering service architecture, where the consensus component transaction stream into blocks. will have been deployed as an external service. In the quest In 2018 Sousa et al. made an attempt to convert this to develop a clustered ordering service based on an embedded implementation into a BFT ordering service. They replaced consensus library, we had to write our own library. the Kafka service with a cluster of BFT-Smart based servers, where each server consisted of a BFT-Smart core wrapped with a thin layer that allowed it to somewhat “understand” Fabric transactions. The proof of concept presented in the paper exhibited promising performance, but was eventually correctness of the BFT protocol. The corresponding block not adopted by the community. The community discussions validation policy was added to the peer and orderer for that reveal a mix of fundamental and technical reasons for that effect, and the SDK client was augmented to interact properly. with a BFT service. The solution presented was composed of two processes (or- derer front-end and BFT server), written in two languages (Go C. Tendermint and Hyperledger Fabric & Java, resp.). That was not well received as it complicates Another option we considered was to re-use the Tendermint the development, maintenance, and deployment of the ordering Core as an embedded consensus library. There is an service. The experience gained with the Kafka-based service Application Blockchain Interface (ABCI) which defines an motivated the community to move towards a single process API between the BFT protocol and state machine replication that embeds a consensus library, as eventually happened with layer (application). However, the consensus protocol itself ac- the introduction of the Raft-based ordering service in Fabric tually implements the blockchain, with batches of transactions v1.4.1. chained together forming the ledger. The ABCI only provides There were, however, more fundamental reasons. The code an interface for the application to validate and execute trans- that wrapped the BFT-Smart core did not include Fabric’s actions. That implies that using Tendermint as a consensus membership service provider (the MSP), which defines iden- library in the Fabric OSN would have resulted in a ledger tities, organizations, and certificate authorities (CAs), and in- with Tendermint blocks in addition to the ledger maintained cludes the cryptographic tools for the validation of signatures. by Fabric. In addition, Tendermint implements a custom peer- Therefore, the BFT cluster signatures were not compliant with to-peer communication protocol inspired by the station-to- Fabric’s, and the identities of the BFT cluster servers were station protocol. This protocol significantly deviates from the not part of Fabric’s configuration. In Fabric, configuration is communication model of Fabric. Unfortunately the current part of the blockchain and must be agreed upon. Incorporating implementation is too tightly coupled into the Tendermint Core the Java BFT cluster endpoints and identities (certificates & and would have required substantial refactoring. Overall, the CAs) into the configuration flow would have meant providing a lack in configuration flexibility and the inability to replace the Java implementation to an already highly sensitive component. communication layer led us towards the decision to implement This shortcoming also meant that the front-end orderers had to our own BFT library instead. collect 2F + 1 signed messages from the BFT cluster servers, increasing the number of communication rounds to four. VIII. C ONCLUSION The blocks produced by the front-end servers included only In this paper we described the design, implementation, the signature of a single front-end orderer. This does not allow and evaluation of a BFT ordering service for Fabric. In the an observer of the blockchain (or a peer) to be convinced that heart of this implementation lies a new consensus library, the block was properly generated by a BFT service. Moreover, based on the BFT-SM A RT protocol, and written in Go. The even if the 2F +1 BFT cluster signatures were included in the library presents a new API suited for permissioned blockchain block metadata, that does not help an observer, as the identities applications, such as Fabric. It delegates many of the core of said servers are not included in Fabric’s configuration. functions that any such library must use to the application Moreover, peers and clients did not have the policies that employing it, allowing for maximal flexibility and generality. correspond to the BFT service they consumed from. For example, cryptographic functions, identity management, Another subtle problem with a monolithic BFT cluster is as well as point to point communication are not embedded that it does not allow a follower to properly validate the but are exposed through proper interfaces, to be implemented transactions proposed by the leader against the semantics of by the application using it. This allowed us to re-use some of Fabric – again – without pulling in a significant amount of the sophisticated mechanisms that Fabric already possessed. In Fabric’s code. order to make Fabric an end-to-end BFT system, we ensured BFT-Smart owes much of its performance to the internal that the peer and the client SDK interact properly with the batching of requests. However, those batches are not consistent BFT ordering service. with the blocks of Fabric, so had to be un-packed, multiplexed We chose to implement the BFT-SM A RT protocol because into different channels, and then re-packed into Fabric blocks. of its simplicity. This protocol is significantly simpler than We confronted these problems when we designed the library PBFT, because it does not allow for a transaction pipeline. and its integration with Fabric. The interface of the library In BFTSmart there is only a single proposed transaction by a allowed us to seamlessly integrate with Fabric’s MSP and given leader at any point in time, which dramatically simplifies configuration flow. Our implementation allows the leader to the view change sub-protocol. This simplicity greatly increases assemble blocks according to Fabric’s rules, so transactions are our confidence in the correctness of the implementation. batched once. It allows followers to validate the transactions However, these advantages come with a cost – reduced per- against Fabric’s semantics during the consensus protocol, and formance. This is especially salient when comparing against it collects the quorum signatures during the commit phase. the highly mature and optimized etcd/raft library, which This reduces the number of communication rounds to three, uses pipelining extensively. Nevertheless, our implementation and allows the observer of a single block to validate the exhibits levels of performance that are sufficient for most permissioned blockchain applications: a 7 node BFT ordering M. Yin, D. Malkhi, M. K. Reiter, G. G. Gueta, and I. Abraham, “Hot- service (F = 2) can support ∼ 2500 TPS in a LAN, and stuff: Bft consensus with linearity and responsiveness,” in Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, ∼ 1000 TPS in a WAN. These numbers are for a single 2019, pp. 347–356. channel; a Fabric network can scale horizontally by adding G. G. Gueta, I. Abraham, S. Grossman, D. Malkhi, B. Pinkas, M. K. channels. Reiter, D.-A. Seredinschi, O. Tamir, and A. Tomescu, “Sbft: a scal- able decentralized trust infrastructure for blockchains,” arXiv preprint arXiv:1804.01626, 2018. M. Ben-Or, “Another advantage of free choice (extended abstract) R EFERENCES completely asynchronous agreement protocols,” in Proceedings of the second annual ACM symposium on Principles of distributed computing, “Hyperledger fabric,” https://www.hyperledger.org/use/fabric, 2020. 1983, pp. 27–30. E. Androulaki, A. Barger, V. Bortnikov, C. Cachin, K. Christidis, G. Bracha, “An asynchronous [(n-1)/3]-resilient consensus protocol,” A. De Caro, D. Enyeart, C. Ferris, G. Laventman, Y. Manevich, in Proceedings of the third annual ACM symposium on Principles of S. Muralidharan, C. Murthy, B. Nguyen, M. Sethi, G. Singh, K. Smith, distributed computing, 1984, pp. 154–162. A. Sorniotti, C. Stathakopoulou, M. Vukolić, S. W. Cocco, and J. Yellick, S. Toueg, “Randomized byzantine agreements,” in Proceedings of the “Hyperledger Fabric: A distributed operating system for permissioned third annual ACM symposium on Principles of distributed computing, blockchains,” in Proceedings of the Thirteenth EuroSys Conference, ser. 1984, pp. 163–178. EuroSys ’18, 2018, pp. 1–30. C. Cachin, K. Kursawe, and V. Shoup, “Random oracles in constantino- D. Ongaro and J. Ousterhout, “In search of an understandable consensus ple: Practical asynchronous byzantine agreement using cryptography,” algorithm,” in Proceedings of the 2014 USENIX Conference on USENIX Journal of Cryptology, vol. 18, no. 3, pp. 219–246, 2005. Annual Technical Conference, ser. USENIX ATC’14. USA: USENIX A. Miller, Y. Xia, K. Croman, E. Shi, and D. Song, “The honey badger Association, 2014, p. 305–320. of bft protocols,” in Proceedings of the 2016 ACM SIGSAC Conference “The raft consensus algorithm,” https://raft.github.io, 2020. on Computer and Communications Security, ser. CCS ’16. New York, J. Sousa, A. Bessani, and M. Vukolic, “A byzantine fault-tolerant NY, USA: Association for Computing Machinery, 2016, p. 31–42. ordering service for the hyperledger fabric blockchain platform,” in I. Abraham, S. Devadas, D. Dolev, K. Nayak, and L. Ren, “Synchronous 2018 48th Annual IEEE/IFIP International Conference on Dependable byzantine agreement with expected o(1) rounds, expected o(n2 ) com- Systems and Networks (DSN), June 2018, pp. 51–58. munication, and optimal resilience,” in Financial Cryptography and “Regarding byzantine fault tolerance in hyperleder fabric.” https://lists. Data Security, ser. FC ’19, I. Goldberg and T. Moore, Eds. Springer hyperledger.org/g/fabric/topic/17549966#3135, March 2018. International Publishing, 2019, pp. 320–334. A. Bessani, J. Sousa, and E. E. P. Alchieri, “State machine replication T. Locher, “Fast byzantine agreement for permissioned distributed for the masses with BFT-SMaRt,” in 2014 44th Annual IEEE/IFIP ledgers,” in Proceedings of the 32nd ACM Symposium on Parallelism International Conference on Dependable Systems and Networks, June in Algorithms and Architectures, ser. SPAA ’20. New York, NY, USA: 2014, pp. 355–362. Association for Computing Machinery, 2020, p. 371–382. [Online]. M. Castro and B. Liskov, “Practical byzantine fault tolerance,” in Available: https://doi.org/10.1145/3350755.3400219 Proceedings of the Third Symposium on Operating Systems Design and “Source code, practical byzantine fault tolerance,” http://www.pmg.csail. Implementation, ser. OSDI ’99. USA: USENIX Association, 1999, p. mit.edu/bft/, 1999. 173–186. “Tendermint core: Byzantine-fault tolerant state machines,” https:// “The smartbft-go library open-source repository,” https://github.com/ github.com/tendermint/tendermint, 2016. SmartBFT-Go/consensus, 2019. “The honey badger of bft protocols,” https://github.com/initc3/ HoneyBadgerBFT-Python/, 2016. “Hyperledger Fabric BFT open-source repository,” https://github.com/ SmartBFT-Go/fabric, 2019. “The smartbft-go Java SDK library open-source repository,” https:// github.com/SmartBFT-Go/fabric-sdk-java, 2019. Atomyze, “Atomyze launches industrial asset tokeniza- tion platform,” https://www.prnewswire.com/news-releases/ atomyze-launches-industrial-asset-tokenization-platform-301010804. html, February 2020. N. day crypto, “Norilsk nickel introduced atomyze platform based on hyperledger blockchain,” https://newdaycrypto.com/norilsk-nickel- introduced-atomyze-platform-based-on-hyperledger-blockchain/, Febru- ary 2020. “Hyperledger Fabric RFCs,” https://github.com/hyperledger/fabric-rfcs, 2020. “An RFC proposal for a BFT ordering service for Hyperledger Fabric,” https://github.com/hyperledger/fabric-rfcs/pull/33, 2020. E. Androulaki, A. De Caro, M. Neugschwandtner, and A. Sorniotti, “Endorsement in hyperledger fabric,” in 2019 IEEE International Con- ference on Blockchain (Blockchain), 2019, pp. 510–519. Y. Manevich, A. Barger, and Y. Tock, “Endorsement in Hyperledger Fab- ric via service discovery,” IBM Journal of Research and Development, vol. 63, no. 2/3, pp. 2:1–2:9, 2019. “Apache kafka,” https://kafka.apache.org. “etcd: A distributed, reliable key-value store for the most critical data of a distributed system,” https://etcd.io and https://github.com/etcd-io/etcd. J. Sousa and A. Bessani, “From byzantine consensus to bft state machine replication: A latency-optimal transformation,” in 2012 Ninth European Dependable Computing Conference, May 2012, pp. 37–48. M. Castro and B. Liskov, “Practical byzantine fault tolerance and proactive recovery,” ACM Trans. Comput. Syst., vol. 20, no. 4, p. 398–461, Nov. 2002. E. Buchman, “Tendermint: Byzantine fault tolerance in the age of blockchains,” Ph.D. dissertation, 2016.