Ztachip Accelerates Tensorflow and Image Workloads

[Vuong Nguyen] clearly knows its way around AI accelerator hardware, creating ztachip: an open source implementation of an accelerator platform for AI and traditional image processing workloads. Ztachip (pronounced “zeta-chip”) contains a set of custom processors and is not tied to any particular architecture. Ztachip implements a new tensor programming paradigm that [Vuong] created, who box speed up TensorFlow tasks, but not limited to that. In fact, it can process TensorFlow in parallel with non-AI tasks, as shown in the video below.

A RISC-V core, based on the VexRiscV design, is used as the host processor to handle application distribution. VexRiscV itself is quite interesting. Written in SpinalHDL (a Scala variant), it is super configurable, producing a Verilog core, ready to be integrated into the design.

A Digilent Arty-A7, Arducam and a PMOD VGA are all you need

From a hardware design perspective, the RISC-V core connects to an AXI crossbar, with all AXI-lite buses multiplexed as usual for the AMBA AXI ecosystem. The Ztachip core as well as a DDR3 controller are also connected, as well as a camera interface and VGA video.

In addition to providing an FPGA-specific DDR3 controller and an AXI crossbar IP, the rest of the design is generic RTL. It’s good news. The demo below deploys on an Artix-7 based Digilent (Arty-A7) with a PMOD VGA module, but nothing else is needed. A pre-built Xilinx IP is provided, but targeting a different FPGA shouldn’t be a huge task for the experienced FPGA ninja.

Ztachip high-level architecture

The magic happens in the Ztachip core, which is basically an array of Pcores. Each Pcore has both vector and scalar processing capability, making it extremely flexible. The Tensor Engine (internally it’s the “Data Plane Processor”) is in charge here, sending instructions from the RISC-V core into the Pcore array along with image data, as well as video data in continued. This camera is only a 0.3MP Arducam and the video is in VGA resolution, but give it a bigger FPGA and those limits could be increased.

This domain-specific approach uses a highly modified C-like language (with a custom compiler) to describe the application to be distributed in the accelerator array. We couldn’t find any documentation on this, but there are some sample algorithms.

The demo video shows a real-time mix of four algorithms running in parallel; object classification (Google’s Tensorflow mobilenet-ssd, a pre-trained AI model) nifty edge detection, Harris corner detection, and optical flow that gives it motion vision like a predator.

[Vuong] account, in terms of efficiency, it is 5.5 times more computationally efficient than a Jetson Nano and 37 times more than Google’s Edge TPU. These are bold claims, to say the least, but who are we to argue with a clearly incredibly talented engineer?

We cover a lot of AI-related topics, like this AI-assisted typing gadget, to start with. And not wanting to forget the original AI hardware, the good old-fashioned neuron, we’ve got that covered too!

About Robert Wright

Check Also

Evoland Legendary Edition and Fallout 3: GotY Edition free on Epic Games Store, Saturnalia and Warhammer 40k: Mechanicus next week

Evoland Legendary Edition Evoland Legendary Edition is a bundle containing Evoland: A Short Story of …