Browse Prior Art Database

Efficient Accelerators for Matrix Multiplication in Machine Learning

IP.com Disclosure Number: IPCOM000243676D
Publication Date: 2015-Oct-09
Document File: 3 page(s) / 51K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is an efficient accelerator, for machine learning applications, that leverages the robustness of the applications to inexact computation. The approach focuses on performing multiplication of large matrices, as this is the dominant kernel in a number of machine learning applications.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 39% of the total text.

Page 01 of 3

Efficient Accelerators for Matrix Multiplication in Machine Learning

Computations involving large matrices are pervasive in data analytics and machine learning workloads. These computations are generally slow on general purpose processors, and benefit significantly from special purpose hardware accelerators. Such accelerators may be implemented as application specific integrated circuits (ASICs) or on Field Programmable Gate Arrays (FPGAs). Some machine learning workloads provide opportunities for realizing more power and/or area efficient accelerators by relaxing precision requirements without any impact on the performance of the application itself.

This article describes an efficient accelerator architecture that leverages the robustness of the applications to inexact computation. The approach focuses on performing multiplication of large matrices, as this is the dominant kernel in a number of machine learning applications.

Floating point computation that is typically performed in central processing units (CPUs) and graphics processing units (GPUs) is overkill for most machine learning applications.

A simple migration to fixed point computation from floating point computation yields significant gains in terms of power efficiency, the speed of operation, and the density of multipliers in the chip. Moreover, most modern FPGAs have a large number of
hard wired fixed point multipliers and adders available, increasing the attractiveness of fixed point computation even further.

However, building an accelerator for matrix multiplication that uses fixed point multipliers encounters the following problems:


• Fixed point numbers have a range and resolution trade off that can adversely impact the quality of the results obtained at the application level


• The elements of the input matrices need to be distributed to many multipliers ,

  which can increase the complexity of the on chip wiring and reduce the maximum operating frequency
• Off chip communication bandwidth limitations make it challenging to keep the

on chip computation resources running at full utilization

The most commonly used alternative to custom hardware based acceleration is

GPU based acceleration. GPUs only perform floating point computations and are an order of magnitude less power efficient (at comparable throughput) than ASICs/FPGAs.

The proposed solution to these challenges is an accelerator for machine learning applications. The novel accelerator comprises a new cognitive computing architecture that unifies the ideas of using systolic array based multipliers and randomized rounding of fixed point numbers, together with the innovation of pre fetching as much data as possible to keep the processing units running at full utilization , ensuring that the problem remains compute bound (the notion of p sets is described below). In addition, the solution organizes the available processing unit resources as multiple smaller and/or rectangular systolic arrays (as opposed to...