The Bi-i camera (2002)  (Figure 35) was the first professional camera, with integrated CNN technology. It applied the ACE16k chip , which is a 128x128 sized CNN chip, designed by Gustavo Linnan from Angel Rodríguez-Vázquez’ team (IMSE CNM, Seville, Spain). The Bi-i is an embedded camera computer, which was designed to be able to make high-speed visual decisions in standalone mode, within the camera head. This means that it can capture images, make decisions, and send over a decision report only, rather than entire image flows. To support these requirements the Bi-i camera applies 4 major components (Figure 36).
• ACE16k CNN chip. It is a topographic sensor-processor array, which can capture and process images in ultra-high speed. Its special feature is that each pixel element is corresponded with a mixed-signal processing elements. These high number (above 16000) fully programmable processing elements deliver an extraordinary computational capability. Though it can reach its top performance (above 10,000 FPS), when it both captures and processes the images, because its full-frame grayscale IO bandwidth is limited to 4000 FPS, it can be used as a co-processor too. In co-co-processor mode, the image is received electrically, rather than optically.
• Megapixel CMOS sensor. In a foveal camera system, the high resolution sensor should be able to support multi-scale region of interest (ROI) readout. This means that arbitrary sized and positioned window of the image can be read out, rather than the whole frame. Moreover, the image can be subsampled on the sensor readout level. This is very important, because the image sensors are read out with a fixed pixel rate, hence the frame-rate is proportional with the number of read out pixels. Typical readout time of a full frame is below video speed. The only way to reach higher frame-rates is to drop the pixel count.
• High-end DSP with memory. The highest performance Texas DSP was applied in the Bi-i. It has three independent communication channels, to be able to communicate with the three other major components of the Bi-i. Typical operation mode of the system is that the high performance ACE16k chip calculates the computationally demanding parts of the image processing. During this phase, the 2D image is converted to 1D feature vectors. The DSP is used to evaluate this reduced dimension data, and to make final decisions.
• Communication processor. A standalone embedded system needs to communicate with other remote computers. However, such a communication requires a complete operating system (OS). If an OS is implemented on the DSP, it looses its ability to be real-time, which is essential to serve in ultra-high speed applications, moreover, it loses significant computational performance. Therefore, a communication processor was applied for handling the external communication.
Such a communication processor contains a low performance entire computer including processor, memory, flash, ethernet and other communictaion periphery.
Figure 35. The Bi-i camera. It has two optical inputs, a low resolution, ultra-high speed focal-plane sensor-processor device (ACE16k chip) and a high resolution CMOS sensor with ROI.
The Bi-i supports three operation modes: a high-resolution (megapixel) low speed (below video speed) mode, a low resolution (128x128) ultra-high speed (above 10,000 FPS) mode, and a virtual high resolution (megapixel) high speed (100-1000 FPS) mode, which combines the previous two, applying the virtual foveal processor array concept. These three modes are briefly summarized in the following subsections.
Megapixel CMOS sensor
with ROI (IBIS 5)
CNN sensor and processor
(128×128) (ACE16k chip) Control Unit
• Texas C6415 DSP
Power supply module
• Analog power supplies
• Digital power supplies Communication unit
• ETRAX 100 processor
• LAN module
• Serial line
Figure 36. Block diagram of the Bi-i camera
3.4.1 Low resolution (128×128) ultra high speed mode of the Bi-i
The first operation mode of the Bi-i is a low resolution (128×128), ultra high-speed (above 10,000 FPS, or less than 100μs) mode. In this case, it uses the ACE16k chip both for capturing and processing the images. This CNN chip can perform an image processing operation (template) in 2-10 μs, while its external grayscale image IO takes 250μs. Besides the relatively slow grayscale IO, the ACE16k chip has two other output channels. The first is a single bit IO, which tells whether the a binary image is pure black, or it contains at least one white pixel. The other is a vector output, which delivers the coordinates of the black pixels against white background. The first can be used for present or absent kind of decisions, while the second for gaining position information too.
To be able to reach ultra-high speed, all the processing steps, including decisions, are supposed to be executed inside the ACE16k chip, and some present/absent or position readout can be used. Some typical problems which can be efficiently solved with this technology are as follows:
• Event detection;
• Present or absence in non-trivial cases;
• Size, shape and orientation classification;
• Position detection;
• Single or multiple object tracking.
Figure 37 shows an example, where the Bi-i is classifying small objects based on their size and shape above 10,000 FPS.
3.4.2 High resolution (megapixel) video speed mode of the Bi-i
In the second operation mode the Bi-i camera uses its high resolution CMOS sensor as an input device. The captured images are processed jointly by the ACE16k and the DSP. To be able to use the ACE16k chip, the megapixel image should be cut to 64 (8x8) or 81 (9x9) overlapping segments, and the segments should be processed one after the other. Due to the relatively long IO time, processing of a segment takes 600-700 μs. Hence, the overall processing speed of a frame will be in the 15-25 FPS range, which is balanced with the input rate of the megapixel sensor.
3.4.3 Virtual high resolution (megapixel) high speed mode of the Bi-i
The third mode of the Bi-i camera is the foveal mode. In this mode, the Bi-i uses its megapixel optical sensor, but it does not read out the full frame, rather some regions of interest (ROIs). These regions can be in arbitrary positions. Their size can be arbitrary too, however, we have to keep it in mind that the ACE16k chip will have to process them. Hence, we should be able to put them together to form 128x128 sized images.
The windows can be scaled. This is practically a subsampling of the image already on the sensor level. In case of scale 1:2, technically this means that every odd pixels are read out from every odd rows. The even pixels from the odd rows and the entire even rows are discarded, hence, the readout time is reduced to its 1/4th. Certainly, the image is less detailed, but these scaled images are perfectly suitable to identify the locations of the regions of interest in most cases.
Typical application here is, if we need to follow a few different moving objects in the scene. In this case, one or a few downscaled image initially identifies the location of the objects. Then, we zoom into the picture, exactly to those locations, where the objects are, and read out 1:1 scale windows with the objects in central position. The windows are processed one after the other. During this process, besides the exact location of the objects, the characteristic features (grayness, size, various shape descriptors, orientation, etc) are extracted too. Then, the DSP builds up a database from these feature vectors, which makes possible to identify these objects, calculate their kinetic parameters, and make predictions of their next locations. If these predictions are accurate enough, there is no need to make multi-scale search for the objects in each period.
This multi-scale, multi-fovea virtual processing approach makes possible to maintain both high frame-rate and high resolution without losing relevant information. Some examples, where we reached 100-1000 FPS by applying this method are described in  and .