If a large number of processors are integrated onto a single chip, the processors should be equipped with local memories, because they cannot reach outside data sources parallel due to obvious pin count limitations. In case of a 2D processor array, the processors are usually arranged on a regular grid, which makes topographical mapping of an image obvious. Figure 18 shows the mapping in a situation, when the data array size is equivalent to the fine-grain processor array size.
n×m resolution pixel array (image)
n×m sized processor
Figure 18. Topographic mapping of an image onto a fine-grain 2D processor array
Though these 2D fine-grain CNN type topographic engines could process images on extraordinary high frame rate (above 10,000 FPS or less than 100μs/frame ) the image resolution could not exceed the dimensions of the physical processor array. This means that the physical processor cell size on silicon (from 35×35 up to 75×75 micron , ) limits the largest feasible array size to about 50k cells. This limit is well below the standard analog video frame sizes and very far from the today standard megapixel digital video formats.
Moreover, in many cases, there is no need to process these large format images on ultra-high frame rates, because these images are provided by high resolution imagers serially, which can provide typically images on video rate (30 FPS) anyway. The question is: how to trade resolution for speed, or in other words, how to convert (silicon) space to (execution) time?
We have to clearly distinguish two different processing requirement scenarios. In the first one, early image processing problem is addressed. In this case, there is no a priori information about the image, hence all parts of the image should be handled the same way, because the location of the relevant parts are not knows. In this scenario, usually image enhancement and/or areas of interest identification is done.
In the second scenario, the goal is to perform post deep analysis on certain regions of the image only, because it is assumed that these regions carry all the relevant information of the image. The position and the size of the processed regions (region of interest, ROI) are derived from the results of early image processing. This is an economic solution, because it does not require deep analysis of the entire high-resolution image. On the other hand, assuming an accurate region of interest selection mechanism, we do not lose relevant information. This post-processing approach is called foveal processing, because it mimics the operation of a foveal vision system of primates.
3.1.1 Virtual processor arrays for early vision applications
As we have learned, the root of the problem is that we cannot implement a large enough processor array, which can handle full video frames in one piece. However, we can overcome the problem by introducing a video frame sized virtual processor array, which virtually processes the whole image parallel. Behind the high resolution virtual processor array, an affordable sized physical processor array is performing the calculations. Therefore, only a part of the high-resolution image is topographically mapped and processed at a time.
When a small topographic array processor processes a high resolution image, piece-by-piece, we have to consider two issues. First of all, the data transfer should be optimized. On the other hand, if neighborhood operators are executed, the boundary conditions should be handled properly. This means that we have to be aware of the radius of information required to complete the operations, and use at least as large overlap in the mapping phase as this radius is. However, in some cases, when global propagating type operator is executed (e.g., hole filling) a simple overlapping is not satisfactory, hence multiple scans are required. These issues will be analyzed in Section 126.96.36.199.
The properties of the physical processor array depend on the image sequence type and the required operations. Here we consider early image processing of medium or high resolution analog or digital video images. The images are read out sequentially from an imager line-by-line. The read out pixel train constitutes analog or digital standard video flow formats. By applying an elongated physical processor array, which is exactly as long as an image line is, it can be fed with the pixel train coming out from the imager. In this way, a long and narrow segment of the image, constructed from some consecutive image lines, is mapped at a time onto the processor array (Figure 19). The processed image segment is moving from top to down in discrete steps. The neighboring segments overlap each other to handle boundary conditions. In this way both the IO and the boundary problems are properly handled.
High resolution image sensor
Processed image segment
Physical processor array
Figure 19. Mapping the high resolution image onto an elongated physical processor for performing early image processing
The first architecture, described in Section 3.2, applies mixed-signal cores. Its specialty is that it can process analog video signals on-the-fly without digitization. The architecture was proposed and patented  by myself.
The second architecture, called CASTLE , is described in Section 3.3. It is based on emulated digital CNN cores. This scalable architecture is designed to be able to process high resolution digital video flows on-the-fly, or can process images, when they are stored in a memory. The architecture was proposed by me . One of its versions was implemented on a full custom digital ASIC . Its derivatives are still used both in the academy  and the industry .
3.1.2 Virtual processor arrays using foveal approach
It is a well-known phenomenon that the high level information content of an image in most scenes is focused to one or a few areas (regions of interests, ROIs) rather than equally distributed all over the image. Foveal processing is taking advantage of this fact on a way that it focuses attention (spending computational resources) to the relevant areas only. Naturally, it assumes an appropriate ROI identification strategy in the early image processing phase.
Human vision is also based on this phenomenon. Our eyes have roughly a 210° visual field with varying sensor density (Figure 20). The periphery of the retina is a low density monochromatic area. In the periphery, human visual system can identify spatial and temporal irregularities (high contrast pattern, sudden movements) even under low light conditions .
The fovea is located in the central area of the visual field. Only this part of the visual field is sensed in colorful high contrast high resolution details. It occupies roughly 3° from the visual field and contains high density color sensors. The output of the fovea is thoroughly analyzed by the farther stages of the human visual pathway.
Visual field 210o
Figure 20. Simplified view of human visual field
We fill that we can see a high detailed colorful image in our entire visual field. This is achieved on a way that the fovea – the only area which can perceive this information – is jumping from one point of interest to another. This is called saccadic eye movement . As a contrast, artificial vision systems cannot physically move so quickly due to mechanical limitations, or in many cases, they are not moving at all. To be able to process foveal areas, these vision systems use high resolution, addressable, and zoomable CMOS image sensors, which make possible to read out multiple different sized windows even with different resolution (scale). In this way, a vision system applying the fovea approach can identify and track moving objects by zooming in or zooming out the relevant image parts, according to the scene changes (Figure 21).
A D C s navigating
zoomable active fovea