• Nem Talált Eredményt

Geocoding as Accumulating Intangible Spatial Asset

workplaces of certain number of employees. We are aware of varied level of uncertainty of every data measured, sensed, registered and published. An aggregation method of additional data of different types from different sources will also be outlined. However, we use only a single source in our paper for examination of the method with uniform reliability. Our area of interest is the township of Budapest partitioned into 603 ca. 1 km2 hexagonal cells.

Materials and methods

The start point of our investigation is an address list of registered offices and premises of VAT registered organizations publicly available (NAV 2017). This list also contains number of employees registered per registered offices. Note that average number of employees are assumed for the premises of organizations having single headquarter with several premises. Hence we name workplace the registered office where no other premises are registered and employees are assumed to work at the headquarters. Workplaces are the premises registered in every other case, number of employees calculated as the division of total number of employees and the number of premises registered.

We evaluated ca. 92000 workplaces of ca. 81000 VAT registered organizations with ca. 707000 employees altogether. Organizations with unknown or zero employees are excluded. Source data were prepared with custom transformation jobs in Pentaho Data Integration application using mainly regular expressions in JavaScript to extract relevant text and eliminate common errors of inconsistency and data quality.

Offline sources were also employed (Ráday M. 2013). Addresses were geocoded with modified MMQGIS plugin (Jankó J. A. 2016). However approximately 1%

of addresses were not geocoded due to several type of syntactical discrepancies.

The resulting 2D point cloud (representing both the location of the workplace and number of employees at that place) composed the basis of the spatial analysis. The

workplaces counted and employees summed were aggregated in each hexagonal cell (Fig. 1.), and attributes presented in natural breaks classes (Jenks G. F. 1967).

The resulting features also hold geocoding quality attribute based on the level of consistency and completeness of address of the register. This attribute helps both in manually enhancing quality and automatically employing level of reliability of every derivative based on actual level of certainty.

Distance matrix used as a weight was created in the next step in numerically describing the hexagonal cells.

where Dhw is the matrix of linear distances lhw between the hexagon centroids (h) and all workplaces (w) in ascending order (after inverse probability P-1hw applied);

k is either 5000 or 3000 representing the index i of the k closest workplace wi to hj hexagon centroid. P-1hw probability represents factors approaching near real world distance. These factors could be natural or artificial barriers like water courses/

surfaces, hedges, parks, large industrial sites etc. where transportation means (for example) are limited or extinct. This value matrix (Phw) could also be determined from other GIS sources like topographic map, zoning, ferry routes etc. In our study we used uniform value P=1. More realistic distance matrix could also be congregated from routes based on transportation database but it is beyond our current study.

We intended to numerically describe each hexagonal cell based on transport network utilization by commuting employees arriving to workplaces. This directly indicates necessary level of transportation development but also gives indication to needs of further socio-economic infrastructure such as catering to name one of many.

where Detailed Index value I of hexagon cell j in scenario s is the sum of product of the number of employees at workplace wi and the invers distance between workplace wi and hexagon centroid hj. Base data E is the number of employees at workplace wi. The inverse distance (weight) indicates the inverse relevancy of the number of employees. The bigger the distance from the cell to the workplace wi, the less is the relevance. Quality factor Qi depends on the reliability of employee information of each workplace wi for example. Such data may not be 100% accurate, however we used uniform value Q=1.

The cell containing the highest number of workplaces has ca. 2400 of them.

In the first scenario (a) we counted 5000 neighboring workplaces to determine each

cell’s Detailed Index value. In this case Index is composed by the workplaces within the cell with significant effect by the nature of neighboring cells. In the second scenario (b) we counted only 3000 neighboring workplaces to determine cell Index values. In this case Index is composed by the workplaces within the cell with less effect on dense cells by the nature of neighboring cells. Also note that Index value of cells with significantly infrequent workplaces are still significantly determined by neighboring cells.

Phenomena happening in space also happen in time, although the time scale may differ. In our example commuting is probably more frequent in the early morning and late afternoon, but the same probability is not applicable for workplaces with three shifts with overlapping rotation. Therefore, probability determined by such factor like the (unified) sectoral classification of the organization’s economic activities may apply.

If we also employ time schemes in the process of diagnosing the cells and we assume that time schemes are dependent on the nature of socio-economic activity, then we can expand the equation (2):

where probability P depends on both the socio-economic activity wi and the personal behavior of individual employees e; T is the time scheme vector of socio-economic activity wi. Intensity and impact of commuting differ not only in a daily cycle but may also change over the week and season by season. Value T can be almost 1 if wi represents general office open nine to five and t represents weekday morning summertime. But T becomes 0 for the same workplace wi if t represents weekend morning or workplace wi represents skiing facility.

So far we discussed cells as destinations of employees. However, commuting also has a starter location having impact on transportation network as well. Hence we could introduce and additional Base Index in numerically describing cells in context of commuting.

where s’ indicates that cell is treated as starter location instead of a destination.

Base data e represents habitant x of the cell where the total number of inhabitants is y. T is the time scheme vector of phenomenon (z) examined, wi in our case when eS represents employees. Weight S is a societal factor of habitant e depending on the phenomenon z. S=1 in case when commuting to workplace is examined and ex represents full time employee; S becomes almost 0 when ex represents jobseeker, 0 for children if infant employment prohibited and enforced. However, S remains 1 for children, if examination scheme z represents commuting to schools as pupils. T

also differs based on whether t represents spring weekday morning or summertime weekend lunchtime.

By now we have a more realistic image about the transportation network load of a cell:

where probability P indicates that going and arriving to/from workplace does not occur at exactly the same time. In case z represents commuting to workplace at t weekday morning, going to work from cell j to workplace wi happens some time earlier than arriving to workplaces within the same cell j. This time difference (traffic load balance) in cell j depends on the distance between cell j and the remote home/

workplace if it is not the same. j’ and z’ indicate that this probability is not exclusively determined by the nature of the cell or simply by the phenomena (z) examined, but it is a complex behavioural, societal, socio-economic and infrastructural factor somehow differing cell by cell, investigation by investigation.

Results

We studied a single phenomenon (commuting of employees) having an effect on transportation network. We assumed even quality of source data and uniform time scheme, uniform societal factors and uniform probability. We employed two examination schemes (s = a, b). Scheme a) (with 5000 closest workplaces counted) resulted in Indices of cells slightly uniformly distributed (Fig. 2. a). Scheme b) emphasized local centres of workplaces more dominantly (Fig. 2. b).

If we study the load of transportation network, we must comprehend that it is not determined by commuting employees only. Hence our Base Index (5) may change, taking a more aggregated form to incorporate additional determining factors like: children commuting to schools; tourism and recreation, miscellaneous aspects of POIs; health facilities, administrative institutions; service, commerce, goods

delivery; infrastructure maintenance, network development; adaptation to urbanistic, socio-economic, political changes; weather conditions; emergency, catastrophe; and finance of all of these:

where examination scheme s may represent the best known method of spatial analysis or may be a set of good methods containing one (sg) that best suits the goal. z’ is a subset of m number of appropriate phenomenon having effect on phenomenon n while Z represents all phenomena in the comparative diagnosis of the area of interest (Jankó J. A. – Szabó Gy. 2013).

Equations (3) and (4) can take the following baseline:

where B is the base data of phenomenon n. It is monetary data if phenomena of finance are investigated; it is customized quantitative and scaled numeric value in case phenomenon n is weather condition. W represents a weight depending on the nature of phenomenon n and the method of investigation s. Spatial distance, social situation, and many other priorities can originate weight. Probability P indicates that actual value of base data Bni may be altered by coincidence with other set of phenomena (n’) among all m included in equation (6).

Indices of single cells can tell priority and necessity of further, more detailed investigation. In the meantime, further aggregation of indices along boulevards and avenues for example, can provide additional directions in making decisions about solving issues risen by citizens, academic and professional contributors, policy makes (Fig. 3. a). Budapest contains ca. 8200 streets with more than 4000 km tracks.

Handling the large number of accumulated issues can be fulfilled along the priority came out.

Incorporating sets of historical data may provide not only a better understanding of current state but the flow of current as well. Comparing recent registered offices data with data of year 2009 (KSH 2009) tells a general decline in entrepreneurship.

Despite of the general, suburban areas still indicate growth at least in the number of registered offices (Fig. 3. b).

If we would also incorporate the procedure of data collection, and registration on a regular bases or even real time, the method discussed above might deliver a dynamic, responsive and efficient corroboration machine.

Conclusion

Commuting to and within Budapest (if junior and senior citizens are excluded) bears on approximately 40% of the population of Budapest. Commuting could exploit more or less 2 hours daily, which is 8% of a day, 13% of active period. If this significant time of manpower could be spent on taking care of junior and senior citizens for example, quality of life (hence quality of work) would presumably increase. Our study did not investigate the metropolitan area, nor the national level, but proportion (which is 40% employee, 60% family members of other demographic and social backgrounds), do not change significantly nationwide. We can conclude that geocoding and reasonable derivatives are capable of contributing to national economy.

Following the trends of international stakeholders of the geographic information community, we can point out four observations. Perceptions are related to a) human, individual behavior; b) nature of the data have been being accumulated; c) capability of technological novelties and economic power; d) contribution and responsibility of the technical and legal framework.

People in general tend to seek more and more profitable work to cover expenses of basic needs and to pay for prestigious liabilities. People spend 60-70%

of their wakefulness (including unpaid commuting and unpaid lunchtime) to spend the earnings on their (their family members’, fellows’) real or assumed needs, and to sleep well. Other people live by the activity of the remaining 30 40% of the active part of the day, when shopping, entertainment, recreation take place. We can confidently state that every time and every place are means of profit for someone or for someone else. Reasonable sustainability of this phenomenon depends on the savings if exists and/or the earning capability of the well-educated next generation. Human activities generate vast amount of data in pursuit of a sweet home in proximity of excellent school, good workplace and acceptable marketplace where commuting is reduced while travelling is recreational and entertaining. Our everything that is called digital something proof of our mental, social, professional personality transmitted, collected,

We are not mentioning big data here; we emphasize very big very heterogeneous data, a fingerprint of a very hectic concatenation of affairs. It is secret and private while open and obligatory, both insular and linked. It is virtual and tangible; real and imaginary. It is machine readable, but not auto-comprehensible. Heterogeneity comes from many different directions. Phenomena not only occur but they are perceived by observers of different standing, who register perception with different mapping algorithms.

We can take it for granted that data is the new fuel of the 21st century. It seems very despairing to find the clue, but there are performers of long standing and newbies alike, who have already initiated utilities for intelligent analytics. Giant media and online service providers, IT and GIS consortiums have already been demonstrated that there is no too big data which cannot be handled in some way or the other. But the question remains: what is it good for? Furthermore, who is it good for?

It is the responsibility of the law makers to have private data kept secret and enforce public data being publicly available. It is the responsibility of the boffins to develop procedures that make data of transitional type anonym but keep applicability of it. We presented in our paper that agglomeration of local data at different scale and different level is suitable to achieve these requirements.

References

Jankó, J. A. (2016): MMQGIS Based Address Geocoding with Features Added (2.4).

http://nostra.org/r/nf4z2rs

Jankó, J. A. – Szabó, Gy. (2013): Comparative Settlement Diagnosis. In: Lóki, J. (Ed.): Az elmélet és a gyakorlat találkozása a térinformatikában IV.: Térinformatika Konferencia és Szakkiállítás. Debreceni Egyetemi Kiadó, Debrecen. pp. 237–243.

Jenks, G. F. (1967): The Data Model Concept in Statistical Mapping. In: International Yearbook of Cartography 7. pp. 186–190.

KSH (2009): Gazdasági Szervezetek Regisztere. Központi Statisztikai Hivatal.

http://portal.ksh.hu/pls/portal/vb.teaor_main.gszr_main2

NAV (2017): Adatbázisok. Nemzeti Adó- és Vámhivatal. http://nav.gov.hu/

Ráday, M. (Ed.) (2013): Budapesti utcanevek A-Z, Corvina Kiadó, Budapest.

Sikos, T. T. (2000): Marketingföldrajz, Területfejlesztési szakkönyvek, VÁTI, Budapest.