Language recognition algorithms have a tendency to overestimate English. For an overview of possible linguistic biases of this study, it

Exploring the Status of Languages of France on the Internet:

3. Language recognition algorithms have a tendency to overestimate English. For an overview of possible linguistic biases of this study, it

should be noted that Czech would have more pages than Korean or that Chinese, with an online population almost 10 times larger than German or Russian, would have fewer pages.

In spite of its limitations, W3Techs represents the most attractive source of indicators available today and one have to accept with satisfaction the progress it represents.

42 Following http://news.netcraft.com/archives/2012/11/01/november-2012-web-server-survey.html 650 million websites would be active.

Evolution of Search Engines since 2008

Since 2008, the variety of search engines has been on the decrease and the generic search engines remaining on the market (Google, Yahoo, Bing/Live Search, Ask, AOL, Lycos, Excite, Exalead, Teoma⁴³) have all evolved in the same way:

• significant reduction of the percentage of the indexed part of the Web (over 80% to less than 10% of the total space);

• total loss of credibility of the published figures of the number of occurrences of a given keyword;

• increased “intelligence” of the search keyword that led to the loss of the association keyword/results (either by introduction of automatic translations, or by introducing synonyms or supposed orthography correction).

From 2008, and in amplified manner as time passed, the size of the Web has become uncontrollable and can be considered in practical terms⁴⁴, approaching infinity. This results in the inability, cost wise, to conduct a comprehensive systematic crawl of the entire Web⁴⁵ from the search engines at present, and leaving merely an estimated less than 5% of the total pages⁴⁶.

Together with the rise of Web 2.0 the nature of the Web has changed and static pages (simple HTML) has left more room for dynamic pages. In the same period, the Internet topology for languages has changed radically with the stabilization of the relative growth of the initially well-represented languages (Western languages, in particular) and the rise of Asian languages and more recently of Arabic. In parallel, the nature of content has evolved by reducing the proportion of text data and increasing the audio and especially video (by the end of 2012 video traffic accounted for over 50% of the total, with forecasted growth of this percentage⁴⁷).

43 Google would have (following http://en.wikipedia.org/wiki/Web_search_engine) a little more than 80%

of the market, with however a trend to lower since 2010.

44 Which means the computing cost for systematic crawling.

45 In 2008, the figure of 127 billion pages was provided by various sources (especially the search engine CUIL, now gone, which claimed to crawl the entire Web). See the webpage maintained by archive.org: http://web.

archive.org/web/20100916001435/http://www.cuil.com/.

46 If there is one area where lack of transparency is the rule, it is the size of the indexes. Apparently several tricks are used (especially not to explore all pages within a site) to hide this limitation which does not of course apply the same way to all languages and is a handicap for minority languages.

47 “Cisco Visual Networking Index - Forecast and Methodology, 2010–2015”, http://www.cisco.com/en/US/

solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_paper_c11-481360_ns827_Networking_

Solutions_White_Paper.html.

Under these conditions, the percentage of pages of fixed text in a given language could remain an indicator of some significance but, faced with a more complex reality, one must create other indicators that better reflect this complexity, and accept to deal with partial elements of a mosaic, rather than dealing with limited integral indicators.

In 2010, Union Latine, in collaboration with FUNREDES, conduced a first attempt to grab the perception of complex reality from the dispersed state of languages in the Web elements collection. This first experience was not concluded by a specific publication given the difficulty to rely on a broad base of data and the approximation of most of the data collected; however, it has indirectly contributed to a number of publications⁴⁸.

The year 2010 work was taken in all language directions so as to detect signals of changes and explore the best-known applications (social networks, blogosphere, peer-to-peer, search engines, and of course VOIP⁴⁹, Wikipedia and YouTube). Often, in the absence of alternatives, the data was built from the geographical origin of the traffic to those applications (as opposed to working directly on languages). These principles which have helped to outline a different methodological approach were amplified and extended in this new study.

Methodology for French

The proposed methodology is based on and expanded from the 2010 approach and applied to a given language, French⁵⁰, which is compared to other major languages used on the Internet. In this new study the approach is extended to the largest possible number of recognized application areas. The approach from the geographic traffic which gathered data on linguistic communities that use a given application is supplemented by intense and systematic research about the linguistic content of those applications.

This approach is therefore based on a very important and open work of collecting information about the languages in Internet applications, compilation and organization of data, assessment and validation and direct crossing with different specific studies seriously and ultimately putting them

48 See [Pimienta 2012] and [ITU 2010].

49 Voice over IP like, for example, Skype.

50 This does not completely rule out to work with other languages because it will sometimes be necessary to offer comparisons.

into perspective in order to trends and composite indicators that report on emerging developments⁵¹.

The key methodological elements of this approach are:

• Considered spaces and applications;

• The selection and analysis of the sources of information on the place of French related to the selected spaces and applications;

• Demolinguistic data that are used to put into perspective the data collected;

• The attempt to bring together the scattered data into a synthesis that meaningfully informs on the place of French on the Internet and put in perspective.

Considered Spaces and Applications

Around one hundred applications and spaces of the Internet have been identified as likely to shed light on the role of language on the Internet. Thirty of them were unable to offer reliable and usable data evenly: these have been temporarily excluded and will be integrated as soon as data offering permits.

The chosen initial list (including those that were discarded) is presented and organized in the following tables, by application type or space.

Infrastructures Online Books Telephones &

Tablets Messaging & IP Telephone Internet users per language Virtual library Smartphones Skype Computers per country GoogleBooks Tablets G3 QQ Websites per population Amazon Data Sims AIM

Websites per users GoogleScholar 3G ICQ

Internet penetration Yahoo!

High bandwidth penetration Automatic translation traffic Online linguistic tools

51 To give an order of magnitude, the 2010 work allowed to reference some thirty links (eg. http://

socialmediastatistics.wikidot.com/ which allowed to know the distribution of “tweets” or “Facebook” pages per language). The goal here was to collect, compare and evaluate at least a hundred links for a potential source so to expand coverage. In fact, hundreds of possible sources were evaluated and nearly 200 of them were selected, analyzed and classified.

File Download

& P2P Social Networks Blogs Webpages Counting

Megaupload Wikipedia Technorati W3 Counters

Rapidshare Facebook Blogs WebBoar

Filefactory Twitter WordPress Internet Archive

Depositfiles LinkedIn Google Blogs Google

Hotfile Viadeo Blogger Baidu

Uploading Xing Blogspot Wolfram Alpha

Uploaded Yahoo Sina Weibo MSN

Fileserve Google+ Technorati Bing

Mediafire Windows Live

Profile Foofind

Gigasize MySpace Rtbot

Bitshare Livejournal CC Search

4shared Secondlife Altavista

Ning Yandex

Tuenti Wikia Search

Hi5 Orkut Open Directory

Project (DMOZ) Badoo

Tumblr Instagram Sonico Qzone YouTube Googleplay

Email Search Engines Browsers Operating Systems and Application Suite

Gmail Google Chrome Windows

Hotmail Bing Firefox Linux

Yahoo Yahoo! IE Mac

Yandex Mail Yandex Safari Ios

Icloud Baidu Opera Android

Outlook Others Others Openoffice

Microsoft Office Others

Analyzed Sources

A significant effort to collect sources on the presence of languages on the Internet, in general, and on French in particular was deployed. This chapter is dedicated to the sources that fed the results presented.

Considerations about Sources

The lack of production from FUNREDES and LOP has given way to a place where the vast majority of sources are companies of business intelligence or online marketing which filter some partial information for free (and usually without revealing the methodology) as a means of promoting their paid services. Apart from the traditional and reliable statistical sources that provide useful information to work on languages (UN, UNESCO, ITU, OECD, EU, etc.), while having the same difficulties in identifying the issue of the languages on the Internet, some consultants or experts who are dedicated to gather as much information about a space or a given application, share their results in a website in order to promote their expertise. Analysis of those professional sites is often very useful. Gathering this set of broad but imperfect information allows, with method and with some difficulty, to take stock segmenting the problem following applications (Wikipedia, Twitter, YouTube, Facebook, ...) or space (search engines, email, etc.).

Another limitation to consider however in this type of analysis is that the degree of globalization of spaces and applications is becoming more variable and, increasingly, countries or regions adopt their specific applications to the detriment of the major world applications (such as Baidu instead of Google in

China and Vkontakte and Odnoklassniki instead of Facebook in Russia). We must take this factor into account when quantitative data on the use of languages (or countries) are established for an application or a given space. Thus, the conclusion that in the non-professional social networks, French, for example, is the fourth or the first language from its respective results in Facebook or Viadeo, does not make sense to the extent that any of these applications knows even penetration across all geographic areas and thus language. To obtain a credible indicator about the ranking of French each application of non-professional social network type should be weighted based on its distribution in the world (total weight and relative presence in a country), a task that is beyond the reach of this study.

A simple overview would emphasize that the most globalized applications are Wikipedia, Twitter and YouTube⁵², those which are relatively globalized are Facebook and Google with exceptions for some countries (China, Russia, Kazakhstan, Korea, etc. which used local search engines) or show different habits (Yahoo! is much more used than Google in Japan).

For most other applications and spaces one needs to be careful and trade-off the linguistic conclusions from the results to possible biases in usages.

Source Selection

Since 2010, several hundreds of potential sources have been analyzed and stored and constant monitoring is performed by search keywords or external links consultations of the sources. In total, some hundreds of sources have been detected and analyzed.

Many of them were excluded for one of the following reasons:

• The field of study was too small or partial.

• They appeared too biased.

• The numbers were not updated or there were significant differences in dates inside mentioned sources.

• The methodology used did not allow to compare the data.

• The methodology used did not seem relevant, adequate or credible.

• There were doubts about the reliability of the source.

52 What is less the case for YouTube since it has a competitor in second position, Dailymotion, for which France Telecom is the majority shareholder, and which has a strong francophone presence, even if its name is not indicated, and that it may be tomorrow taken over by Yahoo.

Around 200 sources of information (URL, articles, books or other) that could identify indicators of the presence of languages in different areas were finally selected, classified, evaluated and rated.

Source Classification

Each element of this sample was rated between 10 (excellent) to 5 (average); those scoring less were automatically rejected. The rates have been prepared based on several criteria: relevance, confidence, reach, transparency of method, etc.

The rating results are the following:

Rate ¹⁰ ⁹ ⁸ ⁷ ⁶ ⁵ ^<5

Number 6 4 19 37 38 25 59

59 sources rated less than 5 have been kept for later evaluation.

For each source the following parameters were recorded:

• The last year of publication.

• Target (global, Francophonie, etc.).

• If the source is updated frequently or not.

• Type of source (eg. meta-information).

• Area of application of the source (eg. Facebook).

• If it is specific to the language or not.

The sources will be exhibited in the future in a clearinghouse, a kind of database of web links, with the corresponding parameters in order to maintain an observatory. Meanwhile, the degree of rapid obsolescence of sources and the dynamic creation / removal of Internet pages is such that it is not appropriate in this article to mention all sources.

Demographics and Demolinguistic Data

Really monolingual countries are an exception; multilingualism is rather a rule, such as, for example, in France, the United States, China or Cameroon (one of the countries with the greatest linguistic diversity in the world), or even in micro-states like Monaco, Malta or Singapore.

In almost every country in the world counting speakers of different languages must be undertaken if one wishes to transform data by country (which are natural sources in most cases) into data per language (which are those needed for this study).

Most studies, and it is a weakness for their linguistic theme, provide data by country or region and tend to extrapolate the results to the official language of each country⁵³, where details of the language are set. Such a simplification carries important errors⁵⁴.

To compare French to other languages, it is necessary to establish reliable statistics of all languages of the world, or at least those that are spoken by a large majority. Reliable statistics exist for some languages spoken in well-defined territories and knowing an important development, provided also that they have an official status or some sort of protection and their diaspora still speaking the language is well studied. This is the case in many languages spoken in Europe, for example, Italian, Lithuanian, Polish and Catalan. But this task becomes complicated for languages that meet none of those conditions.

Difficulties with Languages without Supervision

To take examples that are meaningful to the French public, in the case of languages like Occitan or Franco-Provençal, for which, although spoken in developed areas and on fairly well defined territories, lack of institution of guardianship affects the quality of the figures for the number of speakers.

Difficulties with Languages Occupying Large Territories

For some languages occupying extensive territories and diverse socio-economic conditions, there are relatively reliable statistics; This is the case of French or Spanish, for example, through OIF and the Cervantes Institute respectively.

But for other languages with similar characteristics, such as English, Chinese or Portuguese⁵⁵, there are very important differences between sources, particularly on “L2” speakers⁵⁶.

Demolinguistic Conflicting Sources

Divergent methodologies in counting speakers in the many demolinguistic studies add up to the above mentioned complications. To be able to compare

53 One of the rare counterexample is against Wikipedia that provides remarkable linguistic data http://meta.

wikimedia.org/wiki/List_of_Wikipedias/sortable.

54 Spreading the data proportionally of the number of speakers is another simplification of course (as the digital divide is not evenly spread in the population and migrants often have less access to the digital world;

however the errors produced by this simplification are by an order of magnitude lower.

55 Camões Institute, however, starts to propose figures with an acceptable margin of error.

56 L1 is the notation used for mother tongue speakers. L2 is the notation used for speakers controlling or using fluently one language but not having it as their mother tongue.

all available studies on the number of speakers of all languages in the world to create a comparative “super-grid” is a huge task, above the resource of this study. Thus, it is unthinkable to parallel the results of different studies without detailed analysis because it would put on the same level figures obtained by non-comparable definitions or inhomogeneous methods.

Language Typology

An additional problem is to be solved on typology of languages. To come back to the example of Occitan the question is: should speakers of Gascon be accounted for speakers of Occitan language or independently of? If so, should then Provençal, Auvergnat, Languedocian, etc. also be accounted for separately? Some approve more, others less. Should the German language be taken as a whole or as a set of various dialects, sometimes very distant from each other? What to do with the Arabic language, also characterised by a large dispersion of dialects? Should we consider literary – or classical – Arabic only or all of dialectal Arabic (Algerian Arabic, Levantine Arabic, Egyptian Arabic, etc.)? Should Calabrian be associated with Neapolitan as Ethnologue suggests or only the northern Calabrian, attaching the southern Calabrian to Sicilian as Wikipedia suggests?

Second Language Speakers (L2)

But without doubt, the most acute dilemma is how to take into account “L2”

speakers for vehicular languages. These people do not have that language as their mother tongue, but master it enough for common use. If for some languages the number of speakers may not have a significant impact on this study⁵⁷, it is clear that L2 speakers of English, French, Spanish or Russian influence greatly the results. Even if there is no L1 speakers (native) of English in Ghana, according to Ethnologue, the L2 one million speakers of English in this country seems to produce more content on the Internet than the 2.5/3 million speakers of Ewe living in this country. Also, if the population of Paraguay is mostly guarani-speaking, fluency in Spanish (L2) by the general population relegated Guarani on the Internet behind the Spanish language.

But there are many cases where more than one language have a vehicular role for certain populations (English and Swahili in East Africa, English and Hindi in India, French and Hausa in francophone Africa, etc.). In this case, what is the preferred vehicular medium for using the Internet? Hausa and Swahili

57 As it relates to technologies that are not always in use for some populations of less developed countries, such as Quechua or Swahili, also used as L2 language.

seem to be relegated behind French and English, following LOP studies and [Diki-Kidiri 2007], but it would be less and less the case for Hindi, for example.

Although there are some reliable data on L2 speakers of French or Spanish in countries where these languages are official (de jure or de facto), it is impossible to have statistics both accurate and comparable with each other on the use of major languages of communication⁵⁸ other than in countries where these languages are official. However, it is common to see websites in English or having one of the versions in English in many countries where English is not an official language. Also many websites are in French (or include pages in French) in Spain, Germany or the United States, so much so that previous studies from FUNREDES/Union Latine had shown that Spain or the United States produced more French contents than the whole countries in Francophone Africa.

Thus, given the complexity of the panorama, some choices have been made that are not entirely satisfactory, but allow at least to achieve consistent and measurable results.

Demolinguistic Choices

To take the best account of all these elements a number of choices have been made for the best trade-off between homogeneity and reliability of demolinguistics data.

In document Linguistic and Cultural Diversity in Cyberspace (Pldal 145-155)