Preliminary Discussion Forum for the 3rd MIC Sorbonne workshop (Paris, Nov. 15-16, 2012)

New Standards for Language Studies
Nouveaux Standards pour les Sciences du Langage

Researchers (linguists, computer scientists, neurologists, psychologists, sociologists, logicians and philosophers) who are interested in the topics specified in this forum may join us in this preliminary discussion forum

#1 27/03/12 20:44

Wlodarczyk A

Motivations - Methods - Goals


     More and more linguists develop today an interest in using and applying computational intelligence to their research on languages. The methods of Interactive Linguistics are aimed at describing natural languages using data mining techniques elaborated within the framework of the new paradigm of computation known as Knowledge Discovery in Databases (KDD). Indeed, it is important to build or logically reconstruct (enhance, integrate and formalise) theories of language in order to conceive meta-theoretical foundations which are necessary for making further progress in language studies. Interactive Linguistics is an attempt to provide the best research standards for the linguistic science following the example of building the semantic web in the field of information technology (IT).


      Two complementary mono- and multi-lingual approaches have been adopted in order to enhance the possibility of finding semantic features which would reveal knowledge general enough for building the future meta-ontology for linguistic semantics. Interactive Linguistics assumes interdisciplinary synergy involving scientific cooperation of linguists, psychologists, neurologists, logicians and computer science engineers. The task of linguists consists in an interactive (computer-aided) discovery of ontology-based definitions of feature structures. The KDD methods which have already been selected allow to make analysis more precise (fine-grained) using advanced technologies (and their combinations) such as algorithms of Decision Logic, Rough Set Theory and Formal Concept Analysis for symbolic data processing, on the one hand, and algorithms of Cluster and Factor Analyses for statistical data processing, on the other hand. The above mentioned algorithms together with database building tools were implemented in Semana software which was designed especially for linguists.

Two complementary Approaches:
      (A) Corpus linguistics (Text Mining), (B) Interactive Linguistics (KDD)

     The first attempt to use Semana was made in the framework of a 2-years bilateral French-Polish project CASK (Computer-aided Acquisition of Semantic Knowledge Project). Though very brief, this project brought positive results allowing to determine which data mining algorithms should be implemented and, in some cases, how they can be used by linguists. Some participants could collect a sufficiently representative number of examples of linguistic expressions and interactively develop sets of semantic descriptors. It is therefore reasonable to expect that the ontological inquiry will make it possible to conceive more powerful description devices.

Case study reference:

   Ontological Issues for Modelling Aspect


#2 17/10/12 00:06

Wlodarczyk A

Re: Motivations - Methods - Goals

Dr Zielinska D wrote:

For Włodarczyk and for myself, the System_w are the statistical laws (distributions) found in language, which can be established objectively and finding them belongs in sciences. The Process_W, is the mechanism, which underlies linguistic production, which, in a given environment, has resulted in the creation of the System_w, and keeps creating (adjusting) it in a way which can be objectively measured, albeit in statistical terms. The outcome of the Process_w, as it turns out, depends on the current state of the System_w and the environment. Therefore Process_w is not an idealized System_w the way competence is idealized performance. Additionally, since language “is an interface to cognitive processes”  (and has been constituted with them) thus the Processw and (consequently) the System_w can eventually be implied by bio-cognitive-social laws, along with the history of its creation.

In your contribution, you mentioned statistics. Indeed, Corpus Linguistics is, together with Quantitative Linguistics, are prolongation of the "data-oriented" trend in language studies. Data-oriented approaches in linguistics characterise the period where the proponents of the Generative hypothesis claimed to represent the rightly "theory-oriented" (because "formal") approach. And I agree: statistically discovered knowledge is most relevant. Nonetheless, it does not seem to me that we have learned much about the language structure itself within the Quantitative Linguistics framework. Corpus Linguistics is promising more because it utilises NLP-theoretically defined tasks and very powerful computer science tools. Hence, it became possible to collect data in huge databases and retrieve interesting information from these collections of data.

Interactive Linguistics is a natural extension of Corpus Linguistics. It starts at the point where text data (material knowledge) require to be (mostly) manually transformed into symbolical meta-data (attributive knowledge). In this framework, we use meta-theories and tools from research on Data Mining or more generally KDD (see also my previous post above). Obviously, we adopted also very powerful statistical algorithms which are implemented in [Semana] but we are also using symbolical processing techniques such as
(a) Rough-set theoretical (RST) together with Decision Logic (DL) devices for approximating objects and verifying our assumptions and
(b) Formal Concept Analysis (FCA) algorithms for modelling and visualising what we cannot see without it.


