This project develops a computational modeling framework for inducing logical representations of linguistic meanings by synthesizing large-scale behavioral datasets, containing acceptability judgments and inference judgments, with continuous distributional representations derived from contemporary large language models. This framework is formulated as a neurosymbolic typed logic induction system that provides the analyst not only fine-grained control over different aspects of the grammar–what types of meaning are assumed to exist, how those meanings can be constructed, and how they can be put together–but also the ability to rigorously quantitatively compare different grammatical assumptions. Beyond accelerating theory development through these capabilities, the framework also provides a theoretically grounded route to understanding the inner workings of large language models, which are notoriously opaque and hard to interpret.
To support the development of this framework, this project constructs and publicly releases a multilingual, lexicon-scale dataset capturing the argument-taking behavior of predicative expressions as well as lexical entailment among those expressions. This dataset will provide a foundation on which to induce logical representations shared across languages that can be naturally extended (i) to cover phenomena beyond predicate-argument structure; and/or (ii) to cover additional languages. This extensibility is possible due to a novel active learning approach developed under the project that interleaves state-of-the-art deep learning models of acceptability and inference with human annotation to efficiently capture judgments across the entirety of languages’ open class lexical items. This dataset builds on the successes of the aforementioned MegaAttitude datasets developed by the PI, while moving well beyond what was possible during their initial develop in terms of lexical and crosslinguistic coverage.