Probabilistic Language of Thought¶
This tutorial implements the FullBoolean grammar from Piantadosi et al. (2016) to demonstrate concept learning using probabilistic programming.
Prerequisites¶
This tutorial assumes you have completed the Introduction to FlipPy tutorial.
Learning Objectives¶
By the end of this tutorial, you will be able to:
- Understand how compositional concepts can be represented as programs
- Implement a probabilistic context-free grammar (PCFG) in FlipPy
- Perform Bayesian concept learning from labeled examples
What is the Language of Thought?¶
The Language of Thought (LoT) hypothesis proposes that human concepts are represented as structured, compositional expressions—similar to programs in a programming language. For example, the concept "yellow things" might be represented as a simple predicate:
λx. color(x) == yellow
While more complex concepts combine primitives using logical operators:
λx. (color(x) == yellow) AND (shape(x) == circle) # "yellow circles"
λx. NOT (size(x) == small) # "non-small things"
This tutorial shows how to:
- Define a grammar of possible concepts
- Specify a prior over concepts (preferring simpler ones)
- Learn concepts from examples using Bayesian inference
from flippy import infer, condition, keep_deterministic, mem, \
Categorical
from flippy.distributions import Dirichlet
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from frozendict import frozendict
The Prior over Concepts¶
We define concepts using a probabilistic context-free grammar (PCFG). The grammar specifies:
- What primitive predicates are available (color, shape, size)
- How predicates can be combined (AND, OR, NOT)
- The probability of each production rule
Here's the grammar in BNF notation:
START → "λx." BOOL
BOOL → "(" BOOL "or" BOOL ")" # Disjunction
| "(" BOOL "and" BOOL ")" # Conjunction
| "not" BOOL # Negation
| PREDICATE "(x)" # Base predicate
PREDICATE → COLOR_PRED | SHAPE_PRED | SIZE_PRED
COLOR_PRED → "color == yellow" | "color == green" | "color == blue"
SHAPE_PRED → "shape == triangle" | "shape == rectangle" | "shape == circle"
SIZE_PRED → "size == large" | "size == medium" | "size == small"
The probabilities for each rule are drawn from a Dirichlet distribution, which creates a bias toward simpler expressions (since rules that branch have multiple children, they're less likely to terminate quickly).
Key parameters:
inv_temp: Controls how strongly we prefer simpler conceptsmax_depth: Prevents infinite recursion by limiting expression depth
# The inverse temperature controls the sharpness of the right-hand side
# expansion probabilities. When rules that branch have lower Dirichlet weights
# than rules that do not, a higher inverse temperature leads to a bias towards
# shorter rules.
inv_temp = 1.0
# We set a maximum depth for expression, to avoid the generation of extremely large programs.
max_depth = 5
@mem
def expansion_dist(rhs_probs: dict) -> Categorical:
dirichlet_weights = [v**inv_temp for v in list(rhs_probs.values())]
probs = Dirichlet(dirichlet_weights).sample()
return Categorical(list(rhs_probs.keys()), probabilities=probs)
def simple_boolean():
def PREDICATE():
feature_dist = expansion_dist({
"SIZE": 2.0,
"SHAPE": 1.4,
"COLOR": 1.9,
})
attribute = feature_dist.sample()
if attribute == 'COLOR':
color_dist = expansion_dist({
"yellow": .6,
"green": .5,
"blue": 1.5
})
return f"(lambda obj: obj['color'] == '{color_dist.sample()}')"
elif attribute == 'SHAPE':
shape_dist = expansion_dist({
"triangle": 1.0,
"rectangle": 0.4,
"circle": 1.9
})
return f"(lambda obj: obj['shape'] == '{shape_dist.sample()}')"
elif attribute == 'SIZE':
size_dist = expansion_dist({
"large": 1.0,
"medium": 0.3,
"small": 1,
})
return f"(lambda obj: obj['size'] == '{size_dist.sample()}')"
def BOOL(depth):
condition(depth < max_depth)
bool_dist = expansion_dist({
"or": 1.,
"and": 0.25,
"not": 2.0,
"PREDICATE": 1.6,
})
symbol = bool_dist.sample()
if symbol in ("or", "and"):
expr = f"({BOOL(depth+1)} {symbol} {BOOL(depth+1)})"
elif symbol == "not":
expr = f"not {BOOL(depth+1)}"
elif symbol == "PREDICATE":
expr = f"{PREDICATE()}(x)"
# Avoid some trivial cases
condition("not not" not in expr)
return expr
def START(depth):
return f"lambda x: {BOOL(depth+1)}"
expr = START(0)
return expr
We can look at some examples from the above prior.
Note: The above prior can generate very large programs if the probabilities of branching non-terminals like or/and are high enough. For that reason, we use LikelihoodWeighting instead of SamplePrior, which we would ordinarily use.
def print_sorted_dist(dist, top=None):
sorted_dist = sorted(
[(s, dist.prob(s)) for s in dist.support],
key=lambda pair: pair[-1],
reverse=True
)
print("Prob\tExpression")
if top is not None:
sorted_dist = sorted_dist[:top]
for expr, prob in sorted_dist:
print(f"{prob:.4f}\t{expr}")
dist = infer(method="LikelihoodWeighting", samples=30, seed=42)(simple_boolean)()
print_sorted_dist(dist)
Prob Expression 0.2941 lambda x: (lambda obj: obj['color'] == 'blue')(x) 0.1765 lambda x: not (lambda obj: obj['color'] == 'blue')(x) 0.1765 lambda x: not (lambda obj: obj['size'] == 'large')(x) 0.0588 lambda x: (lambda obj: obj['size'] == 'medium')(x) 0.0588 lambda x: not (lambda obj: obj['shape'] == 'circle')(x) 0.0588 lambda x: (lambda obj: obj['shape'] == 'circle')(x) 0.0588 lambda x: ((lambda obj: obj['color'] == 'blue')(x) or ((lambda obj: obj['color'] == 'blue')(x) and ((lambda obj: obj['shape'] == 'triangle')(x) and (lambda obj: obj['size'] == 'small')(x)))) 0.0588 lambda x: (lambda obj: obj['size'] == 'large')(x) 0.0588 lambda x: ((lambda obj: obj['size'] == 'small')(x) or (lambda obj: obj['shape'] == 'triangle')(x))
Performing Inference¶
Now we can perform Bayesian concept learning. Given labeled examples (objects marked as positive or negative instances of a concept), we infer which concept best explains the data.
The inference process:
- Sample a concept from the grammar prior
- Evaluate the concept on each example
- Condition on the concept matching the labels (with some noise tolerance)
We use MetropolisHastings for inference because the space of possible concepts is large and continuous exploration works better than enumeration.
# To make the concept executable, we need a deterministic transformation of it
@keep_deterministic
def convert_to_executable(func):
# Because we are returning a function, it similarly must be marked as deterministic.
return keep_deterministic(eval(func))
@infer(method="MetropolisHastings", samples=2000, seed=1234, burn_in=2000)
def posterior(pcfg, data):
p_correct_label = 0.95
concept = pcfg()
executable_concept = convert_to_executable(concept)
for example, value in data:
condition(p_correct_label if executable_concept(example) == value else 1 - p_correct_label)
return concept
The concept with highest probability for this example is color==yellow, the intended concept. There are a number of other more complex hypotheses that are also consistent with the data -- you might see color==yellow and color==yellow or extraneous predicates like color==yellow or size==large. You'll also see simple hypotheses that are inconsistent with the data, like size==large.
examples = (
(frozendict({'color': 'yellow', 'shape': 'circle', 'size': 'small'}), True),
(frozendict({'color': 'blue', 'shape': 'rectangle', 'size': 'medium'}), False),
(frozendict({'color': 'yellow', 'shape': 'triangle', 'size': 'large'}), True),
(frozendict({'color': 'green', 'shape': 'circle', 'size': 'small'}), False)
)
dist = posterior(simple_boolean, examples)
print_sorted_dist(dist, top=10)
Prob Expression 0.5155 lambda x: (lambda obj: obj['color'] == 'yellow')(x) 0.1485 lambda x: (lambda obj: obj['size'] == 'large')(x) 0.1040 lambda x: ((lambda obj: obj['size'] == 'large')(x) or (lambda obj: obj['color'] == 'yellow')(x)) 0.0590 lambda x: (lambda obj: obj['shape'] == 'triangle')(x) 0.0425 lambda x: ((lambda obj: obj['color'] == 'yellow')(x) or (lambda obj: obj['color'] == 'yellow')(x)) 0.0260 lambda x: not (lambda obj: obj['color'] == 'blue')(x) 0.0245 lambda x: (lambda obj: obj['shape'] == 'circle')(x) 0.0225 lambda x: not (not (lambda obj: obj['color'] == 'yellow')(x) or not (lambda obj: obj['color'] == 'yellow')(x)) 0.0100 lambda x: (lambda obj: obj['size'] == 'small')(x) 0.0080 lambda x: not (not (lambda obj: obj['color'] == 'yellow')(x) and not (lambda obj: obj['color'] == 'yellow')(x))
Interpreting the posterior:
The results show a distribution over hypothesized concepts, ranked by probability:
color == 'yellow'(~52%): The simplest concept that explains all examplessize == 'large'(~15%): Incorrect but somewhat consistent (overfits to one example)- More complex concepts: Logically equivalent to simpler ones or partial fits
The model correctly identifies "yellow" as the most likely concept because:
- It's simple (preferred by the prior)
- It explains all examples (yellow objects are positive, non-yellow are negative)
Notice that some complex but equivalent concepts also appear (e.g., yellow or yellow). In a fuller model, we might add transformations to simplify these.
Summary¶
In this tutorial, you learned:
- Language of Thought represents concepts as compositional programs
- PCFGs define a prior over concept complexity and structure
- Bayesian inference finds concepts that balance simplicity with data fit
- The model exhibits Occam's razor: preferring simpler explanations
This framework has been used to model human concept learning, rule discovery, and language acquisition. Extensions include:
- Noisy memory and attention
- Learning the grammar itself (grammar induction)
- Hierarchical concepts with abstraction