What is TABERTA?

#1
by EnasKhwaileh - opened

TABERTA_cover

I’d like to start a discussion around table retrieval (TLR) and how we can better leverage in (structural datasets not just text) to improve language models.

Most recent progress in retrieval has focused on unstructured corpora. In contrast, large collections of relational tables (e.g., WikiDBs) were originally curated to support table search and dataset discovery, yet are often reduced to flattened text when used to train or fine-tune language models. This largely discards schema structure, relational cues, and the inductive biases that make tables valuable in the first place.

In our recent work, we revisit table retrieval by treating structure exposure as a first-class modeling choice. Using WikiDBs, we fine-tune bi-encoder models with different serialization views: schema-focused, content-aware, and hybrid combined with retrieval-oriented objectives. The goal is not task-specific execution, but learning general table representations that support dataset discovery, evidence selection, and schema grounding.

What we observe is that:

  1. Schema-aware views provide strong, robust signals for ad-hoc table search.
  2. Content-aware views are necessary when relevance depends on value grounding or distributed evidence.
  3. A single hybrid encoder can generalize across multiple table-centric tasks without retraining.

This suggests that improvements in TLR do not necessarily come from larger models alone, but from respecting the structural nature of the data they are trained on. In other words, relevance in table search seems to emerge from structured decomposition, not from treating tables as long text passages.

I’m curious to hear thoughts from the community:
Should table retrieval be treated as a distinct retrieval problem rather than a variant of document retrieval?
Are current language models under-utilizing structured datasets like WikiDBs?
What other structural signals (e.g., keys, constraints, inter-table links) should be exposed during training?
Happy to discuss, and very open to criticism or alternative perspectives.

Sign up or log in to comment