Key Takeaways
- AI does not learn about your firm from a single source. It assembles a profile from at least seven distinct inputs: your website, structured data markup, legal directories, online reviews, court records, public social and press, and brand mentions across the open web.
- Multiple AI crawlers visit your site, each with a different job. Googlebot, GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended each serve different products. Most firms have configured controls for none of them.
- The technical pipeline is consistent across vendors. Content is discovered, normalized, chunked, embedded as numerical vectors, retrieved at query time, and synthesized into an answer. Citation failure can occur at any stage.
- Citability — not content volume — determines visibility. Firms with structured, attributable, jurisdiction-specific content outperform firms with more content but weaker entity signals.
- The American Bar Association has already published governing guidance. Formal Opinion 512 (July 2024) establishes the ethical framework for AI use in legal practice, and at least two attorneys have been sanctioned for misuse of generative AI.
When a prospective client asks ChatGPT, Perplexity, or Google’s AI Overview to identify the best lawyer in their city, an entire technical pipeline determines whether your firm appears in the answer. That pipeline runs through specialized web crawlers, vector databases, retrieval systems, and language models — and most law firms have no working knowledge of how any of it operates.
AI Overviews now appear on roughly half of all Google search results, and AI-referred visitors convert at higher rates than traditional organic traffic. Whether your firm shows up in those answers is no longer a marketing curiosity. It is a business question with measurable revenue implications.
This article explains how AI systems actually learn about your firm, what signals they weigh when deciding whom to cite, and what your firm can do to influence that process. At Matador, we work with more than 175 law firms across the country, and the analysis below reflects what we have learned from optimizing legal websites for both classic search and generative engines.

Why Does AI Search Visibility Matter Right Now?
Two data points define the urgency. First, AI Overviews now appear on approximately 48% of all tracked Google search queries, up from 31% in February 2025. Second, brands cited within AI Overviews earn 35% more organic clicks and 91% more paid clicks than those that are not. The top of the search results page is increasingly an AI-generated summary, and inclusion in that summary carries significant downstream traffic value. (For more on this trajectory, see Dataslayer’s analysis of AI Overview impact on CTR and Averi’s 2026 citation playbook.)
The legal industry is particularly affected for three reasons:
- High trigger rate on legal queries. AI Overviews appear on 23.6% of legal queries, with the highest trigger rates on question-style searches (57.9%) and longer queries of seven or more words (46.4%). These are precisely the query patterns that prospective clients use when researching legal representation.
- Higher conversion quality of AI traffic. Independent studies report AI search visitors converting at 14.2% versus 2.8% for traditional organic traffic, a multiple that reflects the pre-qualified state of users who arrive after consulting an AI summary.
- Compounding effects on local search. Legal services remain a local-search-dominated category, and the firms cited by AI systems also tend to benefit from increased visibility in map pack results and traditional rankings.
For law firms operating in competitive markets, the implication is straightforward: classic SEO performance is no longer sufficient on its own, but it remains the foundation that determines whether AI systems consider your firm at all.
The Four Layers of AI Learning
Rather than a single mechanism, AI systems acquire knowledge about your firm through four distinct layers. Each layer has its own controls, its own failure modes, and its own optimization levers.
Layer 1: Source Discovery
The first layer is the universe of content that AI systems can see. For a law firm, this typically includes:
- Practice area pages, attorney biographies, and FAQs published on your firm’s website
- Google Business Profile listings, including hours, services, and reviews
- Third-party legal directories such as Avvo, Justia, FindLaw, Martindale-Hubbell, and Super Lawyers
- Court records accessible through PACER (federal) and state court systems
- Press coverage, podcast appearances, video content, and public social profiles
- Brand mentions across forums, news sites, and professional networks such as LinkedIn and Reddit
The most common failure at this layer is inconsistency. When your homepage claims service in three cities, your directory listing reflects two, and an archived bio references a fourth, AI systems do not interpret the discrepancy as nuance. They discount the firm as an unreliable source.
Layer 2: Crawler Discovery
This is the layer where most firms have neither visibility nor strategy. Multiple AI crawlers visit your website on different schedules, for different purposes, with different rendering capabilities.

OpenAI alone operates three crawlers, each independently controllable through robots.txt:
- GPTBot crawls content to train future foundation models.
- OAI-SearchBot powers live search citations within ChatGPT.
- ChatGPT-User fetches a specific page when a user asks ChatGPT to read it directly.
According to OpenAI’s published documentation and independent analyses of crawler behavior, GPTBot does not render JavaScript. This creates a silent invisibility problem for law firm websites that render content client-side: the page may appear correctly to human visitors and to Googlebot, while remaining functionally blank to OpenAI’s training crawler. (See also Prerender’s comparison of AI crawler behaviors.)
The same documentation notes that GPTBot’s revisit frequency is low. Most pages may be recrawled only once over a period of weeks, which means recent updates can take significant time to influence the underlying training corpus. Live retrieval via OAI-SearchBot operates on a faster cadence, but the model’s parametric memory of your firm lags accordingly.
The practical implications for law firm configuration:
- Audit the firm’s robots.txt file and document an explicit position on each major AI crawler.
- Ensure server-side rendering or static HTML for any page that should be readable by training crawlers.
- Recognize that blocking AI crawlers carries an opportunity cost as well as a privacy benefit.
This is one reason technical SEO infrastructure — site speed, schema, rendering strategy, crawl budget — now functions as AI infrastructure rather than as a separate concern.
Layer 3: Content Representation
Once a crawler has fetched a page, the system converts it into a representation a model can reason over. This conversion proceeds in two steps:
- Chunking divides long-form content into smaller passages, typically a few hundred tokens each, that fit within the model’s working memory.
- Embedding converts each chunk into a high-dimensional numerical vector, in which semantically similar passages occupy nearby positions.
The embedding step is what allows a prospective client’s query for “best lawyer for serious crash injuries in Santa Monica” to retrieve a page that uses different terminology, such as “catastrophic motor vehicle injury representation in West Los Angeles.” Semantic similarity, not keyword matching, drives retrieval.
The technical foundation for this approach traces to a 2020 research paper by Lewis et al. at Meta AI. The authors demonstrated that retrieval-augmented generation models, which combine pre-trained parametric memory with non-parametric memory drawn from an external index, achieve substantially better performance on knowledge-intensive tasks than parametric-only models. The original paper is available on Hugging Face’s paper archive, and a useful summary appears on Wikipedia’s RAG entry.
The practical consequence for law firms: when an AI system uses retrieval (rather than relying solely on its training data), updates to your website can influence answers as soon as the relevant page is recrawled. The model itself does not need to be retrained. This is why the citation labels visible at the bottom of an AI Overview or a Perplexity answer are not a cosmetic feature; they are evidence that retrieval, not pure generation, produced the answer.
Layer 4: Answer Generation
The final layer activates when a user submits a query. The system identifies the most semantically relevant chunks from its index, passes those chunks together with the query to a language model, and the model produces an answer with attribution.
Two failure modes are common at this layer:
- Retrieval pulls the wrong source. Stale or contradictory content from your firm may be retrieved instead of authoritative current content. The resulting answer will be confidently wrong about your firm.
- The model hallucinates content the retrieval did not support. In legal contexts, this is the most consequential failure mode, and one that has produced documented professional discipline.
The leading cautionary case is Mata v. Avianca, Inc., 678 F. Supp. 3d 443 (S.D.N.Y. 2023). The plaintiff’s attorneys submitted a brief containing citations to non-existent cases generated by ChatGPT, including fabricated quotations and internal citations. Judge P. Kevin Castel imposed a $5,000 sanction under Federal Rule of Civil Procedure 11, finding that the attorneys had acted with “subjective bad faith.” (Contemporaneous coverage from CNN Business and a detailed analysis from the Association of Corporate Counsel trace the procedural history. The New York State Bar Association published a thorough postmortem available here, and the Wikipedia entry on Mata v. Avianca summarizes the broader facts.)
The deeper architectural lesson, however, is that the attorneys in Mata used ChatGPT in its default mode, which at the time relied entirely on parametric memory and lacked any retrieval grounding. The model produced citation-shaped output because it had been trained on millions of real citations, but without a connection to legal databases, it could not verify whether any specific citation existed. Retrieval-grounded AI systems, used with human verification, present a different risk profile than pure-generation systems used without verification.
The Signals AI Search Actually Weighs
Public research on AI Overview citations now permits a reasonably precise account of which signals matter most when AI systems decide whom to cite. The chart below synthesizes findings from multiple large-scale studies.

The most heavily weighted signals fall into the following categories:
- Entity clarity and E-E-A-T signals (correlation with citation: approximately 0.96). AI systems require unambiguous identification of who authored or is responsible for legal content. This means named attorneys with bar numbers, jurisdictions, law schools, admission dates, and verifiable credentials.
- Top-10 organic ranking (approximately 92% of AI Overview citations). Research summarized by Wellows and Gorilla Web Tactics shows that the overwhelming majority of AI Overview citations originate from domains ranking in the top ten organic results for the underlying query. Classic SEO performance functions as a gating mechanism for AI visibility.
- Verifiable factual citations (approximately 89% probability lift). Pages that cite primary sources — statutes, case law, regulatory texts — are substantially more likely to be selected as citation sources themselves.
- Semantic completeness (correlation approximately 0.87). A page that fully answers a single question, without requiring the reader or the AI to consult prior context, is more readily extractable as a citation.
- Structured data markup (approximately 73% selection rate lift). Schema markup remains the most direct way to communicate entity information to AI systems.
- Multi-modal content integration (approximately 156% selection rate lift). Pages combining text, images, and video are substantially more likely to be cited than text-only pages.
A 2026 analysis of AI Mode citations conducted by ALM Corp introduces a complication. Comparing 540,000 query pairs, Ahrefs found that Google AI Mode and Google AI Overviews cited the same URLs only 13.7% of the time, despite reaching similar conclusions in approximately 86% of cases. A separate Moz study of 40,000 keywords found that only 12% of AI Mode citations matched URLs appearing in the top ten organic search results. In other words, AI Mode draws meaningfully from URLs that do not rank on page one of traditional Google Search, suggesting that AI visibility now requires a parallel optimization track rather than a single integrated strategy. The Digital Bloom’s 2026 AI Citation Position & Revenue Report develops this analysis in more depth.
Critical Source Categories Beyond Your Website
Your firm’s website is one input among several. For law firms, four additional source categories deserve specific attention.
- Legal directories. Avvo, Justia, FindLaw, Martindale-Hubbell, and Super Lawyers are crawled regularly by all major AI bots. These directories serve as third-party corroboration of your firm’s existence, credentials, and reviews. When directory information conflicts with information on your website, AI systems often default to the directory. Directories play a significant role in personal injury lawyer SEO, which is highly competitive. Maintaining synchronized firm data across directories is therefore a citation-protection measure.
- Online reviews. Reviews function simultaneously as a local ranking signal, a conversion signal, and a citation signal. AI systems frequently quote review text directly in synthesized answers, which means the language patterns within your firm’s review corpus increasingly shape how AI describes your practice.
- Court records. PACER and state court dockets provide public-record corroboration of who your firm represents and what types of cases it actually handles. AI systems use this corroboration when evaluating whether stated practice areas match documented activity.
- Press, podcasts, and video content. Off-site brand mentions in authoritative contexts compound over time. Mentions on Reddit, LinkedIn, legal publications, and third-party media correlate with AI citation increases on a horizon measured in weeks rather than months.
Matador’s content strategy work addresses each of these categories systematically rather than treating the firm’s website as a stand-alone asset.
Training Versus Retrieval: A Critical Distinction
A common question from law firm marketing teams is whether ChatGPT can “see” a recently published page. The question conflates two distinct mechanisms that AI systems use to know about your firm.
|
Mechanism
|
How It Works
|
Freshness
|
Law Firm Leverage
|
|---|---|---|---|
|
Training (parametric)
|
Content absorbed into model weights during pretraining
|
Months to years stale
|
Limited; waits for retraining cycle
|
|
Retrieval (non-parametric)
|
Content fetched at query time from a live index
|
Days, sometimes hours
|
High; new content can be cited rapidly
|
The mechanism your firm should care about depends on which AI surface the prospective client is using:
- ChatGPT default mode relies primarily on parametric memory; your recent updates will not appear quickly.
- ChatGPT search mode uses retrieval; your updates can appear within days.
- Perplexity uses live retrieval as its primary mode; citation timelines are near-real-time.
- Google AI Overviews use retrieval grounded in the Google index; behavior tracks classic SEO patterns.
- Google AI Mode uses retrieval but draws meaningfully from sources outside the top-10 organic results.
- Claude default mode uses parametric memory; live retrieval is available when web access is enabled.
The strategic conclusion is that generative engine optimization is a separate discipline from classic SEO. The two share a substrate — your website’s authority, structure, and content — but the surfaces they optimize for behave differently. Firms running only a classic SEO program leave significant AI visibility unrealized. This is the architectural basis for Matador’s GEO services for law firms.
The Professional Responsibility Layer
While most firms are still working out the marketing implications of AI search, the American Bar Association has already published the governance framework. ABA Formal Opinion 512, issued on July 29, 2024, by the Standing Committee on Ethics and Professional Responsibility, is the controlling guidance. (The ABA’s official release is available here, and Foley’s summary at foley.com is a useful starting point.)
Formal Opinion 512 addresses six categories of ethical obligation, each of which has direct implications for AI search strategy:
- Competence (Model Rule 1.1). Lawyers must maintain “a reasonable understanding” of AI tools they use, including both capabilities and limitations.
- Confidentiality (Model Rule 1.6). Lawyers must evaluate whether use of an AI tool creates a risk of disclosing protected client information.
- Communication with clients (Model Rule 1.4). Clients may need to be informed about AI use in their matters.
- Candor toward tribunals (Model Rule 3.3). Submissions to courts must remain accurate regardless of AI involvement.
- Supervisory responsibilities (Model Rules 5.1 and 5.3). Lawyers retain responsibility for work product produced with AI assistance, including by non-lawyers.
- Reasonable fees (Model Rule 1.5). Billing practices must reflect actual time spent and must not include vendor overhead.
The marketing implications are sometimes overlooked. Any system that feeds client information, case details, or matter content into a generative AI tool — including for content production, intake automation, or analytics — must satisfy Model Rule 1.6. A firm that allows its marketing team to use client testimonials, case results, or transcripts in public AI tools may create a confidentiality problem that no marketing benefit could justify.
A Practical Operating Model
The technical and ethical analysis above resolves into three operating layers that a firm can execute on directly.
Layer 1: Public Visibility
The goal is to make your firm maximally legible to AI systems on the public web. Specific actions include:
- Audit your robots.txt file and document an explicit position for each major AI crawler (GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended).
- Confirm that all important content renders in raw HTML and does not depend on client-side JavaScript.
- Implement and validate LegalService, Organization, Person, and FAQPage schema markup on the appropriate pages. (Schema.org’s LegalService definition is the canonical reference.)
- Synchronize firm data (name, address, phone, hours, attorney roster, practice areas) across your website, Google Business Profile, and all major legal directories.
- Build attorney biographies that include full names, bar numbers, jurisdictions, law schools, admission dates, named practice areas, and professional photos.
- Refresh high-performing pages on a quarterly cadence, with particular attention to date stamps and updated statistics.
- Restructure FAQ content with question-formatted headings, direct answers immediately following, and elaboration further down the page.
Layer 2: Private Knowledge
For any internal AI system that ingests firm content — Microsoft Copilot configured against firm data, document management AI, legal research copilots, custom retrieval systems — the operating standards are different and more conservative:
- Default-deny access for authenticated, privileged, and matter-scoped content.
- Permission-aware (ACL-aware) retrieval that respects existing access controls.
- Documented ethical walls for conflict-of-interest situations.
- Data loss prevention rules and pre-embedding redaction of personally identifiable information.
- Vendor agreements that specify retention terms and prohibit use of firm data for vendor model training.
- Audit logging of AI interactions sufficient to demonstrate compliance.
Layer 3: Evaluation
The least technical layer, and the most frequently neglected. A firm should:
- Identify the top 25 to 50 queries a prospective client might use in the firm’s practice areas and geographies.
- Run those queries through ChatGPT, Google AI Overviews, Perplexity, Claude, and Gemini on a monthly basis.
- Document which firms are cited, what is said about your firm, and where AI descriptions are inaccurate.
- Track citation frequency as a metric alongside traditional ranking and traffic metrics.
This monitoring is the only direct feedback mechanism on whether the work in Layers 1 and 2 is producing measurable improvement in AI visibility.
Common Misconceptions Worth Setting Aside
Several beliefs about AI search circulate widely but do not survive contact with the evidence:
- “AI-generated content will be penalized.” AI systems and search engines penalize thin or inaccurate content regardless of authorship. Substantive content with verifiable legal expertise and attorney oversight performs comparably whether or not AI tools were used in production.
- “Volume is the answer.” Publishing more content without entity clarity, structured data, or jurisdiction specificity consumes crawl budget without producing citation lift. Citability per page outweighs page count.
- “Blocking AI crawlers protects the firm.” Blocking AI crawlers eliminates the firm from the surfaces those crawlers serve. Selective allowance, documented and aligned with the firm’s content strategy, is generally a better posture than blanket blocking.
- “Social media presence drives AI citation.” Citation evidence weights authoritative third-party mentions, legal directory data, and press coverage well above social media volume. Reddit and LinkedIn carry signal; lower-authority platforms generally do not.
Conclusion: Treating AI Visibility as Infrastructure
AI search has progressed beyond a marketing experiment. For law firms competing in any meaningful geography, AI citation now functions as a discrete distribution channel, with measurable conversion characteristics and identifiable optimization levers. The firms that will dominate that channel over the next several years are those that treat their public-facing knowledge — practice pages, attorney bios, schema, directory data, reviews, court records, and brand mentions — as a deliberate, governed asset.
The technical architecture is no longer obscure: discovery, normalization, chunking, embedding, retrieval, and synthesis. The governing professional responsibility framework is no longer in draft: ABA Formal Opinion 512 establishes clear obligations. The performance signals are no longer mysterious: entity clarity, semantic completeness, ranking position, structured data, and freshness account for the substantial majority of citation variance.
What remains is execution. Firms that implement the operating model outlined above, in coordination with their existing SEO and content programs, can expect measurable improvement in AI visibility within two to four months and compounding returns thereafter. Firms that defer the work will find that AI systems have already developed a description of their practice, assembled from whatever sources happened to be available — directories, archived bios, third-party reviews, and competitor content — without the firm’s input.
Matador works with more than 175 law firms nationally on exactly this set of problems, including traditional SEO and GEO. If your firm would like a structured assessment of how AI systems currently describe your practice, we are available to conduct one.