Building Libraries

Name: Libragen
Author: Libragen

Create optimized RAG libraries from your documentation

This guide covers advanced options for building high-quality RAG libraries.

Basic Build

libragen build ./docs --name my-docs

This processes all markdown, text, and HTML files, generating embeddings and a full-text search index.

Supported File Types

Extension	Format	Chunking
`.ts`, `.tsx`, `.mts`, `.cts`	TypeScript	AST-aware
`.js`, `.jsx`, `.mjs`, `.cjs`	JavaScript	AST-aware
`.py`, `.pyi`	Python	AST-aware
`.rs`	Rust	AST-aware
`.go`	Go	AST-aware
`.java`	Java	AST-aware
`.md`, `.mdx`	Markdown	Text-based
`.txt`	Plain text	Text-based
`.html`	HTML (text extracted)	Text-based

Customize with --include:

libragen build ./docs --name my-docs --include "**/*.md" "**/*.rst" "**/*.txt"

AST-Aware Code Chunking

For supported code files (TypeScript, JavaScript, Python, Rust, Go, Java), libragen uses AST-aware chunking by default. This produces higher-quality embeddings by:

Respecting language boundaries — Chunks align with functions, classes, and methods instead of arbitrary character positions
Including semantic context — Each chunk includes scope chain, imports, and sibling entities
Contextualizing for embeddings — A special contextualized text is generated that includes file path, scope, and signatures

Context Modes

Control how much context is included with each chunk:

# Full context (default) - includes scope chain, imports, siblings, signatures
libragen build ./src --name my-code --context-mode full

# Minimal context - only scope chain and entity signatures
libragen build ./src --name my-code --context-mode minimal

# No context - raw code only
libragen build ./src --name my-code --context-mode none

Disabling AST Chunking

If you prefer text-based chunking for code files:

libragen build ./src --name my-code --no-ast-chunking

Non-code files (Markdown, JSON, text, etc.) always use text-based chunking regardless of this setting.

Text-Based Chunking

Documents are split into chunks for indexing. The chunking parameters affect search quality. The defaults work well for most cases, but you can tune them for specific content types.

When to Adjust Chunk Settings

Use defaults (chunk-size: 1000, overlap: 100) when:

You have typical documentation with mixed content types
You’re not sure what settings to use
Your docs have clear headings and sections

Use smaller chunks (256-384) when:

Content is dense with distinct concepts (API references, glossaries)
Users ask very specific questions (“What does X parameter do?”)
Each paragraph covers a different topic

Use larger chunks (768-1024) when:

Content is narrative or tutorial-style
Context matters more than precision (architecture guides, explanations)
You want results to include more surrounding context

Chunk Size

Target size in characters.

# Smaller chunks = more precise, more results
libragen build ./docs --name my-docs --chunk-size 256

# Larger chunks = more context per result
libragen build ./docs --name my-docs --chunk-size 1024

By content type:

Content Type	Chunk Size	Why
API reference	256-384	Each function/method is self-contained
Configuration docs	384-512	Options are usually independent
Tutorials	512-768	Steps need some context
Architecture guides	768-1024	Concepts span multiple paragraphs
FAQs	256-384	Each Q&A is standalone

Chunk Overlap

Characters shared between adjacent chunks. Overlap helps when:

Important context spans chunk boundaries
You have long sentences or paragraphs
Search queries might match text split across chunks

libragen build ./docs --name my-docs --chunk-overlap 100

Guidelines:

10% overlap (default): Good for well-structured docs with clear sections
15-20% overlap: Better for prose-heavy content or long paragraphs
25%+ overlap: Use when search quality suffers from split context (rare)

Higher overlap increases library size and build time, so only increase if needed.

Versioning Content

Track documentation versions to match your releases:

libragen build ./docs \
  --name my-api \
  --content-version 2.1.0

Query specific versions later:

libragen query -l my-api --content-version 2.1.0 "authentication"

Excluding Files

Skip files matching glob patterns:

libragen build ./docs \
  --name my-docs \
  --exclude "**/node_modules/**" \
  --exclude "**/drafts/**" \
  --exclude "**/*.test.md"

Adding Metadata

Provide description for better discoverability:

libragen build ./docs \
  --name react-docs \
  --description "Official React documentation including hooks, components, and API reference"

The description appears in libragen list and helps AI tools understand what the library contains.

License Tracking

When building from git repositories, licenses are automatically detected from LICENSE files. You can also specify licenses explicitly:

# Explicit license
libragen build ./docs --name my-docs --license MIT

# Multiple licenses (dual licensing)
libragen build ./docs --name my-docs --license MIT Apache-2.0

View license information with:

libragen inspect my-docs.libragen

Supported licenses: MIT, Apache-2.0, GPL-3.0, GPL-2.0, BSD-3-Clause, BSD-2-Clause, ISC, Unlicense, and more.

CI/CD Integration

Automate library builds in your pipeline. See the CI Integration guide for complete examples with GitHub Actions, GitLab CI, CircleCI, and Azure Pipelines.

Output Location

By default, libraries are saved to the current directory. Specify a different location:

libragen build ./docs \
  --name my-docs \
  --output ~/.libragen/libraries/

Performance Tips

Large Documentation Sets

For very large doc sets (>10,000 files):

Use larger chunk sizes to reduce total chunks
Exclude non-essential files (changelogs, drafts)
Build incrementally if possible

Optimizing for Search Quality

Structure your docs - Use clear headings and sections
Front-load important content - Key information at the start of sections
Use consistent terminology - Same terms across related docs
Include examples - Code examples improve retrieval for technical queries

Programmatic Building

Use the Builder class from @libragen/core to build libraries programmatically:

import { Builder } from '@libragen/core';

const builder = new Builder();

const result = await builder.build('./docs', {
  name: 'my-docs',
  version: '1.0.0',
  description: 'My documentation',
  chunkSize: 1000,
  chunkOverlap: 100,
});

console.log(`Built: ${result.outputPath}`);
console.log(`Chunks: ${result.stats.chunkCount}`);
console.log(`Time: ${result.stats.embedDuration}s`);

Build from git repositories:

const result = await builder.build('https://github.com/user/repo', {
  gitRef: 'v2.0.0',
  include: ['docs/**/*.md'],
});

if (result.git) {
  console.log(`Commit: ${result.git.commitHash}`);
  console.log(`License: ${result.git.detectedLicense?.identifier}`);
}

Track progress during builds:

await builder.build('./docs', { name: 'my-docs' }, (progress) => {
  console.log(`[${progress.progress}%] ${progress.phase}: ${progress.message}`);
});

Custom Embedders

Use a custom embedding provider by implementing the IEmbedder interface:

import { Builder } from '@libragen/core';
import type { IEmbedder } from '@libragen/core';

class OpenAIEmbedder implements IEmbedder {
  dimensions = 1536;
  async initialize() { /* setup */ }
  async embed(text: string) { /* call OpenAI */ }
  async embedBatch(texts: string[]) { /* batch call */ }
  async dispose() { /* cleanup */ }
}

const builder = new Builder({ embedder: new OpenAIEmbedder() });
const result = await builder.build('./docs', { name: 'my-docs' });

See the API Reference for the full interface specification.

Need Help?

See the Troubleshooting guide for solutions to common build issues like slow builds, memory errors, and poor search results.