Skip to main content

Building Libraries

Create optimized RAG libraries from your documentation

This guide covers advanced options for building high-quality RAG libraries.

Basic Build

Terminal window
libragen build ./docs --name my-docs

This processes all markdown, text, and HTML files, generating embeddings and a full-text search index.

Supported File Types

ExtensionFormat
.mdMarkdown
.txtPlain text
.htmlHTML (text extracted)
.mdxMDX (treated as markdown)

Customize with --include:

Terminal window
libragen build ./docs --name my-docs --include "**/*.md" "**/*.rst" "**/*.txt"

Chunking Strategy

Documents are split into chunks for indexing. The chunking parameters affect search quality. The defaults work well for most cases, but you can tune them for specific content types.

When to Adjust Chunk Settings

Use defaults (chunk-size: 1000, overlap: 100) when:

  • You have typical documentation with mixed content types
  • You’re not sure what settings to use
  • Your docs have clear headings and sections

Use smaller chunks (256-384) when:

  • Content is dense with distinct concepts (API references, glossaries)
  • Users ask very specific questions (“What does X parameter do?”)
  • Each paragraph covers a different topic

Use larger chunks (768-1024) when:

  • Content is narrative or tutorial-style
  • Context matters more than precision (architecture guides, explanations)
  • You want results to include more surrounding context

Chunk Size

Target size in characters.

Terminal window
# Smaller chunks = more precise, more results
libragen build ./docs --name my-docs --chunk-size 256
# Larger chunks = more context per result
libragen build ./docs --name my-docs --chunk-size 1024

By content type:

Content TypeChunk SizeWhy
API reference256-384Each function/method is self-contained
Configuration docs384-512Options are usually independent
Tutorials512-768Steps need some context
Architecture guides768-1024Concepts span multiple paragraphs
FAQs256-384Each Q&A is standalone

Chunk Overlap

Characters shared between adjacent chunks. Overlap helps when:

  • Important context spans chunk boundaries
  • You have long sentences or paragraphs
  • Search queries might match text split across chunks
Terminal window
libragen build ./docs --name my-docs --chunk-overlap 100

Guidelines:

  • 10% overlap (default): Good for well-structured docs with clear sections
  • 15-20% overlap: Better for prose-heavy content or long paragraphs
  • 25%+ overlap: Use when search quality suffers from split context (rare)

Higher overlap increases library size and build time, so only increase if needed.

Versioning Content

Track documentation versions to match your releases:

Terminal window
libragen build ./docs \
--name my-api \
--content-version 2.1.0

Query specific versions later:

Terminal window
libragen query -l my-api --content-version 2.1.0 "authentication"

Excluding Files

Skip files matching glob patterns:

Terminal window
libragen build ./docs \
--name my-docs \
--exclude "**/node_modules/**" \
--exclude "**/drafts/**" \
--exclude "**/*.test.md"

Adding Metadata

Provide description for better discoverability:

Terminal window
libragen build ./docs \
--name react-docs \
--description "Official React documentation including hooks, components, and API reference"

The description appears in libragen list and helps AI tools understand what the library contains.

License Tracking

When building from git repositories, licenses are automatically detected from LICENSE files. You can also specify licenses explicitly:

Terminal window
# Explicit license
libragen build ./docs --name my-docs --license MIT
# Multiple licenses (dual licensing)
libragen build ./docs --name my-docs --license MIT Apache-2.0

View license information with:

Terminal window
libragen inspect my-docs.libragen

Supported licenses: MIT, Apache-2.0, GPL-3.0, GPL-2.0, BSD-3-Clause, BSD-2-Clause, ISC, Unlicense, and more.

CI/CD Integration

Automate library builds in your pipeline. See the CI Integration guide for complete examples with GitHub Actions, GitLab CI, CircleCI, and Azure Pipelines.

Output Location

By default, libraries are saved to the current directory. Specify a different location:

Terminal window
libragen build ./docs \
--name my-docs \
--output ~/.libragen/libraries/

Performance Tips

Large Documentation Sets

For very large doc sets (>10,000 files):

  1. Use larger chunk sizes to reduce total chunks
  2. Exclude non-essential files (changelogs, drafts)
  3. Build incrementally if possible

Optimizing for Search Quality

  1. Structure your docs - Use clear headings and sections
  2. Front-load important content - Key information at the start of sections
  3. Use consistent terminology - Same terms across related docs
  4. Include examples - Code examples improve retrieval for technical queries

Programmatic Building

Use the Builder class from @libragen/core to build libraries programmatically:

import { Builder } from '@libragen/core';
const builder = new Builder();
const result = await builder.build('./docs', {
name: 'my-docs',
version: '1.0.0',
description: 'My documentation',
chunkSize: 1000,
chunkOverlap: 100,
});
console.log(`Built: ${result.outputPath}`);
console.log(`Chunks: ${result.stats.chunkCount}`);
console.log(`Time: ${result.stats.embedDuration}s`);

Build from git repositories:

const result = await builder.build('https://github.com/user/repo', {
gitRef: 'v2.0.0',
include: ['docs/**/*.md'],
});
if (result.git) {
console.log(`Commit: ${result.git.commitHash}`);
console.log(`License: ${result.git.detectedLicense?.identifier}`);
}

Track progress during builds:

await builder.build('./docs', { name: 'my-docs' }, (progress) => {
console.log(`[${progress.progress}%] ${progress.phase}: ${progress.message}`);
});

Custom Embedders

Use a custom embedding provider by implementing the IEmbedder interface:

import { Builder } from '@libragen/core';
import type { IEmbedder } from '@libragen/core';
class OpenAIEmbedder implements IEmbedder {
dimensions = 1536;
async initialize() { /* setup */ }
async embed(text: string) { /* call OpenAI */ }
async embedBatch(texts: string[]) { /* batch call */ }
async dispose() { /* cleanup */ }
}
const builder = new Builder({ embedder: new OpenAIEmbedder() });
const result = await builder.build('./docs', { name: 'my-docs' });

See the API Reference for the full interface specification.

Need Help?

See the Troubleshooting guide for solutions to common build issues like slow builds, memory errors, and poor search results.