Skip to content

How to Run

  • The first step is to download the model weights in case you do not have them. There are a few torrents floating around as well as some huggingface repositories (e.g Once you have them, copy them into the models folder. Any of the quantized models that can run in llama.cpp should work: Llama, Alpaca, GPT4All, Vicuna...
  • Install the necessary requirements available in the requirements.lock. The usage of a virtual environment is recommended.
    python -m venv env
    source env/bin/activate
    pip install -r requirements.lock


There are two simple CLI applications:

  • ingest: create an embedding vector store given a folder with a bunch of documents

    ❯ python -m src.ingest --help
     Usage: python -m src.ingest [OPTIONS] DOCUMENTATION_PATH MODEL_PATH
     Ingest all the markdown files from documentation_path and create a vector store. It will
     create a new folder with the embedding index.
    ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────╮
    │ *    documentation_path      TEXT  Folder containing the documents. [default: None]       │
    │                                    [required]                                             │
    │ *    model_path              TEXT  Folder containing the model. [default: None]           │
    │                                    [required]                                             │
    ╭─ Options ─────────────────────────────────────────────────────────────────────────────────╮
    │ --help          Show this message and exit.                                               │

  • query: make a question to the LLM given the embedding path

    ❯ python -m src.query --help
     Usage: python -m src.query [OPTIONS] QUESTION MODEL_PATH [INDEX_PATH]
     Ask a question to a LLM using an index of embeddings containing the knowledge. If no
     index_path is specified it will only use the LLM to answer the question.
    ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────╮
    │ *    question        TEXT          Question to answer. [default: None] [required]         │
    │ *    model_path      TEXT          Folder containing the model. [default: None]           │
    │                                    [required]                                             │
    │      index_path      [INDEX_PATH]  Folder containing the vector store with the            │
    │                                    embeddings. If none provided, only LLM is used.        │
    │                                    [default: None]                                        │
    ╭─ Options ─────────────────────────────────────────────────────────────────────────────────╮
    │ --help          Show this message and exit.                                               │


To run the uvicorn server:

uvicorn main:app --reload --host --port 8383
Then you can hit the two endpoints available:


curl ""
curl -X POST -H "Content-Type: application/json" --data '{
    "question":"who is the best?",
Note that when hitting the query endpoint:

  • The index_path is not mandatory. If not provided only the LLM will be used to answer the question.
  • The model_path is only mandatory the first request. Once specified, the model will be already loaded in memory, and you will not need to provide a model path. If another model_path is provided a different model will be loaded.