How to Run
- The first step is to download the model weights in case you do not have them. There are a few torrents floating around as well as some huggingface repositories (e.g https://huggingface.co/nyanko7/LLaMA-7B/). Once you have them, copy them into the models folder. Any of the quantized models that can run in llama.cpp should work: Llama, Alpaca, GPT4All, Vicuna...
- Install the necessary requirements available in the requirements.lock. The usage of a virtual environment is recommended.
CLI
There are two simple CLI applications:
-
ingest: create an embedding vector store given a folder with a bunch of documents
❯ python -m src.ingest --help Usage: python -m src.ingest [OPTIONS] DOCUMENTATION_PATH MODEL_PATH Ingest all the markdown files from documentation_path and create a vector store. It will create a new folder with the embedding index. ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────╮ │ * documentation_path TEXT Folder containing the documents. [default: None] │ │ [required] │ │ * model_path TEXT Folder containing the model. [default: None] │ │ [required] │ ╰───────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ─────────────────────────────────────────────────────────────────────────────────╮ │ --help Show this message and exit. │ ╰───────────────────────────────────────────────────────────────────────────────────────────╯
-
query: make a question to the LLM given the embedding path
❯ python -m src.query --help Usage: python -m src.query [OPTIONS] QUESTION MODEL_PATH [INDEX_PATH] Ask a question to a LLM using an index of embeddings containing the knowledge. If no index_path is specified it will only use the LLM to answer the question. ╭─ Arguments ───────────────────────────────────────────────────────────────────────────────╮ │ * question TEXT Question to answer. [default: None] [required] │ │ * model_path TEXT Folder containing the model. [default: None] │ │ [required] │ │ index_path [INDEX_PATH] Folder containing the vector store with the │ │ embeddings. If none provided, only LLM is used. │ │ [default: None] │ ╰───────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ─────────────────────────────────────────────────────────────────────────────────╮ │ --help Show this message and exit. │ ╰───────────────────────────────────────────────────────────────────────────────────────────╯
API
To run the uvicorn server:
Then you can hit the two endpoints available:Ingest:
curl "http://0.0.0.0:8383/ingest?documentation_path=data&model_path=models/gpt4all/gpt4all-lora-quantized-new.bin"
curl -X POST -H "Content-Type: application/json" --data '{
"question":"who is the best?",
"index_path":"index/datadog_documentation",
"model_path":"models/gpt4all/gpt4all-lora-quantized-new.bin"}' http://0.0.0.0:8383/query```
- The
index_path
is not mandatory. If not provided only the LLM will be used to answer the question. - The
model_path
is only mandatory the first request. Once specified, the model will be already loaded in memory, and you will not need to provide a model path. If anothermodel_path
is provided a different model will be loaded.