Reader
get_sources(documentation_path)
Get the plain text of all the documents in the given folder and split them into smaller chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
documentation_path |
str
|
Path to the directory containing the documents. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List of strings: A list of all the documents, split into smaller chunks. |
Source code in src/utils/reader.py
get_text(input_file)
Determines the file extension and converts the file to plain text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_file |
Path
|
The path to the file to convert. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The plain text content of the file. |
Raises:
Type | Description |
---|---|
Exception
|
If the file extension is not supported or the file cannot be read. |
Source code in src/utils/reader.py
markdown_to_text(text)
Convert a Markdown string to plain text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
str, the Markdown string to be converted. |
required |
Returns:
Type | Description |
---|---|
str
|
str, the plain text content of the Markdown string. |
Source code in src/utils/reader.py
pdf_to_text(input_file)
Convert a PDF file to plain text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_file |
Path
|
Path, the path to the input PDF file. |
required |
Returns:
Type | Description |
---|---|
str
|
str, the plain text content of the PDF file. |
Source code in src/utils/reader.py
split_text(text, separator=' ', chunk_size=512)
Splits a text string into chunks of at most chunk_size characters, using the specified separator character (default is space). Returns a list of strings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
str, the text string to be split. |
required |
separator |
str
|
str, the character to use as a separator for splitting. Defaults to ' ' (space). |
' '
|
chunk_size |
int
|
int, the maximum length of each chunk. Defaults to 512. |
512
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str], a list of string chunks of the original text. |
Raises:
Type | Description |
---|---|
Exception
|
if the text cannot be split into chunks of chunk_size characters using the specified separator. |