Class: HTMLReader

Extract the significant text from an arbitrary HTML document. The contents of any head, script, style, and xml tags are removed completely. The URLs for a[href] tags are extracted, along with the inner text of the tag. All other tags are removed, and the inner text is kept intact. Html entities (e.g., &) are not decoded.

Extends

FileReader

Constructors

new HTMLReader()

new HTMLReader(): HTMLReader

Returns

HTMLReader

Inherited from

FileReader.constructor

Methods

getOptions()

getOptions(): object

Wrapper for our configuration options passed to string-strip-html library

Returns

object

An object of options for the underlying library

skipHtmlDecoding

skipHtmlDecoding: boolean = true

stripTogetherWithTheirContents

stripTogetherWithTheirContents: string[]

See

https://codsen.com/os/string-strip-html/examples

Defined in

packages/llamaindex/src/readers/HTMLReader.ts:43

loadData()

loadData(filePath): Promise<Document<Metadata>[]>

Parameters

• filePath: string

loadDataAsContent()

loadDataAsContent(fileContent): Promise<Document<Metadata>[]>

Public method for this reader. Required by BaseReader interface.

Parameters

• fileContent: Uint8Array

Returns

Promise<Document<Metadata>[]>

Promise<Document[]> A Promise object, eventually yielding zero or one Document parsed from the HTML content of the specified file.

Overrides

FileReader.loadDataAsContent

Defined in

packages/llamaindex/src/readers/HTMLReader.ts:18

parseContent()

parseContent(html, options): Promise<string>

Wrapper for string-strip-html usage.

Parameters

• html: string

Raw HTML content to be parsed.

• options: any = {}

An object of options for the underlying library

Returns

Promise<string>

The HTML content, stripped of unwanted tags and attributes

See

getOptions

Defined in

packages/llamaindex/src/readers/HTMLReader.ts:33

addMetaData()

static addMetaData(filePath): (doc, index) => void

Parameters

• filePath: string

Returns

Function

Parameters

• doc: Document<Metadata>

• index: number

Returns

void

Inherited from

FileReader.addMetaData

Defined in

packages/llamaindex/src/readers/type.ts:29

Class: HTMLReader

Extends​

Constructors​

new HTMLReader()​

Returns​

Inherited from​

Methods​

getOptions()​

Returns​

skipHtmlDecoding​

stripTogetherWithTheirContents​

See​

Defined in​

loadData()​

Parameters​

Returns​

Inherited from​

Defined in​

loadDataAsContent()​

Parameters​

Returns​

Overrides​

Defined in​

parseContent()​

Parameters​

Returns​

See​

Defined in​

addMetaData()​

Parameters​

Returns​

Parameters​

Returns​

Inherited from​

Defined in​

Extends

Constructors

new HTMLReader()

Returns

Inherited from

Methods

getOptions()

Returns

skipHtmlDecoding

stripTogetherWithTheirContents

See

Defined in

loadData()

Parameters

Returns

Inherited from

Defined in

loadDataAsContent()

Parameters

Returns

Overrides

Defined in

parseContent()

Parameters

Returns

See

Defined in

addMetaData()

Parameters

Returns

Parameters

Returns

Inherited from

Defined in