text_dataset_from_directory
functiontf.keras.preprocessing.text_dataset_from_directory(
directory,
labels="inferred",
label_mode="int",
class_names=None,
batch_size=32,
max_length=None,
shuffle=True,
seed=None,
validation_split=None,
subset=None,
follow_links=False,
)
Generates a tf.data.Dataset
from text files in a directory.
If your directory structure is:
main_directory/
...class_a/
......a_text_1.txt
......a_text_2.txt
...class_b/
......b_text_1.txt
......b_text_2.txt
Then calling text_dataset_from_directory(main_directory, labels='inferred')
will return a tf.data.Dataset
that yields batches of texts from
the subdirectories class_a
and class_b
, together with labels
0 and 1 (0 corresponding to class_a
and 1 corresponding to class_b
).
Only .txt
files are supported at this time.
Arguments
labels
is "inferred", it should contain
subdirectories, each containing text files for a class.
Otherwise, the directory structure is ignored.os.walk(directory)
in Python).sparse_categorical_crossentropy
loss).
- 'categorical' means that the labels are
encoded as a categorical vector
(e.g. for categorical_crossentropy
loss).
- 'binary' means that the labels (there can be only 2)
are encoded as float32
scalars with values 0 or 1
(e.g. for binary_crossentropy
).
- None (no labels).max_length
.validation_split
is set.Returns
A tf.data.Dataset
object.
- If label_mode
is None, it yields string
tensors of shape
(batch_size,)
, containing the contents of a batch of text files.
- Otherwise, it yields a tuple (texts, labels)
, where texts
has shape (batch_size,)
and labels
follows the format described
below.
Rules regarding labels format:
- if label_mode
is int
, the labels are an int32
tensor of shape
(batch_size,)
.
- if label_mode
is binary
, the labels are a float32
tensor of
1s and 0s of shape (batch_size, 1)
.
- if label_mode
is categorial
, the labels are a float32
tensor
of shape (batch_size, num_classes)
, representing a one-hot
encoding of the class index.
Tokenizer
classtf.keras.preprocessing.text.Tokenizer(
num_words=None,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
split=" ",
char_level=False,
oov_token=None,
document_count=0,
**kwargs
)
Text tokenization utility class.
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...
Arguments
num_words-1
words will
be kept.'
character.By default, all punctuation is removed, turning the texts into
space-separated sequences of words
(words maybe include the '
character). These sequences are then
split into lists of tokens. They will then be indexed or vectorized.
0
is a reserved index that won't be assigned to any word.