The method is about 1/3 faster than htmlParse but it is still quite slow
Usage
standardize_dehtmlize(
x,
as_single_string = FALSE,
as_single_string_sep = "#_|",
use_read_xml = FALSE,
...
)
Arguments
- x
object (table)
- as_single_string
If set then collapse characters in the main column of the
x
(i.e.,x.col
) as to a single string. It will increase performance (at least for relatively short tables). Default is FALSE- as_single_string_sep
delimiter for collapsed strings to uncollapse it later. Default is "#_|".
- use_read_xml
If set the it will parse XML. Default is FALSE which means it parses HTML
- ...
Arguments passed on to
standardize_options
col
Column of interest (the one we need to standardize) in the
x
object (if it is data.frame like).rows
Logical vector to filter records of interest. Default is NULL which means do not filter records.
omitted_rows_value
If
rows
parameter is set then mergeomitted_rows_value
with the results (filtered byrows
). Either single string or a character vector of lengthnrow(x)
. If NULL (the default) then original values ofcol
are merged with results.output_placement
Where to inset retults (standardized vector) in the
x
object. Default options is 'replace_col' which overwrides thecol
inx
with results. Other options:'omit' :: do not write results back to table (usually used when
append_output_copy
is set for temporary values)'prepend_to_col' :: prepend to
col
'append_to_col' :: append to
col
'prepend_to_x' :: prepend to
x
data.frame like object'append_to_x' :: append to
x
data.frame like object
x_atomic_name
If
x
is vector use this name for original column if it is in results. Default is "x". Ifx
is table the name ofcol
will be used.output_col_name
Use this name for the column with results (standardized values). Parts in curly brakeds are substitute strings. Options for substitutions are:
append_output_copy
Whether to append a copy of result vector to
x
objectoutput_copy_col_name
How the append copy wiil be named