Skip to contents

The method is about 1/3 faster than htmlParse but it is still quite slow

Usage

standardize_dehtmlize(
  x,
  as_single_string = FALSE,
  as_single_string_sep = "#_|",
  use_read_xml = FALSE,
  ...
)

Arguments

x

object (table)

as_single_string

If set then collapse characters in the main column of the x (i.e., x.col) as to a single string. It will increase performance (at least for relatively short tables). Default is FALSE

as_single_string_sep

delimiter for collapsed strings to uncollapse it later. Default is "#_|".

use_read_xml

If set the it will parse XML. Default is FALSE which means it parses HTML

...

Arguments passed on to standardize_options

col

Column of interest (the one we need to standardize) in the x object (if it is data.frame like).

rows

Logical vector to filter records of interest. Default is NULL which means do not filter records.

omitted_rows_value

If rows parameter is set then merge omitted_rows_value with the results (filtered by rows). Either single string or a character vector of length nrow(x). If NULL (the default) then original values of col are merged with results.

output_placement

Where to inset retults (standardized vector) in the x object. Default options is 'replace_col' which overwrides the col in x with results. Other options:

  • 'omit' :: do not write results back to table (usually used when append_output_copy is set for temporary values)

  • 'prepend_to_col' :: prepend to col

  • 'append_to_col' :: append to col

  • 'prepend_to_x' :: prepend to x data.frame like object

  • 'append_to_x' :: append to x data.frame like object

x_atomic_name

If x is vector use this name for original column if it is in results. Default is "x". If x is table the name of col will be used.

output_col_name

Use this name for the column with results (standardized values). Parts in curly brakeds are substitute strings. Options for substitutions are:

append_output_copy

Whether to append a copy of result vector to x object

output_copy_col_name

How the append copy wiil be named

Value

updated object

References

http://stackoverflow.com/questions/5060076