File encoding detection in R
Posted on October 24, 2017
by Stéphane Laurent
Tags: R
There is a Java port of universalchardet
, the encoding detector library of Mozilla. It is called juniversalchardet
. I’m going to show how to use it with the rJava
package.
Firstly, download the jar
file here: https://code.google.com/archive/p/juniversalchardet/downloads. Then, proceed as follows:
library(rJava)
.jinit()
.jaddClassPath("path/to/juniversalchardet-1.0.3.jar")
new(J("org/mozilla/universalchardet/UniversalDetector"), NULL)
detector <- "juniversalchardet.Rmd" # file whose encoding you want to know
f <- as.integer(file.size(f))
flength <-.jcall(detector, "V", "handleData",
readBin(f, what="raw", n=flength), 0L, flength)
.jcall(detector, "V", "dataEnd")
.jcall(detector, "S", "getDetectedCharset")
## [1] "UTF-8"
.jcall(detector, "V", "reset")
Update 2020
Nowadays one can achieve the same result with the R package uchardet
(which does not use Java).