File encoding detection in R

Posted on October 24, 2017 by Stéphane Laurent
Tags: R

There is a Java port of universalchardet, the encoding detector library of Mozilla. It is called juniversalchardet. I’m going to show how to use it with the rJava package.

Firstly, download the jar file here: https://code.google.com/archive/p/juniversalchardet/downloads. Then, proceed as follows:

library(rJava)
.jinit()
.jaddClassPath("path/to/juniversalchardet-1.0.3.jar")
detector <- new(J("org/mozilla/universalchardet/UniversalDetector"), NULL)
f <- "juniversalchardet.Rmd" # file whose encoding you want to know
flength <- as.integer(file.size(f))
.jcall(detector, "V", "handleData",
       readBin(f, what="raw", n=flength), 0L, flength)
.jcall(detector, "V", "dataEnd")
.jcall(detector, "S", "getDetectedCharset")
## [1] "UTF-8"
.jcall(detector, "V", "reset")

Update 2020

Nowadays one can achieve the same result with the R package uchardet (which does not use Java).