Remove password protection from Excel sheets using R

06May18

Most data scientists wished that all data lived neatly managed in some DB. However, in reality, Excel files are ubiquitous and often a common way to disseminate results or data within many companies. Every now and then I found myself in the situation where I wanted to protect Excel sheets against users accidentally changing them. A few months later, however, I found that I sometimes had forgotten the password I used. The “good” thing is that protecting Excel sheets by password is far from safe and access can be recovered quite easily. The following works for .xlsx files only (tested with Excel 2016 files), not the older .xls files.

Before implementing the steps in R, I will outline how to remove the password protection “by hand”. The R way is simply the automation of these steps. The first thing one needs to understand is that a .xlsx file is just a collection of folders and files in a zip container. If you unzip a .xlsx file (e.g. using 7-Zip) you get the following folder structure (sorry, German UI):

In the folder ./xl/worksheets we find one XML file for each Excel sheet. The sheet’s password protection is encoded directly in the sheet. While there used to be the plain password text in the XML in former versions, now, we find the hashed password (see part marked in yellow below). In order to get rid of the password protection, we simply can remove the whole sheetProtection node from the XML. We can do that in any text editor and save the file.

As the last step, we need to recreate the .xlsx file by creating a zip folder that contains our modified XML file (German UI again).

Finally, we change the file extension from .zip back to .xlsx and voilá, we get an Excel file without password protected sheets. Programming the steps outlined above in R is quite straightforward. The steps are commented in the GitHub gist below.


# remove sheet protection in Excel
# Sample file: https://www.dropbox.com/s/4ul0kowrscyr8cz/excel_protected.xlsx?dl=0
library(stringr)
library(zip)
# file with protected sheets
file <- "data/excel_protected.xlsx"
# file name and path after removing protection
file_unlocked <- str_replace(basename(file), ".xlsx$", "_unlocked.xlsx")
file_unlocked_path <- file.path(getwd(), "data", file_unlocked)
# create temporary directory in project folder
# so we see what is going on
temp_dir <- "_tmp"
# remove and recreate _tmp folder in case it already exists
unlink(temp_dir, recursive = T)
dir.create(temp_dir)
# unzip Excel file into temp folder
unzip(file, exdir = temp_dir)
# get full path to XML files for all worksheets
worksheet_paths <- list.files(
paste0(temp_dir, "/xl/worksheets"),
full.name = TRUE,
pattern = ".xml")
# remove the XML node which contains the sheet protection
# We might of course use e.g. xml2 to parse the XML file, but this simple approach will suffice here
for (ws in worksheet_paths) {
x <- readLines(ws, encoding = "windows1")
# the "sheetProtection" node contains the hashed password "<sheetProtection SOME INFO />"
# we simply remove the whole node
out <- str_replace(x, "<sheetProtection.*?/>", "")
writeLines(out, ws)
}
# create a new zip, i.e. Excel file, containing the modified XML files
old_wd <- setwd(temp_dir)
f <- list.files(recursive = T, full.names = F, all.files = T, no..=T)
# as the Excel file is a zip file, we can directly replace the .zip extension by .xlsx
zip::zip(file_unlocked_path, files = f) # utils::zip does not work for some reason
setwd(old_wd)
# clean up and remove temporary directory
unlink(temp_dir, recursive = T)

 
 
 
 
 
 
 
 
 
 
 
 



8 Responses to “Remove password protection from Excel sheets using R”

  1. Hi! Cool article. Is there any similar way to remove password from the entire workbook as well ?
    Awaiting your reply
    Regards

  2. 2 markheckmann

    Hi Yashodhan,
    frankly, I do not know. I have not had the problem yet :)
    Bests
    Mark

  3. Thank you, very interesting. It works, I get a warning though: “Warning message:
    In readLines(ws, encoding = “windows1″) :
    incomplete final line found on ‘_tmp/xl/worksheets/sheet1.xml'”

    Any ideas?

    best
    h

  4. 4 markheckmann

    Hi Holger,
    it is a common warning when using readLines. The reason is that R does not find an end-of-line (EOL) character in the file. R warns you that the file may thus be incomplete. That’s all and not the case here. You can read more about this common warning in many places, e.g. here https://stackoverflow.com/questions/5990654/incomplete-final-line-warning-when-trying-to-read-a-csv-file-into-r or https://stackoverflow.com/questions/47357125/incomplete-final-line-found-error-when-reading-url-using-readlines-in-r?rq=1
    Bests
    Mark

  5. after I unzipped the xlsx file I found these folder / files “[6]DataSpaces”, “EncryptedPackage”, “EncryptionInfo”. No xml file to be found. I guess Microsoft improved their encryption after this post.

  6. 6 markheckmann

    Hi Paul.
    Haha. That is because you have an a password protected workbook, not sheets. With a protected workbook, I don’t think there is no way easy way to remove a password. It is solid encryption, I suppose.
    Bests
    Mark


  1. 1 Remove password protection from Excel sheets using R | R-bloggers
  2. 2 Data Import and Export | Pearltrees

Leave a comment