Remove password protection from Excel sheets using R
Most data scientists wished that all data lived neatly managed in some DB. However, in reality, Excel files are ubiquitous and often a common way to disseminate results or data within many companies. Every now and then I found myself in the situation where I wanted to protect Excel sheets against users accidentally changing them. A few months later, however, I found that I sometimes had forgotten the password I used. The “good” thing is that protecting Excel sheets by password is far from safe and access can be recovered quite easily. The following works for .xlsx files only (tested with Excel 2016 files), not the older .xls files.
Before implementing the steps in R, I will outline how to remove the password protection “by hand”. The R way is simply the automation of these steps. The first thing one needs to understand is that a .xlsx file is just a collection of folders and files in a zip container. If you unzip a .xlsx file (e.g. using 7-Zip) you get the following folder structure (sorry, German UI):
In the folder ./xl/worksheets we find one XML file for each Excel sheet. The sheet’s password protection is encoded directly in the sheet. While there used to be the plain password text in the XML in former versions, now, we find the hashed password (see part marked in yellow below). In order to get rid of the password protection, we simply can remove the whole sheetProtection node from the XML. We can do that in any text editor and save the file.
As the last step, we need to recreate the .xlsx file by creating a zip folder that contains our modified XML file (German UI again).
Finally, we change the file extension from .zip back to .xlsx and voilá, we get an Excel file without password protected sheets. Programming the steps outlined above in R is quite straightforward. The steps are commented in the GitHub gist below.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# remove sheet protection in Excel | |
# Sample file: https://www.dropbox.com/s/4ul0kowrscyr8cz/excel_protected.xlsx?dl=0 | |
library(stringr) | |
library(zip) | |
# file with protected sheets | |
file <- "data/excel_protected.xlsx" | |
# file name and path after removing protection | |
file_unlocked <- str_replace(basename(file), ".xlsx$", "_unlocked.xlsx") | |
file_unlocked_path <- file.path(getwd(), "data", file_unlocked) | |
# create temporary directory in project folder | |
# so we see what is going on | |
temp_dir <- "_tmp" | |
# remove and recreate _tmp folder in case it already exists | |
unlink(temp_dir, recursive = T) | |
dir.create(temp_dir) | |
# unzip Excel file into temp folder | |
unzip(file, exdir = temp_dir) | |
# get full path to XML files for all worksheets | |
worksheet_paths <- list.files( | |
paste0(temp_dir, "/xl/worksheets"), | |
full.name = TRUE, | |
pattern = ".xml") | |
# remove the XML node which contains the sheet protection | |
# We might of course use e.g. xml2 to parse the XML file, but this simple approach will suffice here | |
for (ws in worksheet_paths) { | |
x <- readLines(ws, encoding = "windows1") | |
# the "sheetProtection" node contains the hashed password "<sheetProtection SOME INFO />" | |
# we simply remove the whole node | |
out <- str_replace(x, "<sheetProtection.*?/>", "") | |
writeLines(out, ws) | |
} | |
# create a new zip, i.e. Excel file, containing the modified XML files | |
old_wd <- setwd(temp_dir) | |
f <- list.files(recursive = T, full.names = F, all.files = T, no..=T) | |
# as the Excel file is a zip file, we can directly replace the .zip extension by .xlsx | |
zip::zip(file_unlocked_path, files = f) # utils::zip does not work for some reason | |
setwd(old_wd) | |
# clean up and remove temporary directory | |
unlink(temp_dir, recursive = T) |
Filed under: R / R-Code | 8 Comments
Hi! Cool article. Is there any similar way to remove password from the entire workbook as well ?
Awaiting your reply
Regards
Hi Yashodhan,
frankly, I do not know. I have not had the problem yet :)
Bests
Mark
Thank you, very interesting. It works, I get a warning though: “Warning message:
In readLines(ws, encoding = “windows1″) :
incomplete final line found on ‘_tmp/xl/worksheets/sheet1.xml'”
Any ideas?
best
h
Hi Holger,
it is a common warning when using readLines. The reason is that R does not find an end-of-line (EOL) character in the file. R warns you that the file may thus be incomplete. That’s all and not the case here. You can read more about this common warning in many places, e.g. here https://stackoverflow.com/questions/5990654/incomplete-final-line-warning-when-trying-to-read-a-csv-file-into-r or https://stackoverflow.com/questions/47357125/incomplete-final-line-found-error-when-reading-url-using-readlines-in-r?rq=1
Bests
Mark
after I unzipped the xlsx file I found these folder / files “[6]DataSpaces”, “EncryptedPackage”, “EncryptionInfo”. No xml file to be found. I guess Microsoft improved their encryption after this post.
Hi Paul.
Haha. That is because you have an a password protected workbook, not sheets. With a protected workbook, I don’t think there is no way easy way to remove a password. It is solid encryption, I suppose.
Bests
Mark