or "Faster and More Flexible Enterprise-Scale Approach to Sensitive Content Identification on SMB Shares"
Date of the Research: 06/25/2023
Author: Danil Karandin
Abstract
Sensitive files and information that lie in seas of corporate SMB shares attract the attention of various threat actors upon gaining initial access. Confidential data and credentials found in SMB shares can be a pivotal point during an attack. We all want to conduct proactive self-assessments to identify and mitigate such issues. However, it is problematic to accomplish this at an enterprise scale where seas become oceans regarding the number of systems, SMB share sizes, and file type diversity. The explored approach researches the use of open-source software (ripgrep and Man-Spider) to achieve high speed and reliability during sensitive content discovery across hundreds of SMB-connected systems Various parameters such as execution time, CPU load, and identification accuracy were tested across multiple hardware platforms, the two most prominent virtualization software, and the five most used Operating Systems. These five systems were hosting slightly less than 20 GB of SMB share files combined. Results of the research suggested a multistage implementation, and it allowed to achieve up to 11 times faster speeds, almost double sensitive content file identification accuracy, and avoid system crashes. The new multistage approach allows large organizations to identify sensitive content much quicker and still have room to accommodate the organizations' search and setup specifics. Moreover, there are still potential means to increase speed and accuracy further via infrastructure and technological adjustments.
1. Introduction
Open SMB shares are a valuable source of sensitive information such as insecurely stored credentials, keys, certificates, hostnames, usernames, etc. All these data types pose an interest for malicious actors and a risk to organizations.
1.1. Relevant MITRE SMB Share Techniques used by Threat Actors
As a result of the factors above, we see a broad implementation of the following related MITRE ATT&CK Techniques by many various threat actors in a number of attacks:
T1135 Network Share Discovery (MITRE, 2021)
T1083 File and Directory Discovery (MITRE, 2022)
T1039 Data from Network Shared Drive (MITRE, 2022)
T1552.001 Unsecured Credentials: Credentials In Files (MITRE, 2022)
T1552.004 Unsecured Credentials: Private Keys (MITRE, 2022)
1.2. Attacks In the Wild: Uber Breach 2022
There are multiple examples of attackers leveraging sensitive information found on SMB shares for their own profit. However, one of the most recent and prominent cases was the Uber breach in 2022. The presence of high-privilege, hard-coded credentials in a misconfigured network share was a critical point during the attack. The credentials were embedded in a PowerShell script. With those credentials’ help, attackers were granted administrative access to a privileged access management solution, which provided them with more high-risk access options (CyberArk Blog Team, 2022).
1.3. Purpose of the research paper
Current solutions for searching SMB share files are slow or unreliable at the scale of thousands of shares with an average size of 100GB+. In addition, several parameters must be specified for most tools, such as file types, directory search depth, or file-size limits that may be unknown due to the enterprise's size and complexity.
Some other solutions may excessively utilize the resources of a testing endpoint or download massive amounts of data on a testing system. On top of that, there is often a problem with installing those solutions on what is available inside the organizations due to internal regulations and standards. Moreover, flexible searching for both the name and the content is not something that many tools can provide.
Joshua Wright has done research called "Searching SMB Share Files." It displays and compares various free and commercial tools. Unfortunately, he stated he could not find anything that met his goals (Wright, 2022).
After a small amount of research and talking to several professionals in the penetration testing field, it was concluded that Man-Spider is currently the best tool to utilize for searching sensitive content on SMB shares. However, during the use of Man-Spider, it was observed that processing speed and resource consumption could be problematic depending on the scale and scope. So, before conducting this research, quick and straightforward Man-Spider tests were executed on a 483 KB JMX file with 11 occurrences of the word "password" in a share's root directory to prove the idea. The results showed that despite Man-Spider's particular setting to exclude files larger than 10 MB and with the same amount of CPU threads, it was significantly slower and consumed remarkably more CPU processing power than another open-source non-SMB specialized search tool - ripgrep. In addition, later in the research, I had an opportunity to talk to Andrew Gallant, the author of that tool, and consider suggested ideas for increasing performance during new approach development.
Thus, the solution was not to replace or create the best tool for everything, but to figure out how to utilize these great tools in symbiosis.
1.4. Thesis
A dedicated system with basic open-source tooling that can be installed on various system types will achieve a fast, reliable, and flexible solution for scanning the content of SMB share files at an enterprise scale.
2. Research Method
To prove the thesis, test searches based on content, filename, and extensions were used for performance and reliability analysis. They were conducted from four different testing systems located on three different hardware sets against five virtualized endpoints on the same hardware set with enabled SMB shares. Potential solutions were tested against the same datasets with different complexity levels to determine the best strategy. Those complexity levels include search words, tested systems, file share sizes, file types, performance settings, and capabilities.
Solutions were tested in a corporate and simulated smaller-scale testing environment to share more detailed results.
2.1. Testing Data and Systems
The testing environment comprises a dataset, five virtualized share systems, three hardware instances, and four virtualized scanning machines. All systems are in the same subnetwork.
2.1.1. Dataset
The dataset has 46 Files across six folders totaling 3.81 GB in size. All the files combined have 2,685 occurrences of the word “password” in a single dataset. The Mockaroo_MOCK_DATA.csv file was generated via https://www.mockaroo.com website with the parameters shown in Appendix Figure 1.1. Web server access log file named access.log with the size of 3.3 GB was downloaded from https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs?resource=download and contains 2,670 occurrences of the word “password”.
Files auxiliary-material.tgz (2.0 MB), cz.muni.csirt.IPFlowEntry.tgz (41.9 MB), cz.muni.csirt.SyslogEntry.tgz (157.4 MB), cz.muni.csirt.WinlogEntry.tgz (73.1 MB) totaling 274.5 MB were taken from https://zenodo.org/record/3746129#.ZEcB89LMIUE and placed into DefenseExercise-Compressed folder (261.7 MB) in a compressed form and into DefenseExercise-Extracted folder (305.2 MB) in an uncompressed form.
Basic MockDOCfile.doc (29 KB) and MockDOCXfile.docx (14.3 KB) files were created in Microsoft Word with “Password:Super$01” line placed into their content. MockDataXLS.xls (26 KB) and MockDataXLSX.xlsx (10.3) files were created in Microsoft Excel from generated CSV file that was constructed at https://generatedata.com/generator with the parameters displayed in Appendix Figure 1.2, and later word “password” was placed in the third column header. Files mockPowerPointPasswordTyped.ppt (83.5 KB) and mockPowerPointPasswordTyped.pptx (32.4 KB) were created in Microsoft PowerPoint with one slide having similar payload as Microsoft Word files but represented in a way shown in Figure 1.
Picture files mockPicturePasswordTyped.png (6.8 KB), mockPicturePasswordTyped.jpg (20.6 KB), mockPicturePasswordWritten.png (18.7 KB), mockPicturePasswordWritten.jpg (40.5 KB) were all created with Microsoft Paint. MockPicturePasswordTyped.png and mockPicturePasswordTyped.jpg have a typed content of “Password:5Yu*G1n@p%” as shown in Figure 2. MockPicturePasswordWritten.png and mockPicturePasswordWritten.jpg have a written content “Password:Super$01” that can be seen in Figure 3.
File mockPictures.zip (66.5 KB) comprises all four pictures. Upon creation of PowerPoint files, the slide was also printed into mockPDFPasswordTyped.pdf (23.6 KB) file. Microsoft Outlook message file Email.msg (468 KB) is a saved Microsoft advertisement email sent to the author’s personal email.
SSH Key files MockPrivateSSHKey-Linux (2.5 KB) and MockPrivateSSHKey-Linux.pub (563 B) were created by standard ssh-keygen command. MockPrivateSSHKey-Putty.ppk (1.4 KB) and MockPublicSSHKey-Putty (477 B) were created by Putty on Windows.
2.1.2. Hardware
Available physical research hardware consisted of three sets with the following specifications:
· Hardware Set 1 - Intel Core i7-7700K @ 4.20 GHz CPU, 32 GB RAM, SSD Storage, Linux Debian 10 Host OS.
· Hardware Set 2 - Intel Core i5-1245U @1.60 GHz CPU, 32 GB RAM, SSD Storage, Windows 10 Enterprise Host OS.
· Hardware Set 3 - Intel Core i5-7200U @2.50 GHz CPU, 12 GB RAM, SSD Storage, Windows 10 Home Host OS.
2.1.3. Software
Software that was used during the research consisted of VMware Workstation 15.5 PRO and 16 PRO versions, VirtualBox 7.0, official Kali Linux 2022.4 VMware and Virtual Box images, CentOS 7, Ubuntu 20.04.6, Windows 10 Enterprise, Windows Server 2019 and 2022 Standard Evaluation images, Advanced Port Scanner, smbmap, Advanced IP Scanner, nmap, Man-Spider (Git version e10bb6a from Feb 18, 2022), and ripgrep 13.0.0. Links in section six of the Appendix.
2.1.4. Testing Systems
· Testing System 1 had Kali Linux 2022.4 running on Hardware Set 1 with VMWare Workstation 15.5 PRO. Originally it had 4 Processor Cores (2 Processors, 2 Cores each) and 8 GB RAM
· However, when Man-Spider system crashes were observed, the system was upgraded to 8 Processor Cores (4 Processors, 2 Cores each) and 8 GB RAM.
· Testing System 2 had Kali Linux 2022.4 on Hardware Set 2 with VirtualBox 7.0 and power specifications of 10 CPU with 100% Execution Cap and 16 GB RAM.
· Testing System 3 had Kali Linux 2022.4 on Hardware Set 3 with VMWare Workstation 16 PRO, 4 Processor Cores (2 Processors, 2 Cores each), and 8 GB RAM.
· Testing System 4 had Kali Linux 2022.4 on Hardware Set 2 with VMWare Workstation 16 PRO, 8 Processor Cores (4 Processors, 2 Cores each), and 16 GB RAM.
2.1.5. Share Systems
All the following share systems were located in the testing environment and hosted on Hardware Set 1 with VMWare Workstation 15.5 PRO.
· Share System 1 had CentOS 7 (x86_64-Everything-2009.iso) with 1 Processor Core, 512 MB RAM, shared directory path: /srv/samba/public/, and 192.168.2.54 IP.
· Share System 2 had Ubuntu 20.04.6 with 1 Processor Core, 2 GB RAM, shared directory path: /srv/samba/public/DataSet, and 192.168.2.195 IP
· Share System 3 had Windows Server 2019 (Standard Evaluation 17763.rs5_release.180914-1434) with 2 Processor Cores, 2 GB RAM, shared directory path: C:\DataSetShare, and 192.168.2.149 IP.
· Share System 4 had Windows Server 2022 (Standard Evaluation 21H2 20348.587) with 2 Processor Cores, 2 GB RAM, shared directory path: C:\DataSetShare, and 192.168.2.138 IP.
· Share System 5 had Windows 10 (Enterprise Evaluation 17763.rs5_release.180914-1434) with 2 Processor Cores, 2 GB RAM, shared directory path: C:\DataSetShare, and 192.168.2.191 IP.
2.2 Scoping in Enterprise Environments
Proper scoping is an essential part of any engagement. Since this research is focused on enterprise environments, the scoping must be well-thought before starting the discovery and assessment steps. The critical piece here is to conduct the assessment effectively and efficiently. So, that enormous enterprise systems scale will have to get divided by some parameters to create manageable segments to work around. Several strategies can be used to divide the testing scope: by business units, system types, sensitive data location, geographic regions, and compliance requirements. In addition, testing account privileges (no account, regular user account, or admin level account) need to be chosen as they may dictate assessment depth and risk findings exposure. However, selecting one or a combination of strategies and account privileges would be based on the organization's specific needs and risks.
2.3. Testing Process
The testing process for identifying sensitive files on SMB shares consists of two parts. The first part is Discovery, as we need to find the systems with exposed SMB shares. The second part is Assessment, during which we try to access and search exposed SMB shares for sensitive information.
2.3.1. Discovery
There are numerous ways to get a list of systems that have enabled SMB shares. The most popular approaches are network scanning, Active Directory discovery, utilizing PowerShell Scripts, and built-in and commercial tools. Moreover, there is no ultimate approach. Thus, it is possible that several methods may be combined depending on the organization’s specifics in order to obtain the desired result.
As far as network scanning, it was observed that nmap’s smb-enum-shares.nse script found only one system with shares - CentOS (Appendix 2.1, 2.2). The most probable reason is due to NetShareEnumAll MSRPC Function working anonymously only against Windows 2000 and requiring a user-level account on any other Windows versions (Bowes). However, Ubuntu, which is Share System 2, was not found as well. However, smbmap tool with a fake username and password worked perfectly to display all available shares with available access permissions, as seen in Figure 4.
Other popular tools, such as Advanced IP Scanner and Advanced Port Scanner can be used to find systems and shares. Unfortunately, these tools do not indicate available access permissions like smbmap does.
2.3.2. Assessment
The next step after discovery is assessment. Usually, there is a lot of data and many systems. Thus, search parameters must be considered carefully, as they can significantly depend on the objective. During an assessment, two things are pursued - sensitive directories or file names or both via inspecting full file pathways and searching file content for sensitive information. Full pathway searches can lead to discovering other interesting files that were not considered before. However, it is important to note that directory or file name matches do not always lead to desired targets. So, full content parsing of binary and text-based files is often needed because, for example, renamed credential materials like SSH keys can be located in unusual directories and have modified, non-standard filenames. General content search terms include words like password, key, secret, access, or some specific values that can be related to financial, medical, proprietary, authentication, or other types of data.
It was determined that, to save time and resources, avoid downloading any files and connect all discovered shares to a testing system. Simple proof-of-concept bash scripts that convert a normalized Advanced Port Scanner CSV output file and use it to connect all found shares to a system can be found in Appendix 3.1 and 3.2.
3. Findings and Discussion
The focus of this research was on the performance and reliability of ripgrep and Man-Spider tools. Test cases included content, filenames, extensions, and directory searches to find the best and worst parts of each instrument. The installation component of these tools was considered as well.
Numerous tests indicated that both Man-Spider and ripgrep had many positive aspects in different parts of the test cases, which allowed the design of a better and more flexible approach for identifying sensitive content across hundreds of SMB-connected systems with high speed and reliability. All aspects are discussed in detail below.
3.1. Installation
While Man-Spider installation via pipx on popular Linux security distributions is easy, it can be difficult to obtain and set up needed libraries for Man-Spider to function properly on other assets depending on the organization's controls and policies. It was observed that the default-installed Man-Spider displays some errors related to a missing textract component (pdf2text.py). Man-Spider thought that pdf2text.py was not installed on the system, so it could be a potential reason for the virtual machine crashes that will be discussed below, but it was not.
Therefore, the “export PATH="/home/kali/.local/pipx/venvs/man-spider/bin:$PATH"” Linux path was added into ~/.bashrc to solve the relevant errors. The same behavior was observed across all four testing systems.
Simple “sudo apt-get install ripgrep” or “sudo yum install ripgrep”, depending on a distributive, will install ripgrep without any problems. Installers and precompiled binaries are also available for Linux, Windows, and macOS (Gallant, 2023).
3.2. Performance and Reliability Analysis
During performance and reliability analysis, both Man-Spider and ripgrep tools were executed multiple times with various configurations to find strong and weak sides applicable to sensitive content identification. Linux time command was utilized for measuring performance by comparing a CPU load and how long a given command takes to execute.
3.2.1. Man-Spider Content Search, VM crashing, and Performance Sag
Man-Spider “password” word content scanning of the Share System 5 (Windows 10) was launched from Testing System 1 with original power specifications of VMware Kali Linux 2022.4 with 4 GB RAM and 4 Processor Cores (2 Processors with 2 Cores each). The following command was used:
“time manspider /mnt/smb/--192.168.2.191-DataSetShare/ -c password”
The Share System 5’s file share size is 3.81GB. And it was observed that after Man-Spider starts finding files, it crashes the virtual machine. This was tested multiple times, and the behavior was observed repeatedly. The potential reasons for the crash were inconclusive, and the root cause of the system crash was not investigated. Meanwhile, ripgrep performed basic scanning with no problems (Figure 14). It was decided to upgrade the Testing System 1 virtual machine specifications and increase power to 8 GB of RAM and 8 Processor Cores (4 Processors, 2 Cores each). Unfortunately, this power upgrade did not help with crashing.
Thus, the same Man-Spider test on the same data was executed from the Testing System 2 with Hardware Set 2, VirtualBox Kali Linux 2022.4, 16 GB RAM, and 10 CPU with 100% Execution Cap. And it was completed without any crashes with the following parameters: real 150.92s, user 61.03s, sys 116.99s, cpu 117%.
A quick note regarding CPU percentages being higher than 100%. Time command calculates the percentage of the CPU that got this job by the following formula: (%U + %S) / %E. Where %U is the total number of CPU seconds that the process spent in user mode, %S is in kernel mode, and %E is elapsed real time (Kerrisk, 2019). And, as it can be seen, real time is greater than user and sys times together – this is due to the program getting executed by multiple threads, but the time command brings the sum time of all threads working on it.
MockDataXLS.xls, MockDOCfile.doc, MockDataXLSX.xlsx, mockPDFPasswordTyped.pdf, MockDOCXfile.docx, mockPowerPointPasswordTyped.pptx are the six files that were returned with one match of the word “password” in each. And in access.log, Man-Spider found all 2,670 occurrences. Files larger than 10 MB were not analyzed during this default scan. In addition, it was discovered that ppt extension file analysis is not supported. Image files also were not discovered during this search.
By default, Man-Spider limits the number of threads to five, the maximum depth of scanning is ten levels, and the file size limit is 10 MB. So, it was adjusted to make it more comparable to the default settings of ripgrep even though Man-Spider was considering 3.3 GB large access.log file during the assessment. Command was: “time manspider -s 10GB -t 10 -m 100 /mnt/smb/--192.168.2.191-DataSetShare/ -c password”. The new results found the same files and had the following performance characteristics: real 280.73s, user 99.41s, sys 232.39s, cpu 118% (Appendix 4.2).
Because of the time parameters above, a potential performance sag was suspected in Testing System 2. Moreover, it was observed later, during ripgrep testing, as well (Figure 15, Figure 20, and Figure 24). To compare the results with a similar performance testing system, the analysis was moved to Testing System 4 which had the same hardware as Testing System 2, but different virtualization software. The same command executed faster with fewer threads (Appendix 4.3), indicating that the performance problem could be related to the system or software. The command was: “time manspider -s 10GB -t 8 -m 100 /mnt/smb/--192.168.2.191-DataSetShare/ -c password”.
Even though Man-Spider says that all matching files will be downloaded to /home/kali/.manspider/loot folder and thus it could impact performance results - it was not doing so. Primarily because our method of accessing shares seemed like local files to Man-Spider.
3.2.2. Man-Spider Additional Content Search for Images
Additional search with command “time manspider -s 10GB -t 8 -m 1000 /mnt/smb -e jpg png -c password -v” had to be executed to search images. Man-Spider found all jpg and png files with typed word “password” in them, but not written ones. In addition, this process seemed to be very system intense (Figure 13).
3.2.3. Ripgrep Content Search and System Performance Sag Validation
Ripgrep “password” word basic search was launched from the Testing System 1 with VMware Kali 2022.4, 8 GB RAM, 8 Processor Cores (4 Processors, 2 Cores each) and performed with much better performance values (real 41.71s, user 3.70s, sys 8.14s, cpu 28%) where Man-Spider was crashing a VM previously. Ripgrep found the same amount of “password” occurrences in access.log file as Man-Spider (2,670). But nothing in other files.
However, when the same Ripgrep testing command was executed from the Testing System 2 (VirtualBox Kali 2022.4 with 16 GB RAM and 10 CPU with 100% Execution Cap) the results were disappointing (real 335.80s, user 15.64s, sys 463.02s, cpu 142%) in a similar manner as with Man-Spider before (Figure 11).
So, the command was re-executed later to see if performance abnormality sag repeats. The new results (real 45.84s, user 9.58s, sys 60.85s, cpu 153%) were much closer to the previous results from Testing System 1. Nevertheless, a high CPU utilization percentage remained.
Thus, to test it more, the same ripgrep command was run on Testing System 3 against the same Share System 5, resulting in real 60.64s, user 1.76s, sys 27.12s, cpu 47% parameters. This was to confirm that Testing System 2’s setup specifics were influencing the performance as Testing System 3 had lower specifications than Testing System 2 or Testing System 4.
3.2.4. Ripgrep Additional Content Search for Binary Files
Since ripgrep found nothing in other files by default, the next step was to explore binary files content search with ripgrep and its performance. Part of it was that ripgrep operates in default mode which attempts to remove binary files from a search completely (Gallant, 2022). To consider binary files, a text mode was selected with -a and --no-ignore flags. They were added to make ripgrep parse through all binary files as they would be text while ignoring any found rule files (Gallant, 2022b). It got executed on Testing System 1 with slightly longer parameters (real 47.24s, user 3.40s, sys 8.66s, cpu 25%) than before during regular content search (Figure 14).
The speed comes at a readability cost, as binary data emits control characters to the executing terminal (Gallant, 2022b), and further grepping through the data is not readily available. During the text mode search ripgrep found the following six files: mockPowerPointPasswordTyped.ppt, mockPictures.zip, mockPDFPasswordTyped.pdf, MockDOCfile.doc, MockDataXLS.xls, access.log. Moreover, in this text mode, zipped filenames were still parsable.
When the same test was executed on Testing System 2, it displayed a performance sag again as with during basic file content search in Figure 15.
In addition to the text mode, ripgrep has a binary search mode. After trying it, it was observed to make results "greppable" while working with similar performance (Figure 22). The only caveat is that it does not count all match occurrences in binary files (Gallant, 2022).
To further improve the fullness of ripgrep searches, flag -uuu was used. This flag uses a special type of unrestricted mode that combines disabling .gitignore handling, searching hidden files and directories, and will search binary files like if --binary flag was used. To follow symlinks, -L flag was added without a significant impact on performance (Gallant, 2022b).
Execution on the Testing System 2 displayed similar performance problems (Figure 24) as seen in the very beginning of the testing (Figure 15). The second attempt was tried to see if the previous results were caused by a random issue. However, the observed results did not confirm it.
As it was mentioned earlier, to find out if that performance sag was virtual machine software dependent or system dependent, the same test was executed on Testing System 3 and then later on Testing System 4 in order to replicate Testing System 2 but with different virtualization software.
The same Hardware Set 2 and Kali Linux 2022.4 were used, but this time, with VMware Workstation 16 PRO, 16 GB RAM, and 8 Processor Cores (4 Processors, 2 Cores each), which is slightly lower than what the Testing System 2 has by 2 CPU cores less. The results (real 42.54s, user 5.28s, sys 8.39s, cpu 32%) were vastly different from what was observed on Testing System 2 in Figure 24.
3.2.5. Man-Spider and Ripgrep Content Search Performance and Reliability
The fresh round of tests for both Man-Spider and ripgrep was performed from Testing System 4. For each Share System Man-Spider was executed first, and then upon its completion ripgrep was executed. All Share Systems had the same dataset of 3.81 GB size each, bringing this to a total of 19.05 GB of data for all five shares combined. During the last test, where all shares were getting scanned at once - Man-Spider crashed Testing System 4.
System | Man-Spider | Ripgrep |
Share System 1: CentOS 7 | Found: 7 Files real 57.05s user 55.28s sys 21.31s cpu 134% | Found: 6 Files real 33.53s user 2.18s sys 4.27s cpu 19% |
Share System 2: Ubuntu 20.04.6 | Found: 7 Files real 57.89s user 56.05s sys 20.75s cpu 132% | Found: 6 Files real 35.88s user 2.46s sys 4.58s cpu 19% |
Share System 3: Windows Server 2019 | Found: 7 Files real 66.84s user 61.28s sys 22.80s cpu 125% | Found: 6 Files real 36.26s user 2.42s sys 4.64s cpu 19% |
Share System 4: Windows Server 2022 | Found: 7 Files real 54.85s user 55.45s sys 19.59s cpu 136% | Found: 6 Files real 36.81s user 2.20s sys 4.51s cpu 18% |
Share System 5: Windows 10 | Found: 7 Files real 54.93s user 54.07s sys 19.40s cpu 133% | Found: 6 Files real 37.34s user 2.43s sys 4.72s cpu 19% |
All Share Systems at once | Crashed Testing System 4 | Found: 30 Files real 172.42s user 10.39s sys 16.35s cpu 15% |
Figure 27. Man-Spider and Ripgrep performance results table
3.2.6. Man-Spider and Ripgrep Single Filename Search
Initially, when a single filename Man-Spider search was executed against all five Share Systems, the tool was not finding files during the search. Nothing was in Man-Spider’s loot directory as well. The initial execution time of almost 21 seconds is an anomaly in this case, as it was not observed any time after. Under 1 minute and a half is the average, typical time.
It was suspected that the problem was in syntax, and verbose execution mode was added. After that, Man-Spider started showing all 35 found files but was not deliberately indicating them as matches. To see them, you have to execute it in verbose mode and then apply further filtering due to no built-in match indication. The following command can be used to filter down to the results: time manspider -s 10GB -t 8 -m 1000 /mnt/smb --filenames password -v | grep -v "Skipping" | grep -v "does not match" | grep -v "undesirable extension".
Ripgrep found all 35 files in under one-twentieth of a second. OS Error 13s were related to the absence of write permissions for some folders and were adjusted later.
3.2.7. Man-Spider and Ripgrep Multiple Filenames Search
The next testing case was finding multiple filenames in one search. Man-Spider found all 35 files and performed with similar numbers to the previous single filename search. Figure 31 implements manual highlighting for visibility and displays how Man-Spider verbose results look without filtering.
Ripgrep executed the same search in under one-tenth of a second while finding all files as well.
3.2.8. Man-Spider and Ripgrep Multiple Extensions Search
Two extensions ppk and xlsx were chosen for the multiple extensions test. Man-Spider search performance was like its multiple filename search.
Ripgrep performed the same search in 0.08 seconds which is 15 times faster than Man-Spider. Double matching with extensions file name was fixed later during 3-stage approach development, and no reflections on performance were observed.
During directory name search attempts, it was noticed that Man-Spider does not execute just a directory name search without specifying file names, extensions, or content that needs to be searched along. Meanwhile, ripgrep searched multiple directories and filenames in a similar manner and performance as it did with filenames and extensions.
3.2.9. Overall Performance and Reliability Analysis Table
The following table was constructed to compare capabilities of each tool to figure out the approach with the best sides of each.
Section | Man-Spider | Ripgrep |
Installation | Issues may arise depending on the system compatibility | No issues |
Content Scanning Speed | More than 88 seconds without considering pictures for share size of 3.81 GB | Under 50 seconds for share size of 3.81 GB |
Content Parsing Issues | Did not analyze compressed files and PPT | Did not analyze DOCX, XLSX, PPTX, and pictures |
Filenames, Extensions, and Directory Search | No directory name search without specifying names, extensions, or contentMultiple Extensions search is 1.15 seconds for share size of 3.81 GB | Flexible search parameters
Multiple Extensions search is 0.1 second for share size of 3.81 GB (11 times faster) |
Reliability | Crashes systems upon large files
Heavy system usage, especially during pictures scanning (Figure 13) | Reliable execution
Much lower processing power consumption overall |
User Experience | Automatically displays and counts match occurrences during content search.
But does not display all values in big files. Other searches have to be executed in verbose mode and manually filtered down.
Good support of common and modern binary document file types | Does not count occurrences by default.
Flexibility of filtering and directing output
Custom pre-processors need to be developed in order to support variety of modern binary document files |
Figure 36. Performance and Reliability Analysis Results
3.3. 3-Stage Approach
Because of the observations above, it makes sense that sensitive files and content can be found efficiently with the following 3-stage combination approach.
3.3.1. Stage 1 - Filename Indexing and Ripgrep Scanning
First, the File and Directory Index will be created with ripgrep. A positive aspect of this step is that this index can be reused later for other directory and file name searches. Since this is going to be an index file of all filenames, it needs to be accounted for because it potentially can contain sensitive information in them, and so it is supposed to be appropriately protected. This command was used to create the file and directory index: time rg -uuu -L --files /mnt/smb | sort > 1-rg-output-fileIndex.txt. The contents of 1-rg-output-fileIndex.txt file can be seen on Figure 38 (omitted for brevity).
The second step would be identifying files that ripgrep is not able to parse and input into Man-Spider later. We already know that picture files, along with newer Office documents format, are not searchable by ripgrep. The older PowerPoint extension (.ppt) is also not listed here because Man-Spider has yet to support it. So, we will have to filter them with the following command: time rg -i '\.jpg|\.png|\.docx|\.xlsx|\.pptx' 1-rg-output-fileIndex.txt > 2-ms-inputFiles.txt.
Now in the third step, all other files can be searched for the word “password” in them with the following command: time rg -uuu -L -i password -g '!*.jpg' -g '!*.png' -g '!*.docx' -g '!*.xlsx' -g '!*.pptx' /mnt/smb > 1-rg-output-files_w_password.txt.
Previously identified binary files with the word “password” matches can be added to the Man-Spider search list for ease of displaying values if desired (cat 1-rg-output-files_w_password.txt | cut -d ":" -f 1 | sort -u | rg -i '\.doc|\.xls|\.pdf' >> 2-ms-inputFiles.txt).
3.3.2. Stage 2 - Feeding into Man-Spider
Unfortunately, Man-Spider does not work with file scanning directly at the moment. So, our 2-ms-inputFiles.txt will have to be converted into 2-ms-inputDirectories.txt with this command: sed 's:[^/]*$::' 2-ms-inputFiles.txt | sort -u > 2-ms-inputDirectories.txt
And execute Man-Spider in the following way: for n in $(cat 2-ms-inputDirectories.txt); do manspider -s 10GB -t 8 -m 1 $n -e jpg png doc docx xls xlsx pptx pdf -c password; done > 2-ms-output.txt
Then, to see the results with rg -i matched -A 1 2-ms-output.txt. Please, note that this command may need to be adjusted depending on the amount of matched search word occurrences in a Man-Spider file since ripgrep is looking only for 1 line after a match (-A 1) in the example above.
File 2-ms-inputFiles.txt can be used later when Man-Spider will create a capability to search individual files or feed them into other tools that work with those selected binary files.
Overall time is a little under 278 seconds to create the directory and file name index, and content search five shares totaling 19.05 GB in size while finding 11 out of 13 files containing the word “password” in each share.
3.3.3. Stage 3 - Manual Binary Findings Review
A manual review of the rest of the binary files is needed at this stage if the actual values are desired to be seen. To display the list of those files, the following command can be used:
cat 1-rg-output-files_w_password.txt | rg -v '\.doc|\.xls|\.pdf' | rg -i binary | cut -d ":" -f 1
In the test dataset, those files are ppt and zip extension type ones. With PowerPoint (.ppt) files, it is possible to convert them into rtf or pdf files first and then scan them with a tool again. As far as compressed files, ripgrep has a special flag and can work with them. However, with not all of them, as in our tests, there was a zip file specifically, and no difference was observed with or without the -z flag. This behavior can be happening due to only certain compressed files being supported by ripgrep at the moment. However, ripgrep was still finding the “password” word in a zipped filename in both cases. Additional caution regarding space usage and data security needs to be exercised when considering added file conversion.
Using this three-stage approach, all provided files with the word “password” in them, except written images, were found at significantly faster speeds and without any system crashes.
3.4. Enterprise Testing and Aspects
When testing was put into enterprise realms, various parameters started negatively affecting the testing speed. Some of them are search terms and their amount, system types and remoteness, network reliability, bandwidth, diversity of file quantities, sizes, and types, and directory structure specifics like languages and special characters.
For example, a remote enterprise share of just 43.7 MB with exe, config, dll, pdb, xml, log files inside was scanned (second stage, third step) in under 44 seconds (real 43.59s, user 0.30s, sys 0.35s, cpu 1%) using power specifications identical to Testing System 4; meanwhile, we were able to go through almost 4 GB lab environment share in upper 35 seconds on average. The command that was used: time rg -uuu -L -i password -g '!*.jpg' -g '!*.png' -g '!*.docx' -g '!*.xlsx' -g '!*.pptx' <Share Name> > <OutputFile.txt>. As can be seen, an enterprise environment will be much slower to test than a laboratory environment.
An interesting finding occurred during another test of the same file-share system. When a broader search with five keywords was launched, it finished faster than with one word by nine seconds. This behavior cannot be explained. The command looked in the following way: time rg -uuu -L -i 'password|login|<CompanyName>|secret|key' -g '!*.jpg' -g '!*.png' -g '!*.docx' -g '!*.xlsx' -g '!*.pptx' <Share Name> > <OutputFile.txt>.
Other critical aspects that need to be considered are the manual analysis and filtering of false positives out of the results. These activities can be as time-consuming as an initial assessment itself, depending on search parameters and scope. With the previous example of a share under 50 MB, the single word "password" search yielded 57 findings that must be analyzed and filtered. When the five-word search was executed against the same share system, from the example above, it returned 761 matches (13 times more) that needed to be scrutinized. Thus, this small example brightly displays the importance of scope and search parameters consideration when projecting assessment workload.
4. Recommendations and Implications
4.1. Improving Speed
Even though the researched approach increases speed during sensitive content discovery on SMB shares, the speed can be improved even further. From an architectural perspective, a more powerful remote server can be used to help increase speed. However, there are several things that need to be accounted for, such as the quality of involved network cards, bandwidth, cabling, connection speed limits in protocols and in architectural solutions themselves, for example AWS EBS. All of them will affect the speed of the assessment. Placing multiple scanners closer to targets and executing them in parallel can improve overall time. When working with remote servers, it is important to account for connection timeouts and keep them monitored. For example, SSH connection timeouts are a frequent event, and even with keep-alive packets sessions break occasionally.
Having the right amount of selected and appointed CPU threads can help with the speed. Ripgrep automatically selects and puts what it thinks is a good amount. However, “ripgrep cannot know a priori what the optimal number of threads is for every workload. That's not possible.” - says the author (Gallant, 2023b) of the tool and he adds, “its heuristics are probably not suitable for, say, a 64 core CPU. In that case, you probably want to use -j to increase the number of threads ripgrep uses” (Gallant, 2023b). Man-Spider uses 5 threads by default, so manual threads increase, depending on available hardware, will benefit the assessment’s speed as well.
Utilization of ripgrep preprocessors can make a pronounced difference when executed at scale and various supported file types are present. However, quick testing on the lab dataset did not seem to make a significant difference (Appendix Figure 5.1).
In addition, a file hash comparison system can be integrated into this approach to minimize potential re-scanning of already assessed files and thus increase overall speed.
Indexing the content of the files prior to scanning is another technique that can be used to improve quickness, especially if there is a need to conduct multiple separate content searches. However, privacy and data security need to be taken extra vigilantly with the implementation of this technique. Indexing the content of files was not considered as an approach in this paper.
Thus, it seems like an ultimate solution for improving speed could look in the following way: bare Linux server with high overall performance specifications exceeding Testing System 4 and located close to targeted systems. The system should have an appropriate number of threads selected and appointed during software execution. All data that is indexed and stored would benefit from being physically located on PCIE SSD storage.
4.2. Image Recognition
Pictures with the written word “password” were not identified during the tests. One of the potential ways to fight this problem would be the implementation of machine learning tools that specialize in image content recognition (OCR). However, even though machine learning systems have made significant progress in OCR, there may still be challenges in accurately recognizing complex layouts and handwriting.
4.3. Legacy Systems
Popular legacy systems like Windows XP and Windows 7 were not measured during the research. However, no significant performance impact should be observed based on test data as Linux share systems used SMB versions 1 and 2 (these older versions of SMB are commonly found in older OS such as Windows XP and Windows 7).
4.4. New systems and protocol implementations
With the development of new systems and protocols, things such as scope and testing may be reconsidered as additional assessment approach steps may need to be taken. For example, a new SMB over QUIC protocol implementation that works in Windows Server 2022 Datacenter: Azure Edition and Windows 11 that survives a change in the client’s IP address or port and allows connectivity over the Internet.
5. Conclusion
Sensitive corporate information located on thousands of SMB shares is a big problem in many organizations. Prompt proactive assessments and remediation in this space are necessary to minimize risks related to adversaries pivoting inside of an organization. Of course, there are always different difficulties depending on each specific organization when dealing with problems such as this. But nonetheless, there is always a way to tackle these tasks and flexibility of many open-source tools helps. Big appreciation to Andrew Gallant for his prompt and insightful answers, which supported the ideas regarding improving ripgrep’s speed. I hope the suggested approach and researched information will help benefit your organization’s security posture.
Refences
CyberArk Blog Team. (2022, December 9). Unpacking the Uber breach. Identity Security and Access Management Leader. CyberArk. https://www.cyberark.com/resources/blog/unpacking-the-uber-breach
Bowes, R. (n.d.). Script SMB-enum-shares. smb-enum-shares NSE script - Nmap Scripting Engine documentation. Nmap. https://nmap.org/nsedoc/scripts/smb-enum-shares.html
Gallant, A. (2023, March 24). How do I install ripgrep on linux if I don’t have root permissions? GitHub. https://github.com/BurntSushi/ripgrep/discussions/2477
Gallant, A. (2023b, May). Increasing ripgrep speed for parsing large content size of various binary and text files. GitHub. https://github.com/BurntSushi/ripgrep/discussions/2507
Gallant, A. (2022, August 10). User Guide - Binary data. GitHub. https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md#binary-data
Gallant, A. (2022b, August 10). User Guide - Automatic filtering. GitHub. https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md#automatic-filtering
Joshua Wright, J. (2022, August 4). Searching SMB Share Files. SANS Institute. https://sans.org/blog/searching-smb-share-files/
Kerrisk, M. (2019, March 6). Time(1) — Linux manual page. Man7. https://man7.org/linux/man-pages/man1/time.1.html
Microsoft. (2023, May 18). SMB over QUIC. Microsoft Learn. https://learn.microsoft.com/en-us/windows-server/storage/file-server/smb-over-quic
MITRE ATT&CK®. (2021, December 13). Network Share Discovery, Technique T1135 - Enterprise. ATT&CK Matrix for Enterprise. https://attack.mitre.org/techniques/T1135/
MITRE ATT&CK®. (2022, September 6). File and Directory Discovery, Technique T1083 - Enterprise. ATT&CK Matrix for Enterprise. https://attack.mitre.org/techniques/T1083/
MITRE ATT&CK®. (2022, June 16). Data from Network Shared Drive, Technique T1039 - Enterprise. ATT&CK Matrix for Enterprise. https://attack.mitre.org/techniques/T1039/
MITRE ATT&CK®. (2022, April 12). Unsecured Credentials: Credentials In Files, Sub-technique T1552.001. ATT&CK Matrix for Enterprise. https://attack.mitre.org/techniques/T1552/001/
MITRE ATT&CK®. (2022, March 29). Unsecured Credentials: Private Keys, Sub-technique T1552.004 - Enterprise. ATT&CK Matrix for Enterprise. https://attack.mitre.org/techniques/T1552/004/
Appendix
For appendix materials, please reach out to contact@discerningsec.com
Comments