Search and you shall find. On a Linux system, there are numerous search tools for quickly and precisely finding certain local data.
We could use the locate and find commands to find files by their name, type, timestamps, owner, or size. The find command can also search the file contents, but in most cases, there is an easier tool for that called grep. If we wanted to search a file or directory for some relevant content string, we could use the grep command, or its newer alternative ack.
The name “grep” stands for “global / regular expression / print.” The “g” is an abbreviation for “global search” on Unix. Grep can be used to see if the file input it receives matches a specified pattern; such patterns are called regular expressions, and you have likely seen some of them before in other software tools. In this tutorial, we will only be using the basics of regular expressions, but be sure to explore their “deeper waters,” if needed.
The full power of grep and similar tools really starts to show when we combine its search and filtering operations with other Linux commands.
Step 1: Get Some Sample Data Files
First, we need to install Git, so that we can download projects from Github:
sudo apt-get install git
Now we can download the jquery source code to our home directory:
cd ~ git clone https://github.com/jquery/jquery.git
Then, go into the directory we just downloaded:
Let’s have a look at the files in this directory using the ls command:
We see a list of different file types and a few directories:
AUTHORS.txt bower.json build CONTRIBUTING.md external Gruntfile.js LICENSE.txt package.json README.md src test
Let’s see how we could find content in this source code.
Step 2a: Using Grep
Grep comes already installed on every Linux system, so there is no need for manual installation.
Grep Command Options
This is a summary of the grep command options we will use in this tutorial:
- -i does case-insensitive character matching
- -r reads all files under each directory recursively
- -n shows the line number of each match
- -c shows the match count
- -v inverts the matching by selecting the non-matching lines
- -o prints only the matched parts of a matching line, with each part on a separate output line
- -w only matches on whole words
If you wanted to find the files that contained the string “John Resig” for every file in the current directory, you would type:
grep 'John Resig' *
The resulting output would be:
grep: build: Is a directory grep: external: Is a directory grep: src: Is a directory grep: test: Is a directory
The “*” tells grep to match all files in the current directory. If our search pattern contains any spaces, we need to put quotes around the search string (single quotes or double quotes).
If you wanted to find the files that contained the string “Authors” for every file in the current directory, you would type:
grep Authors *
The resulting output would be:
AUTHORS.txt:Authors ordered by first contribution. grep: build: Is a directory grep: external: Is a directory grep: src: Is a directory grep: test: Is a directory
Grep found one matching file and printed the line that matched our “Author” pattern. Note that grep is not matching the file name here, only the content of the file.
If we had typed this instead:
grep authors *
We would see a different matched file, because grep is sensitive to character casing by default.
We could use a grep command line -i option to turn on case-insensitive character-matching instead to ignore any casing sensitivity:
grep -i authors
Now we can see all matches regardless of any character casing combination we could have used in our search pattern.
To do the same search throughout all the directories (in our current directory), we can add the -r recursive option:
grep -i -r authors *
Now grep will search all the directories and their recursions until it is done.
This same command can be shortened by combining the options, producing the same result:
grep -ir authors *
To see the line numbers of the matching results, we add the -n option:
grep -irn authors *
To search the AUTHORS.txt file for lines with a “gmail.com” domain:
grep -i gmail.com AUTHORS.txt
If we wanted to count all the matches of the previous search, we would add the -c option:
grep -ic gmail.com AUTHORS.txt
We would see a number printed, indicating the number of matched lines.
To invert our a previous “gmail.com” search pattern, we would use the -v option:
grep -iv gmail.com
Now we see all the lines without the “gmail.com” string —a pretty handy feature.
We can search for whole word matches as well. Lets search, case-insensitively, for the word “bug.”
grep -i -w bug *
The -w option forces our pattern to only be matched on whole words, so words containing the string “bug” (e.g., “bugs”) would not be a valid match.
If we wanted to find out the number of times the word “jquery” was mentioned all throughout the source code, we would pipe “|” and then put the wc wordcount command with a -l filter, so we only count the lines – not the number of words or characters. The -o option is used to print each matching part on a separate output line, or our count would not be correct.
grep -iro jquery * | wc -l
If we do a search that returns many matches, we can pipe the grep output to less. Less is a paging tool that makes it easy to scroll through all the output using either the ↓, ↑, “page-up,” or “page-down” keys, or the SPACE bar.
grep -ir jquery * | less
We can also chain several grep commands together to do easy filtering of the results of each previous command.
grep -ir jquery * | grep -i json | less
To create much more precise matching patterns, we will need to use regular expressions.
For example, say we wanted to find the authors with a first name of “Chris” or “John,” but not “Christopher,” “Christian” or any other first name pattern.
grep -E "(^Chris )|(^John )" AUTHORS.txt
And voilà, we see all the authors with a first name of Chris or John.
The -E option tells grep to interpret our search pattern as an extended regular expression. This pattern contains two match parts “(^Chris )” and “(^John )” that are separated by the pipe symbol:”|”, which represents a logical or function. If any of the two parts match, we print the result. To only search for the first names, we use the caret “^” symbol that represents a start-of-line function. So we only want our name patterns to match at the beginning of the lines.
If you would like to learn more about using grep with regular expression, see this tutorial. Mastering regular expressions is a skill worth working on.
Step 2b: Using Ack
Ack is a search tool like just grep, but it’s optimized for searching in source code trees. Ack does almost all that grep does, but it differs in the following ways.
Ack was designed to:
- Search directories recursively by default
- Easily exclude certain file types or only search for certain file types
- Ignore the common version control directories by default; these are directories with names like: .git, .gitignore, .svn
- Ignore binary files by default; these are files like: binary executables, image/music/video files, gzip/zip/tar archive files
- Have better highlighting of matches and also to format the output a bit more cleanly
That being said, one case in which grep often is quicker than ack is if you are searching through very big files looking using regular expressions.
To get started, the first step is to install the ack tool on your machine.
On an Ubuntu or Debian machine, this is as simple as installing the utility from the default repositories. The package is called ack-grep:
sudo apt-get update sudo apt-get install ack-grep
Is the program called ack-grep or ack?
The name of the program is “ack.” Some packagers have called it “ack-grep” when creating packages, because there’s already a package out there called “ack” that has nothing to do with this ack. We can tell our Linux system to shorten this command to “ack” if we would like by typing this command:
sudo dpkg-divert --local --divert /usr/bin/ack --rename --add /usr/bin/ack-grep
Now, the tool will respond to the name “ack” instead of “ack-grep.”
Ack Command Options
This is a summary of the ack command options we will use in his tutorial:
- -i does case-insensitive character matching
- -f–X only prints the files that would be searched, without actually doing any searching, where “X” denotes the filetype (e.g., “–html”)
- -n does not descend into any subdirectories.
- -w only matches whole words
- –type=noX excludes certain filetypes from the search, where “X” denotes the filetype to be excluded (e.g., “–type=nophp” to exclude PHP files)
Let’s do some searching on our jQuery source tree again to see how ack optimizes code searching.
ack -i Authors *
We see this result:
RS.txt 1:Authors ordered by first contribution. bower.json 12: "AUTHORS.txt", external/sizzle/MIT-LICENSE.txt 18:NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE external/qunit/MIT-LICENSE.txt 18:NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LICENSE.txt 27:NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE package.json 10: "url": "https://github.com/jquery/jquery/blob/master/AUTHORS.txt" 41: "grunt-git-authors": "1.2.0",
Compare the above output to the grep version of this search:
grep -i Authors *
We see this result:
AUTHORS.txt:Authors ordered by first contribution. bower.json: "AUTHORS.txt", grep: build: Is a directory grep: external: Is a directory LICENSE.txt:NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE package.json: "url": "https://github.com/jquery/jquery/blob/master/AUTHORS.txt" package.json: "grunt-git-authors": "1.2.0", grep: src: Is a directory grep: test: Is a directory
Note how the ack search is done recursively by default, and each match is printed on its own line with a line number by default. The formatting is a bit easier to read, especially when there are many matches.
These defaults and formatting are nice when you often search through code trees.
Ack can do more than that, though. Lets find all HTML files in the source tree.
ack -f --html
The -f option only prints the files that would be searched without actually doing any searching. The –html option is a special feature of ack. Ack understands many file types, and by specifying this option, you ask it to only search for HTML files.
ack -i -w --js bug
Sometimes we don’t want to do a recursive search. To search in the current directory only for the word “bug,” we type:
ack -n -w bug
The -n option tells ack not to descend into any subdirectories.
ack -w --type=nojs css
The –type=noX option allows for the exclusion of file types known by ack, where “X” denotes the file type to be excluded.
The same regular expression that we used with grep will also work for ack:
ack "(^Chris )|(^John )" AUTHORS.txt
Ack has a lot more to offer than what was shown in here. See the official documentation for a more in-depth look at using ack.
Other grep-like Tools
Here are some other great search tools that are worth exploring:
- zgrep – Grep tool that can search compressed files (e.g., compressed log files)
- agrep – Grep-like tool with support for approximate patterns
- jq – Command line tool to search in JSON files and structure the resulting output (as valid JSON)
- xgrep, xmlgrep, xmlstar – These are similar command line tools to search the content of XML files
- pdfgrep – Command line tool to search the content of PDF files
- git grep – Built-in search tool of the Git versioning system