Bo2SS

Bo2SS

9 Data Extraction, Soft and Hard Connections, Linear Sieve, SED

Course Content#

Data Extraction Operations#

| Command | Function | Command | Function |
|:----:|:----|:----:|:----|:----:|:----|:----:|:----|
| cut | Split | grep | Search |
| sort | Sort | wc | Count characters, words, lines |
| uniq | Remove duplicates | tee | Bidirectional redirection |
| split | File splitting | xargs | Parameter substitution |
| tr | Replace, compress, delete | | |

cut#

——Kitchen Knife

  • -d c Split by character c
    • Can use "" or '' to surround c, very flexible
    • Default is space
  • -f [n/n-/n-m/-m] Select the nth block / nth-end block / nth-m block / 1-m block
  • -b [n-/n-m/-m] Bytes
  • -c [n-/n-m/-m] Characters

grep#

Common operations see: tldr grep

  • -c Count occurrences found [per file]
  • -n Output line numbers in order
  • -v Output non-matching [complement]
  • -C 3 Output content of 3 lines nearby
  • -i Case sensitive
  • [PS] -n and -c conflict, prioritize displaying -c results
  • Example
    • Count the number of lines containing doubleliu3 in the path
      • Image
      • Use cut+awk or directly use awk
    • View the currently running processes related to hz
      • Image
      • ps -ef: Similar to Task Manager in Windows
      • In comparison, ps -aux displays more information, leaning towards UNIX systems in early days
      • [PS] In ps -ef, PPID is the parent process ID

sort#

Similar to Excel's sorting function, defaults to sorting by ASCII code based on the beginning of the line

  • -t Separator, default is TAB
  • -k Sort by which field
  • -n Numeric sort
  • -r Reverse sort
  • -u uniq [but inconvenient for counting]
  • -f Ignore case [default]; -V [do not ignore case]

【Example】Sort user information by uid in pure numeric order

cat /etc/passwd | sort -t : -k 3 -n

【Tested】

  • Default during sorting
  • Omits special characters, such as underscore "_", "*", "/" etc.
  • Also ignores case, placing lowercase before uppercase after sorting [-V can ignore case]

wc#

  • -l Line count
  • -w Word count
  • -m Character count; -c Byte count

【Example】

① Total number of recent logins to the system

last | grep -v "^$" | grep -v "begins" | wc -l

② Statistics related to the PATH: character count, word count, variable count, take the last variable
Image

uniq#

Only consecutive duplicates count as duplicates, generally used with sort

  • -i Ignore case
  • -c Count

【Example】Count the number of recent user logins, sorted from largest to smallest

Image

tee#

Displays in the terminal and writes to a file

  • Defaults to overwrite the original file
  • -a Append without overwriting

【Example】

Image

split#

Splits a file, suitable for handling large files

  • -l num Split by num lines
  • -b size Split by size, 512 [default byte], 512k, 512m
    • May have incomplete first or last lines, so can use the method below👇
    • ❗ -C size Split by at most size, ensuring not to break line data!
  • -n num Split evenly into num parts

【Example】Split the file list in /etc into a file every 20 lines

Image

xargs#

Parameter substitution, suitable for commands that do not accept standard input, equivalent to using command substitution

  • -exxx Read until "xxx" ends

Image

  • -p Ask before executing the entire command
  • -n num Specify the number of parameters received each time⭐, rather than passing all at once
    • Image
    • Suitable for commands that can only read one parameter [like id] or related scenarios

tr#

Character replacement, compression, deletion for standard input, see tldr tr

tr [options] <char/charset1> <char/charset2>

  • Defaults to replace charset1👉charset2, one-to-one correspondence
    • Characters in charset1 that exceed charset2 are replaced by the last character of charset2
  • -c Replace all characters not belonging to charset1👉charset2
  • -s Compress, turning consecutive repeated character1👉one character1 【❗When used with -c, look at charset2】
  • -d Delete all characters belonging to charset1
  • -t First delete characters in charset1 that exceed charset2, then replace one-to-one correspondence【❗ Note the difference from the default method】

【Example】

① Simple usage

Image

  • -s also requires parameters, when used with -c defaults to using charset2
  • -t only cares about the characters that are one-to-one corresponding, ignoring the characters in charset1 that exceed charset2

② Word frequency statistics

Image

  • tr replace👉sort👉remove duplicates count👉sort👉display top few

Hard links can reduce storage usage, soft links have an additional inode and block

【Background Introduction】

  • ext4 file system——three components——inode, block, superblock
  • inode: File node
    • One inode corresponds to one file
    • Stores file information [file permissions, owner, time...] and the real location of the file [blocks location]
    • If it cannot store the blocks location, it will use multi-level blocks
  • block: The real storage location of the file, one block is generally 4096 bytes
  • superblock: At least two, storing overall information of the file system [like inode, block...]

【Hard Link】 Equivalent to an alias

【Soft Link】 Equivalent to a shortcut

  • Creates a new file, has its own inode, points to a block [storing the real location of the file]
  • File type is link
  • [PS] More commonly used than hard links

Linear Sieve#

  • Image
  • Arrays in shell do not need to be initialized, they start empty

    • In fact, no matter what variable, it starts empty, with no type
  • Learn the empty check operation of【variable x == x】

  • Array indices can directly use variable names, without needing to wrap in ${}

【Comparison with Prime Sieve Effect】

Find the sum of primes from 2 to 20000

  • Image
  • In shell, the effect is not as good as the prime sieve, possible reasons are:

    • Shell involves many system calls, cannot purely look at the mathematical level
    • Executing shell scripts can max out the CPU, making it incomparable to C language effects
    • Modulus operation

➕SED Script Editing#

Mainly used for editing scripts, syntax can refer to vim

【Common Operations】 Replace [batch, by line, match], delete

# Replace the first string pattern matched on each line [supports regular expressions]
sed 's/{{regex}}/{{replace}}/' {{filename}}
# Replace the first string of matched lines, delete matched lines
sed '/{{line_pattern}}/s/{{regex}}/{{replace}}/' {{filename}}
sed '/{{line_pattern}}/d' {{filename}}
# Can use other delimiters, such as '#', to handle scenarios where '/' character is needed
sed 's#{{regex}}#{{replace}}#' {{filename}}
# Delete all lines between two matching patterns
sed '/{{regex}}/,/{{regex}}/d' {{filename}}
# [Special] Delete content after/before matching lines
sed '/{{regex}}/,$d' {{filename}}
sed '/{{regex}}/,$!d' {{filename}}
  • sed -i Write modifications to the file
  • sed '.../g' Adding g at the end can perform globally
  • sed '.../d' Adding d at the end is a delete operation
  • sed '/.../,/...' Comma can be used to match two patterns [matching a paragraph]

【Command Demonstration】

① The above common commands operate in sequence

Image

② Used to replace certain configurations in configuration files

  • Image
  • Delete 👉 Add

  • Backup before deletion

  • When adding, include identifier #newadd for easier subsequent sed operations

  • Directly delete and then add, rather than replacing directly, to avoid cumbersome pattern matching [parameter values]

In-Class Exercises#

  1. Find the sum of all numbers in the string: "1 2 3 4 5 6 7 9 a v 你好 . /8"
  2. Convert all uppercase letters in the file to lowercase: echo "ABCefg" >> test.log
  3. Find the last path in the PATH variable
  4. Use the last command to output all reboot information
  5. Sort the contents of /etc/passwd by username
  6. Sort the contents of /etc/passwd by uid
  7. Find the total number of system login users in the cloud host over the past 2 months
  8. Sort all usernames logged into the cloud host over the past 2 months by frequency and output the count
  9. Save every ten file and directory names in the local /etc directory into a file
  10. Output the uid, gid, and groups of the 10th to 20th users stored in /etc/passwd
  11. View users in /etc/passwd by username, ending when reading the 'sync' user, and output the user's uid, gid, and groups
  12. Word frequency statistics

① Use the following command to generate a text file test.txt

cat >> test.txt << xxx
nihao hello hello 你好 
nihao
hello 
ls
cd
world
pwd
xxx

② Count the word frequency in a.txt and output in descending order.
Answer

【1】
  [Clever solution, needs to be run in bash]
    echo "1 2 3 4 5 6 7 9 a v 你好 . /8" | tr -s -c 0-9 "\n" | echo $[`tr "\n" "+"`0]
  [for loop]
    sum=0
    for i in `echo "1 2 3 4 5 6 7 9 a v 你好 . /8" | tr -s -c 0-9 "\n"`; do
    for> sum=$[$sum+$i]
    for> done
  [awk solution 1]
    echo "1 2 3 4 5 6 7 9 a v 你好 . /8" | tr -s -c 0-9 "\n" | awk -v sum=0 '{sum += $1} END { print sum }'
  [awk solution 2]
    echo "1 2 3 4 5 6 7 9 a v 你好 . /8" | tr -s -c 0-9 " " | awk -v sum=0 '{for (i = 1; i <= NF; i++) {sum += $i} } END{print sum}'
【2】 cat test.log | tr A-Z a-z > test.log
【3】 echo ${PATH} | tr ":" "\n" | tail -1
【4】 last | grep "reboot"
【5】 cat /etc/passwd | sort
【6】 cat /etc/passwd | sort -t : -k 3 -n
【7】 last -f /var/log/wtmp.1 -f /var/log/wtmp | grep -v "^$" | grep -v "begins" | grep -v "reboot" | grep -v "shutdown" | wc -l
【8】 last -f /var/log/wtmp.1 -f /var/log/wtmp | grep -v "^$" | grep -v "begins" | grep -v "reboot" | grep -v "shutdown" | cut -d " " -f 1 | sort | uniq -c | sort -n -r
【9】 ls /etc | split -l 10
【10】 cat /etc/passwd | head -20 | tail -10 | cut -d : -f 1 | xargs -n 1 id
【11】 cat /etc/passwd | cut -d : -f 1 | xargs -e="sync" -n 1 id
【12】
  cat test.txt | tr -s " " "\n" | sort | uniq -c | sort -n -r
  [If you want the word to be first and the count to be after, use awk to reverse]
    cat test.txt | tr -s " " "\n" | sort | uniq -c | sort -n -r | awk '{print $2, $1}'

Tips#

  • Commonly tested in interviews: Word frequency statistics
    • Image
    • tr replace👉sort👉remove duplicates count👉sort👉display top few

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.