Bo2SS

Bo2SS

4 Sorting and Searching

Course Content#

Sorting#

  • Four Quadrant Principle: Stable/Unstable, Internal/External
    • Stable: The relative position of identical elements remains unchanged before and after sorting
    • Internal: Requires loading all data into memory for sorting

[Consider sorting from smallest to largest]

Stable Sorting#

  • Insertion (Internal)
    • Front: Sorted area; Back: Unsorted area
    • The numbers at the back find their position in the front to insert
      • Compare and swap one by one to achieve the effect of [insertion]
      • Not directly swapping with the insertion position, as this would disrupt the order of the sorted area
    • Average time complexity: O(N^2)
  • Bubble (Internal)
    • Front: Unsorted area; Back: Sorted area
    • Move from the first element backward, find the maximum element and place it at the front of the sorted area and the end of the unsorted area
    • Perform N - 1 rounds of bubbling, adding one element to the sorted area each round
      • [Optimization] If a round of bubbling does not perform any swap operations, it can be terminated directly
    • Average time complexity: O(N^2)
      • Best time complexity: O(N) ← already sorted
      • Worst time complexity: O(N^2) ← completely reversed
    • Insertion sort can be understood as reverse bubbling
  • Merge (External)
    • Introduction: Merging two sorted arrays (of lengths m and n) into one sorted array
      • Time complexity: O(m + n)
    • First divide and conquer until pairs can be compared, then merge the sorted arrays in pairs
    • Time complexity (best, worst, average): O(NlogN)
      • A total of logN layers, with a merging time of O(N) per layer
    • [External Sorting] 2-way merge (above), k-way merge (multi-way merge, using heaps)
      • Refer to External Merge Sort - Wikipedia, the sorting method for each route can be arbitrary, ultimately merging
      • Whether this is external sorting depends on whether you want or need it, but it can be done

Unstable Sorting#

  • Selection (Internal)
    • Front: Sorted area; Back: Unsorted area
    • Each round selects the smallest element from the unsorted area and swaps it with the first element of the unsorted area
      • There may be cases of swapping itself with itself, so the XOR function cannot be used for swapping
    • Unstable: For example, 22'1, after the first swap, 2 moves behind 2', resulting in 12'2
    • Time complexity (best, worst, average): O(N^2)
    • Generally better than bubble sort, the number of comparisons is similar, but bubble sort has too many swaps
  • Quick Sort (Internal)
    • [Pivot, partition]
      • Take the first element as the pivot
      • [Tail pointer] first looks for elements less than the pivot from the back to place in the position of the first element (which is empty), [Head pointer] then looks for values greater than the pivot from the front to place in the newly emptied position, looping
      • Finally, the head and tail pointers coincide, pointing to an empty position, where the pivot is placed
      • Then perform the above operation on the left and right parts of the pivot
    • Time complexity: O(NlogN)
      • When the first element is chosen as the pivot in a reverse order sequence, it degenerates into selection sort, O(N^2)
    • [Optimization]
      • Randomly select the pivot
      • Reduce the use of recursion, use loops instead
      • Swap only after finding a value to swap on both sides

Searching#

  • Naive Binary Search
    • Premise: Monotonic sequence
    • If the value to be searched exists multiple times in the sequence, binary search cannot distinguish which one is found
  • Special Binary Search
    • 11110000
      • Introduce a virtual head pointer with an index of -1
      • Mid calculation needs to add 1 to avoid infinite loops, such as in the case of 10
    • 00001111
      • Introduce a virtual tail pointer with an index of n [array range from 0 to n - 1]
    • Refer to “Interview and Written Test Algorithms” - 2. Binary Search Topic
    • Here, an additional concept of virtual pointer is introduced, which reflects the situation when a value cannot be found by changing the search boundaries [-1 ~ n -1 or 0 ~ n]
  • Ternary Search
    • Image
    • Solves the problem of [finding the extreme points of concave and convex functions]
    • Divides the scale of the original problem into one-third
    • Update strategy: Retain the smaller value element and the interval of the other end; stopping condition: |m1 - m2| < ESPL
    • Time complexity: O(log[3]N), slightly slower than binary search O(log[2]N)

Hash Table#

  • A [data structure] used for searching
  • Structure definition
    • Requires a contiguous space to store elements
    • Consistent with sequential lists, element types can vary
  • Structure operations
    • Hash function: Can map any type of element to an integer value [array index]
      • Then a certain value can be stored at the corresponding array index
      • Array indices can only be integers
    • Conflict handling methods
      • Open addressing [commonly used]: Look for empty positions ahead, such as quadratic probing... but can easily lead to data clustering issues
      • Rehashing: Also known as hashing method, using multiple hash functions
      • Chaining method [commonly used]: Use a linked list structure to store elements at each position
      • Establish a common overflow area: Place conflicting elements in a specific area
  • Time complexity for searching: Approaches O(1)
    • Other time-consuming factors: Hash function conversion, chaining method [searching linked list elements] or other conflict handling methods
    • Only arrays and sequential lists are O(1)
  • The space utilization rate of hash tables is generally 50-90%
    • Hash functions will always have conflicts, so space needs to be reserved when opening
    • When the utilization rate reaches 70%, it can be used industrially; fewer conflicts lead to higher utilization, and better hash functions

Code Demonstration#

Stable Sorting#

  • Image
  • Image
  • Image
  • See comments and red box markings for details

  • To ensure the stability of stable sorting, pay attention to the case of ==, do not swap [or ensure the original order]!

  • [Note] When calling the TEST macro, the latter array is num, which is an array obtained by copying arr in the macro definition!

Unstable Sorting#

V1.0 Version: Selection Sort & Ordinary Quick Sort#

  • Image
  • Image
  • Focus on the quick sort algorithm, which has many optimization points

    • [Self-optimization] The conditions num[] > z or num[] < z on lines 46 and 48 can add == conditions, so that when encountering equal elements, there will be no swap, which is unnecessary and can help ensure algorithm stability [although quick sort is unstable]
    • [Points to optimize] Pivot selection, recursion → loop, swapping method

V2.0 Version: Optimized Quick Sort#

  • Image
  • Implements the three points mentioned above for optimization

  • Pay attention to several boundary ==: x <= y, num[] < z, num[] > z

    • ❓ How to understand the judgment of ==? See the following thinking point 2
  • Lines 39 and 40: Sort the right interval, and after sorting, reduce the right boundary

Speed Comparison of Two Versions of Quick Sort#

  • Test both random sequences and reverse sequences
    • Random sequence of size 200,000: The difference between the two is minimal
    • Reverse sequence of size 100,000: The difference is significant
      • Image
      • For a reverse sequence of 200,000, the original version of quick sort will cause a stack overflow

Establishing a Hash Table for Strings#

  • Hash function: BKDR-Hash, conflict handling method: Chaining method

  • image-20201205171805733
  • Image
  • The head insertion method is used when inserting values

  • The utilization rate of the hash table structure cannot reach 100%, so the allocated space needs to be doubled

  • Using calloc to allocate hash table space can ensure that the head address of each position's linked list is set to null, ensuring safety

  • Hash functions can be designed by oneself

Thinking Points#

  • [My understanding] To ensure the stability of stable sorting, pay attention to the case of ==, do not swap [or ensure the original order]!
  • ❓ Thoughts on the boundaries in the optimized version of quick sort
    • Image
    • Similarly, according to the ordinary version of quick sort to optimize boundary judgments, the following optimization ideas have problems, while the ordinary version can be implemented
      • Lines 32 - 33: Changing num[] < z and num[] > z to <= and >= will cause [segmentation fault or infinite loop]
        • For num[] <= z [infinite loop]
          • Example: 2 1, with the pivot value of 2, the left pointer x keeps moving to the right end, going out of bounds, while the right pointer y remains unchanged, line 27 while (l < r) holds, x returns to l, resulting in an infinite loop
        • For num[] >= z [segmentation fault - stack overflow]
          • Example: 1 2, with the pivot value of 1, the right pointer y keeps moving to the left end, becoming -1, while the left pointer x and r remain unchanged, line 39 quick_sort(num, x, r) keeps recursively calling, eventually causing a stack overflow
        • Analysis of the reason: The ordinary version takes the pivot value out, while the optimized version keeps the pivot value inside, which needs to be considered
      • Lines 32 - 34: Changing x <= y to x < y will also cause [segmentation fault]
        • Example: 2 3, with the pivot value of 2, at this point x and y point to 2 and 3 respectively, after looping, both x and y point to 2, while x and r remain unchanged, line 39 quick_sort(num, x, r) keeps recursively calling, eventually causing a stack overflow
        • Analysis of the reason: x and y need to move apart, otherwise it resembles the special binary search 111000 not +1 situation
      • Summary: Both x and y need to move; if x does not move, it leads to stack overflow, and if y does not move, it leads to infinite loop
      • [Personal understanding, welcome to discuss]
  • Ternary Search - Practice Problem
    • Image
    • First determine the boundaries: INT32_MIN ~ INT32_MAX
    • Then follow the steps of ternary search
    • Boundary considerations
      • Estimate based on the formula - b / 2a, but it seems a bit suspicious
      • Determine the range based on the two solutions of the quadratic function = 0, but it complicates things
      • The efficiency of ternary search is at the logN level, and the size of the boundary range does not significantly affect it

Tips#


Course Notes#

  • Preview heap and priority queue, as well as the content of disjoint sets in advance
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.