| /*! |
| An implementation of the |
| [Aho-Corasick string search algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm). |
| |
| The Aho-Corasick algorithm is principally useful when you need to search many |
| large texts for a fixed (possibly large) set of keywords. In particular, the |
| Aho-Corasick algorithm preprocesses the set of keywords by constructing a |
| finite state machine. The search phase is then a quick linear scan through the |
| text. Each character in the search text causes a state transition in the |
| automaton. Matches are reported when the automaton enters a match state. |
| |
| # Examples |
| |
| The main type exposed by this crate is `AcAutomaton`, which can be constructed |
| from an iterator of pattern strings: |
| |
| ```rust |
| use aho_corasick::{Automaton, AcAutomaton}; |
| |
| let aut = AcAutomaton::new(vec!["apple", "maple"]); |
| |
| // AcAutomaton also implements `FromIterator`: |
| let aut: AcAutomaton<&str> = ["apple", "maple"].iter().cloned().collect(); |
| ``` |
| |
| Finding matches can be done with `find`: |
| |
| ```rust |
| use aho_corasick::{Automaton, AcAutomaton, Match}; |
| |
| let aut = AcAutomaton::new(vec!["apple", "maple"]); |
| let mut it = aut.find("I like maple apples."); |
| assert_eq!(it.next(), Some(Match { |
| pati: 1, |
| start: 7, |
| end: 12, |
| })); |
| assert_eq!(it.next(), Some(Match { |
| pati: 0, |
| start: 13, |
| end: 18, |
| })); |
| assert_eq!(it.next(), None); |
| ``` |
| |
| Use `find_overlapping` if you want to report all matches, even if they |
| overlap with each other. |
| |
| ```rust |
| use aho_corasick::{Automaton, AcAutomaton, Match}; |
| |
| let aut = AcAutomaton::new(vec!["abc", "a"]); |
| let matches: Vec<_> = aut.find_overlapping("abc").collect(); |
| assert_eq!(matches, vec![ |
| Match { pati: 1, start: 0, end: 1}, Match { pati: 0, start: 0, end: 3 }, |
| ]); |
| |
| // Regular `find` will report only one match: |
| let matches: Vec<_> = aut.find("abc").collect(); |
| assert_eq!(matches, vec![Match { pati: 1, start: 0, end: 1}]); |
| ``` |
| |
| Finally, there are also methods for finding matches on *streams*. Namely, the |
| search text does not have to live in memory. It's useful to run this on files |
| that can't fit into memory: |
| |
| ```no_run |
| use std::fs::File; |
| |
| use aho_corasick::{Automaton, AcAutomaton}; |
| |
| let aut = AcAutomaton::new(vec!["foo", "bar", "baz"]); |
| let rdr = File::open("search.txt").unwrap(); |
| for m in aut.stream_find(rdr) { |
| let m = m.unwrap(); // could be an IO error |
| println!("Pattern '{}' matched at: ({}, {})", |
| aut.pattern(m.pati), m.start, m.end); |
| } |
| ``` |
| |
| There is also `stream_find_overlapping`, which is just like `find_overlapping`, |
| but it operates on streams. |
| |
| Please see `dict-search.rs` in this crate's `examples` directory for a more |
| complete example. It creates a large automaton from a dictionary and can do a |
| streaming match over arbitrarily large data. |
| |
| # Memory usage |
| |
| A key aspect of an Aho-Corasick implementation is how the state transitions |
| are represented. The easiest way to make the automaton fast is to store a |
| sparse 256-slot map in each state. It maps an input byte to a state index. |
| This makes the matching loop extremely fast, since it translates to a simple |
| pointer read. |
| |
| The problem is that as the automaton accumulates more states, you end up paying |
| a `256 * 4` (`4` is for the `u32` state index) byte penalty for every state |
| regardless of how many transitions it has. |
| |
| To solve this, only states near the root of the automaton have this sparse |
| map representation. States near the leaves of the automaton use a dense mapping |
| that requires a linear scan. |
| |
| (The specific limit currently set is `3`, so that states with a depth less than |
| or equal to `3` are less memory efficient. The result is that the memory usage |
| of the automaton stops growing rapidly past ~60MB, even for automatons with |
| thousands of patterns.) |
| |
| If you'd like to opt for the less-memory-efficient-but-faster version, then |
| you can construct an `AcAutomaton` with a `Sparse` transition strategy: |
| |
| ```rust |
| use aho_corasick::{Automaton, AcAutomaton, Match, Sparse}; |
| |
| let aut = AcAutomaton::<&str, Sparse>::with_transitions(vec!["abc", "a"]); |
| let matches: Vec<_> = aut.find("abc").collect(); |
| assert_eq!(matches, vec![Match { pati: 1, start: 0, end: 1}]); |
| ``` |
| */ |
| |
| #![deny(missing_docs)] |
| |
| extern crate memchr; |
| #[cfg(test)] |
| extern crate quickcheck; |
| #[cfg(test)] |
| extern crate rand; |
| |
| use std::collections::VecDeque; |
| use std::fmt; |
| use std::iter::FromIterator; |
| use std::mem; |
| |
| pub use self::autiter::{ |
| Automaton, Match, |
| Matches, MatchesOverlapping, StreamMatches, StreamMatchesOverlapping, |
| }; |
| pub use self::full::FullAcAutomaton; |
| |
| // We're specifying paths explicitly so that we can use |
| // these modules simultaneously from `main.rs`. |
| // Should probably make just make `main.rs` a separate crate. |
| #[path = "autiter.rs"] |
| mod autiter; |
| #[path = "full.rs"] |
| mod full; |
| |
| /// The integer type used for the state index. |
| /// |
| /// Limiting this to 32 bit integers can have a big impact on memory usage |
| /// when using the `Sparse` transition representation. |
| pub type StateIdx = u32; |
| |
| // Constants for special state indexes. |
| const FAIL_STATE: u32 = 0; |
| const ROOT_STATE: u32 = 1; |
| |
| // Limit the depth at which we use a sparse alphabet map. Once the limit is |
| // reached, a dense set is used (and lookup becomes O(n)). |
| // |
| // This does have a performance hit, but the (straight forward) alternative |
| // is to have a `256 * 4` byte overhead for every state. |
| // Given that Aho-Corasick is typically used for dictionary searching, this |
| // can lead to dramatic memory bloat. |
| // |
| // This limit should only be increased at your peril. Namely, in the worst |
| // case, `256^DENSE_DEPTH_THRESHOLD * 4` corresponds to the memory usage in |
| // bytes. A value of `1` gives us a good balance. This is also a happy point |
| // in the benchmarks. A value of `0` gives considerably worse times on certain |
| // benchmarks (e.g., `ac_ten_one_prefix_byte_every_match`) than even a value |
| // of `1`. A value of `2` is slightly better than `1` and it looks like gains |
| // level off at that point with not much observable difference when set to |
| // `3`. |
| // |
| // Why not make this user configurable? Well, it doesn't make much sense |
| // because we pay for it with case analysis in the matching loop. Increasing it |
| // doesn't have much impact on performance (outside of pathological cases?). |
| // |
| // N.B. Someone else seems to have discovered an alternative, but I haven't |
| // grokked it yet: https://github.com/mischasan/aho-corasick |
| const DENSE_DEPTH_THRESHOLD: u32 = 1; |
| |
| /// An Aho-Corasick finite automaton. |
| /// |
| /// The type parameter `P` is the type of the pattern that was used to |
| /// construct this AcAutomaton. |
| #[derive(Clone)] |
| pub struct AcAutomaton<P, T=Dense> { |
| pats: Vec<P>, |
| states: Vec<State<T>>, |
| start_bytes: Vec<u8>, |
| } |
| |
| #[derive(Clone)] |
| struct State<T> { |
| out: Vec<usize>, |
| fail: StateIdx, |
| goto: T, |
| depth: u32, |
| } |
| |
| impl<P: AsRef<[u8]>> AcAutomaton<P> { |
| /// Create a new automaton from an iterator of patterns. |
| /// |
| /// The patterns must be convertible to bytes (`&[u8]`) via the `AsRef` |
| /// trait. |
| pub fn new<I>(pats: I) -> AcAutomaton<P, Dense> |
| where I: IntoIterator<Item=P> { |
| AcAutomaton::with_transitions(pats) |
| } |
| } |
| |
| impl<P: AsRef<[u8]>, T: Transitions> AcAutomaton<P, T> { |
| /// Create a new automaton from an iterator of patterns. |
| /// |
| /// This constructor allows one to choose the transition representation. |
| /// |
| /// The patterns must be convertible to bytes (`&[u8]`) via the `AsRef` |
| /// trait. |
| pub fn with_transitions<I>(pats: I) -> AcAutomaton<P, T> |
| where I: IntoIterator<Item=P> { |
| AcAutomaton { |
| pats: vec![], // filled in later, avoid wrath of borrow checker |
| states: vec![State::new(0), State::new(0)], // empty and root |
| start_bytes: vec![], // also filled in later |
| }.build(pats.into_iter().collect()) |
| } |
| |
| /// Build out the entire automaton into a single matrix. |
| /// |
| /// This will make searching as fast as possible at the expense of using |
| /// at least `4 * 256 * #states` bytes of memory. |
| pub fn into_full(self) -> FullAcAutomaton<P> { |
| FullAcAutomaton::new(self) |
| } |
| |
| #[doc(hidden)] |
| pub fn num_states(&self) -> usize { |
| self.states.len() |
| } |
| |
| #[doc(hidden)] |
| pub fn heap_bytes(&self) -> usize { |
| self.pats.iter() |
| .map(|p| mem::size_of::<P>() + p.as_ref().len()) |
| .sum::<usize>() |
| + self.states.iter() |
| .map(|s| mem::size_of::<State<T>>() + s.heap_bytes()) |
| .sum::<usize>() |
| + self.start_bytes.len() |
| } |
| |
| // The states of `full_automaton` should be set for all states < si |
| fn memoized_next_state( |
| &self, |
| full_automaton: &FullAcAutomaton<P>, |
| current_si: StateIdx, |
| mut si: StateIdx, |
| b: u8, |
| ) -> StateIdx { |
| loop { |
| if si < current_si { |
| return full_automaton.next_state(si, b); |
| } |
| let state = &self.states[si as usize]; |
| let maybe_si = state.goto(b); |
| if maybe_si != FAIL_STATE { |
| return maybe_si; |
| } else { |
| si = state.fail; |
| } |
| } |
| } |
| } |
| |
| impl<P: AsRef<[u8]>, T: Transitions> Automaton<P> for AcAutomaton<P, T> { |
| #[inline] |
| fn next_state(&self, mut si: StateIdx, b: u8) -> StateIdx { |
| loop { |
| let state = &self.states[si as usize]; |
| let maybe_si = state.goto(b); |
| if maybe_si != FAIL_STATE { |
| si = maybe_si; |
| break; |
| } else { |
| si = state.fail; |
| } |
| } |
| si |
| } |
| |
| #[inline] |
| fn get_match(&self, si: StateIdx, outi: usize, texti: usize) -> Match { |
| let pati = self.states[si as usize].out[outi]; |
| let patlen = self.pats[pati].as_ref().len(); |
| let start = texti + 1 - patlen; |
| Match { |
| pati: pati, |
| start: start, |
| end: start + patlen, |
| } |
| } |
| |
| #[inline] |
| fn has_match(&self, si: StateIdx, outi: usize) -> bool { |
| outi < self.states[si as usize].out.len() |
| } |
| |
| #[inline] |
| fn start_bytes(&self) -> &[u8] { |
| &self.start_bytes |
| } |
| |
| #[inline] |
| fn patterns(&self) -> &[P] { |
| &self.pats |
| } |
| |
| #[inline] |
| fn pattern(&self, i: usize) -> &P { |
| &self.pats[i] |
| } |
| } |
| |
| // `(0..256).map(|b| b as u8)` optimizes poorly in debug builds so |
| // we use this small explicit iterator instead |
| struct AllBytesIter(i32); |
| impl Iterator for AllBytesIter { |
| type Item = u8; |
| #[inline] |
| fn next(&mut self) -> Option<Self::Item> { |
| if self.0 < 256 { |
| let b = self.0 as u8; |
| self.0 += 1; |
| Some(b) |
| } else { |
| None |
| } |
| } |
| } |
| |
| impl AllBytesIter { |
| fn new() -> AllBytesIter { |
| AllBytesIter(0) |
| } |
| } |
| |
| // Below contains code for *building* the automaton. It's a reasonably faithful |
| // translation of the description/psuedo-code from: |
| // http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf |
| |
| impl<P: AsRef<[u8]>, T: Transitions> AcAutomaton<P, T> { |
| // This is the first phase and builds the initial keyword tree. |
| fn build(mut self, pats: Vec<P>) -> AcAutomaton<P, T> { |
| for (pati, pat) in pats.iter().enumerate() { |
| if pat.as_ref().is_empty() { |
| continue; |
| } |
| let mut previ = ROOT_STATE; |
| for &b in pat.as_ref() { |
| if self.states[previ as usize].goto(b) != FAIL_STATE { |
| previ = self.states[previ as usize].goto(b); |
| } else { |
| let depth = self.states[previ as usize].depth + 1; |
| let nexti = self.add_state(State::new(depth)); |
| self.states[previ as usize].set_goto(b, nexti); |
| previ = nexti; |
| } |
| } |
| self.states[previ as usize].out.push(pati); |
| } |
| { |
| let root_state = &mut self.states[ROOT_STATE as usize]; |
| for c in AllBytesIter::new() { |
| if root_state.goto(c) == FAIL_STATE { |
| root_state.set_goto(c, ROOT_STATE); |
| } else { |
| self.start_bytes.push(c); |
| } |
| } |
| } |
| // If any of the start bytes are non-ASCII, then remove them all, |
| // because we don't want to be calling memchr on non-ASCII bytes. |
| // (Well, we could, but it requires being more clever. Simply using |
| // the prefix byte isn't good enough.) |
| if self.start_bytes.iter().any(|&b| b > 0x7F) { |
| self.start_bytes.clear(); |
| } |
| self.pats = pats; |
| self.fill() |
| } |
| |
| // The second phase that fills in the back links. |
| fn fill(mut self) -> AcAutomaton<P, T> { |
| // Fill up the queue with all non-root transitions out of the root |
| // node. Then proceed by breadth first traversal. |
| let mut q = VecDeque::new(); |
| self.states[ROOT_STATE as usize].for_each_transition(|_, si| { |
| if si != ROOT_STATE { |
| q.push_front(si); |
| } |
| }); |
| |
| let mut transitions = Vec::new(); |
| |
| while let Some(si) = q.pop_back() { |
| self.states[si as usize].for_each_ok_transition(|c, u| { |
| transitions.push((c, u)); |
| q.push_front(u); |
| }); |
| |
| for (c, u) in transitions.drain(..) { |
| let mut v = self.states[si as usize].fail; |
| loop { |
| let state = &self.states[v as usize]; |
| if state.goto(c) == FAIL_STATE { |
| v = state.fail; |
| } else { |
| break; |
| } |
| } |
| let ufail = self.states[v as usize].goto(c); |
| self.states[u as usize].fail = ufail; |
| |
| fn get_two<T>(xs: &mut [T], i: usize, j: usize) -> (&mut T, &mut T) { |
| if i < j { |
| let (before, after) = xs.split_at_mut(j); |
| (&mut before[i], &mut after[0]) |
| } else { |
| let (before, after) = xs.split_at_mut(i); |
| (&mut after[0], &mut before[j]) |
| } |
| } |
| |
| let (ufail_out, out) = get_two(&mut self.states, ufail as usize, u as usize); |
| out.out.extend_from_slice(&ufail_out.out); |
| } |
| } |
| self |
| } |
| |
| fn add_state(&mut self, state: State<T>) -> StateIdx { |
| let i = self.states.len(); |
| self.states.push(state); |
| i as StateIdx |
| } |
| } |
| |
| impl<T: Transitions> State<T> { |
| fn new(depth: u32) -> State<T> { |
| State { |
| out: vec![], |
| fail: 1, |
| goto: Transitions::new(depth), |
| depth: depth, |
| } |
| } |
| |
| fn goto(&self, b: u8) -> StateIdx { |
| self.goto.goto(b) |
| } |
| |
| fn set_goto(&mut self, b: u8, si: StateIdx) { |
| self.goto.set_goto(b, si); |
| } |
| |
| fn heap_bytes(&self) -> usize { |
| (self.out.len() * usize_bytes()) |
| + self.goto.heap_bytes() |
| } |
| |
| fn for_each_transition<F>(&self, f: F) |
| where F: FnMut(u8, StateIdx) |
| { |
| self.goto.for_each_transition(f) |
| } |
| |
| fn for_each_ok_transition<F>(&self, f: F) |
| where F: FnMut(u8, StateIdx) |
| { |
| self.goto.for_each_ok_transition(f) |
| } |
| } |
| |
| /// An abstraction over state transition strategies. |
| /// |
| /// This is an attempt to let the caller choose the space/time trade offs |
| /// used for state transitions. |
| /// |
| /// (It's possible that this interface is merely good enough for just the two |
| /// implementations in this crate.) |
| pub trait Transitions { |
| /// Return a new state at the given depth. |
| fn new(depth: u32) -> Self; |
| /// Return the next state index given the next character. |
| fn goto(&self, alpha: u8) -> StateIdx; |
| /// Set the next state index for the character given. |
| fn set_goto(&mut self, alpha: u8, si: StateIdx); |
| /// The memory use in bytes (on the heap) of this set of transitions. |
| fn heap_bytes(&self) -> usize; |
| |
| /// Iterates over each state |
| fn for_each_transition<F>(&self, mut f: F) |
| where F: FnMut(u8, StateIdx) |
| { |
| for b in AllBytesIter::new() { |
| f(b, self.goto(b)); |
| } |
| } |
| |
| /// Iterates over each non-fail state |
| fn for_each_ok_transition<F>(&self, mut f: F) |
| where |
| F: FnMut(u8, StateIdx), |
| { |
| self.for_each_transition(|b, si| { |
| if si != FAIL_STATE { |
| f(b, si); |
| } |
| }); |
| } |
| } |
| |
| /// State transitions that can be stored either sparsely or densely. |
| /// |
| /// This uses less space but at the expense of slower matching. |
| #[derive(Clone, Debug)] |
| pub struct Dense(DenseChoice); |
| |
| #[derive(Clone, Debug)] |
| enum DenseChoice { |
| Sparse(Box<Sparse>), |
| Dense(Vec<(u8, StateIdx)>), |
| } |
| |
| impl Transitions for Dense { |
| fn new(depth: u32) -> Dense { |
| if depth <= DENSE_DEPTH_THRESHOLD { |
| Dense(DenseChoice::Sparse(Box::new(Sparse::new(depth)))) |
| } else { |
| Dense(DenseChoice::Dense(vec![])) |
| } |
| } |
| |
| fn goto(&self, b1: u8) -> StateIdx { |
| match self.0 { |
| DenseChoice::Sparse(ref m) => m.goto(b1), |
| DenseChoice::Dense(ref m) => { |
| for &(b2, si) in m { |
| if b1 == b2 { |
| return si; |
| } |
| } |
| FAIL_STATE |
| } |
| } |
| } |
| |
| fn set_goto(&mut self, b: u8, si: StateIdx) { |
| match self.0 { |
| DenseChoice::Sparse(ref mut m) => m.set_goto(b, si), |
| DenseChoice::Dense(ref mut m) => { |
| match m.binary_search_by_key(&b, |&(b, _)| b) { |
| Ok(i) => m[i] = (b, si), |
| Err(i) => m.insert(i, (b, si)), |
| } |
| } |
| } |
| } |
| |
| fn heap_bytes(&self) -> usize { |
| match self.0 { |
| DenseChoice::Sparse(_) => mem::size_of::<Sparse>(), |
| DenseChoice::Dense(ref m) => m.len() * (1 + 4), |
| } |
| } |
| |
| fn for_each_transition<F>(&self, mut f: F) |
| where F: FnMut(u8, StateIdx) |
| { |
| match self.0 { |
| DenseChoice::Sparse(ref m) => m.for_each_transition(f), |
| DenseChoice::Dense(ref m) => { |
| let mut iter = m.iter(); |
| let mut b = 0i32; |
| while let Some(&(next_b, next_si)) = iter.next() { |
| while (b as u8) < next_b { |
| f(b as u8, FAIL_STATE); |
| b += 1; |
| } |
| f(b as u8, next_si); |
| b += 1; |
| } |
| while b < 256 { |
| f(b as u8, FAIL_STATE); |
| b += 1; |
| } |
| } |
| } |
| } |
| fn for_each_ok_transition<F>(&self, mut f: F) |
| where |
| F: FnMut(u8, StateIdx), |
| { |
| match self.0 { |
| DenseChoice::Sparse(ref m) => m.for_each_ok_transition(f), |
| DenseChoice::Dense(ref m) => for &(b, si) in m { |
| f(b, si) |
| } |
| } |
| } |
| } |
| |
| /// State transitions that are always sparse. |
| /// |
| /// This can use enormous amounts of memory when there are many patterns, |
| /// but matching is very fast. |
| pub struct Sparse([StateIdx; 256]); |
| |
| impl Clone for Sparse { |
| fn clone(&self) -> Sparse { |
| Sparse(self.0) |
| } |
| } |
| |
| impl fmt::Debug for Sparse { |
| fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { |
| f.debug_tuple("Sparse").field(&&self.0[..]).finish() |
| } |
| } |
| |
| impl Transitions for Sparse { |
| fn new(_: u32) -> Sparse { |
| Sparse([0; 256]) |
| } |
| |
| #[inline] |
| fn goto(&self, b: u8) -> StateIdx { |
| self.0[b as usize] |
| } |
| |
| fn set_goto(&mut self, b: u8, si: StateIdx) { |
| self.0[b as usize] = si; |
| } |
| |
| fn heap_bytes(&self) -> usize { |
| 0 |
| } |
| } |
| |
| impl<S: AsRef<[u8]>> FromIterator<S> for AcAutomaton<S> { |
| /// Create an automaton from an iterator of strings. |
| fn from_iter<T>(it: T) -> AcAutomaton<S> where T: IntoIterator<Item=S> { |
| AcAutomaton::new(it) |
| } |
| } |
| |
| // Provide some question debug impls for viewing automatons. |
| // The custom impls mostly exist for special showing of sparse maps. |
| |
| impl<P: AsRef<[u8]> + fmt::Debug, T: Transitions> |
| fmt::Debug for AcAutomaton<P, T> { |
| fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { |
| use std::iter::repeat; |
| |
| try!(writeln!(f, "{}", repeat('-').take(79).collect::<String>())); |
| try!(writeln!(f, "Patterns: {:?}", self.pats)); |
| for (i, state) in self.states.iter().enumerate().skip(1) { |
| try!(writeln!(f, "{:3}: {}", i, state.debug(i == 1))); |
| } |
| write!(f, "{}", repeat('-').take(79).collect::<String>()) |
| } |
| } |
| |
| impl<T: Transitions> State<T> { |
| fn debug(&self, root: bool) -> String { |
| format!("State {{ depth: {:?}, out: {:?}, fail: {:?}, goto: {{{}}} }}", |
| self.depth, self.out, self.fail, self.goto_string(root)) |
| } |
| |
| fn goto_string(&self, root: bool) -> String { |
| let mut goto = vec![]; |
| for b in AllBytesIter::new() { |
| let si = self.goto(b); |
| if (!root && si == FAIL_STATE) || (root && si == ROOT_STATE) { |
| continue; |
| } |
| goto.push(format!("{} => {}", b as char, si)); |
| } |
| goto.join(", ") |
| } |
| } |
| |
| impl<T: Transitions> fmt::Debug for State<T> { |
| fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { |
| write!(f, "{}", self.debug(false)) |
| } |
| } |
| |
| impl<T: Transitions> AcAutomaton<String, T> { |
| #[doc(hidden)] |
| pub fn dot(&self) -> String { |
| use std::fmt::Write; |
| let mut out = String::new(); |
| macro_rules! w { |
| ($w:expr, $($tt:tt)*) => { {write!($w, $($tt)*)}.unwrap() } |
| } |
| |
| w!(out, r#" |
| digraph automaton {{ |
| label=<<FONT POINT-SIZE="20">{}</FONT>>; |
| labelloc="l"; |
| labeljust="l"; |
| rankdir="LR"; |
| "#, self.pats.join(", ")); |
| for (i, s) in self.states.iter().enumerate().skip(1) { |
| let i = i as u32; |
| if s.out.is_empty() { |
| w!(out, " {};\n", i); |
| } else { |
| w!(out, " {} [peripheries=2];\n", i); |
| } |
| w!(out, " {} -> {} [style=dashed];\n", i, s.fail); |
| for b in AllBytesIter::new() { |
| let si = s.goto(b); |
| if si == FAIL_STATE || (i == ROOT_STATE && si == ROOT_STATE) { |
| continue; |
| } |
| w!(out, " {} -> {} [label={}];\n", i, si, b as char); |
| } |
| } |
| w!(out, "}}"); |
| out |
| } |
| } |
| |
| fn vec_bytes() -> usize { |
| usize_bytes() * 3 |
| } |
| |
| fn usize_bytes() -> usize { |
| let bits = usize::max_value().count_ones() as usize; |
| bits / 8 |
| } |
| |
| #[cfg(test)] |
| mod tests { |
| use std::collections::HashSet; |
| use std::io; |
| |
| use quickcheck::{Arbitrary, Gen, quickcheck}; |
| use rand::Rng; |
| |
| use super::{AcAutomaton, Automaton, Match, AllBytesIter}; |
| |
| fn aut_find<S>(xs: &[S], haystack: &str) -> Vec<Match> |
| where S: Clone + AsRef<[u8]> { |
| AcAutomaton::new(xs.to_vec()).find(&haystack).collect() |
| } |
| |
| fn aut_finds<S>(xs: &[S], haystack: &str) -> Vec<Match> |
| where S: Clone + AsRef<[u8]> { |
| let cur = io::Cursor::new(haystack.as_bytes()); |
| AcAutomaton::new(xs.to_vec()) |
| .stream_find(cur).map(|r| r.unwrap()).collect() |
| } |
| |
| fn aut_findf<S>(xs: &[S], haystack: &str) -> Vec<Match> |
| where S: Clone + AsRef<[u8]> { |
| AcAutomaton::new(xs.to_vec()).into_full().find(haystack).collect() |
| } |
| |
| fn aut_findfs<S>(xs: &[S], haystack: &str) -> Vec<Match> |
| where S: Clone + AsRef<[u8]> { |
| let cur = io::Cursor::new(haystack.as_bytes()); |
| AcAutomaton::new(xs.to_vec()) |
| .into_full() |
| .stream_find(cur).map(|r| r.unwrap()).collect() |
| } |
| |
| fn aut_findo<S>(xs: &[S], haystack: &str) -> Vec<Match> |
| where S: Clone + AsRef<[u8]> { |
| AcAutomaton::new(xs.to_vec()).find_overlapping(haystack).collect() |
| } |
| |
| fn aut_findos<S>(xs: &[S], haystack: &str) -> Vec<Match> |
| where S: Clone + AsRef<[u8]> { |
| let cur = io::Cursor::new(haystack.as_bytes()); |
| AcAutomaton::new(xs.to_vec()) |
| .stream_find_overlapping(cur).map(|r| r.unwrap()).collect() |
| } |
| |
| fn aut_findfo<S>(xs: &[S], haystack: &str) -> Vec<Match> |
| where S: Clone + AsRef<[u8]> { |
| AcAutomaton::new(xs.to_vec()) |
| .into_full().find_overlapping(haystack).collect() |
| } |
| |
| fn aut_findfos<S>(xs: &[S], haystack: &str) -> Vec<Match> |
| where S: Clone + AsRef<[u8]> { |
| let cur = io::Cursor::new(haystack.as_bytes()); |
| AcAutomaton::new(xs.to_vec()) |
| .into_full() |
| .stream_find_overlapping(cur).map(|r| r.unwrap()).collect() |
| } |
| |
| #[test] |
| fn one_pattern_one_match() { |
| let ns = vec!["a"]; |
| let hay = "za"; |
| let matches = vec![ |
| Match { pati: 0, start: 1, end: 2 }, |
| ]; |
| assert_eq!(&aut_find(&ns, hay), &matches); |
| assert_eq!(&aut_finds(&ns, hay), &matches); |
| assert_eq!(&aut_findf(&ns, hay), &matches); |
| assert_eq!(&aut_findfs(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn one_pattern_many_match() { |
| let ns = vec!["a"]; |
| let hay = "zazazzzza"; |
| let matches = vec![ |
| Match { pati: 0, start: 1, end: 2 }, |
| Match { pati: 0, start: 3, end: 4 }, |
| Match { pati: 0, start: 8, end: 9 }, |
| ]; |
| assert_eq!(&aut_find(&ns, hay), &matches); |
| assert_eq!(&aut_finds(&ns, hay), &matches); |
| assert_eq!(&aut_findf(&ns, hay), &matches); |
| assert_eq!(&aut_findfs(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn one_longer_pattern_one_match() { |
| let ns = vec!["abc"]; |
| let hay = "zazabcz"; |
| let matches = vec![ Match { pati: 0, start: 3, end: 6 } ]; |
| assert_eq!(&aut_find(&ns, hay), &matches); |
| assert_eq!(&aut_finds(&ns, hay), &matches); |
| assert_eq!(&aut_findf(&ns, hay), &matches); |
| assert_eq!(&aut_findfs(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn one_longer_pattern_many_match() { |
| let ns = vec!["abc"]; |
| let hay = "zazabczzzzazzzabc"; |
| let matches = vec![ |
| Match { pati: 0, start: 3, end: 6 }, |
| Match { pati: 0, start: 14, end: 17 }, |
| ]; |
| assert_eq!(&aut_find(&ns, hay), &matches); |
| assert_eq!(&aut_finds(&ns, hay), &matches); |
| assert_eq!(&aut_findf(&ns, hay), &matches); |
| assert_eq!(&aut_findfs(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn many_pattern_one_match() { |
| let ns = vec!["a", "b"]; |
| let hay = "zb"; |
| let matches = vec![ Match { pati: 1, start: 1, end: 2 } ]; |
| assert_eq!(&aut_find(&ns, hay), &matches); |
| assert_eq!(&aut_finds(&ns, hay), &matches); |
| assert_eq!(&aut_findf(&ns, hay), &matches); |
| assert_eq!(&aut_findfs(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn many_pattern_many_match() { |
| let ns = vec!["a", "b"]; |
| let hay = "zbzazzzzb"; |
| let matches = vec![ |
| Match { pati: 1, start: 1, end: 2 }, |
| Match { pati: 0, start: 3, end: 4 }, |
| Match { pati: 1, start: 8, end: 9 }, |
| ]; |
| assert_eq!(&aut_find(&ns, hay), &matches); |
| assert_eq!(&aut_finds(&ns, hay), &matches); |
| assert_eq!(&aut_findf(&ns, hay), &matches); |
| assert_eq!(&aut_findfs(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn many_longer_pattern_one_match() { |
| let ns = vec!["abc", "xyz"]; |
| let hay = "zazxyzz"; |
| let matches = vec![ Match { pati: 1, start: 3, end: 6 } ]; |
| assert_eq!(&aut_find(&ns, hay), &matches); |
| assert_eq!(&aut_finds(&ns, hay), &matches); |
| assert_eq!(&aut_findf(&ns, hay), &matches); |
| assert_eq!(&aut_findfs(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn many_longer_pattern_many_match() { |
| let ns = vec!["abc", "xyz"]; |
| let hay = "zazxyzzzzzazzzabcxyz"; |
| let matches = vec![ |
| Match { pati: 1, start: 3, end: 6 }, |
| Match { pati: 0, start: 14, end: 17 }, |
| Match { pati: 1, start: 17, end: 20 }, |
| ]; |
| assert_eq!(&aut_find(&ns, hay), &matches); |
| assert_eq!(&aut_finds(&ns, hay), &matches); |
| assert_eq!(&aut_findf(&ns, hay), &matches); |
| assert_eq!(&aut_findfs(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn many_longer_pattern_overlap_one_match() { |
| let ns = vec!["abc", "bc"]; |
| let hay = "zazabcz"; |
| let matches = vec![ |
| Match { pati: 0, start: 3, end: 6 }, |
| Match { pati: 1, start: 4, end: 6 }, |
| ]; |
| assert_eq!(&aut_findo(&ns, hay), &matches); |
| assert_eq!(&aut_findos(&ns, hay), &matches); |
| assert_eq!(&aut_findfo(&ns, hay), &matches); |
| assert_eq!(&aut_findfos(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn many_longer_pattern_overlap_one_match_reverse() { |
| let ns = vec!["abc", "bc"]; |
| let hay = "xbc"; |
| let matches = vec![ Match { pati: 1, start: 1, end: 3 } ]; |
| assert_eq!(&aut_findo(&ns, hay), &matches); |
| assert_eq!(&aut_findos(&ns, hay), &matches); |
| assert_eq!(&aut_findfo(&ns, hay), &matches); |
| assert_eq!(&aut_findfos(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn many_longer_pattern_overlap_many_match() { |
| let ns = vec!["abc", "bc", "c"]; |
| let hay = "zzzabczzzbczzzc"; |
| let matches = vec![ |
| Match { pati: 0, start: 3, end: 6 }, |
| Match { pati: 1, start: 4, end: 6 }, |
| Match { pati: 2, start: 5, end: 6 }, |
| Match { pati: 1, start: 9, end: 11 }, |
| Match { pati: 2, start: 10, end: 11 }, |
| Match { pati: 2, start: 14, end: 15 }, |
| ]; |
| assert_eq!(&aut_findo(&ns, hay), &matches); |
| assert_eq!(&aut_findos(&ns, hay), &matches); |
| assert_eq!(&aut_findfo(&ns, hay), &matches); |
| assert_eq!(&aut_findfos(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn many_longer_pattern_overlap_many_match_reverse() { |
| let ns = vec!["abc", "bc", "c"]; |
| let hay = "zzzczzzbczzzabc"; |
| let matches = vec![ |
| Match { pati: 2, start: 3, end: 4 }, |
| Match { pati: 1, start: 7, end: 9 }, |
| Match { pati: 2, start: 8, end: 9 }, |
| Match { pati: 0, start: 12, end: 15 }, |
| Match { pati: 1, start: 13, end: 15 }, |
| Match { pati: 2, start: 14, end: 15 }, |
| ]; |
| assert_eq!(&aut_findo(&ns, hay), &matches); |
| assert_eq!(&aut_findos(&ns, hay), &matches); |
| assert_eq!(&aut_findfo(&ns, hay), &matches); |
| assert_eq!(&aut_findfos(&ns, hay), &matches); |
| } |
| |
| #[test] |
| fn pattern_returns_original_type() { |
| let aut = AcAutomaton::new(vec!["apple", "maple"]); |
| |
| // Explicitly given this type to assert that the thing returned |
| // from the function is our original type. |
| let pat: &str = aut.pattern(0); |
| assert_eq!(pat, "apple"); |
| |
| // Also check the return type of the `patterns` function. |
| let pats: &[&str] = aut.patterns(); |
| assert_eq!(pats, &["apple", "maple"]); |
| } |
| |
| // Quickcheck time. |
| |
| // This generates very small ascii strings, which makes them more likely |
| // to interact in interesting ways with larger haystack strings. |
| #[derive(Clone, Debug, PartialEq, Eq, PartialOrd, Ord)] |
| pub struct SmallAscii(String); |
| |
| impl Arbitrary for SmallAscii { |
| fn arbitrary<G: Gen>(g: &mut G) -> SmallAscii { |
| use std::char::from_u32; |
| SmallAscii((0..2) |
| .map(|_| from_u32(g.gen_range(97, 123)).unwrap()) |
| .collect()) |
| } |
| |
| fn shrink(&self) -> Box<Iterator<Item=SmallAscii>> { |
| Box::new(self.0.shrink().map(SmallAscii)) |
| } |
| } |
| |
| impl From<SmallAscii> for String { |
| fn from(s: SmallAscii) -> String { s.0 } |
| } |
| |
| impl AsRef<[u8]> for SmallAscii { |
| fn as_ref(&self) -> &[u8] { self.0.as_ref() } |
| } |
| |
| // This is the same arbitrary impl as `String`, except it has a bias toward |
| // ASCII characters. |
| #[derive(Clone, Debug, PartialEq, Eq, PartialOrd, Ord)] |
| pub struct BiasAscii(String); |
| |
| impl Arbitrary for BiasAscii { |
| fn arbitrary<G: Gen>(g: &mut G) -> BiasAscii { |
| use std::char::from_u32; |
| let size = { let s = g.size(); g.gen_range(0, s) }; |
| let mut s = String::with_capacity(size); |
| for _ in 0..size { |
| if g.gen_bool(0.3) { |
| s.push(char::arbitrary(g)); |
| } else { |
| for _ in 0..5 { |
| s.push(from_u32(g.gen_range(97, 123)).unwrap()); |
| } |
| } |
| } |
| BiasAscii(s) |
| } |
| |
| fn shrink(&self) -> Box<Iterator<Item=BiasAscii>> { |
| Box::new(self.0.shrink().map(BiasAscii)) |
| } |
| } |
| |
| fn naive_find<S>(xs: &[S], haystack: &str) -> Vec<Match> |
| where S: Clone + Into<String> { |
| let needles: Vec<String> = |
| xs.to_vec().into_iter().map(Into::into).collect(); |
| let mut matches = vec![]; |
| for hi in 0..haystack.len() { |
| for (pati, needle) in needles.iter().enumerate() { |
| let needle = needle.as_bytes(); |
| if needle.len() == 0 || needle.len() > haystack.len() - hi { |
| continue; |
| } |
| if needle == &haystack.as_bytes()[hi..hi+needle.len()] { |
| matches.push(Match { |
| pati: pati, |
| start: hi, |
| end: hi + needle.len(), |
| }); |
| } |
| } |
| } |
| matches |
| } |
| |
| #[test] |
| fn qc_ac_equals_naive() { |
| fn prop(needles: Vec<SmallAscii>, haystack: BiasAscii) -> bool { |
| let aut_matches = aut_findo(&needles, &haystack.0); |
| let naive_matches = naive_find(&needles, &haystack.0); |
| // Ordering isn't always the same. I don't think we care, so do |
| // an unordered comparison. |
| let aset: HashSet<Match> = aut_matches.iter().cloned().collect(); |
| let nset: HashSet<Match> = naive_matches.iter().cloned().collect(); |
| aset == nset |
| } |
| quickcheck(prop as fn(Vec<SmallAscii>, BiasAscii) -> bool); |
| } |
| |
| |
| #[test] |
| fn all_bytes_iter() { |
| let all_bytes = AllBytesIter::new().collect::<Vec<_>>(); |
| assert_eq!(all_bytes[0], 0); |
| assert_eq!(all_bytes[255], 255); |
| assert!(AllBytesIter::new().enumerate().all(|(i, b)| b as usize == i)); |
| } |
| |
| #[test] |
| fn regression_full_order1() { |
| let matches1 = aut_findfo(&["inf", "ind"], "infind") |
| .into_iter() |
| .map(|m| (m.start, m.end)) |
| .collect::<Vec<(usize, usize)>>(); |
| let matches2 = aut_findfo(&["ind", "inf"], "infind") |
| .into_iter() |
| .map(|m| (m.start, m.end)) |
| .collect::<Vec<(usize, usize)>>(); |
| assert_eq!(matches1, matches2); |
| } |
| |
| // See: https://github.com/BurntSushi/aho-corasick/issues/37 |
| #[test] |
| fn regression_full_order2() { |
| let haystack = "libcore/char/methods.rs"; |
| |
| let matches1 = aut_findfo(&["libcore/", "libstd/"], haystack) |
| .into_iter() |
| .map(|m| (m.start, m.end)) |
| .collect::<Vec<(usize, usize)>>(); |
| let matches2 = aut_findfo(&["libstd/", "libcore/"], haystack) |
| .into_iter() |
| .map(|m| (m.start, m.end)) |
| .collect::<Vec<(usize, usize)>>(); |
| assert_eq!(matches1, matches2); |
| } |
| } |