用另外两个文件过滤一个文件

$ cat arquivo1.txt
6|1000|121|999
1|1000|2000|3001
2|1000|2000|3001
3|2000|11|11
4| 100|22|1
5|1000|2000|4000
1000|10|11|12

$ cat arquivo2.txt
5
1000
7

$ cat arquivo3.txt
20

I want to output all lines from arquivo1.txt that the second field (arquivo1.txt) is not in arquivo2.txt and the substring(first 2 chars) of the second field (arquivo1.txt) not in arquivo3.txt.

我想从arquivo1中输出所有的行。第二个字段(arquivo1.txt)不在arquivo2中。txt和第二个字段的子字符串(前两个字符)(arquivo1.txt)不在arquivo3.txt中。

In this example, the output would be:

在本例中，输出将是:

4| 100|22|1
1000|10|11|12

So, I did the filter of arquivo2.txt:

所以，我做了一个arquivo2.txt的过滤器:

$ awk -F'|' 'FNR==NR { a[$1]; next } !($2 in a)' arquivo2.txt arquivo1.txt

And I did the filter of arquivo3.txt:

我过滤了arquivo3。txt

$ awk -F'|' 'FNR==NR { a[$1]; next } !(substr($2,1,2) in a)' arquivo3.txt arquivo1.txt

Is it possible to have these commands together in one line of code?

是否可能将这些命令放在一行代码中?

All I need is performance, because these files are big (arquivo1.txt have 1 million lines and arquivo2.txt and arquivo3.txt have 200k lines each), is this the best approach to achieve the best response time?

我需要的是性能，因为这些文件很大(arquivo1)。txt有100万行和arquivo2。txt和arquivo3。txt各有200k行)，这是获得最佳响应时间的最佳方法吗?

3 个解决方案

#1

I have a kind of solution but it is for gawk (awk solution at the end of this post). Maybe it is usable.

我有一个解决方案，但它是给gawk (awk解决方案在这篇文章的结尾)。也许它是可用的。

To use a hash is a good idea to make the searching in constant time.

使用散列是在常数时间内进行搜索的好方法。

awk -F\| '
  ARGIND == 1 {a[$1]=1;next}
  ARGIND == 2 {b[$1]=1;next}
  !($2 in a) && !(substr($2,1,2) in  b)
' arquivo2.txt arquivo3.txt arquivo1.txt

Output:

输出:

4| 100|22|1
1000|10|11|12

I did some measurements. I generated the 3 files with the following awk script:

我做了一些测量。我用以下awk脚本生成了3个文件:

time awk ' BEGIN {
  for(i=0;i<1000000;++i) print i"|"i"|1000|123">"arquivo1.txt"
  for(i=0;i<200000;++i) print (i*10)>"arquivo2.txt"
  for(i=0;i<200000;++i) print (i*10+5)>"arquivo3.txt"
}' || exit 1

Then I measured the time needed to run the second script adding time before awk and I redirected the output to /dev/null not to measure the screening. Here is the result of three independent runs:

然后，我测量了运行第二个脚本所需的时间，并在awk之前将输出重定向到/dev/null，以不度量筛选。以下是三个独立运行的结果:

$./test.sh
real    0m2.880s
user    0m2.816s
sys     0m0.044s
$./test.sh
real    0m2.931s
user    0m2.892s
sys     0m0.032s
$./test.sh
real    0m2.924s
user    0m2.864s
sys     0m0.040s

(The creation of tables finished in 1.5 sec). For 1 million rows for the input table, and 2x200_000 rows for the filter tables finishes in 3 sec and it prints 809_999 lines (at least so many times both conditions are evaluated).

(在1.5秒内完成表格的创建)。对于输入表的100万行，以及过滤器表的2x200_000行，将在3秒内完成，并打印809_999行(至少要对这两个条件求多次)。

Is something You expected, or it is still to much for runtime? My machine is a little bit old laptop with Pentium(R) Dual-Core CPU T4300 @ 2.10GHz CPU.

是你期望的，还是运行时的期望?我的机器是一个有点旧的笔记本电脑与奔腾(R)双核CPU T4300 @ 2.10GHz CPU。

ADDED

添加

Here is a little bit faster and real awk solution:

这里有一个更快更真实的awk解决方案:

awk -F\| '
BEGIN {
  while((getline<"arquivo2.txt")>0) a[$0];
  while((getline<"arquivo3.txt")>0) b[$0];
}
!($2 in a) && !(substr($2,1,2) in  b)
' arquivo1.txt

For the big test files the run time is:

对于大型测试文件，运行时间为:

real    0m2.544s
user    0m2.452s
sys     0m0.048s

real    0m2.458s
user    0m2.420s
sys     0m0.032s

real    0m2.493s
user    0m2.448s
sys     0m0.036s

So this runs in 2.5 sec.

这是2。5秒。

I hope this helps a bit!

我希望这能有所帮助!

3 个解决方案

#1

更多相关文章

随机推荐