用另外两个文件过滤一个文件
$ cat arquivo1.txt
6|1000|121|999
1|1000|2000|3001
2|1000|2000|3001
3|2000|11|11
4| 100|22|1
5|1000|2000|4000
1000|10|11|12
$ cat arquivo2.txt
5
1000
7
$ cat arquivo3.txt
20
I want to output all lines from arquivo1.txt that the second field (arquivo1.txt) is not in arquivo2.txt and the substring(first 2 chars) of the second field (arquivo1.txt) not in arquivo3.txt.
我想从arquivo1中输出所有的行。第二个字段(arquivo1.txt)不在arquivo2中。txt和第二个字段的子字符串(前两个字符)(arquivo1.txt)不在arquivo3.txt中。
In this example, the output would be:
在本例中,输出将是:
4| 100|22|1
1000|10|11|12
So, I did the filter of arquivo2.txt:
所以,我做了一个arquivo2.txt的过滤器:
$ awk -F'|' 'FNR==NR { a[$1]; next } !($2 in a)' arquivo2.txt arquivo1.txt
And I did the filter of arquivo3.txt:
我过滤了arquivo3。txt
$ awk -F'|' 'FNR==NR { a[$1]; next } !(substr($2,1,2) in a)' arquivo3.txt arquivo1.txt
Is it possible to have these commands together in one line of code?
是否可能将这些命令放在一行代码中?
All I need is performance, because these files are big (arquivo1.txt have 1 million lines and arquivo2.txt and arquivo3.txt have 200k lines each), is this the best approach to achieve the best response time?
我需要的是性能,因为这些文件很大(arquivo1)。txt有100万行和arquivo2。txt和arquivo3。txt各有200k行),这是获得最佳响应时间的最佳方法吗?
3 个解决方案
#1
1
I have a kind of solution but it is for gawk (awk solution at the end of this post). Maybe it is usable.
我有一个解决方案,但它是给gawk (awk解决方案在这篇文章的结尾)。也许它是可用的。
To use a hash is a good idea to make the searching in constant time.
使用散列是在常数时间内进行搜索的好方法。
awk -F\| '
ARGIND == 1 {a[$1]=1;next}
ARGIND == 2 {b[$1]=1;next}
!($2 in a) && !(substr($2,1,2) in b)
' arquivo2.txt arquivo3.txt arquivo1.txt
Output:
输出:
4| 100|22|1
1000|10|11|12
I did some measurements. I generated the 3 files with the following awk script:
我做了一些测量。我用以下awk脚本生成了3个文件:
time awk ' BEGIN {
for(i=0;i<1000000;++i) print i"|"i"|1000|123">"arquivo1.txt"
for(i=0;i<200000;++i) print (i*10)>"arquivo2.txt"
for(i=0;i<200000;++i) print (i*10+5)>"arquivo3.txt"
}' || exit 1
Then I measured the time needed to run the second script adding time
before awk and I redirected the output to /dev/null
not to measure the screening. Here is the result of three independent runs:
然后,我测量了运行第二个脚本所需的时间,并在awk之前将输出重定向到/dev/null,以不度量筛选。以下是三个独立运行的结果:
$./test.sh
real 0m2.880s
user 0m2.816s
sys 0m0.044s
$./test.sh
real 0m2.931s
user 0m2.892s
sys 0m0.032s
$./test.sh
real 0m2.924s
user 0m2.864s
sys 0m0.040s
(The creation of tables finished in 1.5 sec). For 1 million rows for the input table, and 2x200_000 rows for the filter tables finishes in 3 sec and it prints 809_999 lines (at least so many times both conditions are evaluated).
(在1.5秒内完成表格的创建)。对于输入表的100万行,以及过滤器表的2x200_000行,将在3秒内完成,并打印809_999行(至少要对这两个条件求多次)。
Is something You expected, or it is still to much for runtime? My machine is a little bit old laptop with Pentium(R) Dual-Core CPU T4300 @ 2.10GHz
CPU.
是你期望的,还是运行时的期望?我的机器是一个有点旧的笔记本电脑与奔腾(R)双核CPU T4300 @ 2.10GHz CPU。
ADDED
添加
Here is a little bit faster and real awk solution:
这里有一个更快更真实的awk解决方案:
awk -F\| '
BEGIN {
while((getline<"arquivo2.txt")>0) a[$0];
while((getline<"arquivo3.txt")>0) b[$0];
}
!($2 in a) && !(substr($2,1,2) in b)
' arquivo1.txt
For the big test files the run time is:
对于大型测试文件,运行时间为:
real 0m2.544s
user 0m2.452s
sys 0m0.048s
real 0m2.458s
user 0m2.420s
sys 0m0.032s
real 0m2.493s
user 0m2.448s
sys 0m0.036s
So this runs in 2.5 sec.
这是2。5秒。
I hope this helps a bit!
我希望这能有所帮助!
更多相关文章
- Linux系统下 使用Lsof恢复误删除的文件
- C语言文件I/O 读取一个文件并输出出来 和 输出到另一个文件里面
- vs code远程编辑文件
- 从QQ浏览器缓存文件中提取出完整的视频
- Linux程序设计——文件操作(标准I/O库)
- Linux之profile、bash_profile、bashrc文件
- Linux服务器权限管理实践——添加用户只访问某些文件目录
- 如何卸载内核代码中的文件系统
- 如何在init.d上重启Jar文件?