[ Pobierz całość w formacie PDF ]
1
Callgraphpropertiesofexecutables
DanielBilar
WellesleyCollege
DepartmentofComputerScience
Wellesley,MA02481,USA
E-mail:dbilar@wellesley.edu
polymorphiccode)andheuristicssuchasemula-
tionaremoreresilient(butquiteslowandmay
hingeonenvironmentaltriggers),adetectionap-
proachthatcombinesthebestofbothworlds
wouldbedesirable.Thisisthephilosophybehind
astructural¯ngerprint.Structural¯ngerprintsare
statisticalinnature,andassucharepositionedas
`fuzzier'metricsbetweenstaticsignaturesanddy-
namicheuristics.Thestructural¯ngerprintinves-
tigatedinthispaperfordi®erentiationpurposesis
basedonsomepropertiesoftheexecutable'scall-
graph.Ialsoproposeagenerativemechanismfor
thecallgraphtopology.
Thispaperexaminesthecallgraphsof120malicious
and280non-maliciousexecutables.Paretomodels
were¯ttedtoin-degree,out-degreeandbasicblock
countdistribution,andastatisticallysigni¯cantdi®er-
enceshownforthederivedpowerlawexponent.Atwo-
stepoptimizationprocessishypothesizedtoaccount
forstructuralfeaturesoftheexecutable.
Keywords:Executables,Callgraph,HOTprocess,
Graph-structuralFingerprint,Malware
2.Generatingthecallgraph
1.Motivation
Primarytoolsusedaredescribedinmoredetails
intheappendix.
Allcommercialantivirus(AV)productsrelyon
signaturematching;thebulkofwhichconstitutes
strictbytesequencepatternmatching.Formod-
ern,evolving
polymorphic
and
metamorphic
mal-
ware,thisapproachisunsatifactory.Clementire-
centlychecked¯fteenstate-of-the-art,updatedAV
scanneragainsttenhighlypolymorphicmalware
samplesandfoundfalsenegativeratesfrom0-
90%,withanaverageof48%[5].Thisdevelop-
mentwasalreadypredictedin2001[29].Polymor-
phicmalwarecontaindecryptionroutineswhich
decryptencryptedconstantpartsofthemal-
warebody.Themalwarecanmutateitsdecryp-
torsinsubsequentgenerations,therebycomplicat-
ingsignature-baseddetectionapproaches.Thede-
cryptedbody,however,remainsconstant.Meta-
morphicmalwaregenerallydonotuseencryp-
tion,butareabletomutatetheirbodyinsub-
sequentgenerationusingvarioustechniques,such
asjunkinsertion,semanticNOPs,codetransposi-
tion,equivalentinstructionsubstitutionandreg-
isterreassignments[4][27].Thenetresultofthese
techniquesisashrinkingusable\constantbase"
forstrictsignature-baseddetectionapproaches.
Sincesignature-basedapproachesarequitefast
(butshowlittletoleranceformetamorphicand
2.1.Samples
Fornon-malicioussoftware,henceforthcalled
`goodware',samplingfollowedatwo-stepprocess:
IinventoriedallPEs(theprimary32-bitWindows
¯leformat)onaMicrosoftXPHomeSP2lap-
top,extracteduniformrandomly300samples,dis-
cardedoverlylargeandsmall¯les,yielding280
samples.Formalicioussoftware(malware),seven
classesofinterestwere¯xed:backdoor,hacking
tools,DoS,trojans,exploits,virus,andworms.
ThewormclasswasfurtherdividedintoPeer-
to-Peer(P2P),InternetRelayChat/InstantMes-
senger(IRC/IM),EmailandNetworkwormsub-
classes.Forannon-specialistintroductiontoma-
licioussoftware,see[26];foracanonicalreference,
see[28].Eachclass(subclass)containedatleast
15samples.SinceAVvendorswerehesitantfor
liabilityreasonstoprovidesamples,Igathered
themfromherm1t's(underground)collectionand
identi¯edcompilerand(potential)packermeta-
datausingPEiD.Practicallyallmalwaresamples
wereidenti¯edashavingbeencompiledbyMS
C++5.0/6.0,MSVisualBasic5.0/6.0orLCC,
AICommunications
ISSN0921-7126,IOSPress.Allrightsreserved
2
(a)Example:Callgraph (b)Example:ControlFlowGraph
(CFG)
(c)Example:BasicBlock
Fig.1.Graphstructuresofanexecutable
andaboutadozensampleswerepackedwithvar-
iousversionsofUPX(anexecutablecompres-
sionprogram).Malwarewasrunthroughbest-of-
breed,updatedopen-andclosed-sourceAVprod-
uctsyieldingafalsenegativerateof32%(open-
source)and2%(closed-source),respectively.Over-
all¯lesizesforbothmal-andgoodwareranged
from£(10kb)to£(1MB).Apreliminary¯lesize
distributioninvestigationyieldedalog-normaldis-
tribution;foraputativeexplanationoftheun-
derlyinggenerativeprocess,see[19]and[16].All
400sampleswereloadedintothede-factoindus-
trystandarddisassembler(IDAPro[12]),inter-
andintra-procedurallyparsedandaugmentedwith
symbolicmeta-informationgleanedprogrammati-
callyfromthebinaryviaFLIRTsignatures(when
applicable).Iexportedtheidenti¯edstructuresex-
portedviaIDAPythonintoaMySQLdatabase.
Thesestructuresweresubsequentlyparsedbya
disassemblyvisualizationtool(BinNavi[6])togen-
erateandinvestigatethecallgraph.
mallyusedtopasscontrolaroundwithinonefunc-
tionoftheprogram,whilefarbranchesareused
tocallotherfunctions.Asequenceofinstructions
thatiscontinuous(e.g.hasnobranchesjumping
intoitsmiddleandendsatabranchinstruction)is
calleda
basicblock
.Weconsiderthegraphformed
byhavingeachbasicblockasanode,andeach
shortbranchanedge.Theconnectedcomponents
inthisdirectedgraphcorrespondtothe°owgraphs
ofthefunctionsinthesourcecode.Foreachcon-
nectedcomponentinthepreviousgraph,wecre-
ateanodeinthecallgraph.Foreachfarbranch
intheconnectedcomponent,weaddanedgeto
thenodecorrespondingtotheconnectedcompo-
nentthisbranchistargeting.Fig.1illustratethese
concepts.
Formally,denoteacallgraph
CG
as
CG
=
G
(
V;E
),where
G
(
¢
)standsfor`Graph'.Let
V
=
S
F
,where
F2normal;import;library;thunk
.
ThisjustsaysthateachfunctioninCGiseithera
`library'function(fromanexternallibrariesstati-
callylinkedin),an`import'function(dynamically
importedfromadynamiclibrary),a`thunk'func-
tion(mostlyone-linewrapperfunctionsusedfor
callingconventionortypeconversion)ora`nor-
mal'function(canbeviewedastheexecutables
ownfunction).Followingmetricswereprogram-
maticallycollectedfrom
CG
{
jVj
isnumberofnodesin
CG
,i.ethefunction
countofthecallgraph
{
Forany
f2V
,let
f
=
G
(
V
f
;E
f
)where
b2
V
f
isablockofcode,i.eeachnodeinthe
callgraphisitselfagraph,a°owgraph,and
eachnodeonthe°owgraphisabasicblock
{
De¯ne
IC
:
B!N
where
B
isde¯nedtobe
setofblocksofcode,and
IC
(
b
)isthenumber
ofinstructionsin
b
.Wedenotethisfunction
shorthandas
jbj
IC
,thenumberofinstructions
inbasicblock
b
.
2.2.Callgraph
Following[7],wetreatanexecutableasa
graph
ofgraphs
.Thisfollowstheintuitionthatinany
procedurallanguage,thesourcecodeisstructured
into
functions
(whichcanbeviewedasa°owchart,
e.g.adirectedgraphwhichwecall
°owgraph
).
Thesefunctionscalleachother,thuscreatinga
largergraphwhereeachnodeisafunctionandthe
edgesarecalls-torelationsbetweenthefunctions.
Wecallthislargergraphthe
callgraph
.Were-
coverthisstructurebydiassemblingtheexecutable
intoindividualinstructions.Wedistinguishbe-
tween
short
and
far
branchinstructions:Short
branchesdonotsaveareturnaddresswhilefar
branchesdo.Intuitively,shortbranchesarenor-
3
{
Weextendthisnotation
j¢j
IC
toelementsof
class metric £(10) £(100) £(1000)
Goodware r 0.05 -0.017 -0.0366
IQR 12 44 36
Malware r 0.08 0.0025 0.0317
IQR 8 45 28
Table1
Correlation,IQRforinstructioncount
V
bede¯ning
jfj
IC
=
P
b2V
f
jbj
IC
.Thisgives
usthetotalnumberofinstructionsinanode
ofthecallgraph,i.einafunction.
{
Let
d
+
G
(
f
),
d
¡
G
(
f
)and
d
bb
G
(
f
)denotethein-
degree,outdegreeandbasicblockcountofa
function,respectively.
eratethe°owgraphforimportedfunctions,i.e.
animportedfunctionautomaticallyhasoutde-
gree0,andbutwillbecalledfrommanyother
functions).Additional,Isize-blockedbothsample
groupsintothreefunctioncountblocks,withblock
criteriachosenas£(10),£(100)and£(1000)func-
tioncountstoinvestigateacorrelationbetween
instructioncountinfunctionsandcomplexityof
theexecutable(withfunctioncountasaproxy).
Again,Ifoundnocorrelationatsigni¯cancelevel
·
0
:
001.Coe±cientvaluesandtheIQRforin-
structioncounts(aspreadmeasure,thedi®erence
betweenthe75thandthe25thpercentilesofthe
sample)aregiveninTable1.The¯rstresultcor-
roborateprevious¯ndings;thesecondresultatthe
phenomenologicallevelagreeswiththe`refactor-
ing'modelin[21],whichpositsthatexcessively
longfunctionsthattendtobedecomposedinto
smallerfunctions.Remarkably,thespreadisquite
low,ontheorderofafewdozeninstructions.Iwill
discussmodelsmoreinsection4.
(a)Malware:pvs
r
in;out
2.4.Functiontypes
EachpointinthescatterplotsinFig.3repre-
sentsthreemetricsforoneindividualexecutable:
Functioncount,andtheproportionsofnormal
function,staticlibrary+dynamicimportfunc-
tions,andthunks.Proportionsforanindividual
executableaddupto1.Thefoursubgraphsare
parsedthusly,usingFig.3(b)asanexample.The
x-axisdenotestheproportionof`normal'function,
andthey-axistheproportionof\thunk"func-
tionsinthebinaries.Thecolorofeachpointin-
dicates
jVj
,whichmayserveasaroughproxy
fortheexecutable'ssize.Thedarkredpointat
(
X;Y
)=
(0.87,0.007)
is
endnote.exe
,sinceit
istheonlygoodwarebinarywithfunctionscount
of£(10
4
).
Mostthunksarewrappersaroundimports,
henceinsmallexecutables,alargerproportionof
thefunctionswillbethunks.Thesameholdsforli-
braries:Thelargertheexecutable,thesmallerthe
(b)Goodware:pvs
r
in;out
Fig.2.CorrelationCoe±cient
r
in;out
2.3.Correlations
Icalculatedthecorrelationbetweeninand
outdegreeoffunctions.Prioranalysisofstatic
classcollaborationnetworks[24][21]suggestan
anti-correlation,characterizingsomefunctionsas
sourceorsinks.Ifoundnosigni¯cantcorrelation
betweeninandoutdegreeoffunctionsinthedis-
assembledexecutables(Fig.2).Correlationintu-
itivelyisunlikelytooccurexceptinthe`0out-
degree'case(theBinNavitoolsetdoesnotgen-
s
s
s
s
0
2
0
3
0
4
0
2
0
2
0
3
0
3
0
4
0
4
0
2
0
3
0
4
2
4
6
8
2
4
6
8
2
4
6
8
2
4
6
8
s
s
0
1
M
s
0
0
1
s
0
1
0
1
1
0
0
8
k
5
2
2
6
4
1
4
l
l
1
l
4
6
6
2
l
5
8
8
0
1
1
1
5
0
0
k
0
2
0
3
0
4
0
2
0
2
0
3
0
3
0
4
0
4
0
2
0
3
0
4
1
2
3
4
5
6
7
8
9
1
2
2
3
4
4
5
6
6
7
8
8
9
2
4
6
8
s
s
0
1
s
0
0
1
s
1
0
1
0
0
0
0
2
2
2
2
4
4
4
4
l
l
k
k
6
6
6
6
8
8
8
8
1
1
1
1
0
2
0
3
0
4
0
2
0
2
0
3
0
3
0
4
0
4
0
2
0
3
0
4
1
2
3
4
5
6
7
8
9
1
2
2
3
4
4
5
6
6
7
8
8
9
2
4
6
8
s
s
s
s
0
1
0
0
1
1
0
1
0
0
0
0
2
4
6
8
2
4
6
8
2
4
6
8
2
4
6
8
s
s
0
1
0
0
1
s
s
0
1
0
1
1
0
0
2
2
2
2
8
5
k
2
2
4
4
4
4
l
l
k
k
6
4
4
1
6
6
6
6
l
l
1
l
4
6
6
8
8
8
8
2
l
5
8
8
1
1
1
1
0
1
1
1
5
0
0
0
2
0
3
0
4
0
2
0
2
0
3
0
3
0
4
0
4
0
2
0
3
0
4
k
0
2
0
3
0
4
0
2
0
2
0
3
0
3
0
4
0
4
0
2
0
3
0
4
2
4
6
8
2
2
4
4
6
6
8
8
2
4
6
8
0
1
0
0
1
M
0
1
0
1
1
0
0
8
5
k
2
2
6
4
4
1
5
class BasicBlock Indegree Outdegree
GWN(1.634,0.3) N(2.02,0.3) N(1.69,0.307)
MWN(1.7,0.3) N(2.08,0.45) N(1.68,0.35)
t 2.57 1.04 -0.47
Table2
®
distribution¯ttingandtesting
HorrorPlot'[25],where^
®
sandthecorresponding
CIswereveryjittery.
The¯ttedpower-lawexponents
®
indeg
,
®
outdeg
,
®
bb
,togetherwithindividualfunctions'callgraph
sizeareshownFig.5.Forbothclasses,therange
extendsfor
®
indeg
¼
[1.5-3],
®
outdeg
¼
[1.1-2.5]and
®
bb
¼
[1.1-2.1],withaslightlygreaterspreadfor
malware.
(a)GWsample:Fittingfor
alpha
indeg
and
k
c
2.6.Testingfordi®erence
Inowcheckwhetherthereareanystatistically
signi¯cantdi®erencesbetween(
®
,
k
c
)¯tforgood-
wareandmalware,respectively.Followingproce-
duresin[34],I¯nd
®
indeg
,
®
outdeg
and
®
bb
dis-
tributedapproximatelynormal.Theexponential
cuto®parameters
k
c
arelognormallydistributed.
Applyingastandardtwo-tailedt-test(Table2),I
¯ndatsigni¯cancelevel0.05(
t
critical
=1.97)only
¹
(
®
bb;malware
)
¸¹
(
®
bb;goodware
).
Forthebasicblocks,
k
c
¼LogN
(59
:
1
;
52)
(goodware)and
¼LogN
(54
:
2
;
44)(malware)and
¹
(
k
c
(
bb;malware
))=
¹
(
k
c
(
bb;goodware
))wasre-
jectedviaWilcoxonRankSumwith
z
=13
:
4.
Thesteeperslopeofmalware's
®
bb
implythat
functionsinmalwaretendtohavealowerbasic
blockcount.Thiscanbeaccountedforbythefact
thatmalwaretendstobesimplerthanmostap-
plicationsandoperateswithoutmuchinteraction,
hencefewerbranches,hencefewerbasicblocks.
Malwaretendstohavelimitedfunctionality,and
operateindependentlyofinputfromuserandthe
operatingenvironment.Also,malwareisusually
notcompiledwithaggressivecompileroptimiza-
tionsettings.Sucharegimeleadstomoreinlin-
ingandthusincreasesthebasicblockcountofthe
individualfunctions.Itmaybepossible,too,that
malwareauthorstendtobreakfunctionsintosim-
plercomponentsthan`regular'programmers.The
smallercuto®pointformalwareseemstocorrob-
oratethis,aswell,inthatthepowerlawrelation-
shipholdsoverashorterrange.However,thisex-
planationshouldberegardedasspeculativepend-
ingfurtherinvestigation.
(b)MWsample:Fittingfor
alpha
bb
and
k
c
Fig.4.Pareto¯ttingECCDFs,shownwithHillestimator
inset
Totentativelycorroboratetheconsistencyof
ourpositedParetomodel,30(goodware)and21
(malware)indegree,outdegreeandbasicblockEC-
CDFplotswereuniformlysampledintothreefunc-
tioncountblocks,withblockcriteriachosenas
£(10),£(100)and£(1000)functioncounts,yield-
ingasamplingcoverageof10%(goodware)and
17%(malware).Visualinspectionindicatesthatfor
malware,themodelseemedmoreconsistentfor
outdegreethanindegreeatallfunctionsizes.For
basicblockcount,theconsistencytendstobebet-
terforsmallerexecutables.Iseethesetendencyfor
goodware,aswell,withtheobservationthatout-
degreewasmostconsistentinsizeblock£(100);
for£(10)and£(1000).Forbothmalwareand
goodware,indegreeseemedtheleastconsistent,
quiteafewsamplesdidexhibitaso-called`Hill
[ Pobierz całość w formacie PDF ]
zanotowane.pl doc.pisz.pl pdf.pisz.pl adbuxwork.keep.pl
1
Callgraphpropertiesofexecutables
DanielBilar
WellesleyCollege
DepartmentofComputerScience
Wellesley,MA02481,USA
E-mail:dbilar@wellesley.edu
polymorphiccode)andheuristicssuchasemula-
tionaremoreresilient(butquiteslowandmay
hingeonenvironmentaltriggers),adetectionap-
proachthatcombinesthebestofbothworlds
wouldbedesirable.Thisisthephilosophybehind
astructural¯ngerprint.Structural¯ngerprintsare
statisticalinnature,andassucharepositionedas
`fuzzier'metricsbetweenstaticsignaturesanddy-
namicheuristics.Thestructural¯ngerprintinves-
tigatedinthispaperfordi®erentiationpurposesis
basedonsomepropertiesoftheexecutable'scall-
graph.Ialsoproposeagenerativemechanismfor
thecallgraphtopology.
Thispaperexaminesthecallgraphsof120malicious
and280non-maliciousexecutables.Paretomodels
were¯ttedtoin-degree,out-degreeandbasicblock
countdistribution,andastatisticallysigni¯cantdi®er-
enceshownforthederivedpowerlawexponent.Atwo-
stepoptimizationprocessishypothesizedtoaccount
forstructuralfeaturesoftheexecutable.
Keywords:Executables,Callgraph,HOTprocess,
Graph-structuralFingerprint,Malware
2.Generatingthecallgraph
1.Motivation
Primarytoolsusedaredescribedinmoredetails
intheappendix.
Allcommercialantivirus(AV)productsrelyon
signaturematching;thebulkofwhichconstitutes
strictbytesequencepatternmatching.Formod-
ern,evolving
polymorphic
and
metamorphic
mal-
ware,thisapproachisunsatifactory.Clementire-
centlychecked¯fteenstate-of-the-art,updatedAV
scanneragainsttenhighlypolymorphicmalware
samplesandfoundfalsenegativeratesfrom0-
90%,withanaverageof48%[5].Thisdevelop-
mentwasalreadypredictedin2001[29].Polymor-
phicmalwarecontaindecryptionroutineswhich
decryptencryptedconstantpartsofthemal-
warebody.Themalwarecanmutateitsdecryp-
torsinsubsequentgenerations,therebycomplicat-
ingsignature-baseddetectionapproaches.Thede-
cryptedbody,however,remainsconstant.Meta-
morphicmalwaregenerallydonotuseencryp-
tion,butareabletomutatetheirbodyinsub-
sequentgenerationusingvarioustechniques,such
asjunkinsertion,semanticNOPs,codetransposi-
tion,equivalentinstructionsubstitutionandreg-
isterreassignments[4][27].Thenetresultofthese
techniquesisashrinkingusable\constantbase"
forstrictsignature-baseddetectionapproaches.
Sincesignature-basedapproachesarequitefast
(butshowlittletoleranceformetamorphicand
2.1.Samples
Fornon-malicioussoftware,henceforthcalled
`goodware',samplingfollowedatwo-stepprocess:
IinventoriedallPEs(theprimary32-bitWindows
¯leformat)onaMicrosoftXPHomeSP2lap-
top,extracteduniformrandomly300samples,dis-
cardedoverlylargeandsmall¯les,yielding280
samples.Formalicioussoftware(malware),seven
classesofinterestwere¯xed:backdoor,hacking
tools,DoS,trojans,exploits,virus,andworms.
ThewormclasswasfurtherdividedintoPeer-
to-Peer(P2P),InternetRelayChat/InstantMes-
senger(IRC/IM),EmailandNetworkwormsub-
classes.Forannon-specialistintroductiontoma-
licioussoftware,see[26];foracanonicalreference,
see[28].Eachclass(subclass)containedatleast
15samples.SinceAVvendorswerehesitantfor
liabilityreasonstoprovidesamples,Igathered
themfromherm1t's(underground)collectionand
identi¯edcompilerand(potential)packermeta-
datausingPEiD.Practicallyallmalwaresamples
wereidenti¯edashavingbeencompiledbyMS
C++5.0/6.0,MSVisualBasic5.0/6.0orLCC,
AICommunications
ISSN0921-7126,IOSPress.Allrightsreserved
2
(a)Example:Callgraph (b)Example:ControlFlowGraph
(CFG)
(c)Example:BasicBlock
Fig.1.Graphstructuresofanexecutable
andaboutadozensampleswerepackedwithvar-
iousversionsofUPX(anexecutablecompres-
sionprogram).Malwarewasrunthroughbest-of-
breed,updatedopen-andclosed-sourceAVprod-
uctsyieldingafalsenegativerateof32%(open-
source)and2%(closed-source),respectively.Over-
all¯lesizesforbothmal-andgoodwareranged
from£(10kb)to£(1MB).Apreliminary¯lesize
distributioninvestigationyieldedalog-normaldis-
tribution;foraputativeexplanationoftheun-
derlyinggenerativeprocess,see[19]and[16].All
400sampleswereloadedintothede-factoindus-
trystandarddisassembler(IDAPro[12]),inter-
andintra-procedurallyparsedandaugmentedwith
symbolicmeta-informationgleanedprogrammati-
callyfromthebinaryviaFLIRTsignatures(when
applicable).Iexportedtheidenti¯edstructuresex-
portedviaIDAPythonintoaMySQLdatabase.
Thesestructuresweresubsequentlyparsedbya
disassemblyvisualizationtool(BinNavi[6])togen-
erateandinvestigatethecallgraph.
mallyusedtopasscontrolaroundwithinonefunc-
tionoftheprogram,whilefarbranchesareused
tocallotherfunctions.Asequenceofinstructions
thatiscontinuous(e.g.hasnobranchesjumping
intoitsmiddleandendsatabranchinstruction)is
calleda
basicblock
.Weconsiderthegraphformed
byhavingeachbasicblockasanode,andeach
shortbranchanedge.Theconnectedcomponents
inthisdirectedgraphcorrespondtothe°owgraphs
ofthefunctionsinthesourcecode.Foreachcon-
nectedcomponentinthepreviousgraph,wecre-
ateanodeinthecallgraph.Foreachfarbranch
intheconnectedcomponent,weaddanedgeto
thenodecorrespondingtotheconnectedcompo-
nentthisbranchistargeting.Fig.1illustratethese
concepts.
Formally,denoteacallgraph
CG
as
CG
=
G
(
V;E
),where
G
(
¢
)standsfor`Graph'.Let
V
=
S
F
,where
F2normal;import;library;thunk
.
ThisjustsaysthateachfunctioninCGiseithera
`library'function(fromanexternallibrariesstati-
callylinkedin),an`import'function(dynamically
importedfromadynamiclibrary),a`thunk'func-
tion(mostlyone-linewrapperfunctionsusedfor
callingconventionortypeconversion)ora`nor-
mal'function(canbeviewedastheexecutables
ownfunction).Followingmetricswereprogram-
maticallycollectedfrom
CG
{
jVj
isnumberofnodesin
CG
,i.ethefunction
countofthecallgraph
{
Forany
f2V
,let
f
=
G
(
V
f
;E
f
)where
b2
V
f
isablockofcode,i.eeachnodeinthe
callgraphisitselfagraph,a°owgraph,and
eachnodeonthe°owgraphisabasicblock
{
De¯ne
IC
:
B!N
where
B
isde¯nedtobe
setofblocksofcode,and
IC
(
b
)isthenumber
ofinstructionsin
b
.Wedenotethisfunction
shorthandas
jbj
IC
,thenumberofinstructions
inbasicblock
b
.
2.2.Callgraph
Following[7],wetreatanexecutableasa
graph
ofgraphs
.Thisfollowstheintuitionthatinany
procedurallanguage,thesourcecodeisstructured
into
functions
(whichcanbeviewedasa°owchart,
e.g.adirectedgraphwhichwecall
°owgraph
).
Thesefunctionscalleachother,thuscreatinga
largergraphwhereeachnodeisafunctionandthe
edgesarecalls-torelationsbetweenthefunctions.
Wecallthislargergraphthe
callgraph
.Were-
coverthisstructurebydiassemblingtheexecutable
intoindividualinstructions.Wedistinguishbe-
tween
short
and
far
branchinstructions:Short
branchesdonotsaveareturnaddresswhilefar
branchesdo.Intuitively,shortbranchesarenor-
3
{
Weextendthisnotation
j¢j
IC
toelementsof
class metric £(10) £(100) £(1000)
Goodware r 0.05 -0.017 -0.0366
IQR 12 44 36
Malware r 0.08 0.0025 0.0317
IQR 8 45 28
Table1
Correlation,IQRforinstructioncount
V
bede¯ning
jfj
IC
=
P
b2V
f
jbj
IC
.Thisgives
usthetotalnumberofinstructionsinanode
ofthecallgraph,i.einafunction.
{
Let
d
+
G
(
f
),
d
¡
G
(
f
)and
d
bb
G
(
f
)denotethein-
degree,outdegreeandbasicblockcountofa
function,respectively.
eratethe°owgraphforimportedfunctions,i.e.
animportedfunctionautomaticallyhasoutde-
gree0,andbutwillbecalledfrommanyother
functions).Additional,Isize-blockedbothsample
groupsintothreefunctioncountblocks,withblock
criteriachosenas£(10),£(100)and£(1000)func-
tioncountstoinvestigateacorrelationbetween
instructioncountinfunctionsandcomplexityof
theexecutable(withfunctioncountasaproxy).
Again,Ifoundnocorrelationatsigni¯cancelevel
·
0
:
001.Coe±cientvaluesandtheIQRforin-
structioncounts(aspreadmeasure,thedi®erence
betweenthe75thandthe25thpercentilesofthe
sample)aregiveninTable1.The¯rstresultcor-
roborateprevious¯ndings;thesecondresultatthe
phenomenologicallevelagreeswiththe`refactor-
ing'modelin[21],whichpositsthatexcessively
longfunctionsthattendtobedecomposedinto
smallerfunctions.Remarkably,thespreadisquite
low,ontheorderofafewdozeninstructions.Iwill
discussmodelsmoreinsection4.
(a)Malware:pvs
r
in;out
2.4.Functiontypes
EachpointinthescatterplotsinFig.3repre-
sentsthreemetricsforoneindividualexecutable:
Functioncount,andtheproportionsofnormal
function,staticlibrary+dynamicimportfunc-
tions,andthunks.Proportionsforanindividual
executableaddupto1.Thefoursubgraphsare
parsedthusly,usingFig.3(b)asanexample.The
x-axisdenotestheproportionof`normal'function,
andthey-axistheproportionof\thunk"func-
tionsinthebinaries.Thecolorofeachpointin-
dicates
jVj
,whichmayserveasaroughproxy
fortheexecutable'ssize.Thedarkredpointat
(
X;Y
)=
(0.87,0.007)
is
endnote.exe
,sinceit
istheonlygoodwarebinarywithfunctionscount
of£(10
4
).
Mostthunksarewrappersaroundimports,
henceinsmallexecutables,alargerproportionof
thefunctionswillbethunks.Thesameholdsforli-
braries:Thelargertheexecutable,thesmallerthe
(b)Goodware:pvs
r
in;out
Fig.2.CorrelationCoe±cient
r
in;out
2.3.Correlations
Icalculatedthecorrelationbetweeninand
outdegreeoffunctions.Prioranalysisofstatic
classcollaborationnetworks[24][21]suggestan
anti-correlation,characterizingsomefunctionsas
sourceorsinks.Ifoundnosigni¯cantcorrelation
betweeninandoutdegreeoffunctionsinthedis-
assembledexecutables(Fig.2).Correlationintu-
itivelyisunlikelytooccurexceptinthe`0out-
degree'case(theBinNavitoolsetdoesnotgen-
s
s
s
s
0
2
0
3
0
4
0
2
0
2
0
3
0
3
0
4
0
4
0
2
0
3
0
4
2
4
6
8
2
4
6
8
2
4
6
8
2
4
6
8
s
s
0
1
M
s
0
0
1
s
0
1
0
1
1
0
0
8
k
5
2
2
6
4
1
4
l
l
1
l
4
6
6
2
l
5
8
8
0
1
1
1
5
0
0
k
0
2
0
3
0
4
0
2
0
2
0
3
0
3
0
4
0
4
0
2
0
3
0
4
1
2
3
4
5
6
7
8
9
1
2
2
3
4
4
5
6
6
7
8
8
9
2
4
6
8
s
s
0
1
s
0
0
1
s
1
0
1
0
0
0
0
2
2
2
2
4
4
4
4
l
l
k
k
6
6
6
6
8
8
8
8
1
1
1
1
0
2
0
3
0
4
0
2
0
2
0
3
0
3
0
4
0
4
0
2
0
3
0
4
1
2
3
4
5
6
7
8
9
1
2
2
3
4
4
5
6
6
7
8
8
9
2
4
6
8
s
s
s
s
0
1
0
0
1
1
0
1
0
0
0
0
2
4
6
8
2
4
6
8
2
4
6
8
2
4
6
8
s
s
0
1
0
0
1
s
s
0
1
0
1
1
0
0
2
2
2
2
8
5
k
2
2
4
4
4
4
l
l
k
k
6
4
4
1
6
6
6
6
l
l
1
l
4
6
6
8
8
8
8
2
l
5
8
8
1
1
1
1
0
1
1
1
5
0
0
0
2
0
3
0
4
0
2
0
2
0
3
0
3
0
4
0
4
0
2
0
3
0
4
k
0
2
0
3
0
4
0
2
0
2
0
3
0
3
0
4
0
4
0
2
0
3
0
4
2
4
6
8
2
2
4
4
6
6
8
8
2
4
6
8
0
1
0
0
1
M
0
1
0
1
1
0
0
8
5
k
2
2
6
4
4
1
5
class BasicBlock Indegree Outdegree
GWN(1.634,0.3) N(2.02,0.3) N(1.69,0.307)
MWN(1.7,0.3) N(2.08,0.45) N(1.68,0.35)
t 2.57 1.04 -0.47
Table2
®
distribution¯ttingandtesting
HorrorPlot'[25],where^
®
sandthecorresponding
CIswereveryjittery.
The¯ttedpower-lawexponents
®
indeg
,
®
outdeg
,
®
bb
,togetherwithindividualfunctions'callgraph
sizeareshownFig.5.Forbothclasses,therange
extendsfor
®
indeg
¼
[1.5-3],
®
outdeg
¼
[1.1-2.5]and
®
bb
¼
[1.1-2.1],withaslightlygreaterspreadfor
malware.
(a)GWsample:Fittingfor
alpha
indeg
and
k
c
2.6.Testingfordi®erence
Inowcheckwhetherthereareanystatistically
signi¯cantdi®erencesbetween(
®
,
k
c
)¯tforgood-
wareandmalware,respectively.Followingproce-
duresin[34],I¯nd
®
indeg
,
®
outdeg
and
®
bb
dis-
tributedapproximatelynormal.Theexponential
cuto®parameters
k
c
arelognormallydistributed.
Applyingastandardtwo-tailedt-test(Table2),I
¯ndatsigni¯cancelevel0.05(
t
critical
=1.97)only
¹
(
®
bb;malware
)
¸¹
(
®
bb;goodware
).
Forthebasicblocks,
k
c
¼LogN
(59
:
1
;
52)
(goodware)and
¼LogN
(54
:
2
;
44)(malware)and
¹
(
k
c
(
bb;malware
))=
¹
(
k
c
(
bb;goodware
))wasre-
jectedviaWilcoxonRankSumwith
z
=13
:
4.
Thesteeperslopeofmalware's
®
bb
implythat
functionsinmalwaretendtohavealowerbasic
blockcount.Thiscanbeaccountedforbythefact
thatmalwaretendstobesimplerthanmostap-
plicationsandoperateswithoutmuchinteraction,
hencefewerbranches,hencefewerbasicblocks.
Malwaretendstohavelimitedfunctionality,and
operateindependentlyofinputfromuserandthe
operatingenvironment.Also,malwareisusually
notcompiledwithaggressivecompileroptimiza-
tionsettings.Sucharegimeleadstomoreinlin-
ingandthusincreasesthebasicblockcountofthe
individualfunctions.Itmaybepossible,too,that
malwareauthorstendtobreakfunctionsintosim-
plercomponentsthan`regular'programmers.The
smallercuto®pointformalwareseemstocorrob-
oratethis,aswell,inthatthepowerlawrelation-
shipholdsoverashorterrange.However,thisex-
planationshouldberegardedasspeculativepend-
ingfurtherinvestigation.
(b)MWsample:Fittingfor
alpha
bb
and
k
c
Fig.4.Pareto¯ttingECCDFs,shownwithHillestimator
inset
Totentativelycorroboratetheconsistencyof
ourpositedParetomodel,30(goodware)and21
(malware)indegree,outdegreeandbasicblockEC-
CDFplotswereuniformlysampledintothreefunc-
tioncountblocks,withblockcriteriachosenas
£(10),£(100)and£(1000)functioncounts,yield-
ingasamplingcoverageof10%(goodware)and
17%(malware).Visualinspectionindicatesthatfor
malware,themodelseemedmoreconsistentfor
outdegreethanindegreeatallfunctionsizes.For
basicblockcount,theconsistencytendstobebet-
terforsmallerexecutables.Iseethesetendencyfor
goodware,aswell,withtheobservationthatout-
degreewasmostconsistentinsizeblock£(100);
for£(10)and£(1000).Forbothmalwareand
goodware,indegreeseemedtheleastconsistent,
quiteafewsamplesdidexhibitaso-called`Hill
[ Pobierz całość w formacie PDF ]